System and Method for Analyzing Trends in a Categorical Dataset Using Semantic Infusion
20230315995 · 2023-10-05
Inventors
Cpc classification
G06F16/3335
PHYSICS
International classification
Abstract
A method for detecting semantic trends within the categorical datasets from text-based documents includes using a processing module to obtain the plurality of text-based documents and perform a basic cleaning of each of the plurality of text-based documents. A semantic infusion module may generate an infused sentence in each of the plurality of text-based documents by inserting a word “A_ class (C.sub.i) _time (T.sub.j)” based on a computed infusion frequency value (I.sub.freq). A pattern generation module is configured to generate semantic trends by extracting the trending items from the word vector representation created by a word vector module, for each word of each infused sentence of each of the plurality of text-based documents.
Claims
1. A method for detecting semantic trends within the categorical datasets from text-based documents, the method comprising: obtaining, with a processing module, a plurality of text-based documents; removing, with the processing module, at least one of symbols, special characters, and predefined stop words from each sentence of each of the plurality of text-based documents; identifying, with the processing module, a category class (C.sub.i) and a time class (T.sub.j) to which each of the plurality of text-based documents is associated; determining, with a semantic infusion module, a number of words (L) in the respective sentence; computing, with the semantic infusion module, an infusion frequency value (I.sub.freq) based on the number of words (L) and determining the I.sub.freq number of non-consecutive random numbers from a set [0, (L−1)]; generating, with the semantic infusion module, an infused sentence by inserting a word “A_ class (C.sub.i) _time (T.sub.j)” prior to a word at a position in the respective sentence indicated by each of the determined non-consecutive random numbers; generating, with a word vector module, a word vector for each word of each infused sentence of each of the plurality of text-based documents; and generating, with a pattern generation module, semantic trends by extracting the trending items from the word vector representation generated for each word of each infused sentence of each of the plurality of text-based documents.
2. The method as claimed in claim 1, wherein the generating of the word vector includes using a Word2Vec technique to generate the word vector.
3. The method as claimed in claim 1, wherein the infusion frequency value (I.sub.freq) is computed as one of [ceil{(log.sub.2 L)/2}], {ceil(√(L))},
4. The method as claimed in claim 1, wherein the category class (C.sub.i) is indicative of a context to which text of an associated text-based document of the plurality of text-based documents is associated.
5. The method as claimed in claim 1, wherein the time class (T.sub.j) is indicative of a year, month, or date to which text of an associated text-based document of the plurality of text-based documents is associated.
6. A system for detecting semantic trends within the categorical datasets from text-based documents, the system comprising: a processing module configured to: obtain a plurality of text-based documents; remove at least one of symbols, special characters, and predefined stop words from each sentence of each of the plurality of text-based documents; and identify a category class (C.sub.i) and a time class (T.sub.j) to which each of the plurality of text-based documents is associated; a semantic infusion module configured to: determine, for each sentence in each of the plurality of text-based documents, a number of words (L) in the respective sentence; compute an infusion frequency value (I.sub.freq) based on the number of words (L); determine the I.sub.freq number of non-consecutive random numbers from a set [0, (L−1)]; and generate an infused sentence by inserting a word “A_ class (C.sub.i)_time (T.sub.j)” prior to a word at a position in the respective sentence indicated by each of the determined non-consecutive random numbers; a word vector module configured to generate a word vector for each word of each infused sentence of each of the plurality of text-based documents; and a pattern generation module configured to generate semantic trends by extracting the trending items from the word vector representation generated by said word vector module for each word of each infused sentence of each of the plurality of text-based documents.
7. The system as claimed in claim 6, wherein the word vector module is configured to generate the word vector using a Word2Vec technique.
8. The system as claimed in claim 6, wherein the semantic infusion module is configured to compute the infusion frequency value (I.sub.freq) as one of [ceil{(log.sub.2 L)/2}], {ceil(√(L))},
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] The detailed description is provided with reference to the accompanying figures, wherein:
[0007]
[0008]
DETAILED DESCRIPTION
[0009] The present subject matter describes example methods and systems for detecting semantic trends within the categorical datasets from text-based documents for context trend analysis. In the example methods and systems described herein, overcomes the frequency-based bias associated with traditional trend analysis and detects semantically meaningful trends for a given time-sliced categorical a plurality of text-based documents.
[0010] The present subject matter is further described with reference to the accompanying figures. Wherever possible, the same reference numerals are used in the figures and the following description to refer to the same or similar parts. It should be noted that the description and figures merely illustrate principles of the present subject matter. It is thus understood that various arrangements may be devised that, although not explicitly described or shown herein, encompass the principles of the present subject matter. Moreover, all statements herein reciting principles, aspects, and examples of the present subject matter, as well as specific examples thereof, are intended to encompass equivalents thereof.
[0011] The manner in which the methods and systems are implemented are explained in detail with respect to
[0012]
[0013] The system 100 may include a processing module 102. The processing module 102 may include microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any other devices that manipulate signals and data based on computer-readable instructions. Further, functions of the various elements shown in the figures, including any functional blocks labelled as “processor(s)”, may be provided through the use of dedicated hardware as well as hardware capable of executing computer-readable instructions.
[0014] Further, the system 100 may include a semantic infusion module 104, a word vector module 106 and a pattern generation module 108, coupled to the processing module 102. The modules 104, 106 and 108 may be implemented as a combination of hardware and programming, for example, programmable instructions to implement a variety of functionalities of the modules 104, 106 and 108. In examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the pattern generation module 108 may be executable instructions. Such instructions may be stored on a non-transitory machine-readable storage medium which may be coupled either directly with the system 100 or indirectly (for example, through networked means). In the present examples, the non-transitory machine-readable storage medium may store instructions that, when executed by the processor, implement modules 104, 106 and 108. In other examples, the modules 104, 106 and 108 may be implemented as electronic circuitry.
[0015] The modules 104, 106 and 108, amongst other things, includes routines, programs, objects, components, and data structures, which perform particular tasks or implement particular abstract data types. The modules 104, 106 and 108, may also be implemented as, signal processor(s), state machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the modules 104, 106 and 108, can be implemented by hardware, by computer-readable instructions executed by a processing unit, or by a combination thereof.
[0016] Further, the system 100 includes a storage device 110. The storage device 110 may include any non-transitory computer-readable medium including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic random-access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. The storage device 110 may store an activity data 103. In an example, the activity data 103 includes a plurality of text-based documents D={d.sub.k}.sub.{k=1}.sup.N D with each document d.sub.k having a category to a class ci in the set of M classes C={c.sub.i}.sub.{i=1}.sub.
[0017] In an example, the system 100 includes interface(s) 112. The interface(s) 112 may include a variety of interfaces, for example, interface(s) 112 for users. The interface(s) 112 may include data output devices. The interface(s) 112 may facilitate the communication of the system 100 with various communication and electronic devices. In an example, the interface(s) 112 may enable wireless communications between the system 100, such as a laptop, and one or more other computing devices (not shown).
[0018] Description hereinafter describes how detecting semantic trends within the categorical datasets from text-based documents is performed by the system 100. The processing module 102 obtains a plurality of text-based documents D (101), where d.sub.k indicates the k.sup.th text-based document of the of the N number of text-based documents, where N is any positive number. In an example, N=1000. Each text-based document d.sub.k is a document that has text, such as names of vehicle parts, program source code, batch files and is readable by the processing module 102. In an example, the plurality of text-based documents D (101), may be obtained by converting a plurality of voice notes into text documents. Such a conversion may be performed by an automatic speech recognition technique. Each text-based document d.sub.k includes a set of sentences j.sub.dk and belongs to a class c.sub.i from amongst a set of M classes defined by C={c.sub.i}.sub.{i=1}.sup.M. In case the value of M is 3, the set of M classes is {c.sub.1, c.sub.2, c.sub.3}. The class c.sub.i is a technical field to which the text-based document d.sub.k belongs. Each text based document d.sub.k belongs to a category class c.sub.i∈C such that C={c.sub.i}.sup.M.sub.i=1 and a time class t.sub.j∈such that T={t.sub.j}.sup.L.sub.j=1. The system 100 detects top-k semantic trending items for each pair of C.sub.iT.sub.j. For example, in automobiles, all text-based documents referring to a seat belt can be considered of one class.
[0019] Further, at least one of symbols, special characters, and predefined stop words are removed from each sentence of the set of sentences j.sub.dk of each document d.sub.k of the plurality of text-based documents D (101), by the processing module 102 for basic cleansing of the plurality of text-based documents D. Symbols may represent idea, object, or relationship using a mark or a sign. A special character is a character that is not an alphabetic or numeric character, for example, punctuation marks are considered as special characters. A stop word is a word that is not related to the context of the document d.sub.k. For each class c.sub.i, a list of stop words may be predefined and based on the class c.sub.i of the document d.sub.k, predefined stop words are removed from each sentence of the set of sentences j.sub.dk of each document d.sub.k. In an example, if the plurality of text-based documents D belongs to a class of English language, articles and prepositions may be considered as predefined stop words and can be removed from the plurality of text-based documents D for basic cleansing.
[0020] The processing module 102 is configured to identify a category class (C.sub.i) and a time class (T.sub.j) to which each of the plurality of text-based documents (101), belong. The category class c.sub.i may be an indicative of a context to which text of a text-based document d.sub.k is associated. In an example of automobile domain, if the plurality of text-based documents D (101), relate to an engine assembly, then the class c.sub.i may be identified as “engine assembly”. Further, the time class (T.sub.j) may be an indicative of a year, month or date to which text of a text-based document d.sub.k is associated.
[0021] Further, the semantic infusion module 104 is configured to determine for each sentence in each of the plurality of text-based documents d.sub.k, a number of words (L) in the respective sentence. Upon identifying the class c.sub.i of each cleaned text-based document d.sub.k, the processing module 102 transmits this information to semantic infusion module 104, for determining a number of words (L) for each sentence in each text-based document d.sub.k of the plurality of text-based documents D (101). The determined number of words (L) is an indicative of a cleaned length of each sentence.
[0022] Furthermore, the semantic infusion module 104 is configured to compute an infusion frequency value (I.sub.freq) based on the number of words (L) determined for each sentence of each text-based document d.sub.k of the plurality of text-based documents D (101). In an example, the infusion frequency value (I.sub.freq) is computed as one of [ceil{(log.sub.2 L)/2}], {ceil(√(L))},
and {ceil(√(L/2))}. The ceil(p) is a function that returns an integer value which is an upper round-off of the value of p, in case the value of p is not an integer. In the present case, if the infusion frequency value (I.sub.freq) is computed as [ceil{(log.sub.2 L)/2}], a value of {(log.sub.2 L)/2} is rounded off to an integer value that is closest and bigger than the value of {(log.sub.2 L)/2}. Computation of the infusion frequency value (I.sub.freq) as one of [ceil{(log.sub.2 L)/2}], {ceil(√(L))},
and {ceil(L/2)} ensures that the infusion frequency value (I.sub.freq) is not proportional to the determined number of words (L), i.e. the cleaned length of each sentence. In an example, the infusion frequency value (I.sub.freq) is considered as 1 for each sentence of each text-based document d.sub.k of the plurality of text-based documents D (101).
[0023] In one embodiment, the sematic infusion module 104 performs the semantic infusion technique. The purpose of using this technique is to infuse additional meta-data (referred to as Anchors) within the clean sentences so that the vector space (as generated by Word2Vec Gen (W) module 106) can be partitioned into the labeled regions. Given a clean sentence of length =len, of a document d.sub.k and category class C.sub.i and a time class T.sub.j, the semantic infusion technique defines the Infusion Frequency (I.sub.freq), where I.sub.freq∈R, as the count of anchors to be infused in the clean sentence. ensures that the I.sub.freq not ∝ len. This helps in making this technique a near-lossless in nature.
I.sub.freq=(┌log.sub.2(len)/2┐) equation (1)
[0024] Further the sematic infusion module 104 further determines the I.sub.freq number of non-consecutive random numbers from a set [0, (L−1)]. In an example, if the value of L is 4, the infusion frequency value (I.sub.freq) is computed as 1 using [ceil{(log.sub.2 L)/2}]. Thereafter 1 (equal to I.sub.freq number) random number is determined from a set [0, (4−1)], i.e., [0, 3]. With said example, the random number is determined as one of 0, 1, 2, and 3. In case I.sub.freq is computed as 2, then 2 non-consecutive random numbers can be determined from a set [0, (L−1)].
[0025] Upon determining the I.sub.freq number of non-consecutive random numbers from a set [0, (L−1)], the sematic infusion module 104 generates an infused sentence by inserting a word “A_ class (C.sub.i) _time (T.sub.j)” prior to a word at a position in the respective sentence indicated by each of the determined non-consecutive random numbers, where in “A_C.sub.iT.sub.j”, C.sub.i is the category class and T.sub.j is time class, to which the document d.sub.k belongs. In an example, for a document d.sub.k of length len and belonging to a category class C.sub.i and a time class T.sub.j, an Anchor term A_C.sub.iT.sub.j in infused at P random and non-consecutive positions within the document, where P=┌(log.sub.2(len))/2┐.
[0026] Further, the word vector module 106 is configured to generate a word vector for each word of each infused sentence of each of the plurality of text-based documents. The word vector module 106 generates a word vector for each word of each infused sentence of each of the plurality of text-based documents D (101). Each word is replaced with a vector of a multiple dimensions and vector size of each word is the same. Therefore, the infused sentence after replacement of each word with a respective vector is a matrix representing each word as row and dimension of each vector as column. The vectors are chosen such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by the vectors. In an example, the word vector may be generated using an unsupervised algorithm. In an example, the unsupervised algorithm may be based on a Word2Vec technique.
[0027] Upon generating the word vector for each word of each infused sentence of each of the plurality of text-based documents D (101), the pattern generation module 108 generates a semantic trend. The pattern generation module 108 is configured to generate semantic trends by extracting the trending items from the word vector representation created by word vector module 106. The pattern generation module 108 for each of the plurality of text-based documents D, extracts the trending items from the word vector representation in a two-step process in which first, for each pair of c.sub.it.sub.j such that ci∈C and t.sub.j∈T, the corresponding anchor A_c.sub.it.sub.j is identified. In second step, the top-k words are extracted from the word vector representation which are closest to the A_c.sub.it.sub.j in the vector space. These words represent the top-k semantic trends for the category class C.sub.i and the time class T.sub.j.
[0028] In one embodiment, for a plurality of domain-specific documents D={d.sub.k}.sup.N.sub.k=1 in which each document d.sub.k belongs to a category class C.sub.i∈C such that C={c.sub.i}.sup.M.sub.i=1 and a time class T.sub.j∈T such that T={t.sub.j}.sup.L.sub.j=1. The pattern generation module 108 further configured to detect top-k trending items for each pair of c.sub.it.sub.j.
[0029] In one embodiment the system 100 gets regularly updated for the new documents in the time class t.sub.j+1 by re-sampling the topic assignments for all documents in a fixed-sized sliding window L. In the re-sampling process, θ and ϕ of the model in time class t.sub.j are used as α and β respectively, for the model in time class t.sub.j+1. A contribution factor c such that c∈[0, 1] determines the degree of contribution of learned parameters to the priors of the new model1. After all iterations, each time class is assigned a set of topics using θ, and each topic is characterized with a set of words (trending items) using ϕ, from the text-based documents makes the text-based documents contextual understandable.
[0030]
[0031] In some examples, processes involved in the method 200 can be executed based on instructions stored in a non-transitory computer-readable medium. The processing module 102 may be communicatively coupled to the non-transitory computer-readable medium so as to fetch and execute computer-readable instructions from the non-transitory computer-readable medium. The non-transitory computer-readable medium may include, for example, digital memories, magnetic storage media, such as a magnetic disks and magnetic tapes, hard drives, or optically readable digital data storage media.
[0032] Referring to
[0033] At block 204, the method 200 may include removing at least one of symbols, special characters, and predefined stop words from each sentence of the set of sentences j.sub.dk of each document d.sub.k of the plurality of text-based documents for basic cleansing.
[0034] At block 206, the method 200 may include identifying a category class (C.sub.i) and a time class (T.sub.i), to which each cleaned text-based document d.sub.k of the plurality of text-based documents D belong, by the processing module 102. For example, if a clean sentence is “right front wheel locked vehicle spin response anti-lock brakes”, the class c.sub.i may be identified as “Service-Brakes”, since the clean sentence indicates about the brakes.
[0035] At block 208 of the method 200, a number of words (L) for each sentence in each text-based document d.sub.k, which is a cleaned length of each sentence, are determined by the semantic infusion module 104. Further, based on the determined number of words (L), an infusion frequency value (I.sub.freq) is computed. In an example, the infusion frequency value (I.sub.freq) is computed as one of [ceil{(log.sub.2 L)/2}], {ceil(√(L))},
and {ceil(L/2)}. In a specific example, the processing module 102 may assume the infusion frequency value (I.sub.freq) as 1 for each sentence of each text-based document d.sub.k.
[0036] Further, the I.sub.freq number of non-consecutive random numbers are determined from a set [0, (L−1)] and an infused sentence is generated by inserting a word “A_ class (C.sub.i) _time (T.sub.j)” prior to a word at a position in the respective sentence indicated by each of the determined non-consecutive random numbers, where in “A_C.sub.iT.sub.j”, C.sub.i is the category class and T.sub.j is time class, to which the document d.sub.k belongs. For example, for a clean sentence “right front wheel locked vehicle spin response anti-lock brakes” of a document class c.sub.i=Service-Brakes is processed as “right A_Service-Brakes front wheel locked vehicle spin A_Service-Brakes response anti-lock brakes”.
[0037] At block 210, the method 200 may include generating a word vector for each word of each infused sentence of each of the plurality of text-based documents D, by a word vector module 106. Thus, each word is replaced with a vector. Each vector is chosen such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by the vectors and thus the vectors capture the co-occurrence statistics of the words, such that, words that typically co-occur or words that share similar context are closer to each other in a vector space. In an example, the word vector may be generated based on a Word2Vec technique.
[0038] At block 212, the method 200 may include generating semantic trends by extracting the trending items from the word vector representation created for each word of each infused sentence of each of the plurality of text-based documents, in the step 210, by a pattern generation module 108. The pattern generation module 108 for each of the plurality of text-based documents D, extracts the trending items from the word vector representation in a two-step process in which first, for each pair of c.sub.it.sub.j such that c.sub.i∈C and tj∈T, the corresponding anchor A_c.sub.it.sub.j is identified. In second step, the top-k words are extracted from the word vector representation which are closest to the A_c.sub.it.sub.j in the vector space. These words represent the top-k semantic trends for the category class C.sub.i and the time class T.sub.j.
[0039] The present subject matter is employed to aid text analytics activities to operate seamlessly by identifying trends for categorical data using a semantic infusion technique, from the input text-based documents. With this technique, the system 100 overcomes the frequency-based bias associated with traditional trend analysis techniques and detect semantically meaningful trends for a given time-sliced categorical corpus of text-based documents.
[0040] Although aspects for the present disclosure have been described in a language specific to structural features and/or methods, it is to be understood that the appended claims are not limited to the specific features or methods described herein. Rather, the specific features and methods are disclosed as examples of the present disclosure.