Text Analysis System, and Characteristic Evaluation System for Message Exchange Using the Same

Abstract

Aspects of this disclosure provide a device, system, and method for analyzing text. In an embodiment, a system is configured to convert characters of the text into a numerical time series signal. The numerical time series signal includes a time series conversion of the characters in numerical format. The system is further configured to generate a waveform with extracted information from the numerical time series signal. The extracted information having features based on politeness in language, a quantifiable use of punctuations, a quantifiable use of conjunctions, use of idioms, or a combination thereof. The system is additionally configured to determine whether the text is written by a specific user based on an analysis of the waveform against a threshold.

Claims

1. A system for analyzing text, the system comprising: a converter circuit configured to convert characters of the text into a numerical time series signal, wherein the numerical time series signal comprises a time series conversion of the characters in numerical format; a feature extractor circuit configured to generate a waveform comprising extracted information from the numerical time series signal, wherein the extracted information comprises features based on politeness in language, a quantifiable use of punctuations, a quantifiable use of conjunctions, use of idioms, or a combination thereof; and a determination circuit configured to determine whether the text is written by a specific user based on an analysis of the waveform against a threshold.

2. The system of claim 1, wherein the feature extractor circuit is configured to generate the waveform based on a trained neural network, wherein the trained neural network is trained using text previously written by the specific user.

3. The system of claim 1, wherein the system further comprises: an acquisition circuit configured to acquire the characters of the text; and a detector circuit configured to detect anomalous text different from the extracted information based on a determination result from the determination circuit.

4. The system of claim 1, wherein the converter circuit is configured to convert the characters of the text into the numerical time series signal based on a predetermined conversion table or using Unicode.

5. The system of claim 1, wherein the converter circuit is configured to normalize the numerical time series signal to a range between “0” and “1,” inclusive.

6. The system of claim 1, wherein the converter circuit is configured to attenuate a value of the numerical time series signal that is more than a set threshold to normalize the numerical time series signal.

7. The system of claim 1, wherein the feature extractor circuit is configured to encode the extracted information using an auto-encoder, and wherein the feature extractor circuit learns to extract the extracted information using a neural network.

8. The system of claim 1, wherein the system is part of a message exchange system, wherein the text is part of a message communicated using the message exchange system, and wherein the message exchange system is configured to determine whether to communicate the message based on the determining as to whether the message is written by the specific user.

9. The system of claim 8, wherein the message exchange system is configured to halt transmission of the message when an anomaly is detected based on an analysis of the waveform against the threshold.

10. The system of claim 8, wherein the message exchange system is configured to generate a notification message in response to halting the transmission.

11. A method for analyzing text, the method comprising: converting characters of the text into a numerical time series signal, wherein the numerical time series signal comprises a time series conversion of the characters in numerical format; generating a waveform comprising extracted information from the numerical time series signal, wherein the extracted information comprises features based on politeness in language, a quantifiable use of punctuations, a quantifiable use of conjunctions, use of idioms, or a combination thereof; and a determination circuit configured to determine whether the text is written by a specific user based on an analysis of the waveform against a threshold.

12. The method of claim ii, further comprising converting the characters of the text into the numerical time series signal based on a predetermined conversion table or using Unicode.

13. A device for analyzing text, the device comprising: a non-transitory memory storage comprising instructions; and a processor in communication with the non-transitory memory storage, the instructions, when executed by the processor, cause the processor to: convert characters of the text into a numerical time series signal, wherein the numerical time series signal comprises a time series conversion of the characters in numerical format, generate a waveform comprising extracted information from the numerical time series signal, wherein the extracted information comprises features based on politeness in language, a quantifiable use of punctuations, a quantifiable use of conjunctions, use of idioms, or a combination thereof, and determine whether the text is written by a specific user based on an analysis of the waveform against a threshold.

14. The device of claim 13, wherein the instructions, when executed by the processor, cause the processor to convert the characters of the text into the numerical time series signal based on a predetermined conversion table or using Unicode.

15. The device of claim 13, wherein the extracting of the information is performed by a neural network trained on previous texts generated by the specific user.

16. The device of claim 13, wherein the instructions, when executed by the processor, cause the processor to normalize the numerical time series signal to a range between “0” and “1,” inclusive.

17. The device of claim 13, wherein the instructions, when executed by the processor, cause the processor to: detect an anomaly in the text based on a threshold difference in the features of a typical text written by the specific user and the text under analysis, wherein the detection is based on a learning model of a neural network trained using a plurality of previously written text by the specific user; and halt the transmission of the text under analysis based on the detection of an anomaly.

18. The device of claim 13, wherein the waveform includes a classification of the extracted information, and wherein the analysis of the waveform corresponds to an analysis of the waveform to a threshold set using a plurality of previously written text by the specific user.

19. The device of claim 13, wherein the threshold is determined based on a second waveform generated using a plurality of previously written text by the specific user.

20. The device of claim 13, wherein the device is part of a message exchange system, wherein the text is part of a message communicated using the message exchange system, and wherein the message exchange system is configured to determine whether to communicate the message based on the determining as to whether the message is written by the specific user.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0027] For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

[0028] FIG. 1 is a block diagram illustrating a structure of a text analysis system according to a first embodiment of the present disclosure;

[0029] FIG. 2 is a block diagram illustrating an internal structure of a feature extraction circuit shown in FIG. 1;

[0030] FIG. 3 is an example of a part of Unicode;

[0031] FIG. 4 is an example illustrating that electronic mail is acquired as text data and a time series signals of the electronic mail are normalized;

[0032] FIG. 5 is a flow chart illustrating an example of an operation of a signal normalization according to an embodiment of the present disclosure;

[0033] FIG. 6 is a diagram illustrating a feature extraction from an input by a signal classification circuit according to an embodiment of the present disclosure;

[0034] FIG. 7 is a diagram illustrating an auto-encoder according to an embodiment of the present disclosure;

[0035] FIG. 8 is a diagram illustrating an example of a classification by a threshold of the signal classification circuit;

[0036] FIG. 9 is a block diagram illustrating a structure of an outgoing e-mail monitoring system according to a second embodiment of the present disclosure;

[0037] FIG. 10 is a flow chart illustrating an operation of the outgoing email monitoring system according to a second embodiment of the present disclosure;

[0038] FIG. 11 is a graph showing one experimental result according to an embodiment of the present disclosure; and

[0039] FIG. 12 is a graph showing another experimental result according to an embodiment of the present disclosure.

[0040] Embodiments can be implemented in hardware, software, or in any combination thereof.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

[0041] The following reference numerals can be used in conjunction with the drawings:

[0042] 100: text analysis system

[0043] 110: text acquisition circuit

[0044] 120: feature extraction circuit

[0045] 130: feature storage circuit

[0046] 140: anomalous text detection circuit

[0047] 200: outgoing email monitoring system

[0048] 210: outgoing email acquisition circuit

[0049] 220: feature extraction circuit

[0050] 230: feature storage circuit

[0051] 240: anomalous email detection circuit

[0052] 250: output control circuit

[0053] A text analysis system according to the present disclosure may be applied to any electronic devices having functions to electronically process text, such as computer devices, mail servers, client terminals, smartphones, and the like.

[0054] FIG. 1 is a diagram illustrating an example of a structure of a text analysis system according to an embodiment herein. According to the embodiment, a text analysis system 100 includes a text acquisition circuit 110 for acquiring text data, a feature extraction circuit 120 for extracting feature(s) of the text data acquired in the text acquisition circuit 110, a feature storage circuit 130 for storing the feature extracted by the feature extraction circuit 120, and an anomalous text detection circuit 140 for detecting anomalous text based on the feature in the feature extraction circuit 120 or the feature storage circuit 130.

[0055] In embodiments, the text analysis system 100 is implemented by software such as a mail server and a client terminal etc., hardware, or the combination of software and hardware. The text acquisition circuit 110 acquires text data (for example, electronic mail etc.) written by a user. In the case where text data is an electronic mail (e.g., an electronic mail in HTML form written using a mailing software loaded in a client terminal), an electronic mail sent from a client terminal to a mail server through the internet, or an electronic mail in a message exchange system is acquired.

[0056] The text acquisition circuit 110 may acquire text data written by multiple users. To provide a learning function to the text analysis system 100 in advance, text data acquired by the text acquisition circuit 110 is normal text data that is written in accordance with the user's normal behaviors, i.e., written with a normal expressive feature or structural feature.

[0057] The feature extraction circuit 120 extracts a feature included in normal text data written with the normal expressive feature or structural feature of the user and learns the feature of the user's text. After learning by the text analysis system 100, the text acquisition circuit 110 acquires optional text data and the text analysis system 100 identifies whether a feature of the optional text data corresponds to the feature of text written with the normal expressive feature or structural feature.

[0058] For example, for a text written by one user, it is identified whether the text is written with the normal expressive feature or structural feature or whether the text is written by another user.

[0059] FIG. 2 shows an internal structure of the feature extraction circuit 120. The feature extraction circuit 120 includes a character signalizing circuit 122 for receiving text data acquired in the text acquisition circuit 110 to convert characters described in a text to time series signals, a normalization circuit 124 for normalizing the time series signals that are converted into signals by the character signalizing circuit 122, and a signal classification circuit 126 for classifying the normalized signals.

[0060] The character signalizing circuit 122 converts a series of characters described in a text into one-dimensional time series signals. In one preferred example, the character signalizing circuit 122 converts each of the characters in the text into numerical data based on Unicode. Unicode is one of the international standards for character code, where codes are assigned to characters, numbers, or symbols of various languages in the world.

[0061] FIG. 3 shows an example of a part of Unicode. In Unicode, ASCII, Chinese characters, Arabic, and Greek symbols, etc. are coded to binary data in 16-bit or more. The character signalizing circuit 122 may have the amount of data in which the number of bits per numerical value is converted from one character multiplied by the number of characters. Also, the character signalizing circuit 122 may convert fixed-length data to one continuous sequence data or to varying-length data.

[0062] In another example, a conversion table may be previously prepared in which the relationship between character, idiom, phrase, etc., and numerical data is uniquely defined. The character signalizing circuit 122 may convert each character or idiom etc. in a text to numerical data by using the conversion table.

[0063] The character signalizing circuit 122 converts characters from the first to the last in a text to numerical data. For example, if the text has the size of P row(s)×Q column(s) (P and Q are any integer number), time series signals including binary value data corresponding to the number of characters in P×Q may be generated.

[0064] In this case, the character is a concept including characters in natural language, numbers, symbols, figures, and blank (space) without any characters. For example, for a text in horizontal writing, characters may be sequentially scanned from the first line to the last line, from left to right, or from right to left. Alternatively, for text in vertical writing, characters may be sequentially scanned from the first line to the last line, from the top to the bottom or from the bottom to the top. Thus, characters from the first to the last may be converted to numerical data. The scanning direction may be optionally determined. If page information configuring text data (the number of lines, the number of characters in one line) is required, the page information may be acquired at the same time. Thus, characters from the first to the last may be identified in reference to the page information.

[0065] The time series signals generated by the character signalizing circuit 122 may be regarded as a non-periodic waveform comprised of characters in the text. Words or idioms included in the text are expressed as a waveform pattern. For example, when a user uses a word or idiom “XX” frequently, a waveform pattern corresponding to “XX” may be included in the time series signals. Alternatively, when the user writes a text in polite language or uses a lot of punctuation or a lot of certain conjunction with a normal expressive feature or structural feature, a waveform pattern expressing them may be included. Such waveform pattern is one feature for identifying the user.

[0066] The character signalizing circuit 122, according to an embodiment herein, converts characters into signals based on Unicode or the conversion table. Thus, it may be applied to multiple languages without depending on a certain language. Language differences may be expressed as the difference in waveforms of time series signals. Further, the character signalizing circuit 122 does not perform morphological analysis or syntax analysis, so dictionaries such as corpus, etc. are not required, which reduces cost.

[0067] The signal normalization circuit 124 normalizes a time series signal generated by the character signalizing circuit 122. When characters are converted into numerical form by Unicode, each numerical value for generating a time series signal is expressed in a discrete value whose range may be extremely large. Thus, the signal normalization circuit 124 performs a process for suppressing outliers of the time series signals and a process for normalizing the range.

[0068] By the process of suppressing outliers, a numerical value that is more than a preset threshold value is attenuated. For example, the process is performed by the following equation, where “avg” is an average, “std” is a standard deviation, “x” is a target value (in this case, a numerical value of a time series signal), “rate” is an attenuation rate, and “d” is a coefficient that is multiplied by a numerical value to be added for raising the overall value.

threshold=|std−avg|×(1−d)

avg+((x−avg)×rate+(|x−avg|×d)):(|x−avg|>threshold)

x:(|x−avg|≤threshold) Equation 1

[0069] The threshold value (threshold) is set inside by an infinitesimal d from a point away from the average by σ, as described above (|standard deviation−average value|×(1−d)). That is, since the degree away from the average value is referenced, the target value is also divided to cases by reference to the degree away from the average value |x−avg|.

[0070] Then, for a signal for which the process of suppressing outliers is performed, the process of normalization of the range is performed. In the process of normalization of the range, the standard deviation (std) is normalized to 1 and the average (avg) is normalized to 0, after that, the minimum value is normalized to 0 and the maximum value is normalized to 1, such that the time series signals are converged into the range of 0-1.

[0071] FIG. 4 shows an example of a normalization, where characters of the body of an electronic mail are converted to time series signals when the electronic mail is received as text data, and the time series signals are normalized to be converged to the range of 0-1.

[0072] FIG. 5 shows a flow chart for an example of an operation of the signal normalization circuit 124 according to an embodiment herein. First, each character in an acquired text is converted into a numerical form by the character signalizing circuit 122, for example, based on UNICODE, at step S100. Then, the numerical value of the time series signals is multiplied by an integer by the signal normalization circuit 124 to extend a waveform at step S102. The numeral value may be adjacent due to languages, this process is performed to correct it.

[0073] Then, the process for suppressing outliers is performed by the signal normalization circuit 124, as shown above in step S104. In the process of suppressing outliers, numerical values more than the threshold value are attenuated. The attenuation may be performed multiple times, at step 106. The number of times of attenuation may be adjusted according to the data. Then, the variance and the average are normalized by the signal normalization circuit 124, after that, the minimum value is normalized to “0” and the maximum value is normalized to “1.” Unless the value of the variance is below a certain threshold value, the processes of steps S104-S108 are repeated. An upper limit may be set to the number of times of the repeated process.

[0074] Now, the signal classification circuit 126 is explained. The signal classification circuit 126 receives a normalized time series signal from the signal normalization circuit 124 to extract a feature included in the time series signal. The extracted feature is the one from which the input can be reproduced. The signal classification circuit 126 learns the feature. The signal classification circuit 126 learns text data only that includes a normal expressive feature or structural feature. For example, a feature is extracted from the normalized input form as shown in FIG. 6. To acquire output waveforms that can reproduce almost input waveforms by using the extracted feature, the feature is learned.

[0075] In one preferred aspect, the signal classification circuit 216 reduces the dimensionality(s) of the feature by an auto-encoder using a neural network and suppresses the amount of information. FIG. 7 shows a concept of the auto-encoder using a neural network. In a preferred aspect, the auto-encoder is configured with fully connected layers only and includes four encoder layers and four decoder layers. The width of each layer of the neural network is variable according to the length of a signal converted from the character string. The encoder reduces the unrequired dimensionality(s) of input to compress the feature. The decoder reproduces the input from the compressed feature. The neural network adjusts the respective weights of the encoder and the decoder by using the learning function. In this example, the neural network reproduces the input with a symmetrical configuration. In embodiments, the input has a fixed length.

[0076] The signal classification circuit 126 also includes a function to inspect the reproducibility of the output waveform. Specifically, the distances between each point in two time series of the input waveform and the output waveform as shown in FIG. 6 are compared in a round-robin manner to detect a path with the shortest distance of two time series. The path is regarded as DTW (Dynamic Time Warping) distance. While the reproduced waveform has some deviations, the inspection is tough to phase shift, etc. The DTW distance is used to measure the reproducibility of new data after the learning model is defined. In this case, new data is new text data that is determined whether or not it is unique by the text analysis system 100. New text data is processed by the auto-encoder. When the DTW distance of input/output waveforms is more than a threshold value (described later), the reproducibility is low and the text data is determined as unique data (that has no normal expressive feature or normal structural feature). The determination result is provided to the anomalous text detection circuit 140.

[0077] The signal classification circuit 126 calculates a threshold value for classifying waveforms. Specifically, evaluation data, i.e., a feature that is extracted from a text (sentence) written by a normal expressive feature or structural feature and is compressed by the auto-encoder (which is expressed as the weight of the auto-encoder, for example, as coefficients of the equation which each neuron has) is evaluated to calculate identity. Then, the median value and the standard deviation of the identity are obtained and a threshold value is calculated by the following equation. The threshold value means that almost 95% of waveforms are included within the range from the median value to the standard deviation*2 if the waveforms show generally a normal distribution.

threshold value=median value−standard deviation×2 Equation 2

[0078] The threshold value is not limited to the above equation. If waveforms are closer to a normal distribution, threshold value=mean value−standard deviation*2(2σ) may be employed. When the similarity of waveforms is calculated by another equation, a threshold value may be based on this equation.

[0079] FIG. 8 shows an example of a classification according to a threshold value. In a graph of FIG. 8, dashed lines are one user's text that has already been learned, and solid lines are another user's text. In this example, the threshold value of the feature is 5.8. A text that has a feature more than this value is detected as another user's text.

[0080] The feature storage circuit 130 stores a feature extracted by the feature extraction circuit 120 and its threshold value. Each time text data is learned, the feature and the threshold value are updated.

[0081] After pre-learning by the feature extraction circuit 120 is completed, the anomalous text detection circuit 140 detects anomalous text by using the result of the pre-learning. That is, an arbitrary text A is obtained by the text acquisition circuit 110, then the feature of the text A is extracted by the feature extraction circuit 120. The signal classification circuit 126 compares the feature extracted from text A with a threshold value stored in the feature storage circuit 130. When the feature is more than the threshold value, text A is determined as anomalous text. The result of the determination is provided to the anomalous text detection circuit 140. The anomalous text detection circuit 140 detects that text A determined as anomalous text is not written with a normal expressive feature or structural feature. For example, text A is estimated as a text that is written by another user other than one user or a text that is written by the one user himself with a specific expressive feature or structural feature.

[0082] FIG. 9 shows an application example of a text analysis system according to an embodiment herein of an outgoing email monitoring system. An outgoing email monitoring system 200 may be achieved for example in a mail server, or client terminal (computer device, mobile device, etc.) with a mail sending/receiving function.

[0083] The outgoing email monitoring system 200 includes an outgoing email acquisition circuit 210 for acquiring outgoing mail written by a user; a feature extraction circuit 220 for extracting a feature of the outgoing mail that is acquired by the outgoing email acquisition circuit 210; a feature storage circuit 230 for storing the extracted feature; an anomalous email detection circuit 240 for detecting whether or not the acquired outgoing mail has anomalous; and a transmission control circuit 250 for controlling the transmission of the outgoing mail based on the detection result of the anomalous email detection circuit 240. These functions may be performed by software in a mail server or client terminal, hardware, or the combination of software and hardware.

[0084] The outgoing email acquisition circuit 210 acquires an electronic mail in HTML form written using mailing software that is mounted in a client terminal or acquires an electronic mail for sending uploaded from a client terminal to the mail server.

[0085] The feature extraction circuit 220 operates similarly to the feature extraction circuit 120 of the above-described text analysis system. For simplicity's sake, the feature extraction circuit 220 shall be preliminary learned a feature of an electronic mail that is written by user X with a normal expressive feature or structural feature. Accordingly, if an outgoing email acquired from the outgoing email acquisition circuit 210 is written by user X, the outgoing mail has the feature same as the learned feature. Thus, the outgoing mail is identified as mail that is written by user X with a normal expressive feature or structural feature. If an outgoing mail is written by user X with specific expressive or structural features or written by another user, the outgoing mail does not have the feature same as the learned feature. Thus, outgoing mail is identified as mail that is written by user X with specific expressive or structural features or written by another user. As shown in FIG. 8, whether or not the electronic mail has an identity is determined based on the threshold value.

[0086] When it is determined that an outgoing mail has no identity, the anomalous email detection circuit 240 detects the outgoing mail as anomalous mail and provides the detection result to the transmission control circuit 250. When anomalous mail is detected, the transmission control circuit 250 instructs, for example, a client terminal or mail server to halt or hold the transmission of the outgoing mail and alerts the user to non-delivery. For example, non-delivery is displayed on the display of the client terminal or voice guidance may be used. When anomalous mail is not detected, the outgoing mail is sent to the client terminal or mail server.

[0087] FIG. 10 is a flow chart for explaining an example of an operation of the outgoing email monitoring system. First, outgoing mail is acquired by the outgoing email acquisition circuit 210 (S200). Then, each character of the body of the outgoing mail is converted into signals by the feature extraction circuit 220 to generate a one-dimensional time series signal (S202). The time series signal is normalized (S206). Then, a feature is extracted from the time series signals. Then, whether or not there is any identity between the extracted feature and the learned feature is determined (S208). When there is an identity, the outgoing mail is determined as the one that is written with the user's usual expressive or structural features (S210). Then, the outgoing mail is sent to a sending address (S212). When it is determined that there is no identity, the outgoing mail is determined as the one that is written by the user with specific expressive or structural features or written by another user (S220). Sending the outgoing mail is halted (S222).

[0088] Thus, according to embodiments herein, outgoing mail is determined if the mail is written with usual expressive or structural features. When the mail is written by the user with specific expressive or structural features or written by another user, sending of the outgoing mail is halted. Thus, information leaks by unsolicited outgoing mail may be prevented.

[0089] Now, an example of verification of the text analysis system according to an embodiment herein is described. In an experiment, four types of email magazines were used for evaluation. Only one email magazine A of the four email magazines was learned. It was evaluated whether or not the other three email magazines that were not targeted to learn are identified as the one other than email magazine A (That is, as shown in FIG. 8, it is evaluated whether or not email magazines more than a threshold value can be detected as the one that is written with specific expressive or structural features). Also, these four email magazines were translated into eight languages to verify the accuracy when the language is different.

[0090] In the experiment, 1000 email magazines A were learned and each 100 email magazines of each of the other three were evaluated whether or not they are identified as the one other than email magazine A. FIG. 11 shows the probability that the email magazines in each language are identified as the one other than email magazine A. As seen in FIG. 8, the mail magazines B and C were identified with good probability, while the mail magazine D has some scatterings between languages. This is caused by the difference in feature(s) of each language. For example, the number of characters in the Japanese language is 50+50+lowercase characters+Chinese characters, the English language has 26 characters+their lowercase characters, the Chinese and Taiwanese languages have 87,000 characters (Unicode11), the French language have 26 characters+lowercase characters+7 characters, the Hindi language has 156 characters+lowercase characters, the Korean language has 11,172 characters, and the Finnish language has 29 characters+lowercase characters. Based on the language, the length of one sentence is different or the amplitude when converting into signals is different. Thus, the accuracy may be finally improved by optimal normalization.

[0091] In another experiment, emails from three employees were evaluated. Users A and B were with a sales department and User C was with a quality management engineering department. In the experiment, emails by user A were learned. FIG. 12 is a graph showing a rate of whether or not users B and C are identified as the ones other than user A. The rate in which user A was identified as others (mail written with specific expressive or structural features) is 5.95%. The rate in which users B and C were identified as the ones other than user A (mail written with specific expressive or structural features) is 62.00% and 51.00%, respectively.

[0092] For emails, if the text is short, the difference is not sufficiently expressed, which causes low accuracy. Also, if the type of occupation is partially overlapped, the expressions is similar. Thus, it is expected that the difference is not sufficiently expressed.

[0093] While the preferred embodiments are described above in detail, the present disclosure is not limited thereto. Modifications or variations are possible within the scope of the claims.

Text Analysis System, and Characteristic Evaluation System for Message Exchange Using the Same

Inventors

Cpc classification

Classification Explorer

G06F16/3344

PHYSICS

Classification Explorer

G06F21/50

PHYSICS

Classification Explorer

G06F40/216

PHYSICS

Classification Explorer

G06F40/126

PHYSICS

Classification Explorer

G06F40/226

PHYSICS

Classification Explorer

G06F40/194

PHYSICS

Classification Explorer

G06F40/205

PHYSICS

Classification Explorer

G06F40/30

PHYSICS

International classification

Classification Explorer

G06F40/205

PHYSICS

Classification Explorer

G06F16/33

PHYSICS

Classification Explorer

G06F40/30

PHYSICS

Abstract

Claims

Description