Text Analysis System, and Characteristic Evaluation System for Message Exchange Using the Same
20220343067 · 2022-10-27
Inventors
Cpc classification
G06F21/50
PHYSICS
International classification
Abstract
[Problem(s)] To provide a text analysis system that is low cost and able to detect text with a normal expressive or structural features.
[Solution] A text analysis system 100 according to the present invention includes a text acquisition portion 110 for acquiring text data; a feature extraction portion 120 for converting the text data acquired by the text acquisition portion 110 into a time series signal to extract a feature from the converted time series signal; a feature storage portion 130 for storing the feature extracted by feature extraction portion 120; and an anomalous text detection portion 140 for detecting anomalous text based on the feature in the feature storage portion 130.
Claims
1-15. (canceled)
16. A text analysis system for analyzing text, the system comprising: acquisition means for acquiring text data; a converter configured to convert characters of the acquired text data into a numerical form to convert the text data into a time series signal; a feature extractor configured to extract feature information from the time series signal to store the extracted feature information, the feature extractor being further configured to extract a feature from a normalized time series signal of text data described by a normal expressive feature, structural feature, or both, and learn the feature to acquire an output waveform that reproduces an input waveform of the time series signal by using the feature; and determination means for determining an identity of text data newly acquired by using the feature information.
17. The text analysis system of claim 16, the system further comprising: a detector configured to detect anomalous text different from the feature information, based on a determination result by the determination means.
18. The text analysis system of claim 16, wherein the converter is configured to convert characters into numerical data based on a predetermined conversion table.
19. The text analysis system of claim 16, wherein the converter is configured to normalize the time series signal to converge them into a range from a minimal value 0 to a maximum value 1.
20. The text analysis system of claim 16, wherein the converter is configured to attenuate a value of the time series signal that is more than a set threshold to normalize the time series signal.
21. The text analysis system of claim 16, wherein the feature extractor is configured to encode the feature information by an auto-encoder.
22. The text analysis system of claim 21, wherein the feature extractor learns the feature information by a neural network.
23. A feature evaluation system for message exchange, the feature evaluation system comprising the text analysis system of claim 17, wherein the detector is configured to detect an anomaly in a transmitting email based on the determination result by the determination means.
24. The feature evaluation system of claim 23, the feature evaluation system further comprising a transmission controller configured to halt transmission of the transmitting email when the anomaly is detected in the transmitting email.
25. The feature evaluation system of claim 24, the feature evaluation system further comprising a notification means for notifying the halt of transmission of the transmitting email when the transmission of the transmitting email is halted by the transmission controller.
26-27. (canceled)
28. A text analysis method, the method comprising the steps of: acquiring text data; converting characters of the acquired text data into a numerical form to convert the text data into a time series signal; extracting feature information from the converted time series signal to store the extracted feature information, wherein extracting the feature information comprises extracting a feature from a normalized time series signal of text data described by a normal expressive feature, a structural feature, or both, and learning the feature to acquire an output waveform that reproduces an input waveform of the time series signal by using the feature; and determining an identity of newly-acquired text data by using the extracted feature information.
29. The text analysis method of claim 28, wherein the step of determining an identity includes identifying a transmitting email described with an anomalous expressive feature and/or structural feature different from the feature information.
30-35. (canceled)
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
[0027] The following reference numerals can be used in conjunction with the drawings: [0028] 100: text analysis system [0029] 110: text acquisition portion [0030] 120: feature extraction portion [0031] 130: feature storage portion [0032] 140: anomalous text detection portion [0033] 200: outgoing email monitoring system [0034] 210: outgoing email acquisition portion [0035] 220: feature extraction portion [0036] 230: feature storage portion [0037] 240: anomalous email detection portion [0038] 250: output control portion
[0039] A text analysis system according to the present invention may be applied to any electronic devices having functions to electronically process text (such as computer device, mail server, client terminal, and smart phone).
[0040]
[0041] The text analysis system 100 is implemented by software such as mail server and client terminal etc., hardware, or the combination of software and hardware. The text acquisition portion no acquires text data (for example, electronic mail etc.) composed by a user. In the case where text data is an electronic mail, an electronic mail in HTML form composed by a mail soft loaded in a client terminal, an electronic mail sent from a client terminal to a mail server through internet, or an electronic mail in a message exchange system is acquired.
[0042] The text acquisition portion no may acquire text data composed by multiple users. To provide a learning function to the text analysis system 100 in advance, text data acquired by the text acquisition portion no is normal text data that is composed in user's normal behaviors, i.e., composed with a normal expressive feature or structural feature. The feature extraction portion 120 extracts a feature included in normal text data composed with the normal expressive feature or structural feature of users and learns the feature of user's text. After learning by the text analysis system 100, the text acquisition portion no acquires optional text data and the text analysis system 100 identifies whether a feature of the optional text data corresponds to the feature of text composed with the normal expressive feature or structural feature. For example, for a text composed by one user, it is identified that whether the text is composed with the normal expressive feature or structural feature or whether the text is composed by another user.
[0043]
[0044] The character signalizing portion 122 converts a series of characters described in a text into one-dimensional time series signals. In one preferred example, the character signalizing portion 122 converts each of characters in the text into a numerical data based on Unicode. Unicode is one of the international standards for character code, where codes are assigned to characters, numbers, or symbols of various languages in the world.
[0045] In another example, a conversion table may be previously prepared in which the relationship between character, idiom, and phrase etc. and numerical data is uniquely defined. The character signalizing portion 122 may convert each character or idiom etc. in a text to numerical data by using such conversion table.
[0046] The character signalizing portion 122 converts characters from the first to the last in a text to numerical data. For example, if the text has the size of P row(s)×Q column(s) (P and Q are any integer number), time series signals including binary value data corresponding to the number of characters in P×Q may be generated. In this case, character is a concept including characters in natural language, numbers, symbols, figures, and blank (space) without any characters. For example, for a text in horizontal writing, characters may be sequentially scanned from the first line to the last line, from left to right or from right to left. Alternatively, for a text in vertical writing, characters may be sequentially scanned from the first line to the last line, from the top to the bottom or from the bottom to the top. Thus, characters from the first to the last may be converted to numerical data. The scanning direction may be optionally determined. If page information configuring text data (the number of lines, the number of characters in one line) is required, the page information may be acquired at the same time. Thus, characters from the first to the last may be identified in reference to the page information.
[0047] The time series signals generated by the character signalizing portion 122 may be regarded as a non periodic waveform composed by characters in the text. Words or idioms included in the text are expressed as a waveform pattern. For example, when a user uses a word or idiom “XX” frequently, a waveform pattern corresponding to “XX” may be included in the time series signals. Alternatively, when the user composes a text in polite language and/or uses a lot of punctuations and/or a lot of certain conjunction with a normal expressive feature or structural feature, a waveform pattern expressing them may be included. Such waveform patter is one feature for identifying user.
[0048] The character signalizing portion 122 according an embodiment herein converts characters into signals based on Unicode or the conversion table. Thus, it may be applied to multiple languages without depending on a certain language. Language differences may be expressed as the difference of waveforms of time series signals. Further, the character signalizing portion 122 does not perform morphological analysis and/or syntax analysis, so that dictionaries such as corpus etc. are not required, which reduces cost.
[0049] The signal normalization portion 124 normalizes a time series signal generated by the character signalizing portion 122. When characters are converted into numerical form by Unicode, each numerical value for generating a time series signal is expressed in a discrete value whose range may be extremely large. Thus, the signal normalization portion 124 performs a process for suppressing outliers of the time series signals and a process for normalizing the range.
[0050] By the process for suppressing outliers, a numerical value that is more than a preset threshold value is attenuated. For example, the process is performed by the following equation, where “avg” is an average, “std” is a standard deviation, “x” is a target value (in this case, a numerical value of a time series signal), “rate” is an attenuation rate, and “d” is a coefficient that is multiplied by a numerical value to be added for raising the overall value.
threshold=|std−avg |×(1−d)
avg+((x−avg)×rate+(|x−avg|×d)):(|x−avg|>threshold)
x:(|x−avg|≤threshold) Equation 1
[0051] The threshold value (threshold) is set inside by an infinitesimal d from a point away from the average by G, as described above (|standard deviation−average value|×(1−d)). That is, since the degree away from the average value is referenced, the target value is also divided to cases by reference to the degree away from the average value |x−avg|.
[0052] Then, for a signal for which the process for suppressing outliers is performed, the process of normalization of the range is performed. In the process of normalization of the range, the standard deviation (std) is normalized to 1 and the average (avg) is normalized to 0, after that, minimum value is normalized to 0 and maximum value is normalized to 1 again, so that the time series signals are converged into the range of 0-1.
[0053]
[0054] Now, the signal classification portion 126 is explained. The signal classification portion 126 receives a normalized time series signal from the signal normalization portion 124 to extract a feature included in the time series signal. The extracted feature is the one from which the input can be reproduced. The signal classification portion 126 learns the feature. The signal classification portion 126 learns text data only that is composed with a normal expressive feature or structural feature. For example, a feature is extracted from the normalized input form as shown in
[0055] In one preferred aspect, the signal classification portion 216 reduces dimensionality(s) of the feature by an auto-encoder using neural network and suppresses the amount of information.
[0056] The signal classification portion 126 also includes a function to inspect the reproducibility of the output waveform. Specifically, the distances between each point in two time series of the input waveform and the output waveform as shown in
[0057] The signal classification portion 126 calculates a threshold value for classifying waveforms. Specifically, evaluation data, i.e., a feature that is extracted from a text (sentence) written by a normal expressive feature and/or structural feature and is compressed by the auto-encoder (which is expressed as the weight of the auto-encoder, for example, as a coefficients of equation which each neuron has) is evaluated to calculate identity. Then, the median value and the standard deviation of the identity are obtained and a threshold value is calculated by the following equation. The threshold value means that almost 95% waveforms are included within the range from the median value to the standard deviation*2, if the waveforms show generally a normal distribution.
threshold value=median value−standard deviation×2 Equation 2
[0058] The threshold value is not limited to the above equation. If waveforms is closer to a normal distribution, threshold value=mean value−standard deviation*2(2σ) may be employed. When the similarity of waveforms is calculated by another equation, a threshold value may be based on this equation.
[0059]
[0060] The feature storage portion 130 stores a feature extracted by the feature extraction portion 120 and its threshold value. Each time text data is learned, the feature and the threshold value are updated.
[0061] After pre-learning by the feature extraction portion 120 is completed, the anomalous text detection portion 140 detects anomalous text by using the result of the pre-learning. That is, an arbitrary text A is obtained by the text acquisition portion no, then the feature of the text A is extracted by the feature extraction portion 120. The signal classification portion 126 compares the feature extracted from the text A with a threshold value stored in the feature storage portion 130. When the feature is more than the threshold value, the text A is determined as anomalous text. The result of the determination is provided to the anomalous text detection portion 140. The anomalous text detection portion 140 detects that the text A determined as anomalous text is not composed with a normal expressive feature and/or structural feature. For example, the text A is estimated as a text that is composed by another user other than one user or a text that is composed by the one user himself with a specific expressive feature and/or structural feature.
[0062]
[0063] The outgoing email monitoring system 200 includes an outgoing email acquisition portion 210 for acquiring an outgoing mail composed by a user; a feature extraction portion 220 for extracting a feature of the outgoing mail that is acquired by the outgoing email acquisition portion 210; a feature storage portion 230 for storing the extracted feature; an anomalous email detection portion 240 for detecting whether or not the acquired outgoing mail has anomalous; and a transmission control portion 250 for controlling the transmission of the outgoing mail based on the detection result of the anomalous email detection portion 240. These functions may be performed by software in mail server or client terminal, hardware, or the combination of software and hardware.
[0064] The outgoing email acquisition portion 210 acquires an electronic mail in HTML form composed by mail soft that is mounted in a client terminal or a acquires an electronic mail for sending uploaded from a client terminal to mail server.
[0065] The feature extraction portion 220 operates similar to the feature extraction portion 120 of the above-described text analysis system. For simplicity sake, the feature extraction portion 220 shall be preliminary learned a feature of an electronic mail that is composed by user X with a normal expressive feature and/or structural feature. Accordingly, if an outgoing email acquired from the outgoing email acquisition portion 210 is composed by user X, the outgoing mail has the feature same as the learned feature. Thus, the outgoing mail is identified as a mail that is composed by user X with a normal expressive feature and/or structural feature. If an outgoing mail is composed by user X with specific expressive and/or structural features or composed by another user, the outgoing mail does not have the feature same as the learned feature. Thus, the outgoing mail is identified as a mail that is composed by user X with specific expressive and/or structural features or composed by another user. As shown in
[0066] When it is determined that an outgoing mail has no identity, the anomalous email detection portion 240 detects the outgoing mail as anomalous mail and provides the detection result to the transmission control portion 250. When anomalous mail is detected, the transmission control portion 250 instructs, for example, a client terminal or mail server to halt or hold the transmission of the outgoing mail and alerts user to non-delivery. For example, non-delivery is displayed on the display of the client terminal or voice guidance may be used. When anomalous mail is not detected, the outgoing mail is sent to the client terminal or mail server.
[0067]
[0068] Thus, according to embodiments herein, an outgoing mail is determined if the mail is composed with usual expressive and/or structural features. When the mail is composed by user with specific expressive and/or structural features or composed by another user, sending of the outgoing mail is halted. Thus, information leak by unsolicited outgoing mail may be prevented.
[0069] Now, an example of a verification of the text analysis system according to an embodiment herein is described. In an experiment, four type of email magazines were used for evaluation. Only one email magazine A of the four email magazines was learned. It was evaluated whether or not the other three email magazines that were not targeted to learn are identified as the one other than email magazine A (That is, as shown in
[0070] In the experiment, 1000 email magazines A were learned and each 100 email magazines of each of the other three were evaluated whether or not they are identified as the one other than email magazine A.
[0071] In another experiment, emails by three employees were evaluated. Users A, B were with a sales department and User C was with a quality management engineering department. In the experiment, emails by user A was learned.
[0072] For emails, if the text is short, the difference is not sufficiently expressed, which causes low accuracy. Also, if the type of occupation is partially overlapped, the expressions is similar. Thus, it is expected that the difference is not sufficiently expressed.
[0073] While the preferred embodiments are described above in detail, the present invention is not limited thereto. Modifications and/or variations are possible within the scope of the claims.