Hybrid comparison for unicode text strings consisting primarily of ASCII characters
10789416 ยท 2020-09-29
Assignee
Inventors
Cpc classification
H03M7/02
ELECTRICITY
International classification
H03M7/02
ELECTRICITY
Abstract
A method compares text strings having Unicode encoding. The method receives a first string S=s.sub.1s.sub.2 . . . s.sub.n and a second string T=t.sub.1t.sub.2 . . . t.sub.m, where s.sub.1, s.sub.2, . . . , s.sub.n and t.sub.1, t.sub.2, . . . , t.sub.m are Unicode characters. The method computes a first string weight for the first string S according to a weight function . When S consists of ASCII characters, (S)=S. When S consists of ASCII characters and some accented ASCII characters that are replaceable by ASCII characters, (S)=g(s.sub.1)g(s.sub.2) . . . g(s.sub.n), where g(s.sub.i)=s.sub.i when s.sub.i is an ASCII character and g(s.sub.i)=s.sub.i when s.sub.i is an accented ASCII character that is replaceable by the corresponding ASCII character s.sub.i. The method also computes a second string weight for the second text string T. Equality of the strings is tested using the string weights.
Claims
1. A method of comparing text strings having Unicode encoding, comprising: at a computer having one or more processors, and memory storing one or more programs configured for execution by the one or more processors: receiving a first text string S=s.sub.1s.sub.2 . . . s.sub.n having Unicode encoding and a second text string T=t.sub.1t.sub.2 . . . t.sub.m having Unicode encoding, wherein n and m are positive integers, and s.sub.1, s.sub.2, . . . , s.sub.n and t.sub.1, t.sub.2, . . . , t.sub.m are Unicode characters; computing, for the first text string S, a first string weight (S) according to a weight function , computed according to: when it is determined that S consists entirely of ASCII characters, (S)=S; and when it is determined that S consists of ASCII characters and one or more accented ASCII characters that are replaceable by corresponding ASCII characters, (S)=g(s.sub.1)g(s.sub.2) . . . g(s.sub.n), wherein g(s.sub.i)=s.sub.i when s.sub.i is an ASCII character and g(s.sub.i)=s.sub.i when s.sub.i is an accented ASCII character that is replaceable by the corresponding ASCII character s.sub.i; computing, a second string weight (T), for the second text string T, according to the weight function ; and determining whether the first text string and the second text string are equal by comparing the first string weight to the second string weight.
2. The method of claim 1, wherein comparing the first text string S to the second text string T uses a hash function h, and the first text string S is determined to be not equal to the second text string T when h((S))h((T)).
3. The method of claim 1, wherein computing the first string weight comprises: subdividing the first text string S into one or more r-byte contiguous blocks, wherein r is a predefined integer greater than or equal to 4; and processing the r-byte blocks sequentially beginning on the left, including applying a first bitwise operator to each r-byte block to determine whether the respective block consists of ASCII characters.
4. The method of claim 3, wherein comparison is designated as case-insensitive, and computing the first string weight further comprises applying a sequence of bitwise operators to convert upper-case ASCII characters to corresponding lower-case ASCII characters.
5. The method of claim 3, wherein r=8.
6. The method of claim 3, further comprising padding the first text string on the right so that a total length of the first text string is an integer multiple of r.
7. The method of claim 6, wherein padding the first text string comprises adding ASCII null characters.
8. A computing device, comprising: one or more processors; memory; and one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs comprising instructions for: receiving a first text string S=s.sub.1s.sub.2 . . . s.sub.n having Unicode encoding and a second text string T=t.sub.1t.sub.2 . . . t.sub.m having Unicode encoding, wherein n and m are positive integers, and s.sub.1, s.sub.2, . . . , s.sub.n and t.sub.1, t.sub.2, . . . , t.sub.m are Unicode characters; computing, for the first text string S, a first string weight (S) according to a weight function , computed according to: when it is determined that S consists entirely of ASCII characters, (S)=S; and when it is determined that S consists of ASCII characters and one or more accented ASCII characters that are replaceable by corresponding ASCII characters, (S)=g(s.sub.1)g(s.sub.2) . . . g(s.sub.n), wherein g(s.sub.i)=s.sub.i when s.sub.i is an ASCII character and g(s.sub.i)=s.sub.i when s.sub.i is an accented ASCII character that is replaceable by the corresponding ASCII character s.sub.i; computing, a second string weight (T), for the second text string T, according to the weight function ; and determining whether the first text string and the second text string are equal by comparing the first string weight to the second string weight.
9. The computing device of claim 8, wherein comparing the first text string S to the second text string T uses a hash function h, and the first text string S is determined to be not equal to the second text string T when h((S))h((T)).
10. The computing device of claim 8, wherein computing the first string weight comprises: subdividing the first text string S into one or more r-byte contiguous blocks, wherein r is a predefined integer greater than or equal to 4; and processing the r-byte blocks sequentially beginning on the left, including applying a first bitwise operator to each r-byte block to determine whether the respective block consists of ASCII characters.
11. The computing device of claim 10, wherein comparison is designated as case-insensitive, and computing the first string weight further comprises applying a sequence of bitwise operators to convert upper-case ASCII characters to corresponding lower-case ASCII characters.
12. The computing device of claim 10, wherein r=8.
13. The computing device of claim 10, further comprising padding the first text string on the right so that a total length of the first text string is an integer multiple of r.
14. The computing device of claim 13, wherein padding the first text string comprises adding ASCII null characters.
15. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computing device having one or more processors and memory, the one or more programs comprising instructions for: receiving a first text string S=s.sub.1s.sub.2 . . . s.sub.n having Unicode encoding and a second text string T=t.sub.1t.sub.2 . . . t.sub.m having Unicode encoding, wherein n and m are positive integers, and s.sub.1, s.sub.2, . . . , s.sub.n and t.sub.1, t.sub.2, . . . , t.sub.m are Unicode characters; computing, for the first text string S, a first string weight (S) according to a weight function , computed according to: when it is determined that S consists entirely of ASCII characters, (S)=S; and when it is determined that S consists of ASCII characters and one or more accented ASCII characters that are replaceable by corresponding ASCII characters, (S)=g(s.sub.1)g(s.sub.2) . . . g(s.sub.n), wherein g(s.sub.i)=s.sub.i when s.sub.i is an ASCII character and g(s.sub.i)=s.sup.i when s.sub.i is an accented ASCII character that is replaceable by the corresponding ASCII character s.sub.i; computing, a second string weight (T), for the second text string T, according to the weight function ; and determining whether the first text string and the second text string are equal by comparing the first string weight to the second string weight.
16. The computer readable storage medium of claim 15, wherein comparing the first text string S to the second text string T uses a hash function h, and the first text string S is determined to be not equal to the second text string T when h((S))h((T)).
17. The computer readable storage medium of claim 15, wherein computing the first string weight comprises: subdividing the first text string S into one or more r-byte contiguous blocks, wherein r is a predefined integer greater than or equal to 4; and processing the r-byte blocks sequentially beginning on the left, including applying a first bitwise operator to each r-byte block to determine whether the respective block consists of ASCII characters.
18. The computer readable storage medium of claim 17, wherein comparison is designated as case-insensitive, and computing the first string weight further comprises applying a sequence of bitwise operators to convert upper-case ASCII characters to corresponding lower-case ASCII characters.
19. The computer readable storage medium of claim 17, wherein r=8.
20. The computer readable storage medium of claim 17, further comprising padding the first text string on the right so that a total length of the first text string is an integer multiple of r.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) For a better understanding of the aforementioned systems and methods, as well as additional systems and methods for comparing text strings, reference should be made to the Description of Implementations below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
DESCRIPTION OF IMPLEMENTATIONS
(13) Reference will now be made to implementations, examples of which are illustrated in the accompanying drawings. In the following description, numerous specific details are set forth in order to provide an understanding of the various described implementations. However, it will be apparent to one of ordinary skill in the art that the various described implementations may be practiced without these specific details. In other instances, well-known methods, procedures, components, circuits, and networks have not been described in detail so as not to unnecessarily obscure aspects of the implementations.
(14)
(15) In some cases, the personal device 102 connects over one or more communications networks 108 to one or more external database servers 106 and/or a data visualization server 104. The communication networks 108 may include local area networks and/or wide area networks, such as the Internet. In some implementations, the data visualization server 104 provides a data visualization web application that runs within a web browser 220 on the personal device 102. In some implementations, data visualization functionality is provided by a local application 222 with certain functions provided by the data visualization server 104. For example, the data visualization server 104 may be used for resource intensive operations. In some implementations, the one or more database servers 106 include a database engine 120, which provides access to one or more databases 122 that are stored at the database server 106. As illustrated in
(16)
(17) The computing device 200 may also include data and executable modules to compare and collate text strings. The text comparison module 272 is used for determining whether two text strings are considered equal. Note that text comparison can occur in multiple phases. For example, to implement a hash join between two database tables, data values for one side of the join are processed and hashed, and these values are later compared to hash values computed for the second side of the join. Because text comparisons can be performed according to various text processing parameters (e.g., accent sensitive/insensitive and case sensitive/insensitive), two text strings can be classified as equal without being identical. In some implementations, the text comparison module 272 uses an ASCII substitution table 260, as illustrated below in
(18) In some implementations, the computing device 200 includes a text collation module 274. Given a pair of text strings S.sub.1 and S.sub.2, the text collation module determines whether S.sub.1<S.sub.2, S.sub.1=S.sub.2, or S.sub.1>S.sub.2, as illustrated in
(19) Each of the above identified executable modules, applications, or sets of procedures may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, or modules, and thus various subsets of these modules may be combined or otherwise rearranged in various implementations. In some implementations, the memory 214 stores a subset of the modules and data structures identified above. Furthermore, in some implementations, the memory 214 stores additional modules or data structures not described above.
(20) Although
(21) Typically, comparing Unicode strings involves determining weights of each letter of each string. These weights consist of three parts that are merged and concatenated to form the basis for comparison as well as for hashing. The overhead of converting each letter to its corresponding weight by a table lookup creates substantial inefficiency in contrast to pure ASCII-based implementations that can just rely on the ASCII weights of the letters. This is prohibitively expensive for database operations that deal predominantly with ASCII letters with only a few Unicode symbols interspersed. In particular, database systems typically implement hash-based algorithms for joins, grouping, and aggregation, and a single database query many have many of these types of operations. When a user is expecting query results almost instantaneously, the slow process of looking up Unicode weights for all characters can lead to unhappy users.
(22) Some implementations are based on UTF-8 Unicode encoding. UTF-8 was designed to contain the 7-bit encoded ASCII characters one-to-one. When a string with a long initial ASCII portion is encountered, implementations can take a short-cut from the costly weight-based comparison procedure, operating directly on the Unicode encoding. Some implementations process 8-byte blocks (typically 8 ASCII characters) at a time using arithmetic and bit-wise operations on the 64-bit integer representation of the block. This provides a huge performance gain compared to the indirect comparison method using multiple weight components.
(23) The following example illustrates the slow Unicode process using weight components. Consider the three letter Unicode word L.sub.1L.sub.2L.sub.3. For collation (sorting) operations the word is segmented into the three symbols. The symbols are individually assigned weights, which consist of several (most often three) fragments:
W(L.sub.1)=wp.sub.1ws.sub.1wt.sub.1
W(L.sub.2)=wp.sub.2ws.sub.2wt.sub.2
W(L.sub.3)=wp.sub.3ws.sub.3wt.sub.3
(24) The first (primary) weight component (identified by wp)) constitutes the weight of the base symbol, not differentiating between small and capital letters or accents. The secondary component (identified by ws) determines the weight of the accent, if there is any. The third (tertiary) component (identified by wt) determines whether the letter is capitalized or lower case. The weight for sorting or comparing entire words, such as the three-letter word L.sub.1L.sub.2L.sub.3 is determined by first concatenating the first components, then the second components, and finally the third components of the individual weights. The final weight is therefore W(L.sub.1L.sub.2L.sub.3)=wp.sub.1wp.sub.2wp.sub.3ws.sub.1ws.sub.2ws.sub.3wt.sub.1wt.sub.2wt.sub.3. If the comparison to be performed is case-insensitive, the tertiary components wt.sub.1wt.sub.2wt.sub.3 are omitted from the final weight. If the comparison to be performed is accent-insensitive, the secondary components ws.sub.1ws.sub.2ws.sub.3 are omitted from the final weight. If a comparison is both accent-insensitive and case-insensitive, the final weight consists of just the primary components W(L.sub.1L.sub.2L.sub.3)=wp.sub.1wp.sub.2wp.sub.3.
(25) Consider a concrete example. The first weight components of the four letters a, A, , are all identical (e.g., 0x29). The primary weight components of the letters b and B are also identical (e.g., 0x2A). The second component handles the accents/umlauts. In some implementations, a character with no accent is encoded as 0x05 and an umlaut accent has a weight of 0x4596. Capital and lower-case letters are differentiated in the third weight component. In some implementations, upper case Latin letters have a case weight of 0xDC and lower case Latin letters have a case weight of 0x05.
(26) With this example weight encoding, the sort keys for some symbols are (depending on the collation strengthcase sensitive, accent-sensitivethat is specified):
W(a)=29 05 05
W(b)=2A 05 05
W(B)=2A 05 DC
W()=29 4596 05
W()=29 4596 DC
(27) Therefore, the weight of the word ab is 29 2A 29 05 05 4596 05 05 DC.
(28)
(29) When comparing text strings, a language is specified or known, so the comparison has to perform the comparison according to the rules for that language. For each language, the Unicode Weight table 264 includes language specific weights for certain symbols. Each language typically has a small number of symbols that are different from the default weight table 302, so the differences are stored as overrides. Typically, the overrides are stored as ranges rather than individual overrides. Some implementations store overrides in ranges of 32 characters. Other implementations store larger or smaller ranges.
(30) In some implementations, the override table entries (e.g., an entry in the Language A range 304A) do not always contain the three component weights of the corresponding symbol because some Unicode symbols are assigned multiple different weights. In this case, the entry is a pointer to a separate list of weights or a pointer to a small trie data structure that can be used to determine the proper weight.
(31) Performing comparison operations using the Unicode Weight table 264 is a resource intensive process, especially when millions of text strings are involved. This process is particularly excessive when the text strings contain mostly ASCII characters (e.g., for the text operations performed by a database engine 120). The disclosed processes employ a hybrid approach, which is very fast for the most common cases of pure ASCII text, and falls back to the slower process only when necessary.
(32)
(33) Each block is evaluated to determine if it consists of just simple ASCII characters. In some implementations, this is a two-part test. First, a test is performed to determine if the characters are just ASCII characters (i.e., in the range 0x00 to 0x7F). If the characters are in this range, a second test is performed to determine if there are any ASCII control characters (i.e., 0x01 to 0x1F and 0x7F). Some implementations use the routine isSimpleASCIIBlock( ) illustrated in
(34) Some implementations separately use the functions blockIsASCII( ) in
(35) In accordance with Unicode comparison standards, ASCII control characters are typically ignored when comparing character strings. For example, a text string that has an embedded Bel character (0x07) is considered equal to the text string with the Bel character removed. Therefore, some implementations use a function to strip out the ASCII control characters from each text string. In some implementations, the function asciiBlockIsSimple( ) or the function isSimpleASCIIBlock( ) is enhanced to remove these ASCII control characters.
(36) In the example shown in
(37) In some implementations, the routine asciiBlockToLower( ) 1000 is called to convert upper case letters to lower case (or vice versa). This routine is illustrated in
(38) If at least one block contains at least one non-ASCII character, additional steps determine if the non-ASCII character can be replaced with an equivalent ASCII character. This is illustrated in
(39) Note that the overall weight 420 is used for testing equality of strings, not for sorting strings. In particular, the ASCII codes used to form the overall weight are not assigned in a lexicographic collation order. For example, the ASCII code for the letter a is 97, which is greater than the ASCII code for the letter Z, which is 90. Techniques for faster sorting of strings with Unicode encoding are illustrated in
(40) If a block contains a non-ASCII character, it might still be possible to remedy the situation. This is illustrated in
(41) As shown in
(42) An ASCII Substitution table 260 is used to see if the non-ASCII character 508 can be replaced by an equivalent ASCII character. The ASCII Substitution table 260 depends on both the designated language for the comparison as well as the strength parameters 270 (e.g., whether the comparison will be case-insensitive and/or accent-insensitive). In this illustration of
(43) Replacement of non-ASCII characters with equivalent ASCII characters introduces at least two complexities when processing the characters in blocks. First, a non-ASCII character consists of two or more bytes, but the replacement ASCII character consists of only a single byte. In some implementations, the shortened block is used directly in the formation of the overall string weight 520. Some implementations fill in using one or more additional bytes from the end portion 504 of the text string 502 beyond the current block (e.g., shifting left). When shifting in additional bytes, the block has to be rechecked (510) to confirm that it is an ASCII block. Because of substitution and the need for shifting, some implementations identify the blocks one at a time dynamically rather than building all of the blocks at the outset and rebuilding the blocks later after a shift.
(44) A second complexity is that an ASCII character may span more than a single block. For example, the first byte of the non-ASCII character is 0xC3, which may be the 8th byte in a block. It is recognized as the beginning of a non-ASCII character, and the first byte indicates that there is one more byte to complete the character. The second byte 0xA7 is the next byte beyond the current block, so this byte is read in order to form the complete character 0xC3A7 (i.e., the letter ). As illustrated above, this character is looked up (512) in the ASCII substitution table 260, and the ASCII character c replaces (514) it. In this case, the replacement fills out one full block (e.g., an 8-byte block here), so it can be copied (516) directly to the overall weight 520. In this scenario, the shifting occurs when processing the subsequent blocks from the end portion 504.
(45) Because of shifting, some implementations wait to insert padding until the final block is being processed.
(46)
(47) When the non-ASCII character 608 is identified (610) in the third block 606-3, a first test is performed to determine whether the non-ASCII character can be replaced by an equivalent ASCII character, as illustrated in
(48) As soon as a non-replaceable non-ASCII character 608 is encountered in a text string 602, the process reverts to the slower Unicode process. In some implementations, if the non-ASCII character 608 is not the first character in a block, the portion of the block before the non-ASCII character is included in the ASCII prefix 622 of the overall weight 620. In other implementations, once a block with a non-replaceable non-ASCII character is encountered, the slow Unicode process begins with the first character of the block. Once the slow Unicode process is initiated for a text string 602, it is performed for the remainder of the string. That is, the process discontinues (611) the ASCII-based block weight calculation, and performs the slower Unicode weight calculation holistically for the third block 606-3 together with any additional characters 606-4 in the text string 602 beyond the third block 606-3
(49) Here, assume that the remainder, starting with the third block 606-3 (and including any additional characters 606-4), has k characters, which is generally a mixture of ASCII characters (each character consisting of a single byte) and one or more Unicode characters (each consisting of two or more bytes). The remainder is used (612) by the Unicode process. If the characters are labeled as L.sub.1, L.sub.2, . . . , L.sub.k, their weights are retrieved (614) by looking them up (613) in the Unicode weight table 264. As discussed above with respect to
(50)
where W is the weight for a given character, wp is the primary weight (the base symbol), wa is the accent weight (also known as the secondary weight), and wc is the case weight (also known as tertiary weight, which has differing values depending on whether the character is upper case or lower case. The weights are concatenated (616) using the primary weights first, the accent weights second, and finally the case weights, to form the Unicode weight suffix 624. Here, the Unicode weight suffix 624 is:
W(L.sub.1L.sub.2 . . . L.sub.k)=wp.sub.1wp.sub.2 . . . wp.sub.kwa.sub.1wa.sub.2 . . . wa.sub.kwc.sub.1wc.sub.2wc.sub.k
(51) In most cases, either the entire string 602 consists of ASCII characters (as illustrated in
(52)
(53) To illustrate, a general sorting operation includes sorting a first text string 702 and a second text string 704. To perform the sort using Unicode-based weights, the computer compares the primary weight (e.g., wp) of the first character in the first text string 702 to the primary weight of the first character in the second text string 704. If the two primary weights are the same, the computer compares the primary weight of the second character in the first text string 702 to the primary weight of the second character in the second text string 704, and so on (a first pass). If the primary weights for all of the characters of both text strings are the same, the computer compares the secondary weight (e.g., wa) of the first character in the first text string 702 to the secondary weight of the first character in the second text string 704. If these two secondary weights are the same, the computer compares the secondary weight of the second character in the first text string 702 to the secondary weight of the second character in the second text string 704, and so on (a second pass). Finally, if the secondary weights for all the characters of both text strings are the same, the computer compares tertiary weights (a third pass). When millions of text strings are involved, this process is cumbersome and time consuming.
(54) In come implementations, lexicographic comparisons use a fast sort weight table 262, which contains weight assignments for ASCII characters (and perhaps some other characters, such as common Unicode characters found in database processing). The weight assignments are derived from the primary weights (wp) of Unicode characters. In doing so, the lengthy Unicode weights described above with reference to
(55) This approach proceeds one character position at a time, comparing each character in the first text string 702 to a corresponding character in the second text string 704 (e.g., the first character in the first text string 702 is compared to the first character in the second text string 704, the second character in the first text string 702 is compared to the second character in the second text string 704, and so on). If this process does not find any differing characters, the two text string are designated as equal. Otherwise, the process finds a first character in the first text string 702 that differs from the corresponding character in the second text string 704.
(56) For example, at position 710, the character x 712 in the first text string 702 differs from the character y 714 in the second text string 704 (not literally the characters x and y). These characters are searched for in the fast sort weight table 262 to determine the weights W.sub.x and W.sub.y. When W.sub.x is less than W.sub.y, the first text string 702 is ordered first in the collation order. Alternatively, when W.sub.y is less than W.sub.x, the second text string 704 is ordered first in the collation order. This approach in particular useful when the difference between the characters is either an accent-based difference or a case-based difference because the second pass and the third pass are typically avoided.
(57) In some instances, the fast sort weight table 262 is missing one or both of the characters. In this case, the computer resorts to the general technique described above that relies on lengthy Unicode weights. However, because database operations typically involve ASCII characters, the process avoids the slow general technique most of the time.
(58) Some implementations use block level comparisons initially (e.g., 8-byte blocks). In some cases, block level comparisons can quickly identify the first differing byte within the two strings. For example, comparing the first block 706-1 of the first string 702 to the corresponding block of the second string 704 reveals that they are the same. However, comparing the second block 706-2 of the first string 702 to the corresponding block of the second string 704 reveals that the second blocks have differing characters at the position 710. When using block level comparison initially, implementations first identify differing bytes, then determine what characters contain those bytes. In most cases, the differing bytes are ASCII characters. However, in some instances the first differing byte may be a portion of a Unicode character. In some of these instances, the first differing byte is the first byte of a multi-byte character. In some instances, the first differing byes are the second or subsequent bytes of a multi-byte character. Note also that a Unicode character that is not ASCII may span a pair of blocks. In some instances, the first differing bytes are the initial bytes of Unicode characters, and the remaining bytes are in the next block. A less common case occurs when the first differing byte is at the beginning of a block, the differing byes are part of multi-byte characters, and the differing bytes are not the first bytes of multi-byte characters. In this case, the process has to reconstruct entire characters, even if it means reusing the ending of the block that was just processed.
(59) In some instances, the first differing characters are determined to be equal according to the collation parameters 270, even if the characters are not identical (e.g., two accented characters involved in an accent-insensitive collation). In this case, the process proceeds to find the next differing character (if any).
(60) Like the other lookup tables, the fast sort weight table 262 depends on both the language and the strength parameters 270.
(61)
(62) The process then determines (812) whether both s.sub.p and t.sub.p are in the Fast Sort Weight table 262. If at least one of characters s.sub.p or t.sub.p is missing from the fast sort weight table 262, the process applies (822) the standard Unicode method to compute weights for the suffixes of the string S and T. Because all of the string positions less than p have the same bytes in both S and T, the Unicode process begins at the first differing character.
(63) If both s.sub.p and t.sub.p are in the Fast Sort Weight table 262, the process reads (814) the weight v.sub.p for the character s.sub.p and reads the weight w.sub.p for the character t.sub.p. The process then compares (816) these two weights. In many cases, the weights v.sub.p and w.sub.p are different, in which case the ordering of these two weights determines (818) the collation order of the two strings. If the two weights are equal, the process iterates (820) the same process shown in
(64)
(65) The process 1100 compares (1102) text strings that have Unicode encoding. Typically, the encoding uses UTF-8. The process is performed (1104) at a computing device having one or more processors and memory. The memory stores (1104) one or more programs configured for execution by the one or more processors.
(66) The process receives (1106) a first text string S.sub.1 and a second text string S.sub.2, both with Unicode encoding. As noted above, these two text strings may be received at substantially the same time or at different times. In some instances, one or both of the text strings include non-ASCII characters, but in many cases the text strings consist entirely of ASCII characters (0x00-0x7F). Each comparison has a designated set of comparison parameters 270. In some implementations, the comparison parameters specify a comparison language, whether the comparison is case-sensitive, and whether the comparison is accent-sensitive. In some instances, the comparison is designated (1108) as case-insensitive. In some instances, the comparison is designated as accent-insensitive. The processing is performed on blocks of bytes at a time according to a predetermined block size n. In some implementations, the block size is 8 (e.g., for a 64-bit processor). When the number of bytes in a received text string is not an integer multiple of n, some implementations pad (1110) the text string to make the total length an integer multiple of n. Typically, the character used for padding is (1112) ASCII Null (i.e., 0x00, which is sometimes written as \0).
(67) The process 1100 computes (1114) a first string weight according to a weight function , which computes an ASCII prefix .sub.A(S.sub.1), computes a Unicode weight suffix .sub.U(S.sub.1), and concatenates the ASCII prefix to the Unicode weight suffix to form the first string weight (S.sub.1)=.sub.A(S.sub.1)+.sub.U(S.sub.1). This is illustrated above in
(68) When the comparison parameters 270 specify accent-insensitivity, the ASCII prefix 622 for the first text string is computed (1126) by replacing accented Unicode characters in the first text string with equivalent unaccented ASCII characters. This is illustrated in
(69) As illustrated in
(70) When there is a first block containing a non-replaceable non-ASCII character, the process 1100 performs (1136) a character-by-character Unicode weight lookup beginning with the first block containing the non-replaceable non-ASCII character. Some implementations use a Unicode Weight table 264 for this process, as illustrated above with respect to
(71) The process 1100 also computes (1150) a weight for the second text string S.sub.2, which applies the same weight function used to compute the weight of the first string S.sub.1.
(72) The process 1100 determines (1152) whether the first string S.sub.1 and the second string S.sub.2 are equal by comparing the first string weight to the second string weight. Some implementations use (1154) a hash function h as part of the determination. In practice, the vast majority of input text strings are not equal, so a process that quickly identifies the strings that are not equal can improve performance. In this regard, a hash function h is useful. In a small number of cases, two different strings have the same hash value (a hash collision) because the hash function is not injective. Therefore, when hash values are equal, a follow up process must be used to determine whether the two input weights (S.sub.1) and (S.sub.2) are actually the same. On the other hand, when h((S.sub.1))h((S.sub.2)), (S.sub.1) is definitely not equal to (S.sub.2).
(73) The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms a, an, and the are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term and/or as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms includes, including, comprises, and/or comprising, when used in this specification, specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof.
(74) The foregoing description, for purpose of explanation, has been described with reference to specific implementations. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The implementations were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various implementations with various modifications as are suited to the particular use contemplated.