Systems and methods to estimate rate of improvement for all technologies

Abstract

Systems and methods for predicting yearly performance improvement rates for nearly all definable technologies for the first time are provided. In one embodiment, a correspondence of all patents within the U.S. patent system to a set of technology domains is created. From the identified patent sets, the invention may calculate average centrality of the patents in each domain to predict improvement rates, following a patent network-based methodology. Also disclosed is a system to intake a user technology search query and match user intent with the technology domain as well as the corresponding improvement rate.

Claims

1. A method for calculating a rate of improvement of any technology, comprising: selecting a United States Patent Classification class and an International Patent Classification class to form a class pair; identifying patents having both the selected United States Patent Classification class and the selected International Patent Classification class; comparing the number of said identified patents with a threshold overlap standard; if the number of said identified patents is below the threshold overlap standard, discarding the class pair; calculating an average centrality for the class pairs; and obtaining an estimated improvement rate based on the calculated average centrality for the class pairs.

2. The method of claim 1, wherein patents common to multiple class pairs are assigned to a largest overlap.

3. The method of claim 1, wherein patents common to multiple class pairs are reflected in each of said multiple class pairs.

4. The method of claim 2, wherein the threshold overlap standard is a randomly expected overlap.

5. The method of claim 4, wherein the threshold overlap standard is 100.

6. The method of claim 5, wherein the threshold overlap standard is both the randomly expected overlap and 100 such that comparing the number of identified patents with a threshold overlap standard comprises both comparing the number of identified patents with the randomly expected overlap and 100, and if the number of identified patents is below either the randomly expected overlap or 100, discarding the class pair.

7. The method of claim 6, further comprising repeating the selecting, identifying, comparing, assigning, calculating, and obtaining actions for all possible class pairs.

8. A method for providing a calculated rate of improvement of technology for patented inventions, comprising: receiving a search query that comprises one or more technological terms that can each be assigned to one or more pre-determined technological domains of a plurality of pre-determined technological domains for patented inventions; identifying one or more pre-determined technological domains of the plurality of pre-determined technological domains for patented inventions based on each technological term of the one or more technological terms of the search query; determining which of the identified one or more pre-determined technological domains is most relevant to the search query; determining a rate of technological improvement for at least one pre-determined technological domain of the identified one or more pre-determined technological domains that is determined to be most relevant to the search query; and providing the determined rate of technological improvement for the at least one pre-determined technological domain that is determined to be most relevant to the search query.

9. The method of claim 8, wherein each pre-determined technological domain of the plurality of pre-determined technological domains comprises patents having a similar technological function using at least one of similar knowledge or similar scientific principles.

10. The method of claim 9, wherein the plurality of pre-determined technological domains comprises a substantial majority of patents in a database of patents that includes all United States patents, each patent of the substantial majority of patents being assigned to one or more pre-determined technological domains of the plurality of pre-determined technological domains.

11. The method of claim 10, wherein a substantial majority of patents comprises at least one of: at least 75 percent of all United States patents, at least 90 percent of all United States patents, at least 95 percent of all United States patents, or at least 97 percent of all United States patents.

12. The method of claim 11, wherein a pre-determined technological domain of the plurality of pre-determined technological domains is based on a designated amount of intersections between International Patent Classifications and United States Patent Classifications.

13. The method of claim 12, wherein the plurality of pre-determined technological domains comprises at least one of: at least 100 pre-determined technological domains, at least 250 pre-determined technological domains, at least 500 pre-determined technological domains, at least 750 pre-determined technological domains, at least 1000 pre-determined technological domains, at least 1250 pre-determined technological domains, at least 1500 pre-determined technological domains, at least 1700 pre-determined technological domains, at least 1750 pre-determined technological domains, or at least at least 1757 pre-determined technological domains.

14. The method of claim 13, wherein determining which of the identified one or more pre-determined technological domains is most relevant to the search query further comprises using mean-precision recall to perform said determining action.

15. The method of claim 14, wherein using mean-precision recall to perform said determining action further comprises using both arithmetic mean and geometric mean to perform said determining action.

16. The method of claim 15, wherein providing the determined rate of technological improvement for the at least one pre-determined technological domain that is determined to be most relevant to the search query, further comprises: providing the determined rate of technological improvement for two or more pre-determined technological domains of the at least one pre-determined technological domain that is determined to be most relevant to the search query; and providing one or more additional types of patent-related information for each of the two or more pre-determined technological domains of the at least one pre-determined technological domain that is determined to be most relevant to the search query.

17. The method of claim 16, wherein the one or more additional types of patent-related information comprises at least one of: a patent, an assignee, an inventor; or other information that allows one patent to be differentiated from another patent.

Description

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING

[0047] FIG. 1 depicts the distribution of difference in predicted K in % per annum.

[0048] FIG. 2 is a flow diagram depicting the process steps for inverted COM.

[0049] FIG. 3 depicts a distribution of difference in predicted K in % per annum for both predictors.

[0050] FIG. 4 depicts fitting the distribution of mean centrality across the 1,757 domains.

[0051] FIG. 5 depicts a welcome page in one embodiment of the invention.

[0052] FIG. 6 depicts a search page in one embodiment of the invention.

[0053] FIG. 7 depicts a results overview page in one embodiment of the invention.

[0054] FIG. 8 depicts a detailed results page in one embodiment of the invention.

[0055] FIG. 9 depicts a feedback section of the detailed results page in one embodiment of the invention.

DETAILED DESCRIPTION OF THE INVENTION

[0056] The following estimated equation, trained by running a regression for the 30 technologies for which observed improvement rates are available, can be used for out-of-sample predictions of the improvement rate of any given technology domain i for which an accurate patent set can be identified.

Estimated K.sub.i=exp(6.16*X.sub.i−5.02)*exp(σ.sub.i.sup.2/2) (Eq. 2)

[0057] In Equation 2, numbers inside the bracket are the estimated coefficients of an OLS regression that has the log of the improvement rate as dependent variable, an intercept and one predictor X.sub.i for each technology domain i. In Triulzi et al. (2020), this predictor is the mean value for all patents in domain i, of the average centrality of the patents cited by each patent j in domain i. The second term in the right-hand side is a correction factor to move back from a log scale to a linear scale.

[0058] A dataset for one embodiment of the invention contains all patents issued by USPTO from 1976-2015 for which valid U.S. Patent Classification system (UPC) and International Patent Classification (IPC) current classification data exist. In some embodiments, the invention may be practiced with the current classification data files (i.e., reclassified data and not the data at time of grant) and the list of 3-digit current UPC classes and 4-character IPC classes for the extension of COM.

[0059] A suitable dataset, with all patents granted since 1976, may be obtained from the PatentsView platform. PatentsView gets access to the data through an arrangement with the Office of Chief Economist in the US Patent and Trademark Office and is current through Oct. 8, 2019. The dataset contains patent number, date of grant and other metadata. The dataset may be limited to U.S. patents because the performance datasets available to us are overwhelmingly from the U.S. and because the UPC system is desirable for application of COM. U.S. patent data is likely representative of patenting activity worldwide due to its reputation as a technology leader and the vast size of the consumer market enticing most global firms to patent in U.S.

[0060] A suitable dataset may include only patents with grant dates between Jan. 1, 1976 to Jun. 1, 2015 totaling 5.7 million. In one embodiment, non-utility (special) classes of patents such as those with the designation “D”, “PP”, “H”, “RE” and “T” summarized in Table 1 may be removed. A very good description of these designations can be found at the USPTO website. This yields a total of 5,083,263 valid unique utility patents. The UPC class “G9B” may be excluded because of its very high similarity to the corresponding IPC class from which it originated, thus, rendering it in unsuitable for COM.

TABLE-US-00001 TABLE 1 Patent Type Number of Patents D-type 573,505 PP-type 25,221 H-type 2,258 RE-type 17,955 T-type 509

[0061] A suitable dataset may use the complete list of 3-digit current UPC classes (439 in number, obtained from the USPTO website and 4-character IPC classes (648 in number, obtained from the WIPO website for utility patents. The USPTO updates the taxonomy at regular intervals to maintain ‘consistency’ in classification (in addition to ease of searching) as the meaning of ‘consistent’ changes over time (Lafond, F., Kim, D., 2017. Long-Run Dynamics of the U.S. Patent Classification System (SSRN Scholarly Paper No. ID 2924387). Social Science Research Network, Rochester, N.Y.). The UPC classes list has not been updated since May 2015. The IPC classes list continues to be updated every year and version 2019.1 is suitable for this invention. The USPTO also reclassifies patents so that, the patents adhere to the latest taxonomy. The reclassification data may be advantageous as the current structure of technology is likely best reflected in the patent classification we have now, instead of the one at the time of grant. This is referred to as current classification in this disclosure. Using classification at time of grant may arrive at a slightly different set of domains (both number and composition). For a historical analysis one could use both the current classification and the classification at time of grants to understand the evolution of structure of technological domains.

[0062] UPC current classification data may be obtained from the PatentsView platform. The classification data is based on USPTO bulk data files which were last updated on May 18, 2018. There are 22,880,877 patent records with the current UPC classification data. These contain 5,134,285 unique patents suggesting each patent belongs to 4.46 UPC classes.

[0063] IPC current classification data may be obtained from the Google BigQuery platform which uses data from IFI CLAIMS Patent Services. The UPC to IPC concordance was last published in Aug. 20, 2015. As such, no reclassification data for IPC is available after 2015. There are 21,857,265 patent records with International Patent Classification system (IPC) classification data (from 1976 to 2019). These contain 5,920,113 unique patents suggesting each patent belongs to 3.69 IPC classes. As noted above, in one embodiment, all classes in which a patent (both UPC and IPC) is listed (and not just the main class) may be used.

Representative Methodology

[0064] Decomposition of the Entire Patent System into a Set of Technology Domains

[0065] This invention builds on the concept of domains and the discovery of patents belonging to a particular domain using the classification overlap method (COM) described above and describe the extension, inversion and automation of COM to give a technology domain description for the entire patent system. The invention does not start with a pre-search for a technology of interest as in the usual COM application. Rather, it may start with the set of patents described above as well the lists of UPC and IPC classes. In operation one class from the UPC list and one from the IPC list are selected and all patents which belong to each of those classes are found using the classification data. The invention may then find the “overlap” between these two sets—the patents which lie in both the given IPC class as well as the given UPC class. This may be done for all possible class pairs, i.e. unique combinations of classes—one from IPC and one from UPC and thus define the full set of overlaps. All overlaps are potentially domains but only if a large enough set of patents occupies the overlap.

[0066] The disclosed method systematically calculates the overlap for all possible class pairs. Since each class pair is composed of an IPC class and a UPC class, there are 284,472 possible class pairs in total. The overlap as the size of intersection of the set of patents in the selected UPC class and the set of patents in the selected IPC class (i.e. the set of patents listed in both the UPC and IPC class being combined) is obtained. Empirically, most of these overlaps are empty-55% of the overlaps are zeros.

[0067] The disclosed method only considers those class pairs as domains which have above random probability of being in an overlap. This is done to avoid misclassification noise. In some embodiments, to deduplicate the patent sets, the disclosed method then assigns patents which lie in more than one overlap to the biggest overlap that they occupy. A final list of domains with each patent matched to only a single domain is thus obtained. In other embodiments patents in multiple domains are counted in each domain as many time times they appear.

[0068] To illustrate this methodology, we start with the overlaps between UPC classes 850, 353 171 and IPC classes G01Q, F02B & H02B in Table 2.

TABLE-US-00002 TABLE 2 Class Class No. Class Name Size 850 Scanning-probe techniques or apparatus 3,045 353 Optics: image projectors 9,282 123 Internal-combustion engines 62,113 G01Q Scanning-probe techniques or apparatus; 4,748 applications of scanning-probe techniques H02B Boards, substations, or switching arrangements 5,147 for the supply or distribution of electric power F02B Internal-combustion piston engines; combustion 35,318 engines in general

[0069] This yields a total of nine potential domains as shown in Table 3 where the values in each intersection are the number of patents found in that overlap. For instance, for the class pair 123F02B there are 20,575 patents that are listed both in the 62,113 patent UPC 123 class and the 35,318 patent IPC F02B class.

TABLE-US-00003 TABLE 3 IPC Class No. G01Q H02B F02B UPC 850 2,946 0 0 353 0 2 0 123 0 0 20,575

[0070] We only consider those class pairs as domains which have above random probability of being in an overlap to avoid misclassification noise as described above. This eliminates the domain label from class pair 353H02B containing only two patents.

[0071] Given the value of P(IPC_x∩UPC_y) and the size of the sample space (total number of patents in our set), we can obtain the expected size of the overlap. For instance, in Table 3, for the class pair 353H02B we calculate:

P(UPC_353)=number of patents in UPC class 353/Total number of US patents (Eq. 3)

[0072] Thus, P(UPC_353)=9,282/5,083,263=0.0018. Similarly, P(IPC_H02B)=0.001.

[0073] The joint probability,

P(UPC_353∩IPC_H02B)=P(UPC_353)×P(IPC_H02B)=1.85×10.sup.−6 (Eq. 4)

[0074] Finally, the expected overlap can be calculated as:

Expected overlap=P(UPC_353∩IPC_H02B)×Total number of US patents (Eq. 5)

[0075] The randomly expected overlap comes out to be 9.4. The actual overlap is only 2. Thus, we regard this class pair as not being a domain and do not analyze it further. For efficiency, we also, discard all class pairs which contain less than 100 patents as we believe that is a reasonable threshold for a set of patents to constitute a technology domain with a coherent function and knowledge base. Deduplication empties a number of small overlaps and reduces the number of patents in others. With reference to our illustrative example, this results in 850G01Q and 123F02B as valid domains with sizes 568 and 20,437 (smaller than the original overlaps due to deduplication) as shown in Table 4.

TABLE-US-00004 TABLE 4 IPC Class No. G01Q H02B F02B UPC 850 568 0 0 353 0 discard 0 123 0 0 20,437

[0076] To assess potential mis-classification and/or typos as a source of noise, we calculate the expected probability of patents lying in a overlap given as the product of probability that a patent lies in a given IPC class x−P(IPC_x) and the probability that the patent lies in the given UPC class y−P(UPC_y), if they were independent events. If the UPC and IPC patent classes are unrelated, i.e. the probability of being classified in the given IPC class is independent of being classified in a given UPC class, then the joint probability P(IPC_x∩UPC_y) is in principle the probability of randomly misclassifying due to a typing mistake (typo) or a thinking mistake (thinko). In general, in a domain, the IPC class and the UPC class should be more than randomly related. Therefore, if the overlap is less than that of the patent being randomly classified in both the given UPC class and the IPC class, that overlap may be discarded as it is not an actual domain.

[0077] The probability of any joint event A and B i.e. P(A∩B) equals the product of probability of event A i.e. P(A) and probability of event B i.e. P(B) if the two events are completely independent:

P(A∩B)=P(A).Math.P(B) (Eq. 6)

[0078] Given the value of P(IPC_x∩UPC_y) and the size of the sample space (total number of patents in our set), we can obtain the expected size of the overlap.

Expected overlap=P(IPC_x∩UPC_y)×Total number of US patents (Eq. 7)

[0079] If the actual overlap come out to be less than the randomly expected overlap comes, that class pair may be regarded as not being a domain and not analyzed further. All class pairs with actual overlap less than randomly expected may be discarded to indicate that the overlap could occur because of noise due to miswritten class numbers or other semi-random noise.

[0080] Empirically for the dataset described above, we lose 23,711 patents accounting for 0.47% of total patents and finally, obtain a set of valid “domains”. For efficiency, we may also discard all class pairs which contain less than 100 patents as we believe that is a reasonable threshold for a set of patents to constitute a technology domain with a coherent function and knowledge base.

[0081] Since some patents lie in multiple UPC and multiple IPC classes (as discussed above), some patents naturally lie in more than overlap. For simplicity and ease, they may be assigned to the largest overlap so that the final decomposition lists each patent in only one domain. For the purposes of technology improvement rate this does not make a big enough difference to concern us as this work is focused on rate of improvement.

[0082] To verify that proposition, the distribution of the difference and percentage difference between predicted rates of improvement (predicted K) from original dataset and deduplicated dataset may be examined. The predicted rates of improvement (predicted K) are reported as percentage change per annum. Table 5 contains summary statistics for predicted rates of improvement (predicted K in % per annum) from an original dataset and a deduplicated dataset.

TABLE-US-00005 TABLE 5 Original Dataset Deduplicated Dataset Difference in K- Predicted K (O) Predicted (K) (D) Value (O − D) MEAN 20 19 0 STD 28 26 7 MIN 3 2 −63 MAX 226 230 77

[0083] As seen in Table 5, the mean of both the percentage difference and the difference are quite small indeed. The mean of difference is 0 percentage points with a standard deviation of 7 which is quite small, suggesting that almost all difference values lie close to 0. This can also be seen clearly in the plot of the distribution of difference in FIG. 1.

[0084] In research on technological structure, the duplicated lists would be used as well. Deduplication empties a number of small overlaps and reduces the number of patents in others. Going forward, by size this disclosure refers to the number of unique patents after deduplication. FIG. 2 shows a graphical summary of the methodology and the key steps described above.

[0085] Recall that for the dataset described there are 283,824 (438×648) potential domains having a unique UPC and IPC classification and 5,083,263 utility patents in our set that have appropriate UPC and IPC designations. 66.3% i.e. two-thirds of these patents are contained in the largest 175 domains which are only 0.06% of the possible domains showing that technologies are selectively in a relatively narrow set of domains. Indeed, 13,142 overlaps (i.e. 4.62% of class pairs) contain 99.52% of all patents. Recall that before deduplication, only 55% of the possible domains contained zero patents. Indeed, the overall concentration is stronger after deduplication. The details in Table 6 illustrate two inter-related key results for the exemplary dataset. First, most of the domains are small domains (86.6% of domains contain less than 100 patents each) and secondly that despite the large number of small domains, most patents are in larger domains (almost 90% of the patents are in domains that have at least 1,000 patents).

TABLE-US-00006 TABLE 6 Number of Total number of Fraction of Number of domains of that unique patents in unique patents in Size of domains of that size as a fraction all domains of all domains of Domain size of total possible that size that size 1-9 8400 0.0296 23864 0.0047 10-99 2985 0.0105 94163 0.0185 100-999 1153 0.0041 390087 0.0767 1000-9999 490 0.0017 1685323 0.3315 Above 10000 114 0.0004 2865248 0.5637

[0086] The domains considered may be confined to those containing greater than or equal to 100 patents. As can be found by adding the last three entries in Table 6, this yields 1,757 domains and 97.2% of the patent system.

Calculation of Rates of Improvement for the Set of Domains

[0087] As explained above, the disclosed method uses the method defined in Triulzi et al. (2020) to predict the improvement rate for each identified technology domain. However, the instant disclosure departs from that paper, among other ways, in the choice of which centrality measure to use. In Triulzi et al. (2020), the authors propose using the average normalized centrality of the patents cited by a domain's patents as a predictor of the domain's improvement rate. This is done because using data on the focal patents' centrality would require waiting an arbitrary number of years after the patent is granted to allow the patent time to accumulate citations, which is necessary to measure its centrality reliably. Since a focal patent's centrality and the centrality of the patent it cites are strongly correlated and given that in Triulzi et al. (2020) some of the domains studied were very recent, in that paper the authors preferred to use the centrality of the cited patents to avoid losing data for the young domains. However, in the instant method, a very large sample of domains may be analyzed, most of which are fairly old. Therefore, the normalized centrality of the focal patents in a domain computed after three years from the moment the patent is granted is preferred, given its stronger appeal in terms of ease of computation and presentation of the measure.

[0088] The proposition that the change in predictor has no significant effects on the result of the prediction may be verified, which is expected since the correlation between the normalized centrality of patents after three years and the normalized cited patents' centrality is 0.97 at the 1,757 domains' level (it is 0.77 at the patent level).

[0089] Table 7 below compares a few indicators of the goodness of the prediction between model 1 in Table 3 of Triulzi et al. (2020), which reports the regression results for the normalized centrality of cited patents, and the same model using the normalized patent centrality after three years.

TABLE-US-00007 TABLE 7 Model using the Model using the normalized patent normalized centrality centrality after of the cited patents three years R.sup.2 0.63 0.62 Sum of squared residuals 10.23 10.68 Standard error of the 0.16 0.16 regression

[0090] One may then examine the distribution of the difference and percentage difference between predicted rates of improvement (predicted K) from original predictor (average centrality of patents cited by a domain's patents) and the modified predictor (centrality of the focal patents in a domain after three years from the moment the patent is granted). The predicted rates of improvement (predicted K) are reported as percentage change per annum.

TABLE-US-00008 TABLE 8 Triulzi et al. (2020) This disclosure Difference in K- predicted K (T) predicted K (S) Value (T − S) MEAN 18 19 −1 STD 27 26 6 MIN 2 2 −63 MAX 262 230 122

[0091] As seen in Table 8, the difference in K-values have a mean of −1% per annum (p.a.) and standard deviation of 6% p.a., showing for most domains predicted Ks' are almost identical. This can also be seen clearly in the plot of the distribution of difference in FIG. 3.

[0092] For prediction, the disclosed method may use the following equation, adapted from Triulzi et al. (2020).

[00001] $\begin{matrix} Estimated K_{i} = \exp (6.22 * X_{i} - 4.97) * \exp (\frac{σ_{i}^{2}}{2}) & (Eq . 8) \end{matrix}$

[0093] The coefficients have been obtained by training an OLS regression of the log of the observed improvement rate for 30 technologies (for which empirical time series of performance over time were available) against the average normalized centrality of their patents measured three years after being granted (X.sub.i in the equation). The improvement rates for these 30 technologies and their patent sets with centrality values may be the same used in Triulzi et al. (2020).

[0094] For each of the 5,083,263 utility patents granted by the USPTO between 1976 and 2015, the normalized centrality index may be computed using the same citation network randomization procedure presented in Triulzi et al. (2020) and explained briefly above. The average centrality of patents in each of the 1,757 identified technology domains may then be computed and plugged into the equation to obtain the predicted yearly performance improvement rate.

[0095] It is important to note that the normalization of the centrality measure for each patent granted by the USPTO, produced an indicator that is uniformly distributed between 0 and 1. Therefore, if patent sets for the technology domains are sampled randomly from the overall set and then the average normalized patent centrality for each technology domain is calculated, its distribution would follow a normal distribution with mean equal to 0.5. If that were to be true, the distribution of the predicted improvement rate could not be interpreted as it would just be an artifact of random sampling patents and of the centrality normalization method. This is not the case as all normality tests for the distribution of mean centrality across domains reject the normality hypothesis. In fact, the best fit for this distribution is an exponentially modified Gaussian (the sum of an exponential random variable and a Gaussian one).

[0096] A series of normality tests may be performed to examine the possibility that the distribution of centrality across domains could reflect a random sampling of patents from the overall population. FIG. 4 shows the distribution of the mean centrality (calculated three years after the patent is granted) for all 1,757 technology domains. The distribution is overlaid by the best fitted probability density function (an exponentially modified Gaussian distribution, with the rate, location and scale parameters respectively 1/2.057, 0.313 and 0.0586) and two comparison distribution, the Gaussian and the log-normal one. The parameters have been obtained by fitting the distribution with the package Fitter in Python. The sum of the squared errors for these three distributions is 7.67 (with an AIC of 120.6 and BIC of −9524.8) for the exponentially modified Gaussian, 11.88 (with an AIC of 147.6 and BIC of −8755.17) for the lognormal and 41.36 (with an AIC of 106.7 and BIC of −6571.9) for the Gaussian. This clearly shows that the distribution is not normal, nor is log-normal.

[0097] A series of normality tests may also be performed, as reported in Table 9, which unequivocally reject normality.

TABLE-US-00009 TABLE 9 Test Statistic P-value Shapiro-Wilk 0.937 0.000 D'Agostino's Chi-squared 240.553 0.000 Anderson-Darling 32.758 0.000 Kolmogorov Smirnov 240.553 0.000

[0098] This methodological result, along with the test for randomly expected overlap, further strengthens the proposition that the method disclosed herein is revealing an underlying property of technology system concerning the distribution of the improvement rates and patent centrality across domains.

Online Technology Search and Identification of Domain Technology

[0099] The process presented provides a broad and systematic account of technological change. However, improvement rates for specific technologies (or domains) or groups of related technologies are also of potential interest to many engineers, product designers, researchers, technology project managers, R&D managers and policy makers. Thus, we have developed an online technology search system which enables a user to find predicted improvement rates for a technology of interest. For each given search term, we return the top 5 most representative domains along with a prediction of the improvement rate for each domain as well as the title and abstract of the most central 20 patents from the most representative domain. The user is then able to judge whether to try different key words if the example patents indicate something different than what they intended to examine or want to pursue interesting leads from reading the patents they discover in the first round. The search tool can be accessed through a user-friendly interface and is hosted on a cloud server.

[0100] For a user trying to find domains, we have developed an online technology search system which enables a user to find predicted improvement rates for a technology of interest. For each given search term, we search patent title, abstracts and (optionally) description across the entire dataset of valid US utility patents (with grant dates between Jan. 1, 1976 to Jun. 1, 2015) and return the list of patent numbers containing the term. This is accomplished by using full text search functionality in a relational database (such as MySQL, PostgreSQL etc.). The standard text search function incorporates tokenization, stemming and vectorization to enhance the search. We then match this list of patents to our corresponding domains by using the correspondence established before. We find the most representative domain for those patents by using a relevance ranking. The relevance ranking for the patent classes is accomplished by using the mean-precision-recall (MPR) value proposed by Benson and Magee (2013). This value was inspired by the ‘F1’ score that is common in information retrieval but uses the arithmetic mean (instead of the geometric mean) of the precision and recall of a returned data set (Magdy, W., Jones, G. J. F., 2010. PRES: a score metric for evaluating recall-oriented information retrieval applications, in: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '10. Association for Computing Machinery, Geneva, Switzerland, pp. 611-618). We return the top 5 most representative domains along with a prediction of the improvement rate for each domain as well as the title and abstract of the top 20 and a random set of 20 patents from the most representative domain.

[0101] The user is then able to judge whether to try different key words if the example patents indicate something different than what they intended to examine or want to pursue interesting leads from reading the patents they discover in the first round. The search tool can be accessed by the readers from the project website through a user-friendly interface and is hosted on a cloud server.

Other Considerations

[0102] This disclosure and invention represent the first attempt at a complete yet granular survey of technological performance improvement rate across the entire spectrum of technology. Methods to survey predicted improvement rates as disclosed, the online technology search system and the analysis of the distribution of improvement rates across all technologies domains have important managerial and policy implications, especially for allocation of resources among competing priorities.

[0103] There are also other aspects and embodiments within the scope of this invention. First, we have not determined every improvement rate of possible interest: the domains with less than 100 patents (˜10,000 domains) may contain some important emerging technologies; we have not even attempted to name all 1,757 domains that we separately predicted rates of improvement for; perhaps most importantly prior work using COM (Benson, C. L., Magee, C. L., 2016. Using Enhanced Patent Data for Future-Oriented Technology Analysis, in: Daim, T. U., Chiavetta, D., Porter, A. L., Saritas, O. (Eds.), Anticipating Future Innovation Pathways Through Large Data Analysis, Innovation, Technology, and Knowledge Management. Springer International Publishing, Cham, pp. 119-131; Guo, X., Park, H., Magee, C. L., 2016. Decomposition and Analysis of Technological domains for better understanding of Technological Structure. arXiv:1604.06053 [cs]; Benson, C. L., Triulzi, G., Magee, C. L., 2018. Is There a Moore's Law for 3D Printing? 3D Printing and Additive Manufacturing 5, 53-62; You, D., Park, H., 2018. Developmental Trajectories in Electrical Steel Technology Using Patent Information. Sustainability 10, 2728) has often found specific technologies to be more closely identified at deeper sub-groups within the UPC and IPC classes than the high-level classes we applied systematically in this work. As shown in the identification work, we have done, the higher level we used leads to coherent domains and recognizable technologies but the inventive method could pursue deeper level sub-domains. A second alternative implementation uses a different method to eliminate doubly (or triply etc.) listed patents where all duplicate patents were assigned to only the largest domain in which they are found. Another approach is to start by combining domains with high overlaps and use the remaining overlaps to give a measure of interactive structure.

Technology Search System and Hardware Implementation

[0104] Since each class pair in the representative dataset is composed of an IPC class and a UPC class, there are 284,472 possible class pairs in total. There are also 5,083,263 patents in the set. As such, powerful computing machines with multiple CPUs and large random-access memories are required to calculate overlaps of all possible pairs, test whether the overlap is above randomly expected overlap and to optionally assign the overlap to the larger domain because these processes require large amount of processing power and working memory. In actual practice, this is often accomplished by using high performance computing clusters rather than personal computers.

[0105] Another implementation is to parallelize the computation using GPU machines. These require vectorization of the calculation to treat the data sources as large arrays of classification data for each patent. In the vectorized version, the system needs to have a larger amount of working memory because these large arrays of classification data need to be merged within the working memory. Once the arrays are merged, one would merely need to group the data by merged IPC and UPC designations to get the list of patents belonging to each pair. One would then test each non-zero pair for whether it is above non-zero overlap or not.

[0106] The list of domains and the patents belonging to them may be used to run OLS regression. Again, a computing equipment with enough computation power and working memory to be able to find the mean of centrality values of the patents belonging to each domain for the prediction is required.

[0107] Once the list of all domains and their rates of improvements are found, one must implement the search system for the user. Since the motivation for the invention is to help policymakers make better decisions, it is important to make information available in a user-friendly manner. Due to the wide variety of technologies likely to be of interest to the various stakeholders as well as the success of the search paradigm for information retrieval, a technology search system was developed as the most suitable framework for presenting information to users to integrate into their decision-making workflow.

[0108] The inventive system may show an introduction and instructions and then may accept a query from a user, which may including multiple strings. Preferably, the system is capable of searching a very big database (4 GB) for matching entries under one second. The system may perform calculations using results from the search and a smaller database (less than 1 mb). In one embodiment, the system outputs a table with about file entries based on results from a calculation step. In a preferred embodiment, each entry in the table shall be clickable and shall lead to detailed results with another table containing about 20 entries from a different database (less than 10 mb). The tables are preferably scrollable.

[0109] In one embodiment the system comprises a custom dynamic web-application capable of search and calculations. A static website is generally much easier to do but limited in functionality and the range of capabilities it makes available to the user. Python has become one of the most commonly used software programming languages. As such, to ensure future maintainability, adequate support and availability of technical expertise, a python-based web-framework is one suitable possibility for the web search system.

[0110] There are two major python-based web-frameworks-Flask and Django. Due to the requirement of dealing with a large database and Django's suitability for dealing with search and cloud deployment, Django may be selected in a preferred embodiment. In Django, there is a clear separation in Django between the client-side and the server-side. The client-side primarily consists of the user interface. These are implemented using Django templates which are written in HTML and/or CSS and are not much different from conventional website design. The server-side consists of data management as well as the operations on the data while meeting user-queries. The key functional elements on the server side are the “model” and the “view”, in addition to the database which is independent of the framework. Taken together, this represents a client-server architecture with model-view-template. These are explained in more detail in the following sections.

[0111] A user interface comprises a graphical use interface on the client side of the web framework. Since the invention is a new kind of tool, some instructions and hand-holding may be necessary for new users. As a user grows familiar with the interface and the capabilities of the tool, the instructions may go away. FIG. 5 shows one embodiment of the instruction interface. In one embodiment, entering the portal takes the user to the search page. In one embodiment, the search page is minimal, distraction free and focuses on the search box as well as the search button. In order to help a new user, the search box may incorporate a technology term prompt as shown in FIG. 6.

[0112] FIG. 7 shows one embodiment of a results overview section. A results overview section may be primarily composed of a table summarizing the improvement rate of the top five matching technological domains. In one embodiment, a table lists the following key pieces of information for each technology domain:

[0113] Domain ID is a unique identifier for each technology domain and also the portmanteau of the parent UPC class and the parent IPC subclass of the domain. Clicking on the Domain ID takes the reader to the Top 20 patents for each domain, allowing them to qualitatively assess the quality of match.

[0114] Estimated Improvement Rate (p.a) is the annual improvement rate of the technological domain in percent. To help the user interpret the results it its noted that a rate above 42% means the technology is improving faster than integrated chips, made famous by Moore's law (1965).

[0115] Domain size is the number of patents in the technology and Patents Matched is the number of patents which contain the keyword you searched for.

[0116] MPR is a quantitative measure of relevance as described above.

[0117] FIG. 8 shows one embodiment of a detailed results page with the top 20 patents from domain 60F02B. Clicking on a Domain ID may take the user to the top 20 patents from each domain. In one possible embodiment, the detailed results could be displayed within the results overview section, but that approach might overload the results page with information and create too much cognitive overhead for a user. In a preferred embodiment, results are hidden away and linked to the main table using the Domain ID. In one embodiment, patents are ranked by their average centrality using the centrality measure described above. For each patent, the patent number, title, abstract and the name of the assignee may be provided.

[0118] FIG. 9 shows a feedback section. A feedback section may follow the overview section in the results page to be accessed by scrolling down. A key goal of the feedback section is to leverage the advantages of an online system by improving the system with user feedback. By accepting feedback, a deeper connection between popular technology terminology and the 1,757 domains can be developed. In one embodiment, the feedback section is minimal with two questions assessing accuracy and preciseness, two blanks with prompt for user input and legend for calibrating the score.

[0119] The system may also comprise a backend implementation. Backend functions may be implemented with a web-framework (such as Flask or JavaScript plus NodeJS) using a similar architecture (client-server with model-view-template) or with a completely different architecture (such as a serverless architecture).

[0120] In one embodiment of database and full text search implementation, the patent title and abstract are combined into a single text field and stored along with the patent number and the corresponding domain ID. The choice of database is independent of the web framework used and Django is flexible enough to deal with different kinds of databases. However, the size of the database (˜4 GB) and requirement of fast search required considerable optimization for full text search implementation.

[0121] For each given search term, the backend solution may search patent title and abstracts across the entire dataset of valid US utility patents (with grant dates between Jan. 1, 1976 to Jun. 1, 2015) and return the list of patent numbers containing the term. To search a database of that size would require up to 300 seconds on a PC of normal configuration (dual core processor with enough RAM). A waiting period this long, could be a serious impediment to user experience and would impair the interactive nature of the tool.

[0122] In a preferred embodiment, an inverted index on a relational database, such as Generalized Inverted Index on PostgreSQL (a relational database like MySQL etc.), may be used. The idea behind using an index in this embodiment is the same as an index for a book. Indexing also incorporates tokenization, stemming and vectorization to cut down on the total number of words the system will have to look-up. This approach necessitates an increase in the size of the database as it has to store the database along with the index. However, since storage tends to be cheaper than processing power at the current time, this may be a tradeoff worth making. In this embodiment, the worst-case search time may be cut down to less than 2-3 seconds for large queries on the same configuration. For most queries in this embodiment, the search is instantaneous. By using a slightly larger configuration, the online system loads results faster than a Google search (although Google looks up a much-much larger index and is providing much a greater diversity of results). This embodiment is typically able to load results under 250 milliseconds and under 300 milliseconds for the largest queries, which is consistently faster than most Google queries in testing (measured using Chrome DevTools).

[0123] In one embodiment, the system uses the full text-search described above and makes calculations on the results returned by the database. These calculations may be made within a Django “view”, an abstraction consisting of Python code which interacts with the database through the abstraction of a “model” and returns graphical objects which can be rendered by the “template”. A model is a data-object with certain attributes. For instance, each separate patent may be described by a model which contains patent number, combined text of title and abstract as well as the corresponding domain. In one embodiment, the “view” accepts a user query, formats it and sends it to the database. The results from the database are then processed using simple algorithms written in Python code. The list of patents retrieved are grouped by their corresponding technology domains by using the correspondence established before. The most representative domain for those patents is found by using a relevance ranking. The relevance ranking for the patent classes is accomplished by using the mean-precision-recall (MPR) value proposed by Benson and Magee (2013). This value was inspired by the ‘F1’ score that is common in information retrieval but uses the arithmetic mean (instead of the geometric mean) of the precision and recall of a returned data set (Magdy and Jones, 2010). Finally, the “view” returns the top 5 most representative domains along with an estimate of the improvement rate for each domain as an HTTP object to the user side and the table is displayed using the “template”. The view also returns the title and abstract of the top 20 patents from the most representative domains.

[0124] To ensure reliable performance, uninterrupted service and effectiveness of cost, the system may be deployed to an on-demand cloud computing platform such as Amazon Web Services (AWS). The advantages of AWS include ease of deployment and availability of detailed documentation as well as support in popular forums. AWS also works well with Django and automatically sets up a computing instance, a storage instance and security protocols for the web application. However, in this embodiment, the database needs to be setup separately and needs to communicate with the computing instance.

[0125] The search system may be implemented as a standalone program on a user's computer. This approach will require large amounts of memory to hold the entire patent corpus and computation power to perform the calculations for match quality. The user may be informed that the results are ready with a graphic prompt (including an animation) or an audio prompt using a speaker system.

[0126] Another method is to implement the search system on a remote machine or a remote server such as those provided by a cloud computing provider. The remote server may further hold the entire corpus in its working memory or implement a relational database (such as PostgreSQL) with inverted index on the remote machine.

[0127] In such a case the user may connect to the server through a dedicated application or a web browser on the user computer or any interactive device with input and output functions to provide the search query over the internet. The calculations will be performed on the remote server and only the results of the calculation will be returned to the user.

[0128] The user may be informed that the results are ready with a graphic prompt (including an animation) or an audio prompt using a speaker system. The user may also choose to provide their query using a microphone and the voice query will converted to a search term by using any commercially available natural language processing system. The results may also be provided to the user through an audio medium by a commercially available text to speech system.

[0129] A dedicated application or the browser interface may also be implemented on other specialized hardware such as collaboration screens or large interactive displays. This would allow teams of engineers and planners to quickly and iteratively search for alternative technological options for their products or project plans and ultimately build consensus.

[0130] Another potential embodiment is on mobile devices or smartphones to send user queries and receive results of the technology search system. Another embodiment is that of augmented reality goggles. These will particularly rely on audio-visual methods as the user might be working with an existing prototype and may wish to search for a technological alternative when the current option is not likely to meet future specification and requirements. For instance, an automotive designer inspecting a prototype of a powertrain system may look at engine subsystem and might become concerned about the likelihood of meeting emission requirements. They could quickly search for and obtain the rate of improvement of the engine subsystem. Based on that they can determine the likelihood of meeting future requirements given current performance. They may decide to swap it with a faster improving subsystem (such as fuel cells) because it is likely to fail emission requirement. As such, they could quickly provide an audio search query and receive audible results while looking at the physical system.

[0131] One skilled in the art will appreciate further features and advantages of the disclosures based on the provided for descriptions and embodiments. For example, the inventive methods and systems disclosed herein may be used with other datasets having similar attributes and relationships to the exemplary dataset. Accordingly, the inventions are not to be limited by what has been particularly shown and described. All publications and references cited herein are expressly incorporated herein by reference in their entirety.

Systems and methods to estimate rate of improvement for all technologies

Assignee

Inventors

Cpc classification

Classification Explorer

G06F18/21342

PHYSICS

Classification Explorer

G06F16/903

PHYSICS

Classification Explorer

G06F2216/11

PHYSICS

Classification Explorer

G06F18/24137

PHYSICS

Classification Explorer

G06F18/22

PHYSICS

International classification

Classification Explorer

G06K9/62

PHYSICS

Classification Explorer

G06F16/903

PHYSICS

Abstract

Claims

Description