Cash flow forecasting using a bottoms-up machine learning approach
11526859 · 2022-12-13
Assignee
Inventors
Cpc classification
G06Q20/389
PHYSICS
G06Q20/02
PHYSICS
G06Q10/04
PHYSICS
International classification
G06Q20/10
PHYSICS
G06Q40/00
PHYSICS
G06Q10/04
PHYSICS
G06Q20/02
PHYSICS
Abstract
A method and apparatus for improving the management of cash and liquidity of an organization utilizing a plurality of ledger accounts and a plurality of currency accounts is described. One improvement in the accuracy of the forecasts comes from the uses of individual ledger accounts. The improvements optimize the interest earnings for the cash balances in each currency account and minimizes the expenses related to funding the currency accounts. Machine learning techniques are incorporated to forecast payments, receipts, interest rates, and currency exchange rates, and then the cash is transferred or borrowed or loaned to fund the payments and utilize available cash.
Claims
1. An improved cash management apparatus comprising: one or more payment rails connected to one or more banks; a special purpose server connected to the one or more payment rails; and a plurality of data storage facilities connected to the special purpose server, wherein the special purpose server is configured to retrieve a set of payment and receipt transactions from the plurality of data storage facilities for a given past date range, configured to separate the set of payment and receipt transactions by ledger accounts, configured to sort the payment and receipt transactions with proximate time frames, configured to perform an ARIMA analysis on the payment and receipt transactions in each ledger account of each currency account to create a model of expected inflow and outflows for each period for each ledger account for each currency account, configured to forecast a receipts forecast and a payments forecast for a future period, where the special purpose server subtracts the payments forecast from the receipts forecast and adds in a previous period cash balance to create a forecast cash balance time series for each currency account, configured to retrieve historical banking rate information and perform the ARIMA analysis on the historical banking rate information to create a forecast banking rate information time series, configured to form a distributed machine learning model using a machine learning algorithm that calculates an F-score and rule for each feature set in each silo and then combines the F-scores and the rules to create the distributed machine learning model, configured to execute the distributed machine learning model on multiple data silos on the forecast cash balance time series for each currency account and on the forecast banking rate information time series to determine a set of optimal cash transfers between each currency account and one or more sweep accounts, and configured to execute instructions to make payments and cash transfers.
2. The improved cash management apparatus of claim 1 wherein the set of payment and receipt transactions further includes transactions from multiple tenants from a bank.
3. The improved cash management apparatus of claim 1 wherein the future period is user-configurable.
4. The improved cash management apparatus of claim 1 wherein the payments forecast is modified to incorporate actual planned payments.
5. The improved cash management apparatus of claim 1 wherein the receipts forecast is modified to incorporate actual incoming receipts.
6. The improved cash management apparatus of claim 5 wherein the machine learning algorithm is K-means.
7. The improved cash management apparatus of claim 5 wherein the historical banking rate information is retrieved from the one or more banks over the one or more payment rails.
8. The improved cash management apparatus of claim 5 wherein the historical banking rate information includes interest rates, foreign exchange rates, and money transfer costs.
9. A method for managing cash in an organization, the method comprising: retrieving a set of payment and receipt transactions from a plurality of data storage facilities for a given past date range for a plurality of currency accounts with software on a special-purpose server that is connected to the plurality of data storage facilities; separating, with the software, the set of the payment and receipt transactions by ledger accounts; sorting, with the software, the set of the payment and receipt transactions with proximate time frames; performing, with the software, an ARIMA analysis in on the set of payment and receipt transactions in each ledger account of each currency account; creating, with the software, a model of expected inflows and outflows for each period for each ledger account for each currency account; forecasting, with the software, a receipts forecast and a payments forecast for a future period; subtracting, with the software on, the payments forecast from the receipts forecast and adding in a previous day cash balance, creating a forecast cash balance time series for each currency account; retrieving historical banking rate information to the software; performing, with the software, the ARIMA analysis on the historical banking rate information to create a forecast banking rate information time series; forming a distributed machine learning model using a machine learning algorithm that calculates an F-score and rule for each feature set in each of a plurality of silos and then combines the F-scores and the rules to create the distributed machine learning model; executing, using the software, the distributed machine learning model utilizing multiple data silos on the forecast cash balance time series for each currency account and on the forecast banking rate information time series to determine a set of optimal cash transfers between each currency account and one or more sweep accounts; and executing, by the software, instructions to make payments and cash transfers.
10. The method of claim 9 wherein the set of payment and receipt transactions further includes transactions from multiple tenants from a bank.
11. The method of claim 9 wherein the given past date range is user-configurable.
12. The method of claim 9 further comprising modifying the payments forecast by incorporating actual planned payments.
13. The method of claim 9 further comprising modifying the receipts forecast by incorporating actual planned receipts.
14. The method of claim 9 wherein the machine learning algorithm is Random Forrest.
15. The method of claim 9 wherein the historical banking rate information is retrieved from the one or more banks.
16. The method of claim 9 wherein the historical banking rate information includes interest rates, foreign exchange rates, and money transfer costs.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
DETAILED DESCRIPTION
(23) The present inventions are now described in detail with reference to the drawings. In the drawings, each element with a reference number is similar to other elements with the same reference number independent of any letter designation following the reference number. In the text, a reference number with a specific letter designation following the reference number refers to the specific element with the number and letter designation and a reference number without a specific letter designation refers to all elements with the same reference number independent of any letter designation following the reference number in the drawings.
(24) Beginning with
(25) The secure channel 107 interfaces with the integration layer 113 of the PCM 110. This channel 107 provides access between the corporate facilities, through the integration layer 113, to the payments manager 111, the statements manager 112, the audit and security 114 module, and the cash manager 115.
(26) On the other side of the PCM 110, the integration layer 113 interfaces with the universal aggregator network gateway 120. The universal aggregator 120 provides access to a number of banks 121-126 using a number of different protocols (SWIFT, ACH, RTP, BACS, etc.) and networks (payment rails 504).
(27) The messages between the universal aggregator 120 and the integration layer 113 are made up of bank statements 118, account balances, interest rates, costs for wire transfers, and foreign exchange rates coming into the PCM 110. The integration layer 113 sends payment messages 119, as well as messages to transfer funds, loan funds, and borrow funds. In addition, message acknowledgements and other housekeeping information are exchanged.
(28) The universal aggregator 120 sends and receives information from the banks 121-126 through the various payment rails 504. A payment rail 504 is a secure network that moves transactions between banks, where the transactions are related to the electronic movement of money from payer to payee.
(29) The payment manager module 111 provides full visibility on the payments lifecycle from input (import or manual creation) to confirmation by the bank. PCM 110 allows importing transaction files from ERP 101a, TMS 101b, DBMS 101c, or any other backend application through a secure channel 107. The payments manager 111 then validates the payment, and gathers approvals through an approval workflow, and prepares the payment for sending to the bank rails. However, before the payment is released, the payments go to the cash manager 115 to assure that funding is available.
(30) The security model 114 enables the granular control of entitlements or access rights allowing access to be controlled based on functionality and data. The user will only see the functionality and data to which he has been granted access.
(31) The statement manager module 112 gives clients the ability to control and monitor proper receipt of statements, and link these to account and business units in static data. PCM module 110 will import all data contained into the bank statement (End of Day and Intraday) which allows clients to control balances and transactions. Statements from partner banks are displayed in exactly the same way, regardless of bank, country, currency, or format.
(32) PCM 110 can interface with a Sanction Screening module 116. This module 116 will check transactions against sanction lists maintained by various government entities that prohibit transactions with sanctioned entities. Transaction sanction screening 116 is performed after authorization and before sending the transaction over a network. Similarly, the risk and fraud management module 117 checks transactions for fraud by performing various analytics to prevent fraudulent payments.
(33) The PCM 110 cash manager 115 incorporates functions for managing liquidity 301, cash forecasting 302, and cash pooling 303. See
(34) The top of the screen shows the balances of each currency account 201. Each tile shows one currency, the account balance in that currency, and a chart showing the amount available for payment and the pending payments. A warning triangle is shown when the payments exceed the balance in that currency. The user is given the option to convert funds into the currency or to directly fund the account.
(35) The PCM 110 in
(36) Users can configure hierarchical groups of accounts 311 for cash management purposes. A typical hierarchy consists of a lead account, with subsidiary accounts in the same currency. Multiple account hierarchies may be defined, and these hierarchies can then be used by the Liquidity View 301 and Cash Forecasts 302. New account hierarchies can be defined for Cash Pooling 303 purposes.
(37) Transactions, either passing through the payments process or captured directly (in the case of expected receipts) from back-office systems are brought into the consolidated cash ledger 312. Here they are matched against confirmation messages, interim statements, and bank statements in order to identify the current “open” transactions yet to be reported on a bank statement.
(38) For accounts that are being actively managed through the PCM User Interface, daily Bank Statements 321 are loaded through the file interface automatically with a batch. These bank statements 321 are used to provide a closing position for the previous day. Additional information is derived from the debit and credit confirmations 322, the outboard payments 323, and the adjusting transactions 324. The closing balance acts as the opening balance for the current position. The bank statement is also reconciled against the current “open” transactions for the day and matching transactions are closed.
(39) By combining and matching your organization's transactions and your partner bank transactions, PCM can create an accurate representation of the short-term cash forecast. The integration, statement processing, and transaction matching processes are automated background tasks and require no user intervention. The short-term cash forecast module displays the current balance+/−pending payments, transfers, and expected receipts for the coming days.
(40) PCM Forecasting 302 features include the creation and administration of user's hierarchical cash forecasting structures that are bank independent, standard cash forecasts based on a 5-day liquidity forecast, drill-down capability to display forecasted transactions against individual accounts, the addition of single or repetitive manual adjustments to cash forecast (for repetitive adjustments, a background process—scheduled on the daily run list, will automatically add the recurring adjustment to the cash forecast), conversion of account balances into account hierarchy currency (uses the exchange rates, stored as static reference data to estimate the conversion between currencies), visual indicators against each account to show if the expected statements or confirmations have been received.
(41) Ledger 312 adjustment can be made directly in PCM 110 to reflect business scenarios outside of the system. Standard adjustment details can be added (account, amount, date, repeat option, and start day of the month). Users can also search on ledger adjustments, view details of individual ledger adjustments, and export the results in the same manner as for payments.
(42) In conjunction with the short-term liquidity forecast 302, the liquidity management functionality may be enabled to provide visibility of cash positions as it displays the current and available funds balance across all accounts in the selected account hierarchy. The liquidity management features include the creation and administration of user's own hierarchical cash reporting structures that are bank independent, drill-down capability to display statement lines against each account along with intraday transactions, closed balance display and available funds display, and conversion of account balances into account hierarchy currency (uses the exchange rates, stored as static reference data to estimate the conversion between currencies).
(43) The cash management 115 can be used to implement cash pooling 303 both across a group or locally as required. Cash pool views can be created from a cash account hierarchy 311. To control the sweeping and funding process, the following parameters can be defined for each account in the cash pool: transfer action (sweep and funds, fund only, report only), minimum transfer, rounding, minimum balance, and target balance. Cash pool controls 313 define the operating rules of the cash pooling 303.
(44) In addition, the cash manager 115 provides the functionality of automatically funding the cash positions in various currencies and in various accounts (or it could provide suggestions to a human cash manager). This feature interprets the current cash forecast, and the parameters set up for the accounts in the cash pool and implements transactions to sweep surplus cash into the lead account for the group or fund subsidiary accounts from the lead account in the group. This operation is repeated for each level in the account hierarchy. The transfers will be executed automatically (or created for authorization). Cross-currency cash pools can be configured, and these will use the exchange rates, stored as static reference data to estimate conversions between currencies.
(45)
(46) Next, an autoregressive integrated moving average (ARIMA) 403 is run against the data. An autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model. Both of these models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting). ARIMA models are applied in some cases where data show evidence of non-stationarity, where an initial differencing step (corresponding to the “integrated” part of the model) can be applied one or more times to eliminate the non-stationarity. The ARIMA model is considered a machine learning technique within the field of artificial intelligence technology.
(47) In some embodiments, the ARIMA model is run, and then all outliers beyond one or two standard deviations, are removed from the dataset, and the data run again. Particularly with receipts and payments, occasionally a very large payment for a large job or a past due client suddenly pays up. This may throw off the predictive values and are best removed from the data unless there is reason to know that this is a periodic event.
(48) Given a time series of data X.sub.t where t is an integer index and the X.sub.t are real numbers, an ARIMA (p, d, q) model is given by
(49)
(50) where L is the lag operator, the ∝.sub.i are the parameters of the autoregressive part of the model, the θ.sub.i are the parameters of the moving average part and the ε.sub.t are error terms. The error terms ε.sub.t are generally assumed to be independent, identically distributed variables sampled from a normal distribution with zero mean. The parameter p is the order (number of time lags) of the autoregressive model, d is the degree of differencing (the number of times the data have had past values subtracted), and q is the order of the moving-average model.
(51) The ARIMA model 403 is run on the receipts and payments separately for each currency and each account, with each predicting the receipts and payments for each currency-account group. The ARIMA model 403 provides the time series X.sub.t for receipts and payments for each currency-account for the next 15 or 30 days. Treasury function in-flows or out-flows (funding loans or transfers or investments) are not included in this calculation.
(52) Since payments are typically paid 30 or so days after receipt, and the approval process takes 7-14 days, so the PCM 110 knows perhaps 15 days' worth of actual payments (other periods could be used for other embodiments). These actual payments 404 are factored into the future payments time series. In some cases, several days' worth of receipts may also be known in advance, depending on the characteristics of the bank account and payment rails. If known in advance, the receipts are also switched to actual values. See
(53) Once the time series X.sub.i for receipts and the time series X.sub.i for payments are determined, the running balance is calculated by taking the previous balance in the currency account and subtracting the payments X.sub.i for the day and adding the receipts X.sub.i for the day to obtain the account balance for the end of the day (cash forecast 405). A fifteen-day forecast should be fine, but other durations could be used without deviating from these inventions. In some embodiments, this number is configurable. The results of the calculations are assembled into a cash flow time series for the future X.sub.i.
(54) Next, data associated with interest rates for loans and investments, both current and historical, for each currency and account location, are collected 406, along with the costs for wire transfers and other forms of money transfers. The spread between sell and buy of the various currencies is also collected, both present and historical values, to determine the foreign exchange rate costs.
(55) The time series for exchange rates and interest rates are run through an ARIMA model to predict the interest rates and exchange rates for the coming fifteen days (or whatever window is being used for payments and receipts).
(56) Each currency account is analyzed 407 to maximize the value of the entire universe of currency accounts. First, critical cash needs are identified by reviewing each currency account for predicted negative balances for the current day. These accounts need funding to cover the payments due in the current day. For example, in one embodiment, based on previously recorded user behavior (foreign exchange, swapping, and funding) a machine learning algorithm proposes a number of operations for final validation: change of currency to have an appropriate balance between the main currency recorded, funding accounts that are in overdraft, but also moving cash to interest-earning accounts. Also, following the previous behavior of the system will allocate cash per counterparty to mitigate risk exposure. In case of negative interest existing on accounts in a specific currency, the system will propose to empty those accounts and swap the currencies against an account in currency who generate positive interest.
(57) Regarding payment rejection, the machine learning/artificial intelligence algorithm will propose predictive and prescriptive analytics. For example, a payment that has been rejected before will be flagged and correction in the data will be proposed, also a payment type channel and route will be analyzed against the cut off time and cost per channel banks and alternative channel and route will be proposed to reduce the cost of payments.
(58) Regarding cash exposure and foreign exchange exposure the system will be able to predict variation and encourage mitigation risk actions, for example, but non exhaustively predict that the US Dollar is going up against the Sterling and propose SWAP and Forward FX operation to mitigate loss exposure.
(59) The entire universe of currency accounts is optimized using multivariate forecasting techniques. In one embodiment, the algorithm to optimize the sum of the balances of all of the currency accounts is a mathematical formula. For each day of the future time series, the currency accounts are searched for negative balances. For each account with a negative balance, the accounts with positive balances are searched for the currency with the lowest interest rate and the lowest cost to transfer to the currency with the shortfall. The transfer instructions are then recorded and if necessary, additional funds are sought. Then the funding for the next account with a shortfall is sought until all of the accounts have funding. Then the accounts with a positive balance are defunded to a sweep account in that currency.
(60) In another embodiment, a machine learning algorithm such as K-means, Random Forrest, or DensiCube (see U.S. Pat. No. 9,489,627, issued to Jerzy Bala on Nov. 8, 2016, said patent incorporated herein in its entirety and U.S. patent application Ser. No. 16/355,985, filed by Jerzy Bala and Paul Green on Mar. 18, 2019, said patent application incorporated herein in its entirety) is used to try various scenarios for cash transfers, borrowing, and funding to optimize the balance at the end of the future forecast period. The fields could be the currency, the cash, the interest in, the interest out, the transfer cost, and the foreign exchange rate to various currencies. A second dimension of the fields could be each day of the forecast period. And the attributes could be the currency accounts. Each combination of cash movements is calculated to determine a sum for the scenario (across the entire forecast period and all currency accounts), and the cash movements related to the best scenario is stored until a better scenario is found. Rather than using an F-score as a quality metric, the cash sum could be used as the quality metric. The best scenario at the end of the analysis is then saved as the cash movement plan. The output of the machine learning algorithm is a rules set that dictate the cash movements. In some embodiments, this rules engine is used as the cash movement plan for a single day; in other cases, the rules could be used as the cash management plane for a short period, perhaps a week. See the discussion below for more detailed information on the distributed Densicube algorithm.
(61) In the operations of the machine learning module over a window of a longer period (perhaps a week, 15 days, or a month), the goal is to find opportunities to invest the money in longer-term, higher interest rate notes, such as a 15-day certificate of deposit, rather than solely investing in overnight sweep accounts. The same is true for borrowing money or for transferring cash between accounts: by looking at the longer term, better rates and lower transaction costs can be achieved.
(62) Once the funding plan is calculated, the cash manager 115 moves the money 408 according to the plan to the appropriate account, borrowing or investing money as specified in the plan. Once funding is in place, the payments for the day are paid 409 from the appropriate currency account.
(63) Because of the complexities of machine learning algorithms, special purpose computing may be needed to build and execute the machine learning model described herein.
(64) Looking at
(65) The transactions are then sorted 604 into specific ledger accounts such as utilities, rent, payroll, and materials (in some embodiments, this sorting uses individual vendor names and in other embodiments, the ledger accounts are grouped). For each ledger account, the sorted transactions are loaded into a time series that is plotted in memory 605. In some embodiments, the transactions are grouped in a time frame 605, and the amounts within the time frame are summed. For instance, the time frame could be daily, weekly, or monthly. In the case of weekly, all transactions for the ledger account within each week are summed, and the time series uses the summed amount. The plotted time series is then processed through the ARIMA algorithm 606 as the model for that ledger account. The model predicts the expected value of transactions for a point in time.
(66) In some embodiments, the data is segregated across a plurality of bank computers, perhaps from a number of different banks. In order to preserve the privacy of the data, the filtering 602, sorting 604, combining 605, and ARIMA algorithm 606 are performed on the bank (or a third party) computers and only the model is transferred to the company for use. See U.S. patent application Ser. No. 16/355,985, filed by Jerzy Bala and Paul Green on Mar. 18, 2019, for a further discussion of distributing machine learning algorithms.
(67) In addition, the company's transactions 101a, 101b, 101c are similarly sorted 611 into the same ledger account categories as used in the sorting 604 of the bank transactions. The sorted company's transactions are then combined into time frames 612 and processed using the ARIMA 613 algorithm to form a company-specific model for each ledger account.
(68) The company-specific ledger account models are then compared 621, on an account-by-account basis, to the ledger account models based on the bank transactions. If the two models are comparable, then the bank transaction model is used going forward. If the ARIMA formulas are the same but the offset, then the constants in the model are adjusted 622 to those in the company-specific model. If the models are not compatible, then the company-specific model is used, or the user is requested to adjust the filtering so that the set of bank transactions more accurately reflects the industry of the company. The final set of ledger account models are then returned 623.
(69) In some embodiments, the bank transactions are not used, and the company-specific transactions are the only data used to create the model.
(70) The methodology of
(71) This cash flow chart can then be inputted to the machine learning 407 processes in
(72) The following description outlines several possible embodiments to create models using distributed data. The Distributed DensiCube modeler and scorer described below extend the predicative analytic algorithms that are described in U.S. Pat. No. 9,489,627 to extend their execution in distributed data environments and into quality analytics. The rule learning algorithm for DensiCube is briefly described below. But the DensiCube machine learning algorithm is only one embodiment of the inventions herein. Other machine learning algorithms could also be used.
(73) Rule Learning Algorithm
(74) The rule learning algorithm induces a set of rules. A rule itself is a conjunction of conditions, each for one attribute. A condition is a relational expression in the form:
A=V,
(75) where A is an attribute and Vis a nominal value for a symbolic attribute or an interval for a numeric attribute. The rule induction algorithm allows for two important learning parameters 802: minimum recall and minimum precision. More specifically, rules generated by the algorithm must satisfy the minimum recall and minimum precision requirements 805 as set by these parameters 802. The algorithm repeats the process of learning a rule 803 for the target class and removing all target class examples covered by the rule 804 until no rule can be generated to satisfy the minimum recall and minimum precision requirements 805 (
(76) In learning a rule, as seen in
(77) Looking at 911, 912, the rule 912 covers all of the positive and negative values, and rule 911 is empty. This rule set is then scored and compared to the base rule 901. The best rule is stored.
(78) Next, the algorithm increments the x-axis split between the rules, creating rule 931 and 932. The rules are scored and compared to the previous best rule.
(79) The process is repeated until all but one increment on the x-axis is left. These rules 941, 942 are then scored, compared, and stored if the score is better.
(80) Once the x-axis has been searched, the best rules are then split on the y-axis (for example, 951,952) to find the best overall rule. This process may be repeated for as many axes as found in the data.
(81) In the Distributed DensiCube algorithm, the functions shown in
(82)
(83) In the Distributed DensiCube algorithm, the entire process described in
(84) Looking at
(85) Every rule induction algorithm uses a metric to evaluate or rank the rules that it generates. Most rule induction algorithms use accuracy as the metric. However, accuracy is not a good metric for imbalanced data sets. The algorithm uses an F-measure as the evaluation metric. It selects the rule with the largest F-measure score. F-measure is widely used in information retrieval and in some machine learning algorithms. The two components of F-measure are recall and precision. The recall of a target class rule is the ratio of the number of target class examples covered by the rule to the total number of target class examples. The precision of a target class (i.e., misstatement class) rule is the ratio of the number of target class examples covered by the rule to the total number of examples (from both the target and non-target classes) covered by that rule. F-measure of a rule r is defined as:
(86)
where β is the weight. When β is set to 1, recall and precision are weighted equally. F-measure favors recall with β>1 and favors precision with β<1. F-measure can be used to compare the performances of two different models/rules. A model/rule with a larger F-measure is better than a model/rule with a smaller F-measure.
Prototype Generation Algorithm for Ranking with Rules
(87) The algorithms incorporate a method, called prototype generation, to facilitate ranking with rules. For each rule generated by the rule learning algorithm, two prototypes are created. In generating prototypes, the software ignores symbolic conditions, because examples covered by a rule share the same symbolic values. Given a rule R with m numeric conditions: A.sub.R1=V.sub.R1 {circumflex over ( )}A.sub.R2=V.sub.R2 {circumflex over ( )}. . . {circumflex over ( )}A.sub.Rm=V.sub.Rm, where A.sub.Ri is a numeric attribute and V.sub.Ri is a range of numeric values, the positive prototype of R, P(R) (p.sub.R1, p.sub.R2, . . . , p.sub.Rm) and the negative prototype of R N(R)=(n.sub.R1, n.sub.R2, . . . , n.sub.Rm), where both p.sub.Ri∈V.sub.Ri and n.sub.Ri∈V.sub.Ri. p.sub.Ri and n.sub.Ri are computed using the following formulas:
(88)
(89) where R(POS) and R(NEG) are the sets of positive and negative examples covered by R respectively, e=(e.sub.R1, e.sub.R2, . . . , e.sub.Rm) is an example, and e.sub.Ri∈V.sub.Ri for i=1, m, because e is covered by R.
(90) Given a positive prototype P(R)=(p.sub.R1, p.sub.R2, . . . , p.sub.Rm) and a negative prototype N(R)=(n.sub.R1, n.sub.R2, . . . , n.sub.Rm) of rule R, the score of an example e=(e.sub.R1, e.sub.R2, . . . , e.sub.Rm) is 0 if e is not covered by R. Otherwise, e receives a score between 0 and 1 computed using the following formula:
(91)
(92) where w.sub.Ri is the weight of Ri.sup.th attribute of R. The value of
(93)
is between −1 and 1. When e.sub.Ri>n.sub.Ri>p.sub.Ri or p.sub.Ri>n.sub.Ri>e.sub.Ri it is −1. When e.sub.Ri>p.sub.Ri>n.sub.Ri or n.sub.Ri>p.sub.Ri>e.sub.Ri it is 1. When e.sub.Ri is closer to n.sub.Ri than p.sub.Ri, it takes a value between −1 and 0. When e.sub.Ri is closer to p.sub.Ri than n.sub.Ri, it takes a value between 0 and 1. The value of score(e, R) is normalized to the range of 0 and 1. If p.sub.Ri=n.sub.Ri, then
(94)
is set to 0.
(95) w.sub.Ri is computed using the following formula.
(96)
(97) where max.sub.Ri and min.sub.Ri, are the maximum and minimum values of the Ri.sup.th attribute of R, respectively. The large difference between p.sub.Ri and n.sub.Ri implies that the values of positive examples are very different from the values of negative examples on the Ri.sup.th attribute, so the attribute should distinguish positive examples from negative one well.
(98) Scoring Using Rules
(99) A rule induction algorithm usually generates a set of overlapped rules. Two methods, Max and Probabilistic Sum, for combining example scores of multiple rules are used by the software. Both methods have been used in rule-based expert systems. The max approach simply takes the largest score of all rules. Given an example e and a set of n rules R={R.sub.1, . . . , R.sub.n,}, the combined score of e using Max is computed as follows:
score(e,R)=max.sub.i=1.sup.n{Precision(Ri)×score(e,R.sub.i)}
where precision(R.sub.i) is the precision of R.sub.i. There are two ways to determine score(e, R) for a hybrid rule. The first way returns the score of e received from rule R for all e's. The second way returns the score of e received from R.sub.i, only if the score is larger than or equal to the threshold of R.sub.i, otherwise the score is 0. The first way returns. For a normal rule,
(100)
(101) For the probabilistic sum method, the formula can be defined recursively as follows.
score(e,{R.sub.1})=score(e,R.sub.1)
score(e,{R.sub.1,R.sub.2})=score(e,R.sub.1)+score(e,R.sub.2)−score(e,R.sub.1)×score(e,R.sub.2)
score(e,{R.sub.1, . . . ,R.sub.n})=score(e,{R.sub.1, . . . ,R.sub.n-1})+score(e,R.sub.n)−score(e,{R.sub.1, . . . ,R.sub.n-1})×score(e,R.sub.n)
Hardware Architecture
(102) Turning to
(103) Distributed DensiCube
(104) By allowing for distributed execution, the Distributed DensiCube algorithm allows for a number of important benefits. First of all, privacy of the data assets in the model generation and prediction modes of operation are preserved by keeping the data in its original location and limiting access to the specific data. Second, the cost of implementing complex ETL processes and data warehousing in general is reduced by eliminating the costs of transmission to and storage in a central location. Third, these inventions increase performance by allowing parallel execution of the DensiCube algorithm (i.e., executing the predictive analytics algorithms on a distributed computing platforms). In addition, this distributed algorithm provides the capability for the Distributed DensiCube algorithm to provide unsupervised learning (e.g., fraud detection from distributed data sources). Finally, it allows predictive analytics solutions to operate and react in real time on a low-level transactional streaming data representation without requiring data aggregation.
(105) The Distributed DensiCube approach represents a paradigm shift of moving from the currently predominant Data Centric approaches to predictive analytics, i.e., approaches that transform, integrate, and push data from distributed silos to predictive analytics agents, to the future Decision Centric (predictive analytics bot agent based) approaches, i.e., approaches that push predictive analytics agents to the data locations and by collaborating support decision-making in the distributed data environments.
(106) Essentially, the distributed DensiCube algorithm operates the Densicube algorithm on each server 503, 505, 507 analyzing the local data in the database 509, 506, 508. The best rule or best set of rules 1105 from each server 503, 505, 507 is then combined into the best overall rule. In some embodiments, several servers could work together to derive a best rule, that is combined with another server.
(107) Collaborating predictive analytics bot agents can facilitate numerous opportunities for enterprise data warehousing to provide faster, more predictive, more prescriptive, and time and cost saving decision-making solutions for their customers.
(108) Distributed DensiCube Concept of Operation
(109) The following sections describe the concept behind the Distributed DensiCube approach. As mentioned in the previous section, the Distributed DensiCube solution continues to use the same modeling algorithms as the current non-distributed predictive analytics solution (with modifications to the scoring algorithms to support privacy by preserving the data assets in silos).
(110) 1.1 Distributed Modeling
(111) The Distributed DensiCube operates on distributed entities at different logical and/or physical locations.
(112) The distributed entity represents a unified virtual feature vector describing an event (e.g., financial transaction, customer campaign information). Feature subsets 2704, 2705 of this representation are registered/linked by a common identifier (e.g., transaction ID, Enrolment Code, Invoice ID, etc.) 2707. Thus, a distributed data 2701 represents a virtual table 2706 of joined feature subsets 2704, 2705 by their common identifier 2707 (see
(113) In
(114) As an example of the distributed DensiCube algorithm, see
(115) The credit agency database 803 contains three fields, the ID (SSN), the Credit Score, and the Total Debt fields. The registry of deeds database 804 also has three fields in this example, the ID (SSN), a home ownership field, and a home value field. In our example, there are a number of reasons that the data in the credit agency 803 needs to be kept separate from the registry data 804, and both of those dataset need to be kept separate from the bank data 802. As a result, the DensiCube algorithm is run three times on each of the databases 802, 803, 804. In another embodiment, two of the servers could be combined, with the algorithm running on one of the servers. This embodiment is seen in
(116) As seen in
(117) All the above components collaborate to generate models and use them for scoring, and at the same time, preserve the privacy of the data silos 1002. There are three levels of privacy that are possible in this set of inventions. The first level could preserve the data in the silos, providing privacy only for the individual data records. A second embodiment preserves that attributes of the data in the silos, preventing the model from knowing the attributes. The second embodiment may also hide the features (names of attributes) by instead returning an pseudonym for the features. In the third embodiment, the features themselves are kept hidden in the silos. For example, in the first level, that the range of the credit scores is between 575 and 829 is reported back to the modeler 1003, but the individual record is kept hidden. In the second embodiment, the modeler 1003 are told that credit scores are used, but the range is kept hidden on the data silo 1002. In the third embodiment, the credit score feature itself is kept hidden from the modeler 1003. In this third embodiment, the model itself is distributed on each data silo, and the core modeler 1003 has no knowledge of the rules used on each data silo 1002.
(118) The collaboration between distributed components results in a set of rules generated through a rule-based induction algorithm. The DensiCube induction algorithm, in an iterative fashion, determines the data partitions based on the feature rule based on the syntactic representation (e.g., if feature F>20 and F<−25). It dichotomizes (splits) the data into partitions. Each partition is evaluated by computing statistical quality measures. Specifically, the DensiCube uses an F-Score measure to compute the predictive quality of a specific partition. In binary classification the F-score measure is a measure of a test's accuracy and is defined as the weighted harmonic mean of the test's precision and recall. Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances, while Recall (also known as sensitivity) is the fraction of relevant instances that have been retrieved over the total amount of instances.
(119) Specifically, the following steps are executed by Distributed DensiCube:
(120) 1) The modelers 1003 invokes feature managers 1004 that subsequently start data partitioning based on the local set of features at the data silo 1002. This process is called specialization.
(121) 2) Feature managers 1004 push their computed partitions (i.e., using the data identifier as the partition identifier) and their corresponding evaluation measures (e.g., F-score) to modelers 1003.
(122) 3) Each feature model manager 1008 compares evaluation measures of the sent partitions and selects the top N best partitions (i.e. specifically it establishes the global beam search for the top preforming partitions and their combinations).
(123) 4) Subsequently, the modeler 1003 proceeds to the process of generating partition combinations. The first iteration of such combinations syntactically represent two-conditional rules (i.e., a partition is represented by a joint of lower and upper bounds of two features). Once this process is completed the identifiers of the two-conditional rules are sent to the feature managers 1004. Once received, feature managers 1004 evaluate the new partitions identified by the identifiers by executing the next iteration specialization.
(124) A data manager 1012 is a logical construct which is comprised of a data orchestrator 1005 and one or more feature data managers 1006, which cooperate to manage data sets. Data sets can be used to create models and/or to make predictions using models. A data orchestrator 1005 is a component which provides services to maintain Data Sets, is identified by its host domain and port, and has a name which is not necessarily unique. A feature data manager 1006 is a component which provides services to maintain Feature Data Sets 1203, is identified by its host domain and port, and has a name which is not necessarily unique. A data set lives in a data orchestrator 1005, has a unique ID within the data orchestrator 1005, consists of a junction of Feature Data Sets 1203, joins Feature Data Sets 1203 on specified unique features, and is virtual tabular data (see
(125) A model manager 1013 is a logical construct which is comprised of a model orchestrator 1007 and one or more feature model managers 1008, which cooperate to generate models.
(126) A prediction manager 1014 is a logical construct which is comprised of a prediction orchestrator 1010 and one or more feature prediction managers 1011, which cooperate to create scores and statistics (a.k.a. predictions).
(127) 1.2 Distributed Scoring
(128) The distributed scoring process is accomplished in two steps. First, partial scores are calculated on each feature manager 1004 on each server. Then, complete scores are calculated from the partial scores.
(129) The combined scores are the sum of the scores from each server divided by the sum of the weights from each server, multiplied by two:
(130)
(131) In this formula, the score for server A and B are similar to the DensiCube scoring described above.
(132)
(133) The weights are also determined for each location, as above.
(134)
(135) With the combined score, we have a metric to show the validity of the selected model.
(136) 2.0 Initial Architectural Concept of Operation and Requirements
(137) 2.1 Feature Manager 1004
(138) At the initialization of the machine learning model generation process, each feature manager 1004 is setup on the local servers 1002. Each feature manager 1004 must be uniquely named (e.g., within the subnet where it lives). The port number where the feature manager 1004 can be reached needs to be defined. Access control needs to be configured, with a certificate for the feature manager 1004 installed and the public key for each modeler 1003 and feature prediction manager 1011 installed to allow access to this feature manager 1004. Each local feature manager 1004 needs to broadcast the name, host, port and public key of the feature manager 1004. In some embodiments, the feature manager 1004 needs to listen to other broadcasts to verify uniqueness.
(139) Next, the data sources are defined. A seen in
(140) Each Data Source shall be described by a name for the data source and a plurality of columns, where each column has a name, a data type, and a uniqueness field. Data Sources can be used by feature model managers 1008 or feature prediction managers 1011 or both. Data Sources are probably defined by calls from a modeler 1003.
(141) The next step involves defining the Data Set Templates. A Data Set Template is a specification of how to join Data Sources defined within a feature data manager 1006. Each Data Set Template must be uniquely identified by name within a feature data manager 1006. A Data Set Template is a definition of Columns without regard to the Rows in each Data Source. For example, a Data Set Template could be represented by a SQL select statement with columns and join conditions, but without a where clause to limit rows. Data Set Templates can be used by feature model managers 1008 or feature prediction managers 1011 or both. Data Set Templates are probably defined by calls from a feature model manager 1008.
(142) Once the Data Set Templates are setup, the next step is to define the Data Sets. A Data Set is tabular data which is a subset of a data from the Data Sources defined within a feature data manager 1006. Each Data Set must be uniquely identified by name within a feature data manager 1006. A Data Set is defined by a Data Set Template to define the columns and a set of filters to define the rows. For example, the filter could be the where clause in a SQL statement. Data Sets can be used by modelers 1003 or feature prediction managers 1011 or both. Data Sets are probably defined by calls from a modeler 1003.
(143) 2.2 Modeler 1003
(144) In
(145) In the setup of the model orchestrator 1007, each modeler 1003 should be uniquely named, at least within the subnet where it lives. However, in some embodiments, the uniqueness may not be enforceable. Next the access control is configured by installing a certificate for the modeler 1003 and installing the public key for each feature manager 1004 containing pertinent data. The public key for each feature prediction manager 1011 is also installed, to which this modeler 1003 can publish.
(146) Once set up, the model orchestrator 1007 establishes a connection to each feature model manager 1008.
(147) Then the Model Data Set templates are defined. A Model Data Set Template is a conjunction of Data Set Templates from feature data managers 1006. Each Data Set Template must be uniquely named within the feature manager 1004. The Data Set Templates on feature data managers 1006 are defined, as are the join conditions. A join condition is an equality expression between unique columns on two Data Sets. For example <Feature Manager A>.<Data Set Template 1>.<Column a>==<Feature Manager B>.<Data Set Template 2>.<Column b>. Each data set participating in the model data set must be joined such that a singular virtual tabular data set is defined.
(148) After the templates are defined, the model data sets themselves are defined. A Model Data Set is a conjunction of Data Sets from feature data managers 1006. The Model Data Set is a row filter applied to a Model Data Set Template. Each Data Set must be uniquely named within a Model Data Set Template. Then the data sets on the feature data managers 1006 are defined. This filters the rows.
(149) Next, the Modeling Parameters are defined. Modeling Parameters define how a Model is created on any Model Data Set which is derived from a Model Data Set Template. Each Modeling Parameters definition must be unique within a Model Data Set Template.
(150) Then, a model is created and published. A model is created by applying Modeling Parameters to a Model Data Set. Each Model must be uniquely identified by name within a Model Data Set. A Model can be published to a feature prediction manager 1011. Publishing will persist the Model artifacts in the feature model managers 1008 and feature prediction managers 1011. Following are some of the artifacts which will be persisted to either the feature data manager 1008 and/or feature prediction manager 1011: Data set templates, model data set templates, and the model.
(151) 2.3 Prediction Orchestrator 1010
(152) The prediction orchestrator 1010 setup begins with the configuration of the access control. This is done by installing a certificate for the feature prediction manager 1011 and installing the public key for each modeler 1003 allowed to access this prediction orchestrator 1010. The public key for each feature manager 1004 containing pertinent data is also installed. Each prediction orchestrator 1010 should be uniquely named, but in some embodiments this may not be enforced.
(153) Next, a connection to each feature prediction manager 1011 is established and to a model orchestrator 1007. The model orchestrator 1007 will publish the Model Data Set Template and Model to the prediction orchestrator 1010.
(154) The scoring data sets are then defined. A Scoring Data Set is a conjunction of Data Sets from the feature data managers 1006. It is a row filter applied to a Model Data Set Template. Each Data Set must be uniquely named within a Model Data Set Template. The data sets on the feature data managers 1006 are defined (this filters the rows).
(155) Then the Scoring Parameters are defined. Scoring Parameters define how Scores are calculated on any Score Data Set which is derived from a Model Data Set Template. Each Scoring Parameters definition must be unique within a Model Data Set Template.
(156) Finally, a Scoring Data Set is defined. Partial Scores are calculated on each feature manager 1004 in the feature prediction manager 1011. See
(157) Looking to
(158) The feature managers 1004 on each of the data silos 1002 then initialize the site 1311, 1321, 1331. The data on the silo 1002 is then sliced, using the list of IDs and the features 1312, 1322, 1332 into a data set of interest, by the feature data manager 1006. The DensiCube algorithm 1313, 1323, 1333 is then run by the feature model manager 1008 on the data of interest, as seen in
(159) The rules, in some embodiments, are then returned to the prediction orchestrator 1010 where they are combined into an overall rule 1304, as seen in
(160) Modifications to the scoring algorithms to support privacy preserving in the data silos.
(161) The above description of the embodiments, alternative embodiments, and specific examples are given by way of illustration and should not be viewed as limiting. Further, many changes and modifications within the scope of the present embodiments may be made without departing from the spirit thereof, and the present inventions include such changes and modifications.