Distributed Data Storage System and Method
20200150885 ยท 2020-05-14
Inventors
Cpc classification
H04L63/0428
ELECTRICITY
H04L9/3239
ELECTRICITY
G06F3/0644
PHYSICS
G06F3/0635
PHYSICS
G06F21/6218
PHYSICS
G06F21/64
PHYSICS
H04L9/3218
ELECTRICITY
H04L63/0414
ELECTRICITY
G06F3/067
PHYSICS
H04L2209/56
ELECTRICITY
H04L63/18
ELECTRICITY
H04L67/1097
ELECTRICITY
International classification
Abstract
A distributed data storage system and method are disclosed. The system comprises a data router and a rules engine. The rules engine comprises a data repository encoding a plurality of data storage rules, each rule specifying an applicable attribute and a data storage outcome, the data storage outcome being selected from a set including a data processing action to be applied to data prior to storage and a designation of storage location. The data router includes an input interface, an output interface and a processor, the data router being configured to receive a data storage request, including data to be stored, via the input interface, determine from the rules engine applicable attributes corresponding to attributes of the data storage request and retrieve any associated data storage outcomes, the processor of the data router being configured, in dependence on any retrieved data storage outcomes, to divide the data into a plurality of fragments and to cause, via the output interface, storage of the data fragments whereby at least selected ones of the fragments are stored in different data stores.
Claims
1. A distributed data storage system comprising a data router and a rules engine, the rules engine comprising a data repository encoding a plurality of data storage rules, each rule specifying an applicable attribute and a data storage outcome, the data storage outcome being selected from a set including a data processing action to be applied to data prior to storage and a designation of storage location; the data router including an input interface, an output interface and a processor, the data router being configured to receive a data storage request, including data to be stored, via the input interface, determine from the rules engine applicable attributes corresponding to attributes of the data storage request and retrieve any associated data storage outcomes, the processor of the data router being configured, in dependence on any retrieved data storage outcomes, to divide the data into a plurality of fragments and to cause, via the output interface, storage of the data fragments whereby at least selected ones of the fragments are stored in different data stores.
2. The distributed data storage system of claim 1, wherein the designation of storage location includes a designation of physical location of the data store.
3. The distributed data storage system of claim 1, further comprising a data repository storing a network address and a physical location of each data store, the processor being configured to select a data store for a fragment in dependence on the storage outcome and on the data in the data repository and cause storage of the data fragment using the network address for the respective data store.
4. The distributed data storage system of claim 1, wherein the data processing actions include one or more actions on how to divide the data to be stored into the plurality of fragments.
5. The distributed data storage system of claim 4, wherein one of the actions comprises fragmenting a salt password or other secret in or associated with the data to be stored whereby it is stored separately to the remaining data.
6. The distributed data storage system of claim 1, further comprising a data sieve configured to process the fragmented data and to remove characters of a predetermined frequency, sequence or type prior to storage in the data store.
7. The distributed data storage system of claim 1, wherein the attributes include: party requesting storage of data, party owning data, party that is the subject of the data and a party requesting data.
8. The distributed data storage system of claim 7, wherein the system is configured to record data on the attributes and on the stored data in a data store.
9. The distributed data storage system of claim 8, wherein the data router is configured to receive a data retrieval request for stored data at the input interface, the processor being configured, responsive to the data retrieval request to determine from the rules engine, applicable attributes corresponding to attributes of the data retrieval request and to applicable attributes on the stored data and retrieve any associated data storage outcomes, upon the requester satisfying requirements of the data storage outcomes, the processor being configured to retrieve the data fragments from the data stores and reconstruct the data.
10. A method for distributed storage of data comprising: storing, in a data repository a plurality of data storage rules, each rule specifying an applicable attribute and a data storage outcome, the data storage outcome being selected from a set including a data processing action to be applied to data prior to storage and a designation of storage location; receiving a data storage request, including data to be stored; determining from data storage rules applicable attributes corresponding to attributes of the data storage request; retrieving any data storage outcomes associated with applicable data storage rules; and, in dependence on the data storage outcomes, dividing the data into a plurality of fragments and causing storage of the data fragments whereby at least selected ones of the fragments are stored in different data stores.
11. The method of claim 10, wherein the designation of storage location includes a designation of physical location of the data store, the method further comprising storing a network address and a physical location of each data store, selecting a data store for a fragment in dependence on the storage outcome and on the stored physical locations and causing storage of the data fragment using the network address for the respective data store.
12. The method of claim 10, wherein the data processing actions include one or more actions to fragment a salt password, or other secret in or associated with the data to be stored, whereby it is stored separately to the remaining data.
13. The method of claim 10, further comprising sieving the fragmented data to remove characters of a predetermined frequency, sequence or type prior to storage in the data store.
14. The method of claim 10, further comprising recording data on the attributes and on the stored data in a data store.
15. The method of claim 10 further comprising: receiving a data retrieval request for stored data; determining from the data repository applicable attributes corresponding to attributes of the data retrieval request and to applicable attributes on the stored data; retrieving any associated data storage outcomes; determining if the requester satisfies requirements of the data storage outcomes; if the requester satisfies the requirements, retrieving the data fragments from the data stores and reconstructing the data.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0049] Embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings in which:
[0050]
[0051]
[0052]
DETAILED DESCRIPTION
[0053]
[0054] The distributed data storage system includes a data router 10, a rules engine 20 and a plurality of data storage systems 30, 40.
[0055] The data router 10 communicates with the rules engine 20 to determine rules applicable to data storage or retrieval requests and then apply the applicable rules to data to be stored or validate the retrieval request in dependence on the applicable rules and the requester.
[0056] In the case of data storage, the data router receives data to be stored, uses them to determine how to fragment the data based on attributes such as those discussed above (data subject, content of data, data storage location, data owner, policies designed for company, country, profession etc.) and then fragments the data before storing the fragments in the designated data storage systems 30, 40.
[0057] In the case of retrieval, the rules are referenced to determine where the fragments are held and then the above process is reversedassuming the accessing party meets any predetermined conditions designed by the rules, the fragments are retrieved, the data recompiles from the retrieved fragments, decrypted and provided to the accessing party.
[0058] Optionally, embodiments of the present invention provide the rules engine for tagging data, can generate a data dictionary and can integrate with an existing service (such as a data protection system, content classification system etc.)
[0059] To assign data ownership to a data subject, including individualsembodiments of the present invention may use a harmonisation table which links application and Id to a master record to all rules to be assigned and revoked at the data subject master level.
TABLE-US-00001 HarmonisationID Common name Domicile Nationaity 1 s@exat.com GB GB 2 p@exat.com DE US Id HarminisationId Application ApplicationId 1 1 Hubspot 251 2 1 Facebook Sonalr 3 1 LinkedIn SonalRattan 4 2 Hubspot 255
[0060] Preferred embodiments pseudonymize ciphertext and stores part of the encrypted text alongside the keys (Salt password) in the country of restriction or an acceptable country of storage. This means that the restricted data is to be kept within the designated country it is domiciled in and the rules engine will determine what other countries the data can be viewed in.
[0061] Preferred embodiments uses two methods to secure data and associated secrets.
[0062] If the data to be encrypted is determined by the data router to have a datasubject assigned to the content (in a header, a record associated with the data or assigned in some other way), a rule check is performed to see where the secrets should be stored i.e. by the domicile of the data subject, by data type, e.g. name, address or by the context of the data creation e.g. account opening in given jurisdiction or by another mechanism defined.
[0063] To search on pseudonymised content, there is the ability to recreate the pseudonymised text by encrypting the data attribute you which to search for with the same key, and then to defragment the data attribute with the same algorithm. The resulting pseudonymised attribute can then be utilized as the search fragment.
[0064] If no data subject is assigned to the data a similar rule check may be performed where the entity domicile is taken into consideration and the rules may be defined by data type, e.g. trade data, tender information etc or by another mechanism defined. This data could be assigned another unique value to be able to search for the same result, similar to two factor authentication.
[0065] Using, preferably, Symmetric encryption, the password is generated of multiple parts to ensure randomisation and those parts are distributed and reconstructed when all conditions are met.
[0066] Types of Parameters used for generating the secret [0067] 1) Guid that defines the firm data belongs to (which can be shared for multi firm sharing and validation) [0068] 2) Random generated text or guid as a secret [0069] 3) Random generated text or guid as an authentication code [0070] 4) Any other parameters that are required to reconstruct the secrethash to check [0071] 5) Algorithm to reconstruct (i.e. use alternative characters)
[0072] The data router's proprietary algorithm then preferably constructs this into a given secret to create the key to encrypt data.
[0073]
[0074] Note that in this example, datastores designated D1, D2 and D3 are the same on both sides of the diagramthey are illustrated twice to show access by both parties.
[0075] In an example, encrypted data could be represented as AAAAAAABBBBBBB (which as appreciated would generally be a far simplified representation). The data router may transform this into three fragments AB1AB1AB1AB1 AB2AB2AB2 AB3AB3AB3AB3. The AB2 fragments may be stored in the country accountable for data (country store AB2), the AB3 fragments may be stored in a metadata repository (AB3) and in the source system. In a preferred embodiment, a data fragment may be held in a public ledger such as the Blockchain. This fragment (AB1) also preferably includes a data reconstruction ID which tells the system how to reconstruct the encrypted data. Preferably it is one way hashed before being stored in the Blockchain. In order to check validity the of the data, the data router (or some other system) reencrypts the data being tested which it then deconstructs to create AB1 and then is hashed to add onto Blockchain).
[0076] The combination of AB1+AB2+AB3 will constitute the complete data attribute when reconstructed in the correct order. As an example, the attribute FIRSTNAME may be encrypted into a string called AAAAAAABBBBBBB. This string may be dynamically fragmented into three self contained sub-fragments that are no longer usable in their current state. Instead, they must be re-assembled using the data router and its algorithms to reconstruct the attribute called AAAAAAABBBBBBB.
[0077] AAAAAAABBBBBBB is an encrypted string which can be mathematically broken with sufficient computational power. Under guidance of the data router, the fragments however, are random and cannot be broken through computational power and AB1 AB2 AB3 are all turned into pseudonymised random data fragments.
[0078] Blockchain is preferably used as a store for the immutability of one of the data fragments alongside the reconstruction data Ids. Continuing the example above:
Data (Multi Factor Feature Recognition)
[0079] Embodiments also allow for pseudonymised features to be included with data
[0080] AAAAAAABBBBBBB(encryption) CCCCCDDDD (Features)
[0081] Splitting the features in exactly the same way as the encryption enables a probabilistic view on whether data is the same (i.e the name Sonal that is encrypted and fragmented as a multi-factor feature created by firm 1 and the encrypted and fragmented sonal created by firm 2 is 99% the same). The feature is reconstructed with data held off chain to be seen as a feature. This will only be granted and reconstructed if the rule check has passed and the CD2 and CD3 are available for the reconstruction. This needs to be done per transaction for the data or features are being processed.
Keys
[0082] Key storage may be performed in a similar way to data above. For symmetric encryption (e.g. AES), a passcode (which is multiple random guids hashed) is used. One is always in memory for the firm or held on in a repository for sharing with a 3rd party. Second is held with metadata and third is stored in the country and part is kept with metadata (one central repository) and country store.
[0083] YYYYYYZZZZ (passcode)->YZ1YZ1YZ1 YZ2YZ2YZ2YZ2 YZ3YZ3YZ3
[0084] Components in the design:
[0085] Data Splitter: This component is designed to spilt a given string into 2 parts with given proportion size.
[0086] Ex: If the string is This is the string that will be split, the split algorithm calculates all white space, then accepts a optional parameter and splits the string by default into 30-70
[0087] SplitString(Original String, propotion1,propotion2)
[0088] {
[0089] Returns 2 values [0090] 1. String 1 with proportion 1 or default 30% [0091] 2. String 2 with proposition 2 or default 70%
[0092] }
[0093] Data Merge
[0094] Merge String (String1,String2)
[0095] { [0096] Unites (String1 and String2)calculates the split percentage Return the Original String
[0097] }
[0098] Data Sieve or Data Filter: This is an optional but preferred component that selectively filters the string for random characters or acts like a Sieve.
[0099] The setting on what characters, how many, frequency and sequence are configurable and will could potentially vary from each user and firm
[0100] Data Sieve helps in reducing the final size of the data/string that needs to be stored and add another layer of complexity for data breaches
[0101] Data Un Sieve or Data Filler: This component does exactly opposite of Data Sieve. This component puts back the characters which were extracted out by Sieve back in their exact position and bring the string/data back to its full form.
[0102] The sequence, format, frequency on data filler is configurable and can vary from each user and firm
[0103] The Un Sieved string could then be processed further
[0104] Blockchain as DLT and Storage mechanism [0105] Using Industry standard Blockchain or other ledger such as Ethereum or Hyperlegder [0106] Holds part of the data for each attribute. (the smaller proportion of each attributeEx: 30%) [0107] DLT would be a Permissioned Blockchain with selected participating Nodes. [0108] Each Node can participate in Data reconciliation and verification process [0109] The solution has an off chain component will hold major part of the data attribute (70%) [0110] Off Chain component can be distributed in any particular country [0111] DLTServes as distributed data amongst permissioned participations, and blockchain structure servers as immutable records for each transactions
[0112] Rules Engine: The rules engine is preferably built on a 3 layered approach as shown in
[0113] Examples of the same attribute with different rules [0114] Customer A exercises his/her right to be forgotten under GDPR, yet record retention rules requires the firm to retain certain information. The keys and part of the pseudonymized strands of Customer A's data are removed from the key management solution and stored in a locked down vault. Customer A's data can no longer be processed. If there is a legal or regulatory need to see
[0115] Customer A's data, the keys and the pseudonymized data strands can be returned and reconstructed, reinstating Customer A's data, in a fully audited and controlled manner. Alternatively, should the record retention period expire, the data in the vault will be automatically purged, in accordance with local regulations. [0116] Customer A and Customer B are both in the same country. Customer A has consented to marketing and access to his information will granted and he will receive notifications. Customer B has not consented so therefore access to his data will not be granted. [0117] Customer A is in a country that has cross border data restrictions. If he has consented, a person within this country may access his data. However, if a person outside the country attempts to access the same data, it will be blocked. [0118] Internally within a firm Customer A may have his data accessed by a relevant person (i.e. Relationship Manager/Sales), as it as part of their normal business activity. If an employee with a role who should not access this data (Finance) attempts to access it, it will be blocked
[0119] Country Level Rules: The system has been designed to allow to input/modify all country level rules once (by data governance teams or potentially by 3.sup.rd party legal counsel)
[0120] Data is routed from one location to another if the rules permit the data to be processed in the country requesting the type data, the internal policy of what countries are allowed to view certain data types and finally the rights of a data subject (if defined) has consented for their data to be accessed across different jurisdictions.
[0121] Data Subject Controls: the system automatically takes into account the rights of a data subject, be it an individual, fund, portfolio level to apply consent type rules against different types of data usage including the sharing of data over multi jurisdictions, sharing data with 3rd party and/or for AI type profiling. The rules engine is designed to route keys and data fragments to the different locations depending on how the rules have been defined against the data subject, data origination point or entity location.
[0122] Embodiments of the present invention may use:
[0123] Multi-Vector Feature Processing and Extractionreading the features for pseudonymised data, describing the underlying data to use for Homomorphic feature recognition.
[0124] Probability based Homomorphic feature recognitioncomparison of Multi-Vector features on pseudonymized data (non-reversible) to be able to compare values and, on a probability basis, be able to determine if the underlying value is the same.
[0125] Zero knowledge distributed fragment processingAbility to reconstruct data with a 3rd party holding partial data and secrets
[0126] AI based entity recognition, reconstruction, and processinga reader bot looks through text and works out if something is pseudonymised and reconstructs it (if it passes the rule check).
[0127] Rule-based dynamic data protection
[0128] Self-identifiable rule-based searchingreconstructing fragmented data to pseudonymised data with multi-factors (such as knowing the attribute type, the data subject common name and reuse keys, secrets and pseudonymisation technique to recreate value).
[0129] To pseudonymise the protected data, we remove the ability to recreate and reconstruct the ciphertext by deleting the salts and the remaining fragments.
[0130] It will be appreciated that the data storage systems themselves may take various forms including central or distributed file stores and databases (such as SQL or other relational or non-relational database types). They may be implemented using storage devices such as hard disks, random access memories, solid state disks or any other forms of storage media. They could also be cloud-based services, public ledger based systems or the like. It will also be appreciated that the processor discussed herein may represent a single processor or a collection of processors acting in a synchronised, semi-synchronised or asynchronous manner.
[0131] It is to be appreciated that certain embodiments of the invention as discussed below may be incorporated as code (e.g., a software algorithm or program) residing in firmware and/or on computer useable medium having control logic for enabling execution on a computer system having a computer processor. Such a computer system typically includes memory storage configured to provide output from execution of the code which configures a processor in accordance with the execution. The code can be arranged as firmware or software, and can be organized as a set of modules such as discrete code modules, function calls, procedure calls or objects in an object-oriented programming environment. If implemented using modules, the code can comprise a single module or a plurality of modules that operate in cooperation with one another.
[0132] Optional embodiments of the invention can be understood as including the parts, elements and features referred to or indicated herein, individually or collectively, in any or all combinations of two or more of the parts, elements or features, and wherein specific integers are mentioned herein which have known equivalents in the art to which the invention relates, such known equivalents are deemed to be incorporated herein as if individually set forth.
[0133] Although illustrated embodiments of the present invention have been described, it should be understood that various changes, substitutions, and alterations can be made by one of ordinary skill in the art without departing from the present invention which is defined by the recitations in the claims below and equivalents thereof.