System and methods for trajectory pattern recognition
11023820 · 2021-06-01
Assignee
Inventors
Cpc classification
G06F18/40
PHYSICS
International classification
G06G7/00
PHYSICS
G06E3/00
PHYSICS
Abstract
A multiple imputation (MI) based fuzzy clustering with visualization-aided MI validation that improves the accuracy and the stability of identified patterns, generally the structure of HD data with missing values.
Claims
1. A computer-implemented method for visualizing pattern recognition of longitudinal data, the computer-implemented method comprising: receiving one or more datasets associated with a population, or sample thereof, of subjects, wherein the dataset includes a plurality of components and one or more attributes associated with each of the plurality of components; producing multiple imputed datasets of longitudinal data by imputing missing attribute values into the one or more datasets based on a multiple imputation technique; identifying one or more trajectory patterns or one or more data clusters from the multiple imputed datasets; performing a multiple imputation based cluster validation on the one or more trajectory patterns or one or more data clusters, the multiple imputation based cluster validation accounting for imputation uncertainty; comparing data associated with one or more individual subjects to the one or more trajectory patterns or one or more data clusters; displaying a stable cluster structure for varying sample sizes and dimensions or attributes by automating searching and finding of an optimal number of iterations across the multiple imputed datasets of longitudinal data; and illustrating, on a display device, the one or more trajectory patterns or one or more data clusters.
2. The computer-implemented method for visualizing pattern recognition of longitudinal data method of claim 1, wherein comparing data associated with one or more individual subjects to the one or more identified clusters or trajectory patterns includes examining cluster membership uncertainty across the one or more identified clusters or trajectory patterns.
3. The computer-implemented method for visualizing pattern recognition of longitudinal data method of claim 2, wherein examining cluster membership uncertainty across identified clusters or trajectory pattern includes generating a diagnosis table.
4. The computer-implemented method for visualizing pattern recognition of longitudinal data method of claim 1, wherein examining cluster membership uncertainty includes determining average fuzzy degrees of subjects for their most likely latent cluster or trajectory pattern membership.
5. The computer-implemented method for visualizing pattern recognition of longitudinal data method of claim 1, wherein comparing data associated with one or more individual subjects to the one or more identified clusters or trajectory patterns includes generating fuzzy degrees of each subject and using the fuzzy degrees to predict an outcome.
6. The computer-implemented method for visualizing pattern recognition of longitudinal data method of claim 1, wherein comparing data associated with one or more individual subjects to the one or more identified clusters or trajectory-pattern includes comparing mean trajectories to individual subject trajectories.
7. A computer-implemented method for visualizing pattern recognition of longitudinal data without changing the classification of the longitudinal data in high-dimensional space, the computer-implemented method comprising: receiving, by a processor, two or more data parameters; generating, with the processor, longitudinal data as defined by the two or more parameters; grouping the longitudinal data according to predefined attributes to obtain a plurality of data clusters, wherein each data cluster is of a defined cluster size and includes two or more data values and replacing any missing data value with a plausible data value, the grouping and replacing accounting for imputation uncertainty; storing each data cluster in memory; determining, with the processor, a plurality of variations of the data clusters, between data clusters, and within the data clusters; optimizing, with the processor, weights for the plurality of variations of the data clusters, while preserving original structure membership of the longitudinal data in the high-dimensional space, the variations occurring between two or more data clusters and within each data cluster; projecting, with the processor, onto a two-dimensional surface the data clusters according to the optimizing; displaying a stable cluster structure for varying sample sizes and dimensions or attributes by automating searching and finding of an optimal number of iterations across the plurality of data clusters of longitudinal data; and displaying, on a display device, the stable cluster structure.
8. The computer-implemented method for visualizing pattern recognition of longitudinal data according to claim 7, wherein said determining includes examining cluster membership uncertainty across the data clusters.
9. The computer-implemented method for visualizing pattern recognition of longitudinal data according to claim 8, wherein said examining includes generating a diagnosis table.
10. The computer-implemented method for visualizing pattern recognition of longitudinal data according to claim 7, wherein the optimizing of the weights is based on gradual optimization and non-linear mapping methods.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
(9) The enhanced projection pursuit method according to the invention improves the accuracy and stability of identified patterns, generally the structure of HD data. The invention automates the searching and finds the optimal number of iterations to display a stable structure base on factors such as sample size and the number of variables.
(10)
(11) System 100 represents an example of a system that may be configured to allow software applications to be developed, distributed, and executed on a plurality of computing devices, such as computing devices 102A-102N. In the example illustrated in
(12) Communications network 104 may comprise any combination of wireless and/or wired communication media. Communication network 104 may include routers, switches, base stations, or any other equipment that may be useful to facilitate communication between various devices and sites. Communication network 104 may form part of a packet-based network, such as a local area network, a wide-area network, or a global network such as the Internet. Communication network 104 may operate according to one or more communication protocols, such as, for example, a Global System Mobile Communications (GSM) standard, a code division multiple access (CDMA) standard, a 3rd Generation Partnership Project (3GPP) standard, an Internet Protocol (IP) standard, a Wireless Application Protocol (WAP) standard, and/or an IEEE standard, such as, one or more of the 802.11 standards, as well as various combinations thereof.
(13) As illustrated in
(14) As illustrated in
(15) As illustrated in
(16) Application interface 112 may be configured to provide an interface between application hosting site 110 and one or more of computing devices 102A-102N for a hosted aspect of an application. Commerce engine 114 may be configured to support transactions that may occur when a user uses a software application. Commerce engine 114 may include a number of components required for online commerce. For example, commerce engine 114 may include modules with instructions stored on a computer readable medium that when executed by a processor cause application hosting site 110 to perform functions related to customer accounts, orders, subscriptions, tax, payments, fraud, and credit processing.
(17) Support engine 116 may be configured to provide support services associated with an application. In one example, support engine 116 may be configured to provide updates to an application installed on one of user devices 102A-102N. As illustrated in
(18)
(19) Processor(s) 202 may be configured to implement functionality and/or process instructions for execution in computing device 200. Processor(s) 202 may be capable of retrieving and processing instructions, code, and/or data structures for implementing one or more of the techniques described herein. Instructions may be stored on a computer readable medium, such as memory 204 or storage device 206. Processor(s) 202 may be digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry.
(20) Memory 204 may be configured to store information that may be used by computing device 200 during operation. As described above, memory 204 may be used to store program instructions for execution by processor(s) 202 and may be used by software or applications running on computing device 200 to temporarily store information during program execution. For example, memory 204 may store instructions associated with operating system 216, applications 218, and pattern recognition application 220 or components thereof, and/or memory 204 may store information associated with the execution of operating system 216, applications 218, and pattern recognition application 220. Memory 204 may be described as a non-transitory or tangible computer-readable storage medium. In some examples, memory 204 may provide temporary memory and/or long-term storage. In some examples, memory 204 or portion thereof may be described as volatile memory, i.e., in some cases memory 204 may not maintain stored contents when computing device 200 is powered down. Examples of volatile memories include random access memories (RAM), dynamic random access memories (DRAM), and static random access memories (SRAM).
(21) Storage device 206 represents memory of computing device that may be configured to store relatively larger amounts of information for relatively longer periods of time than memory 204. Similar to memory 204, storage device 206 may also include one or more non-transitory or tangible computer-readable storage media. Storage device 206 may be internal or external memory and in some examples may include non-volatile storage elements. Examples of such non-volatile storage elements may include magnetic hard discs, optical discs, floppy discs, flash memories, or forms of electrically programmable memories (EPROM) or electrically erasable and programmable (EEPROM) memories.
(22) Input device(s) 208 may be configured to receive input from a user operating computing device 200. Input from a user may be generated as part of a user running one or more software applications, such as applications 218 and/or pattern recognition application 220. Input device(s) 208 may include a touch-sensitive screen, track pad, track point, mouse, a keyboard, a microphone, video camera, or any other type of device configured to receive input from a user. In one example, input device(s) 208 may generate one or more signals corresponding to the coordinates of a position touched on a touchscreen of computing device 200. These signals may be provided as information to components of computing device 200 (e.g., processor 202, or operating system 216) in conjunction with the execution of applications 218 and/or pattern recognition application 220
(23) Output device(s) 210 may be configured to provide output to a user operating computing device 200. Output may tactile, audio, or visual output generated as part of a user running one or more software applications, such as applications 218 and/or pattern recognition application 220. Output device(s) 210 may include a touch-sensitive screen, sound card, a video graphics adapter card, or any other type of device for converting a signal into an appropriate form understandable to humans or machines. Additional examples of an output device(s) 210 may include a speaker, a cathode ray tube (CRT) monitor, a liquid crystal display (LCD), or any other type of device that can provide output to a user. In some examples, output device(s) 210 may be external to computing device 200 and may be operatively coupled to computing device 200 using a standardized communication protocol, such as for example, Universal Serial Bus protocol (USB) or High-Definition Multimedia Interface (HDMI).
(24) In the example illustrated in
(25) Network interface 214 may be configured to enable computing device 200 to communicate with external devices via one or more networks, such as communications network 104. Network interface 214 may be a network interface card, such as an Ethernet card, an optical transceiver, a radio frequency transceiver, or any other type of device that can send and receive information. Network interface 214 may be configured to operate according to one or more of the communication protocols described above with respect to communications network 104.
(26) Operating system 216 may be configured facilitate the interaction of applications, such as application 218 and pattern recognition application 220, with processor(s) 202, memory 204, storage device 206, input device(s) 208, output device(s) 210, display 212, network interface 214 and other hardware components of computing device 200. Operating system 216 may be an operating system designed to be installed on laptops and desktops. For example, operating system 216 may be a Windows operating system, Linux, or Mac OS. In another example, if computing device 200 is a mobile device, such as a smartphone or a tablet, operating system 216 may be one of Android, iOS or a Windows mobile operating system.
(27) Applications 218 may be any applications implemented within or executed by computing device 200 and may be implemented or contained within, operable by, executed by, and/or be operatively/communicatively coupled to components of computing device 200, e.g., processor(s) 202, memory 204, and network interface 214. In one example, an application may be developed by developer site 106 as described above with respect to
(28) Pattern recognition application 220 may be an application configured to analyzing a heterogeneity effect. In one example, the pattern recognition application 220 may be configured to input/output files compatible with Excel, SAS, Stata, R, and SPSS file formats). In one example, the pattern recognition application 220 may be scalable for use with online large-scale electronic databases.
(29) Studies such as observation studies and randomized control trial (RCT) studies are used to examine the effect of one or more treatments across a population of subjects over a period of time. Typically, a study will include one or more components and one or more attributes associated with a component. For example, a smoking cessation intervention study may include one or more intervention components (e.g., messages to/from a tobacco treatment specialist, encouraging email messages from experts, and an online community). Attributes may include a measurement associated with a component at a defined time period (e.g., number of messages received during a month). Thus, a dataset associated with a sample/population of subjects may include a plurality of components and one or more attributes associated with each of the plurality of components.
(30) Clustering techniques such as, for example, Probabilistic clustering, Neural Networks, Agglomerative Hierarchical clustering, and Partition-based clustering may be used to categorize subjects of a study into groups (“clusters”) based on data values of the attributes. Probabilistic clustering, Neural Networks, Agglomerative Hierarchical clustering, and Partition-based clustering may use simple “exposed” and “non-exposed” groups or subgroups based on arbitrarily determined cut-scores (e.g., quintiles or percentages). Simple grouping may overlook important exposure information, or may generate spurious false-positive findings. Further, Probabilistic clustering, Neural Networks, Agglomerative hierarchical clustering, and Partition-based clustering may have deficiencies in analyzing longitudinal multiple-component behavioral RCTs interventions or observational studies. For example, Gaussian, Bayesian, Hierachical and Self Organizing Map (SOM) clustering may not be able to simultaneously, automatically, and effectively handle trajectory pattern recognition.
(31) Further, these clustering techniques may not have embedded and/or transparent validation and diagnosis procedures for situations where subjects can have multiple membership and data are longitudinal, missing, non-normal, high-dimensional and/or correlated. Further, these clustering techniques often need manual testing of a number of patterns, one-by-one. In addition, these clustering techniques may be implemented in a format of source codes or hyper-text code and may need people with advanced computational skills to fulfill the pattern recognition process. There are several disadvantages with respect to clustering techniques.
(32) Fang H, Johnson C, Stopp C, Espy K A. “A new look at quantifying tobacco exposure during pregnancy using fuzzy clustering.” Neurotoxicol Teratol, 2011 January-February; 33(1): 155-65, which is incorporated by reference in its entirety describes using a multiple-imputation-based fuzzy clustering (MI-Fuzzy) model to quantify tobacco exposure during pregnancy. MI-Fuzzy techniques may generate latent patterns by simultaneously considering all available and relevant resources and retaining all enrolled subjects and their data (i.e., collected for the purpose of the intervention/treatment/exposure and outcome analyses). The techniques described herein may be based on MI-Fuzzy techniques, such as, for example, those described in Fang et al. and may use actual evidence of subjects' engagement with or responses to interventions/treatments/exposure to capture behavioral changes over time and to identify latent clusters.
(33) Techniques based on multiple-imputation-based fuzzy clustering models may provide a robust pattern recognition tool and may be designed to characterize behavioral patterns in multi-component behavioral intervention/treatment/exposure and examine differential patient responsivity to intervention/treatment/exposure in (longitudinal) RCTs or observational studies. Techniques based on multiple-imputation-based fuzzy clustering models may help evaluate how the intervention/treatment/exposure components work for different patients/subjects, clarify their efficacy/effectiveness, and provide a detailed understanding of how patients/subjects' engagement and response patterns relate to different health outcomes. Techniques based on multiple-imputation-based fuzzy clustering models may be used to help improve these interventions/treatments/exposure for targeted populations, and uncover important relationships. Further, techniques based on multiple-imputation-based fuzzy clustering models may help provide new evidence on high-risk behavioral patterns that may be clinically important for early targeted intervention/treatment.
(34) The techniques described herein provide a pattern-recognition model that employs a full theoretical integration of (a) multiple imputation, (b) fuzzy clustering, and (c) comprehensive validation. The techniques described herein may be based on multiple-imputation-based fuzzy clustering models and be configured to cope with real-world situations where patients/subjects have membership in multiple clusters, handle high-dimensional longitudinal data with missing values, and validate behavioral patterns.
(35) In one example, the techniques described herein may include Multiple Imputation (MI) techniques. Missing attribute data values are very common in clinical trials or observational studies, especially in longitudinal studies. In one example, the techniques described herein integrate MI techniques into clustering to account for imputation uncertainty, making the techniques robust to data that are completely missing at random (CMAR) or missing at random (MAR).
(36) In one example, the techniques described herein may include fuzzy clustering techniques. For example, in a longitudinal smoking cessation study a single individual can have multiple memberships. That is, a subject can belong to different clusters with a different degree of membership in each cluster. The fuzzy clustering techniques described herein may use “fuzzy degrees” to handle multiple membership situations. Further, the fuzzy clustering based techniques described herein may be analytically tractable and computationally efficient and may be especially useful when the data are complex. The fuzzy clustering techniques described herein may handle non-normal and high-dimensional data with missing values and a mix of categorical and continuous variables, without prior assumptions of statistical distributions.
(37) In one example, the techniques described herein may include pattern validation techniques. The embedded pattern validation process described herein may be replicable and tractable (“transparent”). Further, the validation processes described herein may incorporate graphs to visualize patterns generated from high-dimensional data, MI-based clustering indices to validate engagement/response (trajectory) patterns, and statistical tests to examine clusters.
(38) In one example, the techniques described herein may be scalable to accommodate features such as a diagnostic table for checking the certainty of cluster membership, a graphical presentation of trajectory patterns, and computation of individual fuzzy degrees in each trajectory pattern for outcome tests.
(39) The techniques described herein may be particularly useful for recognizing patterns of smokers' behavioral changes over the course of an intervention/treatment/exposure, particularly in Internet and culturally tailored interventions. That is, the techniques used herein may be used to untangle these complex but informative behavioral variations in the intervention/treatment/exposure. The techniques describe herein may be implemented into a user-friendly software tool and may be applicable in many areas including but not limited to medical areas. For example, the techniques used herein may be useful to clinicians, basic scientists, educators, and students in any various types of research fields.
(40)
(41) Computing device 200 compares individual subject data and/or trajectory patterns to validated clusters data and/or trajectory patterns (312). In one example, computing device 200 may illustrate clusters data and/or trajectory patterns on a display device (see 410 of
(42)
(43) The enhanced projection pursuit method according to one embodiment of the invention aims to obtain an interesting two-dimensional projection of the original high-dimensional data that minimizes its stress function. As shown in
(44) To evaluate the performance of the invention, two publicized and two longitudinal random controlled trials (RCT) datasets were used to compare the invention with Andrews Curves and Grand Tour as shown in
(45) TABLE-US-00001 Name IRIS Waveform TDTA CANDO Cases(N) 150 5000 109 240 Dimensions(d) 4 21 5 8 Time points(t) 1 1 4 4 Clusters(c) 3 3 3 5
(46) As shown in
(47) Turning to
(48) TDTA data was collected from a longitudinal culturally-tailored smoking cessation intervention for 97 Asian American smokers. It contains three identified culturally-adaptive response patterns with the intervention using three components: (1) cognitive behavioral therapy, (2) cultural tailoring, and (3) nicotine replacement therapy. The first two were measured by scores on Perceived Risks and Benefits, Family and Peer Norms, and Self-efficacy scales. Each scale has four repeated measures for a total of 20 attributes. As shown in
(49) Turning to
(50) In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over, as one or more instructions or code, a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol. In this manner, computer-readable media generally may correspond to (1) tangible computer-readable storage media which is non-transitory or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this invention. A computer program product may include a computer-readable medium.
(51) By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transient media, but are instead directed to non-transient, tangible storage media. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc, where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
(52) Instructions may be executed by one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. In addition, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules. Also, the techniques could be fully implemented in one or more circuits or logic elements.
(53) The techniques of this invention may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC) or a set of ICs (e.g., a chip set). Various components, modules, or units are described in this invention to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or provided by a collection of interoperative hardware units, including one or more processors as described above, in conjunction with suitable software and/or firmware.
(54) While the invention is susceptible to various modifications and alternative forms, specific exemplary embodiments of the present invention have been shown by way of example in the drawings and have been described in detail. It should be understood, however, that there is no intent to limit the invention to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the scope of the invention as defined by the appended claims.