WO2012122516A1 - System and method for converting large data sets to other information to observations for analysis to reveal complex relationship - Google Patents

System and method for converting large data sets to other information to observations for analysis to reveal complex relationship Download PDF

Info

Publication number
WO2012122516A1
WO2012122516A1 PCT/US2012/028589 US2012028589W WO2012122516A1 WO 2012122516 A1 WO2012122516 A1 WO 2012122516A1 US 2012028589 W US2012028589 W US 2012028589W WO 2012122516 A1 WO2012122516 A1 WO 2012122516A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
observations
analysis
population
present
Prior art date
Application number
PCT/US2012/028589
Other languages
French (fr)
Inventor
Dale Fedewa
Anthony Edwards
Original Assignee
Redoak Logic, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Redoak Logic, Inc. filed Critical Redoak Logic, Inc.
Publication of WO2012122516A1 publication Critical patent/WO2012122516A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/06Asset management; Financial planning or analysis

Definitions

  • the present invention relates to analyzing data, and more specifically to a system and method for analyzing data sets to reveal complex relationship linkages within the data sets.
  • the present invention is a system and method for analyzing data sets to reveal complex relationship linkages within the data sets.
  • the present invention can be used for data analysis and mining in connection with pharmaceutical drug development analysis, diagnostic tools for healthcare professionals, patient information and decision support, related services, and/or consumer downloadable software associated with such services.
  • the present invention is typically a stand-alone platform that can be used in conjunction with an associative memory engine such as SaffronSierra, or any other platform offering similar functionality such as a natural intelligence platform that includes an associated memory engine as well as other functionality.
  • references to specific software tools and platforms herein such as FreeLing, Saffron, SaffronSierra, and any others are by way of example only and are not intended to limit operation of the present invention solely to such software tools and platforms; other software applications, platforms, and tools offering similar features and functionality can also be used in connection with the present invention.
  • FIG. 1 is a block diagram providing a high level overview of the various components used in connection with the present invention.
  • FIG 2 is a block diagram showing another perspective on the high level functionality of the present invention.
  • FIG 3 is a block diagram showing another perspective of the present invention.
  • FIG 4 is a graphic depiction of one example of the hierarchy used in hierarchical grammar text.
  • FIG 5 shows a depiction of the initial properties known at the beginning of a clinical drug trial according to one aspect of the present invention.
  • FIG 6 shows an example of an interface according to one aspect of the present invention.
  • the present invention is a method of processing large data sets and other information - both numerical and textual.
  • the data sets are typically processed into many observations, which are classified into initial properties, outcomes, and outside influencers.
  • Numerical data is converted to observations, and then analyzed for affinity. Linkages are found between and amongst the initial properties and the outcomes that correlate to outside influencers.
  • Textual data is analyzed and facts extracted to discover connections between numerical data observations and other textual data facts.
  • the present invention can be used in connection with data gathered or generated in connection with clinical drug trials.
  • the present invention is not limited to use in connection with clinical drug trials and can be used in connection with any other application in which large data sets have been collected or prepared for analysis.
  • new facts are learned from analysis of the numerical data, which may be applied during further analysis. In so analyzing, complex linkages are revealed that may not have been uncovered using conventional data analysis techniques.
  • the block diagram shown in Fig. 1 provides a high-level overview of the various components, or modules, used in connection with the present invention. As would be understood by one skilled in the art, the modules are described from a functional perspective and can be implemented across one or more servers as appropriate and can be implemented in a variety of different ways without departing from the scope of the present invention.
  • the Data Preparation module 1100 (identified as the "AMEDataPrep") takes data in any of a number of file formats and converts it into standard file formats.
  • the Data Acquisition module 1200 takes normalized database tables and converts them into observations suitable for input to an associative memory engine (“AME”) 1300 or any other platform offering similar functionality.
  • AME associative memory engine
  • the Data Space Manager module 1400 orchestrates the process of data acquisition, AME ingestion, data mining, and query/analysis.
  • the Query/Analysis module 1500 uses information provided by the Space Manager 1400 (such as model orientation information) and interacts with the associative memory engine 1300 to search and analyze the clinical trial data (or other data as desired) therein.
  • the block diagram shown in Fig. 2 offers another perspective on the high-level functionality of the present invention. From the left of the above diagram, in one example, External Public Data 2100 is available as drug study clinical trial data as well as documents (PDF, web pages, etc.) related to drug trials.
  • Pharma Internal Data 2200 is client-specific drug study clinical trial data as well as documents that are kept separate from other client's drug study data.
  • the Emphasize box 2300 is where all gathered data is converted into a standard format. The two target formats are CSV (text spreadsheet) and plain text.
  • Numerical data is converted from SAS, Microsoft Excel, and other numerical formats into CSV format. Textual data is converted from PDF, HTML, and other text formats into plain text format.
  • the Preprocessing stage 2400 begins, where data is categorized and elaborated into observations for ingestion to an associative memory engine. Numerical data is categorized from a raw form into category and value observations. For example, weight of 127.5 pounds may translate into "weight: 120-130”. Textual data is placed in context. Text such as "increased weight” or other English variations may translate to "sideeffec weight gain”.
  • the Exploratory box 2500 represents research with the Query/Analysis tool in conjunction with the associative memory engine. Data affinity of outcomes to initial properties and outside influencers is determined. Amongst text, facts are connected and related observations discovered.
  • the Refinement box 2600 is a de-correlation step that boils down all the affinity and discovery information where key data relationships will stand out. This step also ties the uncovered information back to the original data for integrity and validation purposes.
  • the Results box 2700 represents the resulting reports for client viewing.
  • the clockwise rotating arrows represent the iterative process: When results are found, it becomes clear that more detail is needed in certain places, so pre-processing is revisited to improve the quality of the observations fed to an associative memory engine.
  • the diagram shown in Fig. 3 offers another perspective of the present invention and is conceptually the same as the previous diagram (the RedOak Ingestion engine 3100 is the Emphasize 2300 and Preprocessing 2400 boxes; the AME Engine 3200 is the Exploratory box 2500; the De-correlation Engine 3300 is the Refinement box 2600; the Navigation platform 3400 is the Exploratory 2500, Refinement 2600, and Results 2700 boxes) except that the Learning concept is now introduced.
  • the present invention is able to characterize positive and adverse outcomes, based upon initial properties and outside influencers (drugs, biologies, and devices). It learns from this data in order to predict outcomes of clinical trial subjects that are not in the system yet.
  • the Data Preparation 1 100 module takes data in any of a number of file formats and converts it into standard file formats. Data is typically numerical or textual. Once converted, it is normalized for comparison to other data.
  • Numerical data can be provided to the present invention in many different formats, such as MS Excel, SAS7,CSV, SQL, tabular, DBF, 123, HTML, proprietary formats, and any other suitable numerical data format.
  • Data inside these file formats are converted to a standard CSV (Comma Separated Values) file format. Once in CSV format, the data is organized into normalized database tables, which are created. The CSV data is readily uploaded into these SQL database tables.
  • CSV Common Separated Values
  • Textual data can be provided to the present invention in many forms as well, such as PDF, HTML, plain text, MSWord, proprietary formats, and any other suitable text data format. Data inside these file formats are converted to a standard plain text file format.
  • HGT Hierarchical Grammar Text
  • NLP Natural Language Processing
  • an HGT format is used for data processing.
  • the HGT utilities are preferably English text processing utilities that interpret English based upon NLP grammar mark-up and context. The utilities are so named because they read, process, and annotate HGT (Hierarchical Grammar Text) files, which are output by a NLP utility, specifically FreeLing or any other suitable NLT tool.
  • An HGT file contains one or more English sentences labeled with grammar parts of speech and hierarchically structured to show which words relate or modify which other words.
  • An HGT file further contains annotations per word. The HGT utilities according the present invention add those annotations.
  • HGT annotations are added to designate data categories.
  • the hgtElaborate utility When processing a medical file, the hgtElaborate utility will annotate the word “schizophrenia” with the “diagnosis:schizophrenia”. It will also find a phrase such as "weight gain” and annotate it with "sideeffect:weight gain”.
  • the hgtNaming utility When processing an HGT file, the hgtNaming utility will identify pronouns and short name references, which are subsequently annotated with the individual's full name (as identified in the text).
  • the HGT utilities work together to create a meaningful grouping of annotations, which then produce sentence-specific observations or facts. These observations or facts are input into an associative memory engine or Discovery Query/Analysis, which further enhances clinical trial research.
  • the first line is the root of the hierarchy; the original text was "gained”; the root verb is "gain”; the part of speech, VBD, indicates the verb was used in past tense.
  • the hierarchy can be viewed graphically shown in Fig. 4.
  • the technology used for Data Preparation is typically individual tools such as special purpose command line utilities.
  • various tools are used to convert numerical data into CSV format:
  • csvdatefix converts the data under named columns into standard database format for dates, e.g. 2010 ⁇
  • these utilities are used to declare/upload text documents to the Document Library, where the document is then converted to plain text, then HGT, then marked with attributes before facts are extracted and uploaded to an associative memory engine:
  • the Data Acquisition module 1200 takes the normalized database tables generated in the Data Preparation 1 100 module and converts them into observations suitable for input to an associative memory engine 1300. This is conventionally known as ETL (Extract, Transform and Load). In this case, Data Acquisition involves elaboration, categorization, and normalization. This process varies per client engagement or per clinical trial.
  • Data elaboration involves data processing to create new data. This may be as simple as taking a date and calculating the years from that date to today. Or this may be more involved, such as calculating the stock market activity between two particular dates.
  • Data categorization is the process of converting data into an observation for comparison. Whereas a data set may contain many unique numbers - such as 132.54 - categorization places a number into a category, such as "130-135", which is readily compared to other similar numbers. Categorization may cross-reference other data; for example, a specific date can be categorized by the phase of the moon on that date, e.g. "first quarter". The primary purpose of data categorization is to place data into discrete, finite buckets (or categories) for reasonable comparison.
  • data categorization is formulaic, such as placing the number 132.54 into the category "130-135" or converting a date to a moon phase category.
  • some data categorization is heuristic. For example, time series characterization and date range categorization. Still other data characterization is textual (not numeric), such as context categorization.
  • the present invention Given a time series of data (a set of (xi, ti) data points), the present invention will typically characterize the variability of xi into categories with values (observations).
  • the dataprop (data properties) utility takes CSV data on input with one column designated for time (ti) and another one for values (3 ⁇ 4). It was written to perform this categorization. Five category observations are typically produced: trend, timing, trending, fluctuation, and delta.
  • the trend category may have a value of increased, decreased,
  • the timing category may have a value of early, late, even or burst, based upon when xi crosses the midpoint of change, (xn-x0)/2: if xi crosses the
  • midpoint of change (and stays on that side) in the first half of the series (before (tn-t0)/2), then it is early. If xi crosses the midpoint of change (and stays on that side) in the second half of the series (after (tn-t0)/2), then it is late. If xi crosses the midpoint of change (and stays on that side) around the middle of the series (around (tn-t0)/2), then it is even. If xi crosses the midpoint of change, but bounces back; or if xi moves beyond xn (too high or too low), then it is burst.
  • the trending category may have a value of inconsistent, consistent, or sudden.
  • the sign of the difference of xi+l-xi remains constant (always upward (+) or always downward (-)), then it is consistent.
  • the sign of that difference is sometimes positive and sometimes negative, then it is inconsistent.
  • that difference is mostly zero (within tolerance) except for an occasional difference, then it is sudden.
  • the delta value is expressed as a percentage like the fluctuation value, but the percentage is calculated simply using (xn-x0)/x0. Note that the delta value may be negative, e.g. -1%, -2%, -4%, etc.
  • Date range categorization is the method of taking two related dates as a date range and extracting observations from a time series for that date range.
  • the purpose of the date range and the nature of the observations do not have to be related.
  • the date range may be determined by a clinical trial start date for a study subject whereas the time series may be stock market data, which is known inside the date range.
  • the date range defines which stock market data to use as a time series.
  • the Data Categorization algorithm explained above is then applied to the time series to extract observations such as "stock market increased during subject study date range".
  • Context categorization determines a word's use or meaning based upon its context.
  • a pile of marbles gaining weight because more marbles are being added to a bucket is not considered a side effect, however a study subject that gains weight while taking a study drug would consider weight gain a side effect.
  • Context categorization is the two step process of (1) choosing vocabulary lists of words (or phrases) that relate to the context of text to analyze and (2) the analysis of that text and tagging it with context tags, such as "side effect", "finding", or "diagnosis”.
  • Data normalization is the process of adjusting data so it can be compared to other possibly related data. This may be a simple mathematical conversion, such as inches to centimeters, or this may be more involved, such as rescaling and adjusting data from one testing score to another similar but standard score. Data normalization is critical for "apples to apples” comparison rather than “apples to oranges” comparison.
  • Textual normalization is typically achieved by using hgtElaborate with configuration files that recognize various English text patterns to produce a common observation. For example:
  • the left side identifies an HGT pattern and the right side is the attribute to assert when the pattern matches.
  • Data acquisition according to another aspect of the present invention is specific to a set of data, such as a specific clinical drug trial.
  • data acquisition will be highly similar as well.
  • the global repository contains a voluminous variety of clinical trial data and textual documents that are publicly available from FDA and other sites.
  • the clinical data is converted to CSV data as previously described.
  • the textual data is converted to HGT and processed as previously described.
  • the technology used for Data Acquisition is web-based, whether as PHP (hypertext processor) script or web services, that is invoked from the Data Space Manager.
  • the Data Space Manager module 1400 typically orchestrates the process of data acquisition, associative memory engine ingestion, data mining, and query/analysis.
  • the Data Space Manager module 1400 invokes Data Acquisition code modules to generate a set of observations that are properly categorized and normalized for ingestion into an associative memory engine. The generated observations are stored for later use.
  • the same interface is used to query schema and model orientation information. Schema information identifies types of observations to leverage the associative memory engine technology.
  • model orientation information is used to enhance and make meaningful the query/analysis.
  • the model referenced here is a Solution Model where observation data is typically characterized as (1) an initial property, (2) an outcome (which is further identified as desirable or undesirable), and (3) an outside influencer (or agent of change).
  • a Solution Model typically provides a context in which to evaluate a large data set of observations where change is present.
  • each item of data (or observation) is typically placed into one of four groups: initial property, outcome, agent (or outside influence), and reference.
  • initial property is known state from the initial state. This may also include information that is not known until later, but is fixed and therefore not subject to change.
  • the outcome is an observation of change; it is not the initial property simply restated.
  • the agent is an observation or presence of a fact that may contribute to change. An agent may or may not actually relate to change in outcome.
  • the reference is not used for observation comparison, but is used to tie observations to individuals or components in the system.
  • Fig. 5 shows Initial Properties 5100 known at the beginning of a clinical drug trial. This may be tested, measured or observed facts with more detailed preferred over less. During the course of a clinical drug trial, the subject may be further tested, measured or observed. Changes over time in any of those tests, measurements, or observations are identified as Outcomes 5200. Outside influences 5300 are anything present during the study period that was not present before the study that may - in any way - influence outcomes. The present invention seeks to show how given a set of initial properties and outside influences, which of those outside influences effect outcomes, perhaps characterized (tied to) certain initial properties.
  • An outcome is not simply a category such as "weight change"; an outcome includes instances of that category, such as increased, decreased, or unchanged. Every outcome instance is judged and labeled as desired, undesired, or neutral. This labeling is important to improve reporting. By knowing which outcomes are desired and which are not desired, the Query/ Analysis software groups and judges outcomes.
  • the Space Manager 1400 handles AME ingestion by uploading the observations using an API offering suitable functionality. For example, it uses a PHP API which in turn uses a web services REST API.
  • the Space Manager can be used for data mining since it stores all the observational data. Data mining is useful for verifying data integrity.
  • the Space Manager 1400 interfaces direction to the Query /Analysis module.
  • the Query/ Analysis module 1500 uses information provided by the Space Manager 1400 (such as model orientation information) and interacts with the natural intelligence platform technology to search and analyze the clinical trial data therein.
  • information provided by the Space Manager 1400 such as model orientation information
  • One of two algorithms is used - affinity and discovery. Affinity analysis enables learning patterns so predictions can be made on new clinical trial data. Affinity/Query Analysis
  • Affinity Query/ Analysis large amounts of potentially contradictory data are analyzed to reveal trends where certain initial data shows an affinity to certain outcomes when a particular outside influencer is present.
  • a noise reduction algorithm is typically used in connection with this Affinity Query/Analysis. The following description provides an example of such an algorithm.
  • Data from database tables of clinical trial patient information is uploaded to the natural intelligence platform, a network database with strong associative analysis capabilities.
  • Data characteristics are divided into three groups: initial properties, outside influencers, and outcomes.
  • the outcomes group is further divided into two sub-groups: desirable outcomes and undesirable outcomes. All data is tied to key indexes (or pivot points) such as patient id and site id. External data is gathered as much as it relates to patients or sites and uploaded as more characteristics tied to the key indexes.
  • Population D The set of all patients that have key characteristics (Q) to examine.
  • Population A The subset of Population D that additionally associates to an outside influence characteristic.
  • Population B The subset of Population D that additionally associates to a control characteristic.
  • each patient id is associated to all information that relates to that patient.
  • Some information - such as initial weight - is common between multiple patients. This is expected and desirable to find common groupings of patients.
  • the Noise Reduction Algorithm compares two populations with respect to some relevant data. For example: hair color. When comparing the two populations, is one hair color more prevalent than another? Or is it all evenly distributed?
  • one or more relevant items of data is chosen to examine. This may be a single item or a list of items (such as a list of desirable outcomes). Given this relevant data, a population of patients can be found. This is Population D. This may be something such as "all red-heads" or "all red-heads scoring high”.
  • Population D's data set is then reduced to a subset of patients associated to a specific outside influence - called Population A.
  • An example outside influence is a 30mg dose of study drug.
  • Population A is then reduced to a different subset of only patients in the control group - called Population B.
  • Population A Although it may be interesting to compare Population A to Population B, they are not immediately compared. The data is pivoted first before comparing them. To pivot Population A is to find the strong characteristic affinities within the data set that is Population A. By definition, Population A will have strong affinity to characteristics Q from above. The process of pivoting reveals other characteristics with strong affinity in Population A. The resulting Population Ap includes information about how often each characteristic (aside from Q) occurs - NAj.
  • Population B is pivoted to produce Population Bp, which includes information about how often each characteristic (aside from Q) occurs - NBj.
  • a characteristic is known through two labels - a category and a value.
  • a category may be something such as “hair color” and a value is a variation within the category such as “red head”, “blond” or “brunette”. We are interested in comparing characteristic values within the same category.
  • Population Ap Population A, pivoted, now including affinity counts, NAj, per characteristic.
  • Population B P Population B, pivoted, now including affinity counts, NBj, per characteristic.
  • each characteristic in one population is compared to the other population for magnitude as a percentage within the category being compared within a population. For example, if the Ai counts within the category "hair color" are 50 for “red head”, 22 for “blond”, 23 for “black” and 5 for “brunette”, then the percentage values are straightforward to calculate, respectively 50%, 22%, 23%, and 5%. These percentages are PAi. Similarly, if the NBj counts are 333 for "red head”, 333 for "blond”, 333 for "black” and 1 for "brunette”, then percentage values, PBj, are similarly straightforward to calculate: 33.3%, 33.3%, 33.3%, .1%.
  • PAj is comparable (roughly equal) to PBj.
  • PAj occurs twice as often as PBj.
  • PAj occurs (many) more times as often as PBj.
  • PAj occurs half as often as PBi or: PBj occurs twice as often as PAj.
  • PBj occurs twice as often as PAj.
  • PAj occurs fractionally as often as PBj or: PBj occurs many more times as often as PAj.
  • Magnitude Refinement Although the above magnitude calculation may return the correct numerical result, it may not be reasonable. For example, using the data from above, the Mj value for "brunette” would be 5% / .1% or 50. It does not seem reasonable to say that "brunette” appears 50 times more often in Population A than Population B, especially since Population A only has 5 brunettes ( NA; is 5) total. So, the magnitude Mj is adjusted using these refinements:
  • Mj is greater than PAj and PAj is greater than PBj, then limit Mi to PAj. For example, as above, if there are only 5 brunettes, then reduce M; down from 50 to 5, meaning brunettes appear only 5 times as often in Population A as Population B. 2. If 1/Mj is smaller than 1/PBi and PBj is greater than PA, then limit Ms to 1/PBj. This is the complementary condition to (1) above where Population B has only 5 brunettes
  • each characteristic is compared to the other characteristics for variance within the category. In other words, how does one characteristic compare with its peers? Is it much larger or much smaller?
  • the medium and standard deviation among the characteristics PAj is calculated as C c and S c for the category.
  • the variance is then calculated as the distance that PA; is from C c in standard deviations, or V; is (PAj - C c ) / S c , rounded to the nearest integer. Observations from the value of Vj can be made:
  • 0 PAj is comparable to the mean of the values (within one standard deviation). 1 PA, occurs more often than the mean of the values (one standard deviation or above).
  • the present invention When all the above calculations have been performed, the present invention generates a report of characteristics associated with Population A that differentiate it from Population B. Not all characteristics are displayed; only the characteristics that stand out. The stand outs are determined using the following noise reduction algorithm:
  • Discovery Query/Analysis 1500 large amounts of non-contradictory facts or observations are analyzed to discover connections between those facts/observations. In this case, the existence of a connection is a desired search result. For example, a report for a drug that addresses schizophrenia may report a side effect of weight gain. These two observations, "diagnosis schizophrenia" and “sideeffect:weightgain” are used to search the RedOak global repository for published papers and other related material that also includes these observations. This provides a researcher with precision document search results that directly relates to the clinical trial.
  • Data stored from learning from an Affinity Query/ Analysis can be applied to predict outcomes for new or theoretical clinical trial data. For example, a drug study may be successful for Caucasian subjects, but not for Asian subjects; the affinity query will learn this association of initial property (race) to outcome (success). When faced with a theoretical subject, it can then predict success or failure for the outside influencer if the race initial property is known. Further, race and diagnosis can become search terms for Discovery Query/ Analysis to determine if other research has found a similar result.
  • the top pull-down selection control, Study Data 6200 allows the user to select the clinical drug trial data set.
  • the Outcome Stage control 6100 selects which outcome (when multiple time periods are present, e.g. Stage 1, Stage 2, Stage 3, and Final).
  • the Outside Influencer control 6300 allows the user select the drug (or any other change agent) to analyze (e.g. 5mg dose, lOmg dose, 30mg dose).
  • the Control control 6400 allows the user to select a control population for comparison (e.g. placebo).
  • Clicking the Overview button 6500 produces an Affinity Query/Analysis report that generally compares the outside influencer to the control, outputting the initial properties common the to Outside Influencer that are not common in the control group. These are the stand-outs resulting from the Noise Reduction discussed above.
  • the Overview report also readily points out desirable outcomes and undesirable outcomes within the specific Outside Influencer population; in other words, what kinds of outcomes were observed as different from the control group.
  • the General Result control offers two choices - desirable outcomes and undesirable outcomes - allowing the user to review initial properties that characterize the desired outcomes (e.g. Caucasian) or the undesired outcomes (e.g. Asian).
  • the Result Category control allows the user to select just one outcome category for analysis without the distraction of other outcomes in the report.
  • the Specific Result control allows one specific outcome category with one specific result value to be analyzed to characterize the initial property population for that result.
  • the AME Interface is an API for interacting with the AME service or any other suitable associative memory engine.
  • a PHP API was created to facilitate interchangeability amongst possibly different AME technology.
  • This API provides a subroutine per feature function.
  • the subroutine communicates with AME using the AME's specific interface (such as conversion of its parameters into a REST operation, which is a specially formulated URL page request.
  • the URL response is then translated and provided as the return value of the respective subroutine.)
  • the writing applications that use the AME or any other suitable associative memory engine is simplified, providing a layer of abstraction so a different tool could be substituted.
  • ANALYSIS FOR LEARNING MODULE ARISTOTLE
  • the learning module also called the Aristotle module, builds on the affinity and discovery analysis described above. Based upon complex relationships that the analysis produces, general rules are created. For example, these rules may be created:
  • Rules may vary by inclusion of other initial properties or outcomes. Simplification rules may determine that certain initial properties may imply other initial properties. Similarly, rules may determine that certain outcomes imply other outcomes. Many rules are created, representing patterns that exist in the data. Unlike the original syllogisms, these rules are not absolute, but probabilistic: The terms on the left side have a sample size where this rule holds true. In the example above, a sample size of 50 out of 80 as well as 75 out of 80 was expressed. The terms on the right side similarly have a sample size (above is shown 50 out of 200 and 75 out of 75). These rules are cascaded together. For example, ⁇ may imply I 35 which implies I49, which can be combined with agent A 65 (outside influencer) to produce outcome X 32 , which also implies outcome X 67 . These rules might be:
  • Clinical trial data is a series of numerical data sets, perhaps stored in SAS or MS Excel. Aside from numerical data, there may be a number of text documents.
  • the CSV data is then loaded into an SQL database.
  • a PHP application is created from a template to read the database tables and generate observations.
  • the PHP application also identifies which observation categories are initial properties, outcomes, or outside influencers.
  • the Space Manager runs the PHP application to generate the observations.
  • the Space Manager uploads the observations to AME using the API.
  • the Space Manager uploads the observations to AME using the API under a study label, e.g. study 9.
  • the user accesses the Affinity Query/Analysis interface to research which outside influencers cause which outcomes, which may be triggered by specific initial properties.
  • the user accesses the Discovery Query/Analysis interface to research connections of the initial properties or outcomes to textual documents, including not only the documents from step #1, but also the RedOak global repository of documents and learned (other) clinical trial results from the Aristotle Module.

Abstract

A system and method for analyzing data sets to reveal complex relationship linkages within the data sets. The present invention can be used for data analysis and mining in connection with pharmaceutical drug development analysis, diagnostic tools for healthcare professionals, patient information and decision support, related services, and/or consumer downloadable software associated with such services.

Description

TITLE OF THE INVENTION
A System And Method For Converting Large Data Sets And Other Information To Observations For Analysis To Reveal Complex Relationship Linkages
CROSS-REFERENCES TO RELATED APPLICATIONS
The present application claims the benefit of United States Provisional Application No. 61/464,873, filed March 10, 2011, the disclosure of which is hereby incorporated by reference.
FIELD OF THE INVENTION
The present invention relates to analyzing data, and more specifically to a system and method for analyzing data sets to reveal complex relationship linkages within the data sets. OVERVIEW OF THE INVENTION
The present invention is a system and method for analyzing data sets to reveal complex relationship linkages within the data sets. In a preferred embodiment the present invention can be used for data analysis and mining in connection with pharmaceutical drug development analysis, diagnostic tools for healthcare professionals, patient information and decision support, related services, and/or consumer downloadable software associated with such services. The present invention is typically a stand-alone platform that can be used in conjunction with an associative memory engine such as SaffronSierra, or any other platform offering similar functionality such as a natural intelligence platform that includes an associated memory engine as well as other functionality. It should be understood that references to specific software tools and platforms herein such as FreeLing, Saffron, SaffronSierra, and any others are by way of example only and are not intended to limit operation of the present invention solely to such software tools and platforms; other software applications, platforms, and tools offering similar features and functionality can also be used in connection with the present invention.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
FIG. 1 is a block diagram providing a high level overview of the various components used in connection with the present invention.
FIG 2 is a block diagram showing another perspective on the high level functionality of the present invention.
FIG 3 is a block diagram showing another perspective of the present invention.
FIG 4 is a graphic depiction of one example of the hierarchy used in hierarchical grammar text.
FIG 5 shows a depiction of the initial properties known at the beginning of a clinical drug trial according to one aspect of the present invention.
FIG 6 shows an example of an interface according to one aspect of the present invention.
The drawings are shown for illustrative purposes only and are not intended to limit the scope of the claimed invention.
DETAILED DESCRIPTION OF THE INVENTION
In one aspect, the present invention is a method of processing large data sets and other information - both numerical and textual. The data sets are typically processed into many observations, which are classified into initial properties, outcomes, and outside influencers. Numerical data is converted to observations, and then analyzed for affinity. Linkages are found between and amongst the initial properties and the outcomes that correlate to outside influencers. Textual data is analyzed and facts extracted to discover connections between numerical data observations and other textual data facts. In one aspect, the present invention can be used in connection with data gathered or generated in connection with clinical drug trials. The present invention is not limited to use in connection with clinical drug trials and can be used in connection with any other application in which large data sets have been collected or prepared for analysis. Using the present invention, new facts are learned from analysis of the numerical data, which may be applied during further analysis. In so analyzing, complex linkages are revealed that may not have been uncovered using conventional data analysis techniques.
The block diagram shown in Fig. 1 provides a high-level overview of the various components, or modules, used in connection with the present invention. As would be understood by one skilled in the art, the modules are described from a functional perspective and can be implemented across one or more servers as appropriate and can be implemented in a variety of different ways without departing from the scope of the present invention. The Data Preparation module 1100 (identified as the "AMEDataPrep") takes data in any of a number of file formats and converts it into standard file formats. The Data Acquisition module 1200 takes normalized database tables and converts them into observations suitable for input to an associative memory engine ("AME") 1300 or any other platform offering similar functionality. The Data Space Manager module 1400 orchestrates the process of data acquisition, AME ingestion, data mining, and query/analysis. The Query/Analysis module 1500 uses information provided by the Space Manager 1400 (such as model orientation information) and interacts with the associative memory engine 1300 to search and analyze the clinical trial data (or other data as desired) therein. The block diagram shown in Fig. 2 offers another perspective on the high-level functionality of the present invention. From the left of the above diagram, in one example, External Public Data 2100 is available as drug study clinical trial data as well as documents (PDF, web pages, etc.) related to drug trials. Pharma Internal Data 2200 is client-specific drug study clinical trial data as well as documents that are kept separate from other client's drug study data. The Emphasize box 2300 is where all gathered data is converted into a standard format. The two target formats are CSV (text spreadsheet) and plain text. Numerical data is converted from SAS, Microsoft Excel, and other numerical formats into CSV format. Textual data is converted from PDF, HTML, and other text formats into plain text format. Once in standard format, the Preprocessing stage 2400 begins, where data is categorized and elaborated into observations for ingestion to an associative memory engine. Numerical data is categorized from a raw form into category and value observations. For example, weight of 127.5 pounds may translate into "weight: 120-130". Textual data is placed in context. Text such as "increased weight" or other English variations may translate to "sideeffec weight gain".
The Exploratory box 2500 represents research with the Query/Analysis tool in conjunction with the associative memory engine. Data affinity of outcomes to initial properties and outside influencers is determined. Amongst text, facts are connected and related observations discovered. The Refinement box 2600 is a de-correlation step that boils down all the affinity and discovery information where key data relationships will stand out. This step also ties the uncovered information back to the original data for integrity and validation purposes. The Results box 2700 represents the resulting reports for client viewing. The clockwise rotating arrows represent the iterative process: When results are found, it becomes clear that more detail is needed in certain places, so pre-processing is revisited to improve the quality of the observations fed to an associative memory engine.
The diagram shown in Fig. 3 offers another perspective of the present invention and is conceptually the same as the previous diagram (the RedOak Ingestion engine 3100 is the Emphasize 2300 and Preprocessing 2400 boxes; the AME Engine 3200 is the Exploratory box 2500; the De-correlation Engine 3300 is the Refinement box 2600; the Navigation platform 3400 is the Exploratory 2500, Refinement 2600, and Results 2700 boxes) except that the Learning concept is now introduced. By analyzing clinical data, the present invention is able to characterize positive and adverse outcomes, based upon initial properties and outside influencers (drugs, biologies, and devices). It learns from this data in order to predict outcomes of clinical trial subjects that are not in the system yet.
The individual components, or modules, of the present invention will now be explained in further detail. The various components can be used individually or in any combination as appropriate.
DATA PREPARATION MODULE
The Data Preparation 1 100 module takes data in any of a number of file formats and converts it into standard file formats. Data is typically numerical or textual. Once converted, it is normalized for comparison to other data.
Numerical data can be provided to the present invention in many different formats, such as MS Excel, SAS7,CSV, SQL, tabular, DBF, 123, HTML, proprietary formats, and any other suitable numerical data format. Data inside these file formats are converted to a standard CSV (Comma Separated Values) file format. Once in CSV format, the data is organized into normalized database tables, which are created. The CSV data is readily uploaded into these SQL database tables.
Textual data can be provided to the present invention in many forms as well, such as PDF, HTML, plain text, MSWord, proprietary formats, and any other suitable text data format. Data inside these file formats are converted to a standard plain text file format.
Once in plain text format, the data is typically converted into a format that enables one or more utilities to scan and annotate the text for Data Acquisition processing. One such format is "HGT" (Hierarchical Grammar Text), which can be created using the NLP (Natural Language Processing) FreeLing tool, or any other suitable natural language processing tool. HGT Utility
According to a preferred aspect of the present invention, an HGT format is used for data processing. The HGT utilities are preferably English text processing utilities that interpret English based upon NLP grammar mark-up and context. The utilities are so named because they read, process, and annotate HGT (Hierarchical Grammar Text) files, which are output by a NLP utility, specifically FreeLing or any other suitable NLT tool. An HGT file contains one or more English sentences labeled with grammar parts of speech and hierarchically structured to show which words relate or modify which other words. An HGT file further contains annotations per word. The HGT utilities according the present invention add those annotations.
HGT annotations are added to designate data categories. When processing a medical file, the hgtElaborate utility will annotate the word "schizophrenia" with the "diagnosis:schizophrenia". It will also find a phrase such as "weight gain" and annotate it with "sideeffect:weight gain". When processing an HGT file, the hgtNaming utility will identify pronouns and short name references, which are subsequently annotated with the individual's full name (as identified in the text).
The HGT utilities work together to create a meaningful grouping of annotations, which then produce sentence-specific observations or facts. These observations or facts are input into an associative memory engine or Discovery Query/Analysis, which further enhances clinical trial research.
For example, the sentence "Subject 523 gained weight during the study" is converted to this HGT text (using the command freeling— outf dep):
sv/top/(gained gain VBD -) [ sn-chunk/ncsubj/(523 523 Z -) [ attrib/ncmod/(Subject subject J J -)
n-chunk dobj/(weight weight NN -) sp-chunk/ncmod/(during during IN -) [ sn-chunk/dobj/(study study NN -) [
DT/det/(the the DT -) 5
]
3
i o I
Note that indenting is important to reflect hierarchical relationships. Note that square brackets are optional in HGT. The first line is the root of the hierarchy; the original text was "gained"; the root verb is "gain"; the part of speech, VBD, indicates the verb was used in past tense. The hierarchy can be viewed graphically shown in Fig. 4.
15 The hgtElaborate utility recognizes the pattern of "weight" as NN dependent upon
"gain" as VBD and tags "weight" with "sideeffect: weight gain" for the following HGT text:
sv/top/(gained gain VBD -)
[attribute]=people: sally
sn-chunk/ncsubj/(523 523 Z -) [
0 attrib/ncmod/(Subject subject JJ -) n-chunk/dobj/(weight weight NN -) sp-chunk/ncmod/(during during IN -)
5 sn-chunk/dobj /(study study NN -)
DT/det/(the the DT -) where the line " [attribute] =people: sally" is a tag for the previous text because of the syntax of ii[keyword\=value>'' '. Multiple tags may be specified as a group, one per line. Any text line in the HGT file may be so tagged.
The technology used for Data Preparation is typically individual tools such as special purpose command line utilities. For example, various tools are used to convert numerical data into CSV format:
1. OpenOffice.org Calc - save spreadsheet as CSV from various formats
2. dsread.exe - convert a SAS7 data set to CSV
For example these are CSV utilities that act on CSV files to prepare them for upload to an SQL database:
1. csvselect - Extract specified columns from CSV file
2. csvreplace - do a search-and-replace within a CSV file
3. csvaddquotes - add double quotes (if missing) to values that are not numbers in the input file
4. csv2sqltable - convert a CSV file into a table in the SQL database
5. csv2sql - convert a CSV file into SQL INSERT statements
csv2sqlload - convert a CSV file into SQL statements (a row may be new or updated based on the key(s) given)
csv2schema - generates an SQL CREATE TABLE command to define a new database table
8. csvdatefix - converts the data under named columns into standard database format for dates, e.g. 2010· By further example, these utilities are used to declare/upload text documents to the Document Library, where the document is then converted to plain text, then HGT, then marked with attributes before facts are extracted and uploaded to an associative memory engine:
1. doclibAdd - declare a text document in the document library
2. doclibUpload - upload a textual document (any format) into the document library 3. doclib2text - convert a document to plain text
4, doclib2hgt - convert plain text to HGT
5. doclib2attr - process document HGT file, adding tags
DATA ACQUISITION MODULE
The Data Acquisition module 1200 takes the normalized database tables generated in the Data Preparation 1 100 module and converts them into observations suitable for input to an associative memory engine 1300. This is conventionally known as ETL (Extract, Transform and Load). In this case, Data Acquisition involves elaboration, categorization, and normalization. This process varies per client engagement or per clinical trial.
Data elaboration according to one aspect of the present invention involves data processing to create new data. This may be as simple as taking a date and calculating the years from that date to today. Or this may be more involved, such as calculating the stock market activity between two particular dates.
Data Categorization
Data categorization is the process of converting data into an observation for comparison. Whereas a data set may contain many unique numbers - such as 132.54 - categorization places a number into a category, such as "130-135", which is readily compared to other similar numbers. Categorization may cross-reference other data; for example, a specific date can be categorized by the phase of the moon on that date, e.g. "first quarter". The primary purpose of data categorization is to place data into discrete, finite buckets (or categories) for reasonable comparison.
Most data categorization is formulaic, such as placing the number 132.54 into the category "130-135" or converting a date to a moon phase category. However, some data categorization is heuristic. For example, time series characterization and date range categorization. Still other data characterization is textual (not numeric), such as context categorization.
Time Series Characterization
Given a time series of data (a set of (xi, ti) data points), the present invention will typically characterize the variability of xi into categories with values (observations). The dataprop (data properties) utility takes CSV data on input with one column designated for time (ti) and another one for values (¾). It was written to perform this categorization. Five category observations are typically produced: trend, timing, trending, fluctuation, and delta.
Trend. The trend category may have a value of increased, decreased,
or unchanged, based upon comparing xO to xn.
Timing. The timing category may have a value of early, late, even or burst, based upon when xi crosses the midpoint of change, (xn-x0)/2: if xi crosses the
midpoint of change (and stays on that side) in the first half of the series (before (tn-t0)/2), then it is early. If xi crosses the midpoint of change (and stays on that side) in the second half of the series (after (tn-t0)/2), then it is late. If xi crosses the midpoint of change (and stays on that side) around the middle of the series (around (tn-t0)/2), then it is even. If xi crosses the midpoint of change, but bounces back; or if xi moves beyond xn (too high or too low), then it is burst.
Trending. The trending category may have a value of inconsistent, consistent, or sudden. When the sign of the difference of xi+l-xi remains constant (always upward (+) or always downward (-)), then it is consistent. When the sign of that difference is sometimes positive and sometimes negative, then it is inconsistent. When that difference is mostly zero (within tolerance) except for an occasional difference, then it is sudden.
Fluctuation. After calculating max(xi)-min(xi) asf, convert it to a percentage as f/xO. Next, place that percentage into a category on a logarithmic scale, closest to one of these values: 0%, 1%, 2%, 4%, 8%, 16%, 32%, 64%, 100%. For example, 8% represents 6% - 12% (below 6% is not valid - it is within 50% of "4%"
- and above 12% is not valid - it is beyond 50% of "8%").
Delta. The delta value is expressed as a percentage like the fluctuation value, but the percentage is calculated simply using (xn-x0)/x0. Note that the delta value may be negative, e.g. -1%, -2%, -4%, etc.
Later processing can take the dataprop output to generate an observation, such as
"weight trend:increased".
Date Range Categorization
Date range categorization according to another aspect of the present invention is the method of taking two related dates as a date range and extracting observations from a time series for that date range. The purpose of the date range and the nature of the observations do not have to be related. For example, the date range may be determined by a clinical trial start date for a study subject whereas the time series may be stock market data, which is known inside the date range. The date range defines which stock market data to use as a time series. The Data Categorization algorithm explained above is then applied to the time series to extract observations such as "stock market increased during subject study date range".
Context Categorization
Context categorization according to another aspect of the present invention determines a word's use or meaning based upon its context. A pile of marbles gaining weight because more marbles are being added to a bucket is not considered a side effect, however a study subject that gains weight while taking a study drug would consider weight gain a side effect. Context categorization is the two step process of (1) choosing vocabulary lists of words (or phrases) that relate to the context of text to analyze and (2) the analysis of that text and tagging it with context tags, such as "side effect", "finding", or "diagnosis".
Data Normalization
Data normalization according to another aspect of the present invention is the process of adjusting data so it can be compared to other possibly related data. This may be a simple mathematical conversion, such as inches to centimeters, or this may be more involved, such as rescaling and adjusting data from one testing score to another similar but standard score. Data normalization is critical for "apples to apples" comparison rather than "apples to oranges" comparison.
Most numerical data normalization is formulaic, such as converting the number 2.54 cm into the 1" or converting a test score of 19 out of 20 into 95%. However, textual data normalization is not formulaic. For example, "nausea", "queezy stomach", and "upset stomach" may be normalized to "nausea". Similarly, "gain weight", "weight gain", "increased weight", "increasing weight" can all be normalized to "weight gain". When text is normalized, it can be compared reliably for sameness. When text is not normalized, a comparison does not suggest sameness. The English language has many synonyms as well as phrases that mean the same thing as certain words or other phrases. Textual normalization is the process of boiling down all those similarities that are phrased differently into a canon of words/phrases for comparison.
Textual normalization is typically achieved by using hgtElaborate with configuration files that recognize various English text patterns to produce a common observation. For example:
NN=gain(,NN=weight) [attribute]=change:increased_weight VB=gain(,N =weight) [attribute]=change:increased_weight
NN ncrease{NN=^eight) [attribute]=change:increased_weight JJ=increased,NN^weight [attribute]=change:increased_weight
NN=weight(JJ=increasing) [attribute]=change: increased jvveight
NN=4ncrease,IN=*(NN- weight) [attribute]=change:increased_weight In the above table, the left side identifies an HGT pattern and the right side is the attribute to assert when the pattern matches. For example, "NN=gain(,N =weight)" translates to: when "gain" is used as a noun and the noun "weight" modifies it (one level below it), assert the attribute "change :increased_weight". Later, these attributes along with other attributes from the same sentence are joined as related observations.
Data Acquisition
Data acquisition according to another aspect of the present invention is specific to a set of data, such as a specific clinical drug trial. When two clinical trials are highly similar (perhaps because they were conducted by the same organization), data acquisition will be highly similar as well.
There is a special case of data acquisition - when preparing a proprietary database for repeated use as appropriate. This is the case for creating a global repository from publicly available data. The global repository contains a voluminous variety of clinical trial data and textual documents that are publicly available from FDA and other sites. The clinical data is converted to CSV data as previously described. The textual data is converted to HGT and processed as previously described.
The technology used for Data Acquisition is web-based, whether as PHP (hypertext processor) script or web services, that is invoked from the Data Space Manager.
DATA SPACE MANAGER MODULE
The Data Space Manager module 1400 according to another aspect of the present invention typically orchestrates the process of data acquisition, associative memory engine ingestion, data mining, and query/analysis. By either invoking PHP or using web services (SOA), the Data Space Manager module 1400 invokes Data Acquisition code modules to generate a set of observations that are properly categorized and normalized for ingestion into an associative memory engine. The generated observations are stored for later use. The same interface is used to query schema and model orientation information. Schema information identifies types of observations to leverage the associative memory engine technology.
Model Orientation Information
In another aspect of the present invention, model orientation information is used to enhance and make meaningful the query/analysis. The model referenced here is a Solution Model where observation data is typically characterized as (1) an initial property, (2) an outcome (which is further identified as desirable or undesirable), and (3) an outside influencer (or agent of change).
A Solution Model typically provides a context in which to evaluate a large data set of observations where change is present. In this Solution Model, each item of data (or observation) is typically placed into one of four groups: initial property, outcome, agent (or outside influence), and reference. This model implies time passes and changes may occur from some initial state until one or more identified later states. The initial property is known state from the initial state. This may also include information that is not known until later, but is fixed and therefore not subject to change. The outcome is an observation of change; it is not the initial property simply restated. The agent is an observation or presence of a fact that may contribute to change. An agent may or may not actually relate to change in outcome. The reference is not used for observation comparison, but is used to tie observations to individuals or components in the system.
The diagram of Fig. 5 shows Initial Properties 5100 known at the beginning of a clinical drug trial. This may be tested, measured or observed facts with more detailed preferred over less. During the course of a clinical drug trial, the subject may be further tested, measured or observed. Changes over time in any of those tests, measurements, or observations are identified as Outcomes 5200. Outside influences 5300 are anything present during the study period that was not present before the study that may - in any way - influence outcomes. The present invention seeks to show how given a set of initial properties and outside influences, which of those outside influences effect outcomes, perhaps characterized (tied to) certain initial properties.
An outcome is not simply a category such as "weight change"; an outcome includes instances of that category, such as increased, decreased, or unchanged. Every outcome instance is judged and labeled as desired, undesired, or neutral. This labeling is important to improve reporting. By knowing which outcomes are desired and which are not desired, the Query/ Analysis software groups and judges outcomes.
The Space Manager 1400 handles AME ingestion by uploading the observations using an API offering suitable functionality. For example, it uses a PHP API which in turn uses a web services REST API. The Space Manager can be used for data mining since it stores all the observational data. Data mining is useful for verifying data integrity. The Space Manager 1400 interfaces direction to the Query /Analysis module.
QUERY / ANALYSIS MODULE
In another aspect of the present invention, the Query/ Analysis module 1500 uses information provided by the Space Manager 1400 (such as model orientation information) and interacts with the natural intelligence platform technology to search and analyze the clinical trial data therein. One of two algorithms is used - affinity and discovery. Affinity analysis enables learning patterns so predictions can be made on new clinical trial data. Affinity/Query Analysis
During Affinity Query/ Analysis, large amounts of potentially contradictory data are analyzed to reveal trends where certain initial data shows an affinity to certain outcomes when a particular outside influencer is present. A noise reduction algorithm is typically used in connection with this Affinity Query/Analysis. The following description provides an example of such an algorithm.
Noise Reduction Summary
When the present invention is used to analyze clinical drug trial data, it finds the most outstanding correlations of data without explicit instructions on what to compare. These findings are the result of study population comparisons that employ a noise reduction algorithm to achieve reasonable conclusions, not strained fringe statistical correlation. Following is an explanation of a Noise Reduction Algorithm that produces those conclusions. Background
Data from database tables of clinical trial patient information is uploaded to the natural intelligence platform, a network database with strong associative analysis capabilities. Data characteristics are divided into three groups: initial properties, outside influencers, and outcomes. The outcomes group is further divided into two sub-groups: desirable outcomes and undesirable outcomes. All data is tied to key indexes (or pivot points) such as patient id and site id. External data is gathered as much as it relates to patients or sites and uploaded as more characteristics tied to the key indexes.
Population D: The set of all patients that have key characteristics (Q) to examine.
Population A: The subset of Population D that additionally associates to an outside influence characteristic. Population B: The subset of Population D that additionally associates to a control characteristic.
By uploading all this data, each patient id is associated to all information that relates to that patient. Some information - such as initial weight - is common between multiple patients. This is expected and desirable to find common groupings of patients.
Choosing Relevant Data (Pre-Pivot) Population
The Noise Reduction Algorithm compares two populations with respect to some relevant data. For example: hair color. When comparing the two populations, is one hair color more prevalent than another? Or is it all evenly distributed?
To determine the two populations, one or more relevant items of data is chosen to examine. This may be a single item or a list of items (such as a list of desirable outcomes). Given this relevant data, a population of patients can be found. This is Population D. This may be something such as "all red-heads" or "all red-heads scoring high".
We are interested in outcomes with respect to a specific outside influence. Population D's data set is then reduced to a subset of patients associated to a specific outside influence - called Population A. An example outside influence is a 30mg dose of study drug.
We are interested in comparing Population A to a control population, which is usually the placebo group. Population D is then reduced to a different subset of only patients in the control group - called Population B.
In summary:
Comparison Populations (Pivoting)
Although it may be interesting to compare Population A to Population B, they are not immediately compared. The data is pivoted first before comparing them. To pivot Population A is to find the strong characteristic affinities within the data set that is Population A. By definition, Population A will have strong affinity to characteristics Q from above. The process of pivoting reveals other characteristics with strong affinity in Population A. The resulting Population Ap includes information about how often each characteristic (aside from Q) occurs - NAj.
Similarly, Population B is pivoted to produce Population Bp, which includes information about how often each characteristic (aside from Q) occurs - NBj.
Each characteristic is known through two labels - a category and a value. A category may be something such as "hair color" and a value is a variation within the category such as "red head", "blond" or "brunette". We are interested in comparing characteristic values within the same category.
In summary:
Population Ap: Population A, pivoted, now including affinity counts, NAj, per characteristic. Population BP: Population B, pivoted, now including affinity counts, NBj, per characteristic.
Within Category Comparison
All characteristics within a category are compared (1) between Population AP and Population BP and (2) among Ap.
Magnitude Comparison. When comparing Population Ap to Population Bp, each characteristic in one population is compared to the other population for magnitude as a percentage within the category being compared within a population. For example, if the Ai counts within the category "hair color" are 50 for "red head", 22 for "blond", 23 for "black" and 5 for "brunette", then the percentage values are straightforward to calculate, respectively 50%, 22%, 23%, and 5%. These percentages are PAi. Similarly, if the NBj counts are 333 for "red head", 333 for "blond", 333 for "black" and 1 for "brunette", then percentage values, PBj, are similarly straightforward to calculate: 33.3%, 33.3%, 33.3%, .1%.
The magnitude is compared by ratio of PAj to PBj. The result of PAj / PBj is Mj. Observations from the value of Mi can be made:
Mi Observation
1 PAj is comparable (roughly equal) to PBj.
2 PAj occurs twice as often as PBj.
3 PA, occurs three times as often as PBj.
PAj occurs (many) more times as often as PBj.
.5 PAj occurs half as often as PBi or: PBj occurs twice as often as PAj.
.333 PA, occurs a third as often as PBj or: PBj occurs twice as often as PAj.
PAj occurs fractionally as often as PBj or: PBj occurs many more times as often as PAj.
Magnitude Refinement. Although the above magnitude calculation may return the correct numerical result, it may not be reasonable. For example, using the data from above, the Mj value for "brunette" would be 5% / .1% or 50. It does not seem reasonable to say that "brunette" appears 50 times more often in Population A than Population B, especially since Population A only has 5 brunettes ( NA; is 5) total. So, the magnitude Mj is adjusted using these refinements:
1. If Mj is greater than PAj and PAj is greater than PBj, then limit Mi to PAj. For example, as above, if there are only 5 brunettes, then reduce M; down from 50 to 5, meaning brunettes appear only 5 times as often in Population A as Population B. 2. If 1/Mj is smaller than 1/PBi and PBj is greater than PA,, then limit Ms to 1/PBj. This is the complementary condition to (1) above where Population B has only 5 brunettes
Figure imgf000023_0001
3. If PBj is 0 (causing a divide by zero condition), then Mi is assigned the value 20, meaning PAj occurs 20+ times more often PBj. (Values above 20 are not differentiated from 20 itself.)
Variance Comparison. When comparing values inside a category for Population Ap, each characteristic is compared to the other characteristics for variance within the category. In other words, how does one characteristic compare with its peers? Is it much larger or much smaller? The medium and standard deviation among the characteristics PAj is calculated as Cc and Sc for the category. The variance is then calculated as the distance that PA; is from Cc in standard deviations, or V; is (PAj - Cc) / Sc, rounded to the nearest integer. Observations from the value of Vj can be made:
Yi Observation
0 PAj is comparable to the mean of the values (within one standard deviation). 1 PA, occurs more often than the mean of the values (one standard deviation or above).
2 PAj occurs much more often than the mean of the values (two standard deviations or above).
-1 PAj occurs less often than the mean of the values (one standard deviation or below).
-2 PAj occurs much less often than the mean of the values (two standard deviations or below). In summary:
Population A characteristic counts: NAj
Population B characteristic counts: NB;
Population A characteristic percentages within a category: PAj
Population B characteristic percentages within a category: PBi
Characteristic Magnitude Mj = PA; / PBj.
When Ms > PA, and PA; > PB;, then Mj = PAj.
When 1 Mi < 1 PBi and PB( > PA, then Mj - 1 PBj.
Medium of values within a category: Cc
Standard deviation of values within a category: Sc
Characteristic Variance V, - (PA; - Cc) / Sc.
Noise Reduction
When all the above calculations have been performed, the present invention generates a report of characteristics associated with Population A that differentiate it from Population B. Not all characteristics are displayed; only the characteristics that stand out. The stand outs are determined using the following noise reduction algorithm:
1. Display a characteristic as present (affinity) if Mj >= 2.
2. Display a characteristic as strongly present if Mi >= 20.
3. Display a characteristic as absent (no affinity) if M; <- .5.
4. Display a characteristic as strongly absent if Mj <= .05.
5. Display a characteristic as present if V, >= 1.
6. Display a characteristic as absent if Vj <= -1. Discovery / Query Analysis
During Discovery Query/Analysis 1500, large amounts of non-contradictory facts or observations are analyzed to discover connections between those facts/observations. In this case, the existence of a connection is a desired search result. For example, a report for a drug that addresses schizophrenia may report a side effect of weight gain. These two observations, "diagnosis schizophrenia" and "sideeffect:weightgain" are used to search the RedOak global repository for published papers and other related material that also includes these observations. This provides a researcher with precision document search results that directly relates to the clinical trial.
Data stored from learning from an Affinity Query/ Analysis can be applied to predict outcomes for new or theoretical clinical trial data. For example, a drug study may be successful for Caucasian subjects, but not for Asian subjects; the affinity query will learn this association of initial property (race) to outcome (success). When faced with a theoretical subject, it can then predict success or failure for the outside influencer if the race initial property is known. Further, race and diagnosis can become search terms for Discovery Query/ Analysis to determine if other research has found a similar result.
Easy Affinity Query/Analysis Interface
An example of an interface according the present invention is shown in Fig. 6. The top pull-down selection control, Study Data 6200, allows the user to select the clinical drug trial data set. The Outcome Stage control 6100 selects which outcome (when multiple time periods are present, e.g. Stage 1, Stage 2, Stage 3, and Final). The Outside Influencer control 6300 allows the user select the drug (or any other change agent) to analyze (e.g. 5mg dose, lOmg dose, 30mg dose). The Control control 6400 allows the user to select a control population for comparison (e.g. placebo). Clicking the Overview button 6500 produces an Affinity Query/Analysis report that generally compares the outside influencer to the control, outputting the initial properties common the to Outside Influencer that are not common in the control group. These are the stand-outs resulting from the Noise Reduction discussed above. The Overview report also readily points out desirable outcomes and undesirable outcomes within the specific Outside Influencer population; in other words, what kinds of outcomes were observed as different from the control group. The General Result control offers two choices - desirable outcomes and undesirable outcomes - allowing the user to review initial properties that characterize the desired outcomes (e.g. Caucasian) or the undesired outcomes (e.g. Asian). The Result Category control allows the user to select just one outcome category for analysis without the distraction of other outcomes in the report. The Specific Result control allows one specific outcome category with one specific result value to be analyzed to characterize the initial property population for that result.
ASSOCIATIVE MEMORY ENGINE MODULE INTERFACE
The AME Interface is an API for interacting with the AME service or any other suitable associative memory engine. A PHP API was created to facilitate interchangeability amongst possibly different AME technology. This API provides a subroutine per feature function. The subroutine communicates with AME using the AME's specific interface (such as conversion of its parameters into a REST operation, which is a specially formulated URL page request. The URL response is then translated and provided as the return value of the respective subroutine.) By defining such an API, the writing applications that use the AME or any other suitable associative memory engine is simplified, providing a layer of abstraction so a different tool could be substituted. ANALYSIS FOR LEARNING MODULE (ARISTOTLE)
Taking the cue from the Greek Aristotle's syllogisms, the learning module, also called the Aristotle module, builds on the affinity and discovery analysis described above. Based upon complex relationships that the analysis produces, general rules are created. For example, these rules may be created:
Ij and Ai [5o/go] i [so/200] Initial property 1 co-present with agent A] implies outcome Xi
I] and Aj [ 5/80] ! i [75/75] Initial property l co-present with agent A/ implies no outcome X\
Rules may vary by inclusion of other initial properties or outcomes. Simplification rules may determine that certain initial properties may imply other initial properties. Similarly, rules may determine that certain outcomes imply other outcomes. Many rules are created, representing patterns that exist in the data. Unlike the original syllogisms, these rules are not absolute, but probabilistic: The terms on the left side have a sample size where this rule holds true. In the example above, a sample size of 50 out of 80 as well as 75 out of 80 was expressed. The terms on the right side similarly have a sample size (above is shown 50 out of 200 and 75 out of 75). These rules are cascaded together. For example, \\ may imply I35 which implies I49, which can be combined with agent A65 (outside influencer) to produce outcome X32, which also implies outcome X67. These rules might be:
II [100/100]— ^ I35 [100/101]
I35 [101/1013— ^ I49 [101/101]
I49 and A65 [30/30]→ X32 [30/50]
32 [50/50]—* &7 [50/51]] Note that these rules also include the sample sizes for observed occurrences. These sample sizes are used to compute the likelihood of the presence of Ii and A s causing Xft .
By building up a library of rules based on all the analyzed data sets (where terms in rules are normalized to hold meaning across those data sets), the system learns relationships that can then be applied to new data sets. By applying this learning, more and richer information can be computed from new data sets.
EXAMPLE OF OPERATION
Following is an example of operation of the present invention. This example is for illustration only and should not be interpreted to limit the operation and functionality of the present invention.
1. At the start, we are provided with data. Clinical trial data is a series of numerical data sets, perhaps stored in SAS or MS Excel. Aside from numerical data, there may be a number of text documents.
2. All numerical data is converted into CSV format using various command line utilities.
1. The CSV data is then loaded into an SQL database.
2. Further command line utilities operate to elaborate the data in the database.
3. A PHP application is created from a template to read the database tables and generate observations.
4. The PHP application also identifies which observation categories are initial properties, outcomes, or outside influencers.
5. The Space Manager runs the PHP application to generate the observations.
6. The Space Manager uploads the observations to AME using the API.
3. All textual data is loaded into the Document Library. 1. Each document is subsequently converted to plain text.
2. Each document is converted to HGT.
3. Each document is tagged with attributes using HGT command line utilities and configuration files. As a result, text is elaborated and normalized in attributes.
4. The resulting attributes are then processed to generate observations, which are subsequently processed by Space Manager.
5. The Space Manager uploads the observations to AME using the API under a study label, e.g. study 9.
4. The user accesses the Affinity Query/Analysis interface to research which outside influencers cause which outcomes, which may be triggered by specific initial properties.
5. For valuable research observations, the user tells the interface to learn so it remembers what characterizes desirable and undesirable outcomes.
6. The user accesses the Discovery Query/Analysis interface to research connections of the initial properties or outcomes to textual documents, including not only the documents from step #1, but also the RedOak global repository of documents and learned (other) clinical trial results from the Aristotle Module.
7. *Note that the global repository is created by following steps #1 - #6 for publicly available clinical trial data and publicly available documents.
The foregoing examples of the present invention are for illustrative purposes only and are not intended to limit the architecture or application of the present invention. Unless expressly indicated herein, the invention is not limited to the specific embodiments described, nor is the functionality and application of the invention limited to the described features. Those skilled in the art will appreciate that numerous modifications and variations may be made to the above disclosed embodiments without departing from the spirit and scope of the present invention.

Claims

What is claimed is:
1. A system for processing data, comprising:
a data preparation module adapted to receive a data set;
a data acquisition module adapted to receive one or more normalized database tables prepared at least partly from said data set and to convert said tables into observations;
an associative memory engine that receives said observations; and
a data space manager module adapted to coordinate the interaction between said data preparation module, said data acquisition module, and said associative memory engine.
PCT/US2012/028589 2011-03-10 2012-03-09 System and method for converting large data sets to other information to observations for analysis to reveal complex relationship WO2012122516A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201161464873P 2011-03-10 2011-03-10
US61/464,873 2011-03-10

Publications (1)

Publication Number Publication Date
WO2012122516A1 true WO2012122516A1 (en) 2012-09-13

Family

ID=46798583

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2012/028589 WO2012122516A1 (en) 2011-03-10 2012-03-09 System and method for converting large data sets to other information to observations for analysis to reveal complex relationship

Country Status (1)

Country Link
WO (1) WO2012122516A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016018258A1 (en) * 2014-07-29 2016-02-04 Hewlett-Packard Development Company, L.P. Similarity in a structured dataset
US11398298B2 (en) * 2017-09-26 2022-07-26 4G Clinical, Llc Systems and methods for demand and supply forecasting for clinical trials

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070112714A1 (en) * 2002-02-01 2007-05-17 John Fairweather System and method for managing knowledge
US20080097938A1 (en) * 1998-05-01 2008-04-24 Isabelle Guyon Data mining platform for bioinformatics and other knowledge discovery
US20080208820A1 (en) * 2007-02-28 2008-08-28 Psydex Corporation Systems and methods for performing semantic analysis of information over time and space
US7593952B2 (en) * 1999-04-09 2009-09-22 Soll Andrew H Enhanced medical treatment system
US20100010968A1 (en) * 2008-07-10 2010-01-14 Redlich Ron M System and method to identify, classify and monetize information as an intangible asset and a production model based thereon

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080097938A1 (en) * 1998-05-01 2008-04-24 Isabelle Guyon Data mining platform for bioinformatics and other knowledge discovery
US7593952B2 (en) * 1999-04-09 2009-09-22 Soll Andrew H Enhanced medical treatment system
US20070112714A1 (en) * 2002-02-01 2007-05-17 John Fairweather System and method for managing knowledge
US20080208820A1 (en) * 2007-02-28 2008-08-28 Psydex Corporation Systems and methods for performing semantic analysis of information over time and space
US20100010968A1 (en) * 2008-07-10 2010-01-14 Redlich Ron M System and method to identify, classify and monetize information as an intangible asset and a production model based thereon

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GUHA ET AL.: "Advances in Cheminformatics Methodologies and Infrastructure to Support the Data Mining of Large,Heterogeneous Chemical Datasets", CURRENT COMPUTER - AIDED DRUG DESIGN, vol. 6, no. 1, March 2010 (2010-03-01), pages 50 - 67, Retrieved from the Internet <URL:http://grids.ucs.indiana.edu/ptliupages/publications/AdvancesinCheminformaticsMethodolo_3.pdf>entiredocument> *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016018258A1 (en) * 2014-07-29 2016-02-04 Hewlett-Packard Development Company, L.P. Similarity in a structured dataset
US11398298B2 (en) * 2017-09-26 2022-07-26 4G Clinical, Llc Systems and methods for demand and supply forecasting for clinical trials

Similar Documents

Publication Publication Date Title
US9268748B2 (en) System, method, and computer program product for outputting markup language documents
Kim et al. Using the Bollen-Stine bootstrapping method for evaluating approximate fit indices
CN103443787B (en) For identifying the system of text relation
Small et al. Review of information extraction technologies and applications
US20040006559A1 (en) System, apparatus, and method for user tunable and selectable searching of a database using a weigthted quantized feature vector
CN105074694B (en) The system and method for natural language processing
US20210027407A1 (en) Method and apparatus for the unified evaluation, presentation and modification of healthcare regimens
US20080016035A1 (en) Integration of documents with OLAP using search
Shivva et al. An approach for identifiability of population pharmacokinetic–pharmacodynamic models
Trevisani et al. A portrait of JASA: the History of Statistics through analysis of keyword counts in an early scientific journal
CN110472209B (en) Deep learning-based table generation method and device and computer equipment
US9569535B2 (en) Systems and methods for keyword research and content analysis
US11790894B2 (en) Machine learning based models for automatic conversations in online systems
JP2003271656A (en) Device and method for related candidate generation, related system, program for related candidate generation and readable recording medium recorded with the same program
WO2021252802A1 (en) Method and system for advanced data conversations
Gonçalo Oliveira A survey on Portuguese lexical knowledge bases: Contents, comparison and combination
Aromi Linking words in economic discourse: Implications for macroeconomic forecasts
Kudi et al. Online Examination with short text matching
Wang et al. Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites
WO2012122516A1 (en) System and method for converting large data sets to other information to observations for analysis to reveal complex relationship
JP2007066202A (en) Data analysis program
JP5534167B2 (en) Graph creation device, graph creation method, and graph creation program
Chaturvedi et al. Automatic short answer grading using corpus-based semantic similarity measurements
CN116168793A (en) Physical examination data processing and analyzing method and related equipment
Eslami et al. Using deep learning methods for discovering associations between drugs and side effects based on topic modeling in social network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12755244

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12755244

Country of ref document: EP

Kind code of ref document: A1