WO2014064584A1

WO2014064584A1 - Comparative analysis and interpretation of genomic variation in individual or collections of sequencing data

Info

Publication number: WO2014064584A1
Application number: PCT/IB2013/059421
Authority: WO
Inventors: Angel Janevski; Sitharthan Kamalakaran; Nilanjana Banerjee; Vinay Varadan; Nevenka Dimitrova
Original assignee: Koninklijke Philips N.V.
Priority date: 2012-10-23
Filing date: 2013-10-17
Publication date: 2014-05-01

Abstract

An in silico test is developed for assessing likelihood that a patient is in a clinical situation under test. The development includes generating feature vectors representing subjects of a set of subjects wherein the feature vectors include feature values derived from molecular marker data,performing feature reduction on the feature vectors to generate reduced-dimensionality feature vectors,and identifying a set of probative features and feature values for the probative features that are indicative of the clinical situation under test based on comparison of the reduced-dimensionality feature vectors with a reference data set including feature values representing subjects identified as being in the clinical situation under test. An input feature vector is generated from data including molecular marker data acquired from a person to be tested. The in silico test is performed by comparing the input feature vector with the feature values for the probative features that are indicative of the clinical situation under test.

Description

COMPARATIVE ANALYSIS AND INTERPRETATION OF GENOMIC VARIATION IN INDIVIDUAL OR COLLECTIONS OF SEQUENCING DATA

The following relates to the genetic analysis arts, medical arts, and to applications of same such as the medical arts including oncology arts, veterinary arts, and so forth.

Genetic testing typically employs standard tests that have been developed and validated for diagnosing particular medical conditions, for assessing whether a particular therapy is indicated, or for other clinical purposes. For example, the Oncotype DX^® test (available from Genomic Health, Inc., Redwood City, CA, USA) measures the levels of 21 molecular markers that have been clinically validated as being probative of breast cancer. Another advanced breast cancer test, the MammaPrint^® test, combines 70 molecular measurements into a prognostic marker. Various treatments, for example a regimen combining chemotherapy and tamoxifen, may be ordered based on the results of such tests.

The focus of such tests on specific molecular markers can result in loss of context, that is, other available information, for example other sequencing data in the case of a whole genome sequence (WGS), is not utilized. Further, a molecular marker test typically includes a precise specification regarding acquisition of the genetic markers, and equivalent molecular data acquired by another approach (e.g., a different sequencing technology, or an entirely different technology such as gene expression measurement rather than sequencing data) is usually not useable in the molecular marker test. Still further, existing molecular marker tests typically answer a specific clinical question, and may therefore miss other relevant clinical implications of the analyzed molecular markers (possibly in combination with other available markers that were not analyzed by the molecular marker test).

Existing molecular marker tests are also difficult to update as new clinical studies are published, and accordingly may not accurately reflect the most current clinical knowledge. To the contrary, the molecular marker test may actually have been developed some years ago on the basis of a relatively small clinical sampling of patients. The following contemplates improved apparatuses and methods that overcome the aforementioned limitations and others.

According to one aspect, an apparatus comprises an electronic data processing device configured to perform a method including: generating feature values for a set of features from data including molecular marker data acquired from a set of subjects to generate feature vectors representing the subjects; deriving a sub-set of discriminative features from the feature vectors and representing the subjects using reduced- dimensionality feature vectors with the sub-set of discriminative features; and identifying a set of probative features and feature values for the probative features that are indicative of a clinical situation under analysis. The identifying is based on comparison of the reduced- dimensionality feature vectors with feature values representing subjects in one or more subject populations that include subjects identified as being in the clinical situation under analysis. The method performed by the electronic data processing device may further comprise: generating input feature values for the set of features from data including molecular marker data acquired from a person of interest to generate an input feature vector; performing a comparative analysis that computes a likelihood that the person of interest is in the clinical situation under analysis by comparing the input feature values with the feature values for the probative features that are indicative of the clinical situation under analysis; and displaying a result of the comparative analysis including at least an indication of the computed likelihood.

According to another aspect, a method comprises: developing an in silico test for assessing likelihood that a patient is in a clinical situation under test by performing operations including (1) generating feature vectors representing subjects of a set of subjects wherein the feature vectors include feature values derived from molecular marker data and (2) performing feature reduction to generate reduced-dimensionality feature vectors representing the subjects and (3) identifying a set of probative features and feature values for the probative features that are indicative of the clinical situation under test based on comparison of the reduced-dimensionality feature vectors with a reference data set including feature values representing subjects identified as being in the clinical situation under test; generating an input feature vector from data including molecular marker data acquired from a person to be tested; performing the in silico test by comparing the input feature vector with the feature values for the probative features that are indicative of the clinical situation under test; and displaying a result of the performed in silico test. The developing and generating operations are performed by an electronic data processing device.

According to another aspect, a non-transitory storage medium stores instructions executable by an electronic data processing device to perform the method set forth in the immediately preceding paragraph.

One advantage resides in more holistic use of available genetic data for patient assessment.

Another advantage resides in leveraging unlabeled subject data to enhance the usefulness of clinical study results.

Numerous additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description.

The invention may take form in various components and arrangements of components, and in various process operations and arrangements of process operations. The drawings are only for the purpose of illustrating preferred embodiments and are not to be construed as limiting the invention.

FIG. 1 diagrammatically shows a system for developing an in silico test for assessing likelihood that a patient is in a clinical situation under test.

FIG. 2 diagrammatically shows a system for applying the in silico test developed by the system of FIGURE 1.

FIG. 3 diagrammatically shows a suitable embodiment of the

probative features and feature values selection module and enrichment module of the system of FIGURE 1.

FIG. 4 diagrammatically shows an example of the feature

extraction/ annotation.

FIGS. 5-8 diagrammatically show some illustrative examples of the feature extraction/annotation approach of FIGURE 4.

FIG. 9 diagrammatically shows a consensus alignment of variations. FIGS. 10-13 diagrammatically show an example of identifying probative variations based on feature vector subsets generated by clustering of feature vectors.

FIG. 14 diagrammatically shows an approach for identifying

probative variations based on feature vector subsets generated by clustering of feature vectors.

FIG. 15 diagrammatically shows a processing sequence including identification of a subset of probative features followed by enrichment including determining an associated clinical implication and corresponding clinical decision

FIG. 16 diagrammatically shows performing a plurality of

comparative analyses each providing one or more clinical implications.

FIGS. 17 and 18 diagrammatically show another illustrative example.

With reference to FIGURE 1 , a system is described for developing an in silico test for assessing likelihood that a patient is in a clinical situation under test or analysis. The clinical situation under test can be substantially any type of clinical situation that is expected to manifest in molecular marker data. Some examples include: a cancer of a specified organ or tissue (e.g., breast cancer, lung cancer, leukemia or other blood cancers, or so forth), various genetic disorders, and so forth. In some cases the clinical situation under test is considered to be hierarchical, that is, a more particular clinical situation under test may be encompassed by the (broader) clinical situation under test. For example, if the situation under test is cancer of a specified organ or tissue, then the more particular situation under test may be a particular type of cancer of the specified organ or tissue. In such a hierarchy, it will be understood that there may be more than one "more particular" situations under test, e.g. several different types of cancer of the specified organ or tissue may be under test or analysis. The in silico test development is based on a set of subjects from whom data including at least molecular marker data are derived. In illustrative FIGURE 1, one illustrative subject 4 undergoes a procedure in a sample extraction laboratory 6 to extract an oral swab, biopsy sample, or other tissue sample 10 (diagrammatically indicated in FIGURE 1 by a vial, but suitably may be carried by a slide or other suitable tissue sample container or support) that is processed by a sequencer apparatus 14 to generate sequencing reads. The sequencer apparatus 14 may be a next generation sequencing (NGS) apparatus or a more conventional sequencing apparatus such as a Sanger sequencing facility. The sequencer apparatus 14 may in some embodiments be a commercial sequencing apparatus such as are available from Illumina, San Diego, CA, USA; Knome, Cambridge, MA, USA; Ion Torrent Inc., Guilford, CT, USA; or other NGS system vendors; however, a noncommercial or custom-built sequencer is also

contemplated. The sequencing reads are suitably filtered to remove duplicate reads or reads of unacceptable base quality score, and the remaining reads are processed by a sequence alignment and annotation module 16 to generate aligned (and optionally annotated) sequencing data. The alignment can be de novo alignment of overlapping portions sequencing reads, and/or can include mapping of the sequencing reads to a reference sequence (e.g., a human reference sequence) while allowing for a certain fraction (e.g., 5-10%) of base mismatches.

The resulting aligned sequence provides substantial information about the subject 4, especially if a WGS was obtained. Additionally, other information may be obtained about the subject 4. For example, other molecular marker data may be acquired by proteomic analysis using a microarray or other process. Data other than molecular marker data may optionally also be obtained, such as test data from non-molecular marker tests (e.g., imaging studies, histopathology tests, or so forth). Such data may, for example, be stored in and retrieved from an electronic patient record 18. (As indicated by a dashed arrow in FIGURE 1 , in some embodiments the WGS or other aligned sequence may also be stored in and retrieved from the electronic patient record 18).

Collectively, the (processed) genetic sequencing data output by the alignment/annotation module 16 and other patient data for the subject 4 constitutes a large knowledge base for the subject 4. However, the data are generally in different formats. A features extraction module 20 receives the data and constructs a feature vector for the subject 4. The elements of the feature vector store feature values for a set of features, where the feature values for the feature vector representing the subject 4 are generated from data, including at least molecular marker data, acquired from the subject 4. In some embodiments the features are all binary features. Such binary features can store substantial data - for example, a binary feature value for a single nucleotide variant (SNV) may store a "1" if the variant is present in the WGS of the subject 4 and a "0" if the variant is not present. An imaging study can be represented by one or more binary features indicating whether the study identified potentially malignant lesions. A histopathology study can be represented by binary values indicating whether a positive ("1") or negative ("0") result was obtained. While binary features are computationally convenient, the feature vector may additionally or alternatively include other types of features, such as integer values (e.g., different values may be used to represent different possible SNV for a given gene location), or text or symbol values (e.g., a feature representing an image test may include a letter value where different letters correspond to different test outcomes), or so forth.

With continuing reference to FIGURE 1 , the in silico test development is based on a set of subjects, of which the illustrative subject 4 is a single illustrative example. Other subjects may be processed by the same laboratory 6 and components 14, 16, 20 to generate a feature vector for each subject. Additional data may optionally be obtained from one or more external databases 22. The feature vectors representing all subjects of the set of subjects is suitably in the same "vector space", in the sense that each feature vector has the same elements in the same order representing the same features of the subject. (For example, if the fifth element of the feature vector for subject 4 represents a particular SNV, then the fifth element of each feature vector representing a subject of the set of subjects should represent that particular SNV using the same representation format). However, it is contemplated that not all features of the feature vector may be available for every subject. For example, some subjects may have undergone imaging studies while other may not. In such a case, the vector element includes a designated value (e.g., a NULL value) that indicates the feature is not available for that subject. The output of the feature extraction module 20 applied to all subjects of the set of subjects is a set of feature vectors 24

representing the subjects.

The feature vector includes elements representing numerous features, some of which may be probative for the clinical situation under test or analysis, and some of which may be irrelevant for the clinical situation under test or analysis. Further, the probative features have certain feature values that are indicative of the clinical situation under test or analysis, while other feature values are not indicative of the clinical situation under test or analysis. That is, a subject whose feature vector has numerous feature values indicative of the clinical situation under test for the probative features has a higher likelihood of being in the clinical situation under test than does a subject whose feature vector has fewer feature values indicative of the clinical situation under test. However, the probative features and the feature values for those probative features that are indicative of the clinical situation under test are not known.

Accordingly, a probative features and feature values selection module 30 receives the clinical situation under test 32 and selects the probative features and the feature values for those probative features that are indicative of the clinical situation under test. In general, this is done by comparing the feature vectors of the set of feature vectors 24 with features of subjects who are known to be in the clinical situation under test, and by comparing the feature vectors of the set of feature vectors 24 with features of subjects who are known not to be in the clinical situation under test, and thereby identifying the probative features and the indicative feature values.

Since the clinical situation under test is expected to manifest in molecular marker data, it is likely that at least some probative features that are identified by the selection module 30 represent are molecular marker data. It is therefore possible that medical literature 34 may have identified biological pathways that are correlated with the feature values that are indicative of the clinical situation under test. Additionally or alternatively, the medical literature 34 may identify clinical implications that are associated with the feature values that are indicative of the clinical situation under test, or with biological pathways correlated with those feature values. In such cases, an enrichment module 36 associates enrichment data from the medical literature 34 (e.g., information on the correlated biological pathways and/or associated clinical implications) with the feature values that are indicative of the clinical situation under test. The enrichment module 36 operates on an electronic medical literature database and suitably performs keyword searching or other data mining to identify relevant medical literature. The enrichment data may be relatively general, for example providing citations to published clinical studies that include terms associated with the feature values indicative of the clinical situation under test (e.g., if the feature value indicates a particular SNV, the associated terms may include the name of the gene and the name of the SNV). Additionally or alternatively, the enrichment data may be more specific, e.g. expressly identifying a biological pathway correlating with the SNV. In some embodiments the enrichment module 36 is semi-automatic rather than fully automatic. For example, the enrichment module 36 may present a human operator with the feature values indicative of the clinical situation under test and links to published clinical studies that include terms associated with those feature values, and provide a viewer window via which the human operator can review the linked clinical studies and a dialog box via which the human operator can manually input enrichment data based on the human operator's review of the linked clinical studies and, optionally, further based on the human operator's medical expertise. The output of the selection module 30 and the enrichment module 36 is an in silico test dataset 40 comprising the selected set of probative features for the clinical situation under test, the selected feature values indicative of the clinical situation under test for these probative features, and any added enrichment data.

It will be appreciated that the set of feature vectors 24 can be used in developing different in silico tests for a plurality of different clinical situations under test or analysis. For example, tests may be developed for cancers of different organs or tissues, and/or for different types of those cancers. The processing for each different clinical situation under test entails inputting that clinical situation as the input 32 and invoking the selection module 30 and the enrichment module 36 for that clinical situation.

With reference to FIGURE 2, an in silico test system applying an in silico test developed using the test development system of FIGURE 1 is described. An input feature vector 46 is generated for a person of interest 44, e.g. a single medical patient or a cohort of patients with similar clinical symptoms. The person of interest 44 is typically not one of the subjects contributing to the set of feature vectors 24 used in developing the in silico test. Rather, the person of interest 44 is typically a current medical patient undergoing clinical diagnosis or treatment. The input feature vector 46 may include molecular marker data generated by genetic sequencing using the same sample extraction laboratory 6, sequencer apparatus 14, and alignment/annotation module 16 as was used for generating the set of feature vectors 24 used in developing the in silico test. Other data for generating feature values may come from the entry for the patient 44 in the electronic patient record 18 (and, again, the genetic sequencing data may be stored in and retrieved from the electronic patient record 18 as indicated by a dashed arrow in FIGURE 2). The feature extraction module 20 is applied to the data for the patient of interest 44 to generate the input feature vector 46. Again, if the available data for the patient 44 is insufficient to compute the feature value for any element of the feature vector, that vector element is suitably filled with the designated value (e.g., NULL value) indicating the feature is not available.

A comparative analysis module 50 compares the input feature vector 46 with the in silico test dataset 40, and more particularly compares the feature values of the input feature vector 46 for the probative features identified in the in silico test dataset 40 with the feature values indicative of the clinical situation under test (also from the in silico test dataset 40). The comparative analysis module 50 computes a likelihood that the person of interest 44 is in the clinical situation under test or analysis. The likelihood is computed by comparing the input feature values with the feature values for the probative features that are indicative of the clinical situation under analysis. It should be noted that the likelihood is typically not a medical diagnosis; rather, it is an intermediate result typically provided as an item of information for consideration by a medical doctor in making a medical diagnosis based on the likelihood and possibly other information. To assist the doctor (or other medical person) in making a diagnosis or other clinical assessment, a comparative analysis results visualization module 52 displays the results, including at least an indication of the computed likelihood. If the computed likelihood is high, then the visualization module 52 may optionally also display any enrichment data (e.g., correlated biological pathways and/or associated clinical implications) for the clinical situation under analysis.

The various processing components, such as the alignment/annotation module 16, the features extraction module 20, the selection module 30, the enrichment module 36, the comparative analysis module 50, and the visualization module 52, are suitably implemented by one or more computers or other electronic data processing devices 55. By way of illustration, the electronic data processing device or devices 55 may include: a notebook computer; a desktop computer; a network server computer accessible via the Internet and/or a local wired/wireless data network; various combinations thereof; or so forth. The electronic data processing device 55 includes or has operative access to a display device or screen 56 for displaying the visualization generated by the visualization module 52. In the illustrative embodiment, the same computer 55 implements both the in silico test development system of FIGURE 1 and the in silico testing system of FIGURE 2. However, it is contemplated to have different computers implement the in silico test development system of FIGURE 1 and the in silico testing system of FIGURE 2. Moreover, in some embodiments the alignment/annotation module 16 may be implemented by a computer associated with the sequencer apparatus 14 that is different from the computer that implements the features extraction module 20, the selection module 30, and the enrichment module 36. Other arrangements of electronic data processing devices are also

contemplated. The disclosed in silico test development and implementation techniques are also suitably embodied as a non-transitory storage medium storing instructions executable by the computer or other electronic data processing device 55 to perform the disclosed techniques. The non-transitory storage medium storing the executable instructions may, for example, include: a hard disk or other magnetic storage medium; an optical disk or other optical storage medium; a flash memory, random access memory (RAM), read-only memory (ROM), or other electronic storage medium; or so forth.

With returning reference to FIGURE 1 , the probative features and feature values selection module 30 operates by comparing the feature vectors of the set of feature vectors 24 with features of subjects who are known to be in the clinical situation under test, and by comparing the feature vectors of the set of feature vectors 24 with features of subjects who are known not to be in the clinical situation under test, and thereby identifying the probative features and the indicative feature values. In one approach, each subject of the set of subjects from which the set of feature vectors 24 is derived is annotated to indicate whether the subject is in the clinical situation under test. In this case, the probative features are identified as features having certain values for those subjects annotated as being in the clinical situation under test and having certain other (different) values for those subjects annotated as not being in the clinical situation under test.

In practice, this approach can be difficult or impossible to implement. There may not be enough suitably annotated subjects with sufficient features of the feature vector to be effectively processed to determine the probative features. For example, the available pertinent clinical studies may investigate populations that are too small for the selection module 30 to generate statistically significant results, and/or the clinical studies may not identify a sufficient number of features for the subjects. In some published studies, only those features that the researchers determined to be relevant are identified in the study, and this may exclude numerous other features that would be identified by the selection module 30 if the full feature sets were available. Some studies may publish only summaries, rather than providing full WGS or other individualized patient data. Even when the studies provide sufficient individualized patient data, the format may be incompatible with the sequences output by the sequencing apparatus 14 that is available for characterizing the patient of interest 44, or may be obtained by an entirely different technology (e.g., proteomic analysis rather than sequencing). Still further, there may be known or unknown population biases present in the clinical study populations. For example, a given the clinical study may have been restricted to women, while the test under development may be intended to be applicable to both women and men.

With reference to FIGURE 3, the illustrative selection module 30 operates on the set of feature vectors representing subjects 24 in which the subjects are not (in general) annotated as to whether the subjects are in the clinical situation under test or analysis. The illustrative selection module 30 also has available to it a database of feature values for subjects of one or more populations 60 that include subjects annotated as to whether they are in the clinical situation under test. The database may be relatively incomplete as compared with the set of features represented feature vector. For example, subjects of the population(s) 60 may be labeled with only a few discrete molecular marker values, for example obtained in a standard test employing a fixed set of markers, rather than with a WGS or other substantial set of molecular marker data. The database 60 may be limited in other ways, for example being biased toward a particular gender, age group, or other demographic due to constraints on study pools imposed by study parameters, and/or having an undesirably small population size, or so forth. The population 60 typically includes both positive and negative samples (i.e. some subjects in the clinical situation under test, and some subject not in the clinical situation under test), although a population with only positive samples may be employed.

The selection module 30 utilizes the substantially larger quantity of data contained in the set of feature vectors representing subjects 24 to effectively generalize data of the more limited study population(s) 60. Toward this end, a discriminative features subset extraction operation 62 analyzes the feature vectors of the set of feature vectors 24 to identify a subset of discriminative features and to discard non-discriminative or minimally discriminative features so as to generate reduced-dimensionality feature vectors 64 that are effective for discriminating amongst the subjects represented by the feature vectors 24. Differential (i.e., discriminative) feature subset extraction (also referred to as feature reduction, feature extraction, or similar phraseology) is a process by which a set of features describing a set of entities is reduced to a subset of features based on their ability to maintain the differential information between the entities (that is, the ability to discriminate between entities). More formally, each entity E_I.._M is described with a vector of N values. In other words, Ei will be represented by feature vector <fi, f₂, .. . , ¾_Ν>, where fi_j is the j-th feature for the i-th entity. These feature vectors need not contain values for each feature (i.e. some vector elements can have a NULL/missing value). To extract a differential feature subset, one can apply many methods suitable based on the purpose of the selection. In one approach, all features found in all entities with values of a comparable range can be eliminated and the remaining features are retained. For example, given three entities with the values in the table, features 1, 3, and 4 would be in the differential feature subset, and the reduced-dimensionality feature vector 64 for Ei is <fi, ¾, ¾>· fl fl fl fl fl

El NULL 3 10 NULL TRUE

E2 A 3 20 NULL TRUE

E3 B 3 200 'yes' TRUE

In a more flexible approach, all features whose values are within the comparable range for a certain (high) percentage of the entities are eliminated. The differential feature extraction 62 can be tuned by selection of the "comparable range" and/or the percentage of entities within that range required for elimination. In general, increasing the size of the comparable range increases the number of features that are eliminated, and lowering the percentage of entities in the comparable range required for elimination increases the number of features that are eliminated. In more elaborate differential feature subset extraction methodologies, information about the entities and/or the features can be employed to determine how informative (i.e., discriminative) individual features or feature subsets are for distinguishing amongst individuals of the population 24. For example, features may be ranked on the properties/distribution of the values and a top set of features (e.g. top 25%) is selected, or subsets may be evaluated as a group by its ability to stratify entities into sub-categories - then the top performing subsets are selected. The foregoing examples are merely illustrative, and other feature reduction techniques can also be employed. After the feature subset extraction operation 62, the variations in the discriminative subset of features (i.e. the reduced-dimensionality feature vectors) 64 are characterized in the context of populations 60 in the identification operation 66.

Populations are samples from previous studies or "virtual" samples that contain collections of identified variations shown to be relevant and characteristic of a clinical phenotype, clinical action, or a clinical outcome. The output of the identification operation 66 is probative features and the feature values for those probative features that are indicative of the clinical situation under test. This output serves as the input to the enrichment module 36, which performs an enrichment operation 68 that enriches the test with enrichment data, e.g. pathway data, clinical implication data, decision recommendation data, or so forth.

It is noteworthy that the feature subset extraction operation 62 is not dependent on the annotated reference population(s) 60. Accordingly, if there are several clinical situations under test for which in silico tests are to be developed, the same reduced features set 64 can be used for each test development, and so the feature subset extraction operation 62 can be run only once.

Some illustrative examples are next set forth.

Input to the comparative variation assessment tool is sequencing profiles obtained from one or more tissue samples. These samples can be a group of patients and can originate in normal and/or cancer tissue and could be obtained with various degrees of invasiveness: from a saliva swab, through blood sample, to biopsy and surgery.

Additionally, a group of samples may be obtained from a single patient, e.g. one normal sample and one or more diseased tissue samples, e.g. several biopsied points in suspicious nodules, plus optionally secondary sites such as lymph nodes may be considered.

The sequencer apparatus 14 acquires single-base level high coverage read of the DNA or RNA molecules from a specimen. The end result after several standard low- level processing steps (e.g., duplicates filtering, removal of low base quality reads, et cetera) is a collection of reads of given lengths which are then aligned to a reference (e.g. human genome for human DNA or RNA sequencing) by the alignment/annotation module 16. The alignment is typically imperfect (by design) and this is captured in the output by the confidence of the match for a single base or a region, coverage of a location on the reference genome, and other quality metric. Given an alignment, it is possible to characterize various types of variations that exist in all individuals. These variations can also be called with some certainty which allows for filtering out noise in the biological signal or in the measurement,

With reference to FIGURE 4, a simplified diagrammatic view of a processing pipeline is presented. One or more of these pipelines can be used to analyze each sample to provide higher-level information on discovered variations. Starting with basic analysis usually with respect to a reference, candidate variations are obtained (e.g. triangle and square locations on the genome which is indicated with the full line in the top diagram of FIGURE 4). Not all variations are of interest to the clinician (for example, they may be known to be irrelevant to the clinical situation under test). The variations are annotated based on some repository of variations and these are indicated in FIGURE 4 (middle diagram) with the symbols A, B, C and i, j,k, and 1. Furthermore, groups of such variations can be grouped based on higher-level grouping (e.g. biological pathways, disease-associated genes, population-specific variants). In FIGURE 4, this higher level annotation is shown in the bottom diagram using symbols α, β, and γ.

Typical variations comprise single nucleotide variations, copy number variations, or so forth. Furthermore, these can be interpreted with respect to their homo- or hetero -zygosity, equivalence to a reference population, et cetera.

In this process, based on the annotation at various levels, the samples can be grouped in various subsets (in case when there are more sequencing outputs/samples to consider) or variations on the annotation of the same measurement relative to the question explored. This is achieved in the approach of FIGURE 3 by the differential features subset operation 62 which produces the reduced-dimensionality feature vectors 64. The annotated reduced profiles (represented by the reduced feature vectors 64) are then analyzed in the context of a reference population or populations 60 with equivalent annotation to identify features that are probative of the clinical situation under test (identification operation 66).

More formally, each sample can be represented by a feature vector of N values (that is, an N-dimensional feature vector) which can for example correspond to the union of all variations found vi, V2, . . . VN- In some embodiments, the feature vector elements have binary values, i.e. each sample is represented by a feature vector of N "0" and "1" values indicating presence or absence of variation v; at the z^'-th position. After the feature reduction 62, the reduced-dimensionality feature vectors 64 can then be used to compute pairwise distances between samples and with this perform hierarchical clustering as part of the identification operation 66 to find sample clusters or subsets. In a simplified example below, N=8 and the samples fall perfectly into two clusters. In a realistic scenario, some tolerance would be added to the subset selection to allow for minor discrepancies. Hierarchical clustering can, for example identify a larger cluster (e.g. corresponding to breast cancer generally) and smaller clusters contained in the larger cluster that correspond to specific types of breast cancer. More generally, a larger cluster "higher" in the hierarchy corresponds to a more general clinical situation under analysis while the contained smaller clusters "lower" in the hierarchy correspond to more particular clinical situations under analysis (which are subsumed by or encompassed by the more general clinical situation). Although hierarchical clustering advantageously enables such stratificiation of more general-to-more particular clinical situations, it is alternatively contemplated to employ non-hierarchical (i.e. flat) clustering.

With reference to FIGURES 5 and 6, output is shown of one analysis that produces data analogous to that discussed with reference to FIGURE 4. The data shown in FIGURES 5 and 6 were extracted from copy number variation (CNV) analysis of DNA sequencing data from eight individuals where seven genomes were analyzed using the eighth genome as a normalization (control) with the CNV-seq tool (Xie et al, "CNV-seq, a new method to detect copy number variation using high-throughput sequencing", BMC Bioinformatics 2009, 10:80). FIGURES 5 and 6 show output of this tool visualized using the UCSC Genome Browser (Kent et al, 'The human genome browser at UCSC", Genome Res. 2002 Jun;12(6):996-1006). In these results, boldface is used to signify amplification and italics to signify deletion.

With reference to FIGURE 7, in another set of examples, we show a close- up of a genome region containing three annotated (sub-)regions A, B, C. One region (A) contains the same annotation which is present in all examples, and hence is not

discriminative and would be discarded as a candidate differential feature. Two regions (B), (C) have different annotations amongst the examples, and hence may be retained as probative features. Similarly, in FIGURE 8, the results of another CNV calling tool (FREEC, see Boeva et al., "Control-free calling of copy number alterations in deep- sequencing data using GC-content normalization", Bioinformatics 2011, 27(2): 268-269) are visualized with the UCSC Genome Browser showing a similar pattern of features. With reference to FIGURE 9, an enhancement of the features reduction operation 62 in which features are consolidated or merged may be employed to improve the quality of the subsequent steps by a consensus alignment of the variations. For example, many variations are typically given for a genomic range and may be prone to some variability. For example, a copy number variation (CNV) on chromosome 1 may be discovered in the range 1,000,000-1,000,100 in one sample and 1,000,100-1,000,900 in another. Such discrepancies can be consolidated across all samples to establish a more robust call of the existence and the quantitative characterization of these (shared) variations. In FIGURE 9, the copy CNV calls from four genomes are broken into eight merged segments MS_m - MS_t each of which is characterized with a vector of four values. Additionally, a step is performed that combines neighboring and overlapping segments into consolidated segments from beginning of MSi to MSs. All eight segments are combined into one consolidated segment. These aggregate segments are used as units of shared variation that can be used to compare occurrence in individual samples in the subsequent analysis. For example, the agreement of each sample with a consolidated segment can be quantified by the degree of overlap between the individual sample reading and the consolidate segment. In one approach only the contribution in binary terms (0 or 1) are counted for each MS;. In illustrative FIGURE 5, the following degree of presence of this consolidated variation in the four samples is determined:

CNV₂=2, CNV3=4, and CNV₄=2.

With reference to FIGURES 10-13, an illustrative example is shown for a subset A, a subset B, and sequences for one or more populations (see FIGURE 10).

FIGURE 11 highlights a first set of probative annotations (i.e., probative features in the context of the feature vector). These features are present in subset A but are absent in Subset B. FIGURE 12 highlights a second set of (one) probative annotation (i.e., another probative feature in the context of the feature vector). This annotation is absent present in subset A but is present in Subset B. FIGURE 13 shows the aggregate set of probative annotation (i.e., probative features in the context of the feature vector). In general, the output of this reference-population-based annotation is a set of variations that characterize the sample(s) relative to a patient population. In the exposed variations of illustrative FIGURE 13, there are two sample subsets A and B that each differs from some reference population in a different fashion. In the illustrative example, all samples in subset A have two single nucleotide variations found in a reference population but not in subset B (see also FIGURE 11). Similarly, subset B contains a single-nucleotide variation found in the reference but not in subset A (see also FIGURE 12).

With reference to FIGURE 14, a diagrammatic example of a suitable probative feature selection is described. In this example, there is one reference population and two sample subsets (A and B). In this case, all possible difference and intersection sets are examined for differential presence and/or absence of variations and such sets are returned. In another approach, this step can be implemented by measuring the distance between profiles where each profile is represented as a vector of values and distances metrics such as correlation and Euclidean distance provide information on which samples are "closest" comprising one or more subsets. The example of FIGURE 14 can be expanded to multiple populations and to more than two subsets.

With reference to FIGURE 15, after common properties of one or more subsets have been identified, the enrichment operation 68 is suitably performed by the enrichment processor 36. The enrichment may include: based on variant genes, identifying biological pathways implicated with these genes; determining clinical implication based on the enrichment data; and obtaining possible clinical decisions to be presented to the clinician. FIGURE 15 diagrammatically shows this process.

With returning reference to FIGURES 1 and 3 and with further reference to FIGURE 16, the process of feature extraction 20, features reduction 62, identifying probative features and indicative feature values 66, and enrichment 68 can be grouped under one comparative analysis and each such analysis has one or more clinical

implications (e.g. therapy regimen, response assessment, recurrence risk, stratification, et cetera). A dataset (i.e., reduced features set 64, see FIGURE 3) can be analyzed multiple times in the context of different clinical implications or different patient populations.

Similarly, the same clinical implication can be assessed using one or more datasets. In the former case, the feature extraction 20 and feature reduction 62 can be performed only once, and the identification and enrichment operations 66, 68 repeated for each different clinical situation under test. Every comparative analysis combination of a dataset and a clinical question results in a clinical decision characterized by a fitness score (e.g. on a scale of 0 to 1 , 0 being dataset not informative to characterize clinical question, and 1 being the dataset is directly relevant to characterize the clinical question). The disclosed processing is captured by the following pipeline: Sample; Measurements; Data Analysis; Post-processing; Annotation (respective to a reference sequence with biological and clinical annotations); Interpretation (referenced to the clinical study population(s) 60 and medical literature 34 specifying clinical implications and clinical outcomes). Multiple such pipelines can be implemented and executed depending on the type of measurement, choice of tools to perform analysis and the data repositories used to annotate the data. Measurements produces raw data, Data Analysis performs the initial processing like alignment and QA (e.g., performed by sequence

alignment/annotation module 16), Post-processing involves determination of the subsets and their properties, and Annotation determines the relationships between the subsets and the populations (e.g., the identification operation 66). Finally, Interpretation connects the molecular characterization with clinical implications (e.g., the enrichment operation 68).

With reference back to FIGURE 2, given such a pipeline and the input measurements in one more modalities, the comparative analysis module 50 provides characterization of all sets - the clinical implication/decision, and also the underlying molecular profile(s) and annotation that contributed towards that conclusion. For example, centroids of the feature vector subsets generated by clustering or other probative features identification processing 66 (see FIGURE 3) can be used to characterize or classify new (individual) patient samples from the patient of interest 44 when they are analyzed in the clinic.

With reference to FIGURE 17, an illustrative comparative analysis is described. In this example, there is one subset S and one population P for simplicity. For example, the population P may be ovarian cancer patients that responded to therapy. The set S n P will be one set for which the Sample; Measurements; Data Analysis;

Post-processing; Annotation; Interpretation pipeline is executed and the set S-P another. Consequently, two centroids will be computed for each set respectively and each clinical implication (say therapies Tl, T2, and T3). These centroids will be used to characterize a new sample when presented and profiled. Table 1 captures the pipelines executed. Table 1

With reference to FIGURE 18, based on Table 1 a method and presentation is provided to present an overview of all relevant patient subsets and clinical implications in the context of the current patient. FIGURE 18 shows one suitable presentation. For each clinical implication (four listed in the example of FIGURE 18), patient, type of measurements, measured tissue, a number of comparative analysis instances can be instantiated and presented for example in a matrix format based on Table 1. Here, each column is a different analysis pipeline dependent on the data type and/or the annotation databases (corresponding to a different probative feature subset in the context of feature vectors), and each row is an application of a comparative analysis instance applied to a particular analysis pipeline corresponding to a particular clinical implication. Each comparative analysis is scored with respect to fitness to the analyzed sample from a new patient. The fitness is an indication of the computed likelihood. The fitness scores can be accumulated for each row providing a total score for each comparative analysis with respect to a clinical implication. With this, the visualization module 52 highlights each comparative analysis that results in a match for the current patient (e.g., using thick cell borders in the matrix of FIGURE 18), and also provide ranking of the clinical implications where the molecular profiles provide insight into the patient sample thus providing the clinician with tools that enable prioritization and easy overview of the relevant clinical categories and possible clinical actions. In the example of FIGURE 18, "Clinical

Implication #3" is most likely as all three probative feature subsets #1, #2, #3 provide a match with the patient of interest.

The clinical implications are then compared and for example, the cells where the clinical implication is in agreement across the analyses and the clinical actions are ranked (using a star system at the right in FIGURE 18) for example based on the strength of evidence obtained by consensus in the analysis output. For example, the annotation of the variations may be aimed at selecting the best therapy for a breast cancer patient. Copy number, transcription levels, and single nucleotide variations are all measured and the annotation is compared with known variations relevant to breast cancer implicated genes. All comparative analyses then focus on selecting the pathways implicated with each measurement and the targets of therapies are identified in each row of the matrix. Based on the ranking of the therapies, the clinician can decide how to proceed, order another line of analysis, explore the underlying evidence, re-evaluate the data with respect to another reference population, et cetera.

Table 2

The analysis up to this point facilitates focusing on a subset of features based on which implication can be derived at a higher level. For example, given a set of genes which are interesting due to the differential CNV features discussed earlier, a subsequent analysis may be applied to derive pathway regulation profiles to indicate which biological pathways are enriched with gene amplifications or gene deletions in the context of the eight genomes CNV analysis. Table 2 shows two sets of biological pathways selected derived for the genome of a person of a Central European origin listing pathways that indicate relative differences s in CNV profiles which may, based on the clinical question asked, indicate susceptibility to a disease, suitability for a therapy, or a candidate for inclusion in a population for broader analysis of samples.

A further example from the literature is considered. In this example based on a specific measurements and phenotypes, molecular profiling data is used to assess carboplatinum-based chemotherapy resistance in ovarian cancer patients. Here, the key genes identified in each patient subgroups are used to further determine which biological pathways are primarily affected in the cancer tissue of two sample subset. By using for example DNA methylation information as well as gene expression data in a sample obtained from tumor biopsies (See Banerjee et al, "Pathway and network analysis probing epigenetic influences on chemosensitivity in ovarian cancer", IEEE GENSIPS 2010), the central genes are identified in two subsets with matching expression and methylation profiles in platinum resistant patients. The annotation (gene names, in this case) is obtained to then identify the biological pathways enriched in each subset. The population is determined based on clinical studies that implicate various biological pathways involved in for example therapy resistance and cancer proliferation. Three populations are identified based on the three degrees of resistance to therapy: platinum-sensitive (PFI >6 months) and platinum-resistant (PFI <6 months) or platinum-refractory (no PFI) where PFI stands for progression- free interval (PFI), a surrogate marker for intrinsic chemosensitivity. The subsets A and B are two groups of pathways determined to be with distinct profiles in the given patient cohort.

In an example from Banerjee et al., supra, pathways showing significant overlap with genes (entities) in the gene list (entity list) selected for analysis are displayed in Table 3. The table also highlights the genes among the pathways important in chemosensitivity to platinum. Contributions from AR pathway, Wnt pathway and PI3K-akt pathway have been well-characterized in ovarian cancer. Methylated PITX2 has been shown to predict outcome in lymph node-negative breast cancer patients. Matching the profile of the patient to one or more of the subtypes with such indication will provide the clinician with a tool to make a decision on therapy or other treatment options.

Table 3 - List of Enriched Pathways and Genes

for Ovarian Cancer/Platinum Therapy Example

The invention has been described with reference to the preferred embodiments. Obviously, modifications and alterations will occur to others upon reading and understanding the preceding detailed description. It is intended that the invention be construed as including all such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

CLAIMS:

1. An apparatus comprising:

an electronic data processing device (55) configured to perform a method including:

generating (20) feature values for a set of features from data including molecular marker data acquired from a set of subjects to generate feature vectors (24) representing the subjects;

deriving a sub-set of discriminative features from the feature vectors and representing the subjects using reduced-dimensionality feature vectors with the sub-set of discriminative features; and

identifying (66) a set (40) of probative features and feature values for the probative features that are indicative of a clinical situation under analysis;

wherein the identifying is based on comparison of the

reduced-dimensionality feature vectors with feature values representing subjects in one or more subject populations (60) that include subjects

identified as being in the clinical situation under analysis.

2. The apparatus of claim 1, wherein the method performed by the electronic data processing device (55) further comprises:

generating input feature values for the set of features from data including molecular marker data acquired from a person of interest (44) to generate an input feature vector (46);

performing a comparative analysis that computes a likelihood that the person of interest is in the clinical situation under analysis by comparing the input feature values with the feature values for the probative features that are indicative of the clinical situation under analysis; and

displaying a result of the comparative analysis including at least an indication of the computed likelihood.

3. The apparatus of claim 2 wherein:

the identifying is repeated for a plurality of different clinical situations under analysis; and

the performing is repeated for each clinical situation of the plurality of different clinical situations by comparing the input feature vector with the feature values for the probative features identified for each different clinical situation under analysis.

4. The apparatus of claim 3 wherein the each different clinical situation under analysis has at least one associated clinical implication and the displaying comprises: ranking the clinical implications respective to the person of interest based on the likelihoods computed by the comparative analyses for the different clinical situations under analysis and their associated clinical implications; and

the displaying of an indication of the computed likelihood includes displaying the ranking of the clinical implications.

5. The apparatus of claim 2 further comprising:

correlating the identified feature values for the probative features that are indicative of the clinical situation under analysis with biological pathway information from medical literature (34) corresponding to the identified feature values for the probative features; wherein the displaying includes displaying the biological pathway information from medical literature.

6. The apparatus of claim 5 wherein the medical literature (34) further associates the correlated biological pathway information with at least one clinical implication, and the displaying includes displaying the associated clinical implication.

7. The apparatus of claim 1 wherein the displaying includes displaying a fitness metric indicative of the computed likelihood that the person of interest (44) is in the clinical situation under analysis.

8. The apparatus of claim 1 wherein:

the identifying (66) includes performing hierarchical clustering to generate a hierarchy of clusters of the reduced-dimensionality feature vectors (64) and

the identifying (66) further includes identifying a subset of the set of probative features and feature values for the subset of the set of probative features that are indicative of a more particular clinical situation under analysis that is encompassed by the clinical situation under analysis;

wherein the identifying of the subset of the set of probative features and the feature values for the subset of the set of probative feature values is based on comparison of the hierarchy of clusters of reduced-dimensionality feature vectors with feature values representing subjects in the one or more subject populations that include subjects identified as being in the more particular clinical situation under analysis.

9. The apparatus of claim 8 wherein:

the clinical situation under analysis is cancer of a specified organ or tissue; and the more particular clinical situation under analysis is a particular type of cancer of the specified organ or tissue.

10. The apparatus of claim 1 wherein the method performed by the electronic data processing device (55) further comprises:

repeating the identifying (66), but not the deriving (62), for an updated one or more subject populations;

wherein the previously identified set of probative features and feature values for the probative features that are indicative of a clinical situation under analysis are used as initial values for the repeating of the identifying.

11. The apparatus of claim 1 wherein the set of features further includes features generated from medical tests other than molecular marker data.

12. A method comprising:

developing an in silico test for assessing likelihood that a patient is in a clinical situation under test by performing operations including (1) generating feature vectors representing subjects of a set of subjects wherein the feature vectors include feature values derived from molecular marker data and (2) performing feature reduction to generate reduced-dimensionality feature vectors representing the subjects and (3) identifying a set of probative features and feature values for the probative features that are indicative of the clinical situation under test based on comparison of the reduced-dimensionality feature vectors with a reference data set including feature values representing subjects identified as being in the clinical situation under test;

generating an input feature vector from data including molecular marker data acquired from a person to be tested;

performing the in silico test by comparing the input feature vector with the feature values for the probative features that are indicative of the clinical situation under test; and displaying a result of the performed in silico test;

wherein the developing and generating operations are performed by an electronic data processing device (55).

13. The method of claim 12 wherein:

the developing is repeated to develop in silico tests for assessing likelihood that a patient is in a plurality of different clinical situations under test;

the generating and performing operations are repeated to assess the likelihoods that the patient is in two or more different clinical situations under test; and

the displaying includes a comparative display of the assessed likelihoods that the patient is in two or more different clinical situations under test.

14. The method of claim 12 wherein:

the developing further includes correlating the feature values identified as being indicative of the clinical situation under test with biological pathway information from medical literature corresponding to the identified feature values; and the displaying includes displaying the biological pathway information from medical literature.

15. The method of claim 12 wherein the displaying includes displaying a fitness metric indicative of the computed likelihood that the person of interest is in the clinical situation under test.

16. The method of claim 12 wherein:

the identifying operation (3) performs hierarchical clustering to generate a hierarchy of clusters of reduced-dimensionality feature vectors; and

the identifying further includes subdividing the set of probative features and feature values to identify more specific probative features and feature values that are indicative of a more particular clinical situation under test that is encompassed by the clinical situation under test;

wherein the subdividing is based on comparison of the hierarchy of clusters of reduced-dimensionality feature vectors with the reference data set.

17. The method of claim 16 wherein:

the clinical situation under test is cancer of a specified organ or tissue; and the more particular clinical situation under test is a particular type of cancer of the specified organ or tissue.

18. The method of claim 12 further comprising:

updating the in silico test by repeating the identifying operation (3) for an updated reference data set;

wherein the updating does not include repeating the feature reduction operation (2).

19. The method of claim 18 wherein the updating uses the set of probative features and feature values identified in the identifying operation (3) of the developing as initial values for performing the identifying operation (3) of the updating.

20. A non-transitory storage medium storing instructions executable by an electronic data processing device (55) to perform the method set forth in claim 12.