WO2004068136A1 - Improved anova method for data analysis - Google Patents

Improved anova method for data analysis Download PDF

Info

Publication number
WO2004068136A1
WO2004068136A1 PCT/US2003/002584 US0302584W WO2004068136A1 WO 2004068136 A1 WO2004068136 A1 WO 2004068136A1 US 0302584 W US0302584 W US 0302584W WO 2004068136 A1 WO2004068136 A1 WO 2004068136A1
Authority
WO
WIPO (PCT)
Prior art keywords
variance
group
measurement
measurements
equation
Prior art date
Application number
PCT/US2003/002584
Other languages
French (fr)
Inventor
Lee Weng
Original Assignee
Rosetta Inpharmatics Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rosetta Inpharmatics Llc filed Critical Rosetta Inpharmatics Llc
Priority to PCT/US2003/002584 priority Critical patent/WO2004068136A1/en
Publication of WO2004068136A1 publication Critical patent/WO2004068136A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention relates to improved Analysis of Variance (ANOVA) methods for analyzing measurement data, e.g., gene/protein expression data.
  • ANOVA Analysis of Variance
  • spotted cDNA arrays are prepared by depositing PCR products of cDNA fragments with sizes ranging from about 0.6 to 2.4kb, from full length cDNAs, ESTs, etc., onto a suitable surface (see, e.g., DeRisi et al, 1996, Nature Genetics 14:457-460; Shalon et al, 1996, Genome Res. 5:689-645; Schena et al, 1995, Proc. Natl Acad. Sci. U.S.A. 53:10539-11286; and Duggan et al., Nature Genetics Supplement 27:10-14).
  • oligonucleotide arrays containing thousands of oligonucleotides complementary to defined sequences, at defined locations on a surface are synthesized in situ on the surface by, for example, photolithographic techniques (see, e.g., Fodor et al, 1991, Science 251:767-773; Pease et al, 1994, Proc. Natl. Acad. Sci. U.S.A. 97:5022-5026; Lockhart et al , 1996, Nature Biotechnology 14: 1675; McGall et al, 1996, Proc. Natl. Acad. Sci. U.S.A. 93:13555-13560; U.S. Patent Nos. 5,578,832; 5,556,752; 5,510,270; and 6,040,138).
  • Methods for generating arrays using inkjet technology for in situ oligonucleotide synthesis are also known in the art
  • DNA array technologies have allowed, inter alia, genome-wide analysis of mRNA expression in a cell or a cell type or any biological sample. Aided by sophisticated data management and analysis methodologies, the transcriptional state of a cell or cell type as well as changes ofthe transcriptional state in response to external perturbations, including but not limited to drug perturbations, can be characterized on the mRNA level (see, e.g., Stoughton et al., International Publication No. WO 00/39336, published July 6, 2000; Friend et al.,
  • Protein microarrays are used to monitor the genome-wide protein expression in cells (i.e., the "proteome,” Goffeau et al, 1996, Science 274:546-567; Aebersold et al., 1999, Nature Biotechnology 10:994-999). Protein microarrays contain binding sites comprise immobilized, preferably monoclonal, antibodies specific to a plurality of protein species encoded by the cell genome (see, e.g., Zhu et al., 2001, Science 293:2101-2105; MacBeath et al, 2000, Science 289:1760-63; de Wildt et al, 2000, Nature Biotechnology 18:989-994).
  • Protein expression in a cell can also be separated and measured by two-dimensional gel electrophoresis techniques.
  • Two-dimensional gel electrophoresis is well-known in the art and typically involves iso-electric focusing along a first dimension followed by SDS-PAGE electrophoresis along a second dimension. See, e.g., Hames et al, 1990, Gel Electrophoresis of Proteins: A Practical Approach, IRL Press, New York; Shevchenko et al, 1996, Proc. Natl. Acad. Sci.
  • the resulting electropherograms can be analyzed by numerous techniques, including mass spectrometric techniques, Western blotting and immunoblot analysis using polyclonal and monoclonal antibodies, and internal and N- terminal micro-sequencing. Using these techniques, it is possible to identify a substantial fraction of all the proteins produced under given physiological conditions, including in cells (e-g- , in yeast) exposed to a drug, or in cells modified by, e.g., deletion or over-expression of a specific gene.
  • ANOVA Analysis of Variance
  • ANOVA is often used for determining whether there are statistical differences among the means of measurements in different measurement groups.
  • the different measurement groups may contain measurements of expression levels of a gene or protein under different drug treatments. In each group, there may be several replicate measurements under the same treatment.
  • the within-group variance is the measurement variance of measurements within a treatment group.
  • the between-group variance is the measurement variance ofthe means of different treatment groups.
  • the within-group variance reflects the measurement error ofthe measurement technology, and the between-group variance includes both the measurement error ofthe measurement technology and changes caused by different treatments. Then the between-group variance is compared to the within-group variance. If the between-group variance is significantly larger than the within-group variance, it may be concluded that the different treatments have produced statistically significant changes in gene expression levels.
  • ANOVA analysis the underlying null- hypothesis is that all treatment groups have the same mean. With the estimated mean squares and degrees of freedom, a p-value of F-statistics can be calculated. The p-value is the probability that the null-hypothesis may be accepted.
  • the null-hypothesis can be rejected and the alternative hypothesis, which means that some ofthe expression levels have different means, can be accepted. In other words, some treatments have produced changes in the expression level ofthe gene.
  • the between-group variance can be much larger than the underestimated within-group variance, leading to small p-value, which in turn incorrectly indicates statistically significant difference in the expression levels ofthe gene or protein.
  • Such incorrectly identification of a gene or protein is called a "false positive.”
  • High false positive rate is a severe problem in gene expression analysis when the traditional ANOVA method is used. Large number of false positives often requires follow-up validation using other expression profiling technologies. Low degrees-of-freedom also reduces the detection sensitivity. As a result, small changes in differential expression may not be detected. Such missed or detected differentially expressed genes or proteins are called “false negatives.”
  • Measurement errors in microarray experiments are often described by error models (see, e.g., Supplementary material to Roberts et al, 2000, Science, 287:873-880; and Rocke et al., 2001, J. Computational Biology 8:557-569). Measurement errors can also be described as a sum of a common error and a scatter error (see, e.g., Stoughton et al., U.S. Patent No. 6,351,712). An error-weighted average is used in combining ratio profiles (see, e.g., Stoughton et al, U.S. Patent No. 6,351,712).
  • the present invention provides methods for analyzing measurement data.
  • a second type of inputs e.g., estimated variances of the measurements, are also used.
  • Each ofthe k different measurement groups consists of one or more measurements ofthe variable under a condition common to the measurement group, and each ofthe measurement ⁇ has a predetermined measurement variance, e.g., ⁇ ,. .
  • k and n t can be any appropriate integer.
  • the variable is the transcript level of a gene or the abundance of a protein.
  • the transcript level is measured using a DNA microarray.
  • the abundance is measured using a protein microarray or two-dimensional gel electrophoresis.
  • the invention provides a method for analyzing the plurality of measurements ofthe variable y tl ⁇ comprising (a) determining a within-group variance for the k different measurement groups, wherein the within-group variance consists of a propagated variance and a scattered variance, the propagated variance being determined o based on predetermined measurement variances of the plurality of measurements, and the scattered variance being determined based on deviation of each ofthe plurality of measurements with respect to a mean of measurements in a respective measurement group; (b) determining a between-group variance for the k different measurement groups, wherein the between-group variance is a variance of a plurality of group means, one for each ofthe k 5 different measurement groups, with respect to a mean of the plurality of measurements, wherein each the group mean is the mean ofthe one or more measurements in a measurement group; and (c) comparing the within-group variance with the between-group variance, thereby analyzing the plurality of measurements.
  • the invention provides a method for analyzing the 0 plurality of measurements ofthe variable ⁇ y tl ⁇ comprising (a) weighting each measurement y tl with a weighting factor to obtain an error- weighted measurement, wherein the weighting factor is determined based on the predetermined measurement variance ofthe measurement; (b) determining a within-group variance for the k different measurement groups, wherein the within-group variance consists of a propagated variance and a scattered variance, the 5 propagated variance being determined based on predetermined measurement variances of the plurality of measurements, and the scattered variance being determined based on deviation of each ofthe plurality of measurements with respect to a mean of error- weighted measurements in a respective measurement group; (c) determining a between-group variance for the k different measurement groups, wherein the between-group variance is a o variance of a plurality of group means, one for each of the k different measurement groups, with respect to a mean ofthe plurality of error- weighted measurements, wherein each
  • the comparing step comprises determining the significance level of a statistical metric by a statistical test, wherein the statistical metric is determined by a method comprising (i) determining a within-group degree of freedom; (ii) determining a between-group degree of freedom; (iii) determining a within-group mean square; (iv) determining a between-group mean square; and (v) calculating the statistical metric.
  • the invention provides an improved ANOVA method for analyzing the plurality of measurements ofthe variable ⁇ y ti ⁇ in which the o improvement comprises determining a within-group variance consisting of a propagated variance and a scattered variance, the propagated variance being determined based on predetermined measurement variances ofthe one or more measurements, and the scattered variance being determined based on deviation of each ofthe one or more measurements with respect to a mean of measurements in a respective measurement group.
  • the within-group variance can be determined based on group variances ofthe k different measurement groups, each group variance consisting of a measurement group propagated variance and a measurement group scattered variance.
  • Each measurement group propagated variance can be detennined based on the predetermined measurement variances ofthe one or more measurements in the 0 measurement group.
  • the measurement group scattered variance can be a variance ofthe one or more measurements with respect to the mean ofthe one or more measurements in the measurement group.
  • the propagated variance for each measurement group is determined according to the equation 5
  • ⁇ ⁇ is the propagated variance.
  • the group propagated variance and the group scattered variance for each group are combined according to the equation
  • the method ofthe invention further comprises a step of determining the predetermined measurement variance for each measurement.
  • the predetermined measurement variance of each measurement y ti is determined according to an error model, e.g., a three-term error model according to equation
  • ⁇ ⁇ t is the predetermined measurement variance of measurement vchel, a is a fractional error coefficient, b is a Poisson error coefficient, and c is a standard deviation of background noise.
  • each error-weighted measurement is generated using a weighting factor which is determined based on the predetermined measurement variance ofthe measurement.
  • the weighting factor for measurement y ti is determined according to the equation
  • the propagated variance for each measurement . is determined according to the equation
  • the methods further comprise using an effective number of samples instead ofthe actual number of samples for each measurement group.
  • the effective numbers of samples are calculated according to equations
  • the within-group degree of freedom is calculated according to the equation
  • the within-group degree of freedom is calculated according to the equation
  • N is the total number of measurements.
  • statistical test is an F-test and the significance level is a p-value determined according to the equation
  • Each ofthe k * n different measurement groups consists of measurements ofthe variable under common conditions t and i of condition groups N having n different conditions and T having k different conditions, respectively, and each measurement y t has a predetermined measurement variance, e.g., ⁇ ? ((7 .
  • k, n and m ti can be any appropriate integer.
  • the invention provides a method comprising (a) determining a within-group variance for the k * n different measurement groups, wherein the within-group variance consists of a propagated variance and a scattered variance, the propagated variance being determined based on predetermined measurement variances of the plurality of measurements, and the scattered variance being determined based on deviation of each ofthe plurality of measurements with respect to a mean of measurements in a respective measurement group; (b) determining (bl) a between-group variance for condition group K, wherein the between-group variance is a variance of condition group means for the condition group K, one for each condition in the condition group K, with respect to a mean ofthe plurality of measurements, wherein each condition group mean for the condition group Kis the mean of measurements in all measurement groups under a respective condition in the condition group K, or (b2) an interaction variance, wherein the interaction variance is a variance of group interaction means, one for each ofthe k * n different measurement groups, with respect to a mean ofthe plurality of measurements and the
  • the method further comprises prior to step (a) a step of weighting each measurement y tlJ with a weighting factor to obtain an error- weighted measurement, where the weighting factor is determined based on the predetermined measurement variance ofthe measurement.
  • the comparing step comprises determining the significance level of a statistical metric by a statistical test, wherein the statistical metric is determined by a method comprising (i) determining a within-group degree of freedom; (ii) determining a between-group degree of freedom, if the between-group variance is determined in step (b), or an interaction degree of freedom, if the interaction variance is
  • step (b) 20 determined in step (b); (iii) determining a within-group mean square; (iv) determining a between-group mean square, if the between-group variance is determined in step (b), or an interaction mean square, if the interaction variance is determined in step (b); and (v) calculating the statistical metric.
  • the invention provides an improved two-way
  • the improvement comprises determining a within-group variance consisting of a propagated variance and a scattered variance, the propagated variance being determined based on predetermined measurement variances ofthe plurality of measurements, and the scattered variance being determined based on deviation of each ofthe plurality of measurements with
  • each of the one or more measurements is an error- weighted measurement generated using a weighting factor determined based on the predetermined measurement variance ofthe measurement.
  • each ofthe one or more measurements is an error- weighted measurement generated using a weighting factor which is determined based on the predetermined measurement variance ofthe measurement.
  • the weighting factor for measurement y tlJ is determined according to the equation
  • the within-group variance can be determined based on group variances ofthe n * k different measurement groups, each group variance consisting of a measurement group propagated variance and a measurement group scattered variance.
  • Each measurement group propagated variance can be determined based on the predetermined measurement variances ofthe one or more measurements in the measurement group.
  • the measurement group scattered variance can be a variance ofthe one or more measurements with respect to the mean ofthe one or more measurements in the measurement group.
  • the propagated variance for each measurement group is determined according to the equation
  • the group propagated variance and the group scattered variance for each group are combined according to the equation
  • the method ofthe invention further comprises a step of determining the predetermined measurement variance for each measurement.
  • the predetermined measurement variance of each measurement ⁇ is determined according to an error model, e.g., a three-term error model according to equation
  • o ⁇ tij is the predetermined measurement variance ofthe measurement ⁇
  • a is a fractional error coefficient
  • b is a Poisson error coefficient
  • c is a standard deviation of background noise.
  • a between-group variance is determined in step (b).
  • the within-group degree of freedom is calculated according to the equation
  • V ⁇ k-l
  • V R is calculated according to the equation
  • the statistical test can be an F-test and the significance level can be a p-value determined according to the equation
  • an interaction variance is determined in step (b).
  • the within-group degree of freedom is calculated according to the equation
  • V* ⁇ m ti - 1 + — m u
  • the statistical test can be an F-test and the significance level o can be a p-value determined according to the equation
  • the within-group variance for each variable can be determined based on group variances ofthe k different measurement groups for the variable, each of such group variance consisting of a measurement group propagated variance and a measurement group scattered variance.
  • Each measurement group propagated variance can ⁇ r be determined based on the predetermined measurement variances ofthe one or more measurements in the measurement group.
  • the measurement group scattered variance can be a variance ofthe one or more measurements with respect to the mean ofthe one or more measurements in the measurement group.
  • the propagated variance for each measurement group is determined
  • the mean of the one or more measurements is determined according to the equation
  • the group propagated variance and the group scattered variance for each group are combined according to the equation
  • the method ofthe invention further comprises a step of determining the predetermined measurement variance for each measurement.
  • the predetermined measurement variance of each measurement y ti (j) is determined according to an error model, e.g., a three-term error model according to equation
  • O ⁇ (J) is the predetermined measurement variance ofthe measurement yetteQ)
  • a is a fractional error coefficient
  • 6 is a Poisson error coefficient
  • c is a standard deviation of background noise.
  • step of comparing the within-group variance ofthe first variable with the between-group variance ofthe first variable is carried out according to the equation
  • N is the total number of measurements of said first variable
  • V avg is chosen to be at least about 20 to about 100.
  • the plurality of measurements are measured under a common condition, and each ofthe measurement has a predetermined measurement variance.
  • the method comprises (a) determining a propagated
  • n t can be any appropriate integer.
  • the propagated variance for each measurement group is determined according to the equation
  • the method further comprises prior to step (a) a step of weighting each measurement y t with a weighting factor to obtain an error- weighted measurement, where the weighting factor is determined based on the predetermined measurement variance ofthe measurement.
  • the weighting factor is determined according to the equation
  • the comparing is carried out using an F-test according to the equation
  • the invention also provides a computer system comprising a processor and a memory coupled to the processor and encoding one or more programs which programs cause the processor to carry out any one ofthe methods described above.
  • the invention also provides a computer program product for use in conjunction with a computer having a processor and a memory connected to the processor, which computer program product comprises a computer readable storage medium having a computer program mechanism encoded thereon.
  • the computer program mechanism may be loaded into the memory ofthe computer and cause the computer to carry out any one ofthe methods described above.
  • FIG.l is a schematic illustration of an embodiment ofthe improved ANOVA data processing method. Briefly, the processing blocks in the diagram are listed as following.
  • this pre-processing module handles data normalization and data transformation of, e.g., intensity data.
  • Intensity error of a micro-array measurement has a multiplicative component that is proportional to the intensity itself.
  • the intensity data can be transformed by an error model based transformation which converts the intensity x and intensity error ⁇ x to a new
  • this improved ANOVA module takes two variables y and ⁇ y as inputs. For one-way ANOVA, the module computes one ANOVA p-value for each measurement y. For two-way ANOVA, it computes two p-values ofthe two factors and one p-value for the interaction between the two factors. The module also estimates the
  • FIG. 2 illustrates an exemplary embodiment of a computer system useful for 30 implementing the methods of this invention.
  • a measurement group refers to a set of one or more replicate measurements of a variable of particular type of sample, e.g., a particular type of cells, a cells sample under a particular condition or perturbation, such as exposure to a drug, a genetic mutation and so on.
  • the variable can be any measurable variable ofthe sample including but not limited to expression or transcript level of a gene or abundance of a protein, level or concentration of a small cellular molecule, e.g., a metabolite, level of a clinical indicator, e.g., level or concentration of a chemical in blood.
  • a measurement group may contain a set of gene expression levels of a biological sample
  • measurement group is used interchangeably with the terms "treatment group” or "sample group.”
  • the state of a cell or other biological sample is represented by cellular constituents
  • the measured j r data can be measurements of such cellular constituents or measurements of responses of cellular constituents.
  • biological sample is broadly defined to include any cell, ⁇ tissue, organ or multicellular organism.
  • a biological sample can be derived, for example, from cell or tissue cultures in vitro.
  • a biological sample can be derived from a living organism or from a population of single cell organisms.
  • the biological sample comprises a living cell or organism.
  • the state of a biological sample can be measured by the content, activities or ⁇ r structures of its cellular constituents.
  • the state of a biological sample is taken from the state of a collection of cellular constituents, which are sufficient to characterize the cell or organism for an intended purpose including, but not limited to characterizing the effects of a drug or other perturbation.
  • the term "cellular constituent" is r also broadly defined in this disclosure to encompass any kind of measurable biological variable.
  • the measurements and/or observations made on the state of these constituents can be of their abundances (i.e., amounts or concentrations in a biological sample) e.g., of mRNA or proteins, or their activities, or their states of modification (e.g., phosphorylation), or other measurements relevant to the biology of a biological sample.
  • abundances i.e., amounts or concentrations in a biological sample
  • states of modification e.g., phosphorylation
  • this invention includes making such measurements and/or observations on different collections of cellular constituents. These different collections of cellular constituents are also called herein aspects ofthe biological state of a biological sample.
  • One aspect ofthe biological state of a biological sample (e.g., a cell or cell culture) usefully measured in the present invention is its transcriptional state.
  • a biological sample e.g., a cell or cell culture
  • transcriptional state is the currently preferred aspect ofthe biological state measured in this invention.
  • the transcriptional state of a biological sample includes the identities and abundances ofthe constituent RNA species, especially mRNAs, in the cell under a given set of conditions. Preferably, a substantial fraction of all constituent RNA species in the biological sample are measured, but at least a sufficient fraction is measured to characterize ⁇ . the action of a drug or other perturbation of interest.
  • the transcriptional state of a biological sample can be conveniently determined by, e.g., measuring cDNA abundances by any of several existing gene expression technologies.
  • One particularly preferred embodiment of the invention employs DNA arrays for measuring mRNA or transcript level of a large number of genes.
  • the other preferred embodiment ofthe invention employs DNA arrays r for measuring expression levels of a large number of genes or exons in the genome of an organism.
  • Another aspect ofthe biological state of a biological sample usefully measured in the present invention is its translational state.
  • the translational state of a biological sample includes the identities and abundances ofthe constituent protein species in the biological
  • ⁇ n sample under a given set of conditions.
  • a substantial fraction of all constituent protein species in the biological sample is measured, but at least a sufficient fraction is measured to characterize the action of a drug of interest.
  • the transcriptional state is often representative ofthe translational state.
  • Still another aspect ofthe biological state of a biological sample is its small
  • ⁇ c molecule state e.g., metabolic state.
  • the small molecule state of a biological sample comprises identities and abundances of small molecules present in a cell.
  • Small molecules refer to molecules of molecular weights of less than about 5000, including but are not limited to sugars, fatty acids, amino acids, nucleotides, intermediates of cellular processes,
  • the activity state of a biological sample includes the activities ofthe constituent protein species (and also optionally catalytically active nucleic acid species) in the biological sample under a given set of conditions.
  • the translational state is often representative ofthe activity state.
  • This invention is also adaptable, where relevant, to "mixed" aspects ofthe biological state of a biological sample in which measurements of different aspects ofthe biological state of a biological sample are combined.
  • the biological state of a biological sample in which measurements of different aspects ofthe biological state of a biological sample are combined.
  • the mixed aspect the
  • the biological state of a biological sample (e.g., a cell or cell culture) is represented
  • S- is the level ofthe fth cellular constituent, for example, the transcript level of gene i, or alternatively, the abundance or activity level of protein i.
  • k is more than 2, preferably more than 10, more preferably more than 100 , still more preferably more than 1000, still more preferably more than 10,000, still more n preferably more than 25,000, still more preferably more than 50,000, and most preferably more than 100,000.
  • cellular constituents are measured as continuous variables.
  • transcriptional rates are typically measured as number of molecules synthesized per unit of time.
  • Transcriptional rate may also be measured as percentage of a or control rate.
  • cellular constituents may be measured as categorical variables.
  • transcriptional rates may be measured as either "on” or “off, where the value "on” indicates a transcriptional rate above a predetermined threshold and value "off indicates a transcriptional rate below that threshold.
  • the responses of a biological sample to a perturbation i.e., under a condition, such as the application of a drug, can be measured by observing the changes in the biological state ofthe biological sample.
  • a condition such as the application of a drug
  • a response profile is a collection of changes of cellular constituents.
  • the response profile of a biological sample (e.g., a cell or cell culture) to the perturbation m is defined as the vector v m) :
  • V-" is the amplitude of response of cellular constituent i under the
  • the biological response to the application of a drug, a drug candidate or any other perturbation is measured by the induced change in the transcript level of at least 2 genes and/or proteins, preferably more than 10 genes and/or proteins, more preferably more than 100 genes and/or proteins, still more preferably more than 1000 genes and/or proteins, still more preferably
  • the biological response to the application of a drug, a drug candidate or any other perturbation is measured by the induced change in the expression levels of a plurality
  • genes and/or proteins preferably more than 10 genes and/or proteins, more preferably more than 100 genes and/or proteins, still more preferably more than 1000 genes and/or proteins, still more preferably more than 10,000 genes and/or proteins, still more preferably more than 25,000 genes and/or proteins, still more preferably more than 50,000 genes and/or proteins, and most preferably more than 100,000 genes and/or proteins.
  • the response is simply the difference between biological variables before and after perturbation.
  • the response is defined as the ratio of cellular constituents before and after a perturbation is applied.
  • v (m) (t) may be the difference or ratio of cellular constituents before the perturbation and at time t after the perturbation.
  • vf is set to zero if the response of gene i is below some threshold amplitude or confidence level determined from knowledge ofthe measurement error behavior. In such embodiments, those cellular constituents whose o measured responses are lower than the threshold are given the response value of zero, whereas those cellular constituents whose measured responses are greater than the threshold retain their measured response values.
  • This truncation ofthe response vector is a good strategy when most ofthe smaller responses are expected to be greatly dominated by measurement error. After the truncation, the response vector v (m) also approximates a 5 'matched detector' (see, e.g., Van Trees, 1968, Detection, Estimation, and Modulation Theory Vol.
  • genes whose transcript level changes are lower than two fold or more preferably four fold are given the 0 value of zero.
  • perturbations are applied at several levels of strength. For example, different amounts of a drug may be applied to a biological sample to observe its response.
  • the perturbation responses may be interpolated by approximating each by a single parameterized "model" function ofthe perturbation 5 strength u.
  • An exemplary model function appropriate for approximating transcriptional state data is the Hill function, which has adjustable parameters a, u 0 , and n.
  • the adjustable parameters are selected independently for each cellular constituent ofthe perturbation response.
  • the adjustable parameters are selected for each cellular constituent so that the sum ofthe squares ofthe differences between the model function 5 (e.g., the Hill function, Equation 3) and the corresponding experimental data at each perturbation strength is minimized.
  • This preferable parameter adjustment method is well known in the art as a least squares fit.
  • Other possible model functions are based on polynomial fitting, for example by various known classes of polynomials. More detailed description of model fitting and biological response has been disclosed in Friend and Stoughton, Methods of Determining Protein Activity Levels Using Gene Expression Profiles, PCT publication WO 99/59037, which is incorporated herein by reference in its entirety for all purposes.
  • Measured data obtained in a microarray experiment often contain errors due both to the inherent stochastic nature of gene expression and to measurement errors from various external sources.
  • the many sources of measurement error that may occur in a measured signal include those that fall into three categories - additive error, multiplicative error, and
  • the signal magnitude-independent or intensity-independent additive error includes errors resulted from, e.g., background fluctuation, or spot-to-spot variations in signal intensity among negative control spots, etc.
  • the signal magnitude-dependent or intensity-dependent multiplicative error which is assumed to be directly proportional to the signal intensity, includes errors resulted from, e.g., the scatter observed for ratios that should be unity.
  • the multiplicative error is also termed fractional error.
  • the third type of error is a result of variation in number of available binding sites in a spot. This type of error depends on the square-root ofthe signal magnitude, e.g., measured intensity. It is also called the
  • Poisson error because it is believed that the number of binding sites on a microarray spot follows a Poisson distribution, and has a variance which is proportional to the average number of binding sites.
  • measured data are first transformed by an error model based transformation before analyzed by the improved ANOVA method ofthe invention.
  • the results from the ANOVA analysis can be transformed back by an appropriate inverse transformation.
  • An error model based data transformation method is described in Weng, U.S. Provisional Patent Application No. 60/353,845, filed on January 31, 2002, which is incorporated by reference herewith in its entirety.
  • ERROR MODELS Errors in measured data can be described by error models (see, e.g., Supplementary material to Roberts et al, 2000, Science, 287:873-880; and Rocke et al, 2001, J. Computational Biology 8:557-569).
  • an error model see, e.g.,
  • the constant error is independent from the hybridization levels of individual spots on a microarray. It may come from scanner electronics noise and/or fluorescence due to nonspecific binding of fluorescence molecules to the surface ofthe microarray. In one embodiment, this constant additive error is taken to have a normal distribution with a mean bkg and a standard deviation ⁇ bkg . After background level subtraction, which is typically
  • the additive mean bkg becomes zero.
  • the background intensity offset has been corrected.
  • the methods ofthe invention can be used with an additional step of making such a correction.
  • the second error source is the multiplicative error that is the combined result ofthe speckle noise inherent in the coherent laser scanner and the fluorescence dye related noise.
  • the multiplicative error is also called fractional error because its level is directly proportional to the magnitude ofthe measured signal, e.g., the measured intensity level. It is the dominant error source at high intensity levels.
  • x(k) is the measured intensity in the /c'th spot.
  • the constant a in Equation 4 is termed fractional error coefficient, and describes the proportion ofthe fractional error to the intensity ofthe measured signal.
  • the constant has a value in the range of 0.1 to 0.2. This constant may vary depending on the particular microarray technology used for obtaining the measured signal and/or the particular hybridization protocol used in
  • parameter a is determined during the error building phase by measuring the variance ofthe log ratio near the high intensity side in a same-vs.- same ratio experiment where the intensities in the ratio numerator and denominator come from the same sample and treatment. At high intensities, the variance of log ratio x, over x 2 relates to parameter a
  • x 1 and x 2 are at least 4, 10, 50, 100, or 200 times ⁇ bkg .
  • the measurement error in a measured signal e.g., measured intensity, x(k)
  • x(k) the measurement error in a measured signal
  • ⁇ * ⁇ (X) ⁇ bkg ( k f + ⁇ f a « ⁇ bks (k) 2 + a 2 - x(kf (6)
  • the background noise variances in Equation 6 are taken as slightly different in different microarray spots or regions of a microarray chip.
  • the difference is less than 20%, 10%, 5%, or 1%.
  • an extra square-root term is included to describe measurement errors originated from variation in the number of available binding sites in a microarray spot. This term is also called the Poisson term.
  • the measured intensity is used to provide an estimate ofthe average number of binding sites.
  • the Poisson error can be approximated as
  • parameter b in Equation 8 is determined by measuring the intensity variance in the middle intensity ranges ofthe same-vs.-same experiments. In one embodiment, the intensity variance is measured in the 25 to 75 percentile range, 35 to 65 percentile range, or 45 to 50 percentile range for determination of b.
  • parameters a and b are fixed for an error model under a given microarray technology and experiment protocol.
  • the background noise ⁇ bkg can be estimated for each particular microarray experiment.
  • the background noise ⁇ bkg for the set can be obtained by averaging the background noise estimated for each ofthe replicate experiments.
  • Equation 6 The two-term error model as described by Equation 6 can been seen as a simplified version ofthe three-term error model described by Equation 8 by setting the Poisson parameter b to zero.
  • Equation 8 is used as the general mathematical description of error models. It will be apparent to an ordinarily skilled artisan that any results obtained based on Equation 8 are also applicable to a two-term error model by setting the Poisson parameter b to zero.
  • the transformation works for both positive and negative (e.g, negative signals obtained after background subtraction) x. More preferably the transformation meets the following additional constraints:
  • an inverse transformation function g exists so that the transformed data in the transformed domain can be transformed back to the original domain.
  • the inverse transformation does the following operation:
  • the inverse transformation function g meets above four constraints as well.
  • the error in the inversely transformed intensity can be determined when the first derivative/' Q ofthe forward transformation function/is available:
  • the forward transformation function/ its first derivative/, and the inverse transformation function g are all in analytical closed-forms.
  • a transformation based on an error model is provided and used to transform measured data obtained in an experiment to a transformed domain such that the measurement errors in transformed data are equal to the measurement errors in the measured data normalized by errors determined based on an error model.
  • such an measurement error i.e., a measurement error which equals the measurement error in the measured signal normalized by an error determined based on an error model
  • a normalized error is also referred to as a normalized error.
  • Any suitable error model can be used in the invention.
  • the error model is a two-term or a three-term error model described in Section 5.1.1.1.
  • the error model is a two-term or a three-term error model described in Section 5.1.1.1.
  • the transformation meets all requirements discussed in Section 5.2.1.2.
  • the basic concept ofthe new transformation method is to apply an error model to normalize errors in real measurements, e.g., standard deviations in measured data, such that the normalized errors are close to a constant. Then a transformation function/) is found by the
  • the methods are applicable to any set of measured data whose errors can be described by a particular error model.
  • the real measurement standard deviation ⁇ x is for the positive intensity x>0.
  • the real standard deviation ⁇ x is usually known before the transformation.
  • An error model in Equation 8 provides ⁇ x that is an estimate ofthe real
  • ⁇ x is an error determined by the experiment.
  • ⁇ x is calculated using an error model ofthe experiment.
  • ⁇ x is chosen to be the larger of an experimentally determined error or an error model-calculated error. Assuming the transformed standard deviation is Ay, the following approximation relates the two errors
  • Equation 8 is an approximation of ⁇ x, if a normalization function ' is defined as follows:
  • Equation 15 provides an analytical form ofthe first derivative function ofthe desired transformation. To obtain the transformation function itself, both sides of Equation 15 are integrated:
  • Equation 16 does have an analytical solution.
  • a 0 symbolic-solution software such as Mathematica, or using an integral table, the solution is obtained as
  • Equation 11 in Section 5.2.1.2 preferably one finds the inverse transformation function g(y) so that the transformed intensity y can be converted back to the original x scale whenever necessary.
  • linear algebra or a symbolic-solution software such as Maple
  • the standard deviation ofthe inversely transformed intensity can be estimated by using Equation 12.
  • the transformation function can be further defined to be symmetric to zero for all x.
  • is used to replace x in the forward transformation in Equation 17 and to give a negative sign to the result y.
  • is used to replace;; and to give a negative sign to the result x.
  • the transformation also meets all other requirements and constraints stated in Section 5.2.1.2.
  • the transformation has several other interesting properties:
  • the transformation described in this section is applicable to any measured data in which the errors can be described by a three-term error model.
  • the measured data are measured in a microarray gene expression experiment.
  • the measured data are measured in a protein array experiment or a 2D gel protein experiment.
  • the measured data are signal data obtained in an microarray experiment in which two spots or probes on a microarray are used for obtaining
  • each measured signal one comprising the targeted nucleotide sequence, i.e., the target probe (TP), e.g., a perfect-match probe, and the other comprising a reference sequence, i.e., a reference probe (RP), e.g., a mutated mismatch probe.
  • the RP probe is used as a negative control, e.g., to remove undesired effects from non-specific hybridization.
  • the measured signal obtained in such a manner is defined as the difference
  • the TP probe is a perfect-match probe (PM)
  • the RP probe is a mismatch probe (MM) (see, e.g., Lockhart et al, 1996, Nature Biotechnology
  • the RP probe is a reverse probe ofthe TP probe, i.e., the RP probe has a sequence that is the reverse complement ofthe TP probe (see, Shoemaker et al., U.S. Patent Application Serial No. 09/781,814, filed on February 12, 2001; and Shoemaker et al., U.S. Patent Application Serial No. 09/724,538, filed on November 28, 2000).
  • Equation 8 when intensity x is very high, the fractional error is the dominant error source. In this case, the standard deviation of is approximately a constant:
  • Still another transformation that can be used to transform the data is a piecewise hybrid transformation (see, e.g., D. Holder, et al, "Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach", presented in Genelogic Workshop on Low Level Analysis of Affymetrix Genechip® data, Nov 19, 2001, Bethesda, MD, http://oz.berkeley.edu/users/terry/zarray/Affy/GL_Workshop/Holder.ppt).
  • This hybrid transformation uses a linear function at the low intensity side and a logarithm function for high intensities.
  • parameter c ' in Equation 28 is chosen to be 20.
  • Errors ofthe hybrid- transformed intensities can be estimated as
  • ⁇ y (k) » ⁇ x (k) - f' ⁇ x(k)) ⁇ x ⁇ k), fo ⁇ 0 ⁇ x ⁇ k) ⁇ c ⁇
  • the present invention provides improved ANOVA methods for processing and analyzing measured data and transformed data, e.g., data transformed by a method described in Section 5.2.
  • transformed data are analyzed by the improved ANOVA method of the invention, it is optional to transform the results back by an appropriate inverse transformation.
  • ANOVA METHOD Analysis of Variance is a widely used method for analyzing gene or protein expression data (see, e.g., Statistics For Experimenters, Box, Hunter and Hunter, John Wiley and Sons, 1978; Siegel et al., Nonparametric statistics for the behavioural sciences, McGraw Hill, 2 nd edition, 1998; Conover, Practical Nonparametric Statistics, John Wiley and Sons, 3 rd edition, 1998; Airman, Practical Statistics for Medical Research, CRC Press, 1991; Berry, Statistical Methods in Medical Research, Blackwell Science, Inc.,
  • ANOVA is a method for determining whether there are statistical differences among the means of different measurement groups.
  • Each different measurement group contains one or more replicate measurements of a particular type of sample, e.g., a particular type of cells, a cells sample under a particular condition or perturbation, such as exposure to a drug, a genetic mutation and so on.
  • gene expression levels or protein abundances or activity levels from cells under different conditions or perturbations including but are not limited to different drug treatments, e.g., different doses the a particular drug, different drugs, different time after the treatment by a drug, etc., can be measured using DNA or protein micro-arrays and 2D gel electrophoresis or mass spectrometry.
  • each measurement group contains replicate measurements of expression level of a gene or abundance/activity level of a protein in cells under a condition or perturbation common to the measurements, whereas different measurement groups contain measurements of expression level ofthe gene or abundance/activity levels ofthe protein in cells under
  • the within-group variance i.e., variance of individual measurements
  • the between-group variance i.e., variance ofthe means of different treatment groups.
  • the within-group variance reflects the measurement error ofthe measurement technology
  • the between- group variance includes both the measurement error ofthe measurement technology and the
  • treatment group or “sample group” are also used.
  • the term "measurement group” are intended to be equivalent and are used interchangeably.
  • the term “average” and “mean” are also used interchangeably.
  • null-hypothesis H 0 is that all treatment groups have the same mean, i.e., the mean of each treatment group does not change among different treatment groups. For example, in cases of expression levels of a gene or abundances of a
  • H 0 is that the mean expression level ofthe gene or abundances ofthe protein is the same under the different conditions.
  • a statistical distribution of residues can be selected. For example, in ANOVA F-statistics is often used with an underlying assumption of normal distribution. A statistical metric can then be chosen. The significance level of he statistical metric is then determined. If the
  • ⁇ c significance level is greater than a given threshold, the null-hypothesis is accepted, whereas if the significance level is lower than the threshold the null-hypothesis is rejected and the alternative hypothesis H t that one or more measurement groups have different means is accepted.
  • a randomization method e.g., permutation, can be employed to obtain a distribution free significance level (see, e..g., Wu et al., "MAANOVA: A software r. package for the Analysis of Spotted cDNA Microarray Experiments, published at www.jax.org/churchill/pubs/wu_maanova.pdf, accessed on October 25, 2002, which is incorporated by reference herein in its entirety).
  • a skilled person in the art will be able to select an appropriate statistical test once the data to be analyzed are given.
  • a statistical metric e.g., a metric for F-statistics
  • k different treatment groups e.g., k different compounds or k different dose levels of one compound
  • n t replicate measurements in treatment group t n t replicate measurements in treatment group t
  • the group average is
  • the traditional one-way A ⁇ OVA table is presented in Table 1 (see, e.g., Box et al., Statistics For Experimenters, John Wiley & Sons, 1978).
  • the A ⁇ OVA table summarizes the quantities used in A ⁇ OVA calculation.
  • the statistical metric for F-statistics can be the ratio ofthe estimated mean squares in Table 1, i.e., s T 2 /s R 2 .
  • the probability that the null-hypothesis may be accepted using F-test is then calculated as a p-value according to Equation 33.
  • the p-value is lower than a given threshold, for example p-value ⁇ 0.01, the null-hypothesis is rejected and the alternative hypothesis H j is accepted.
  • a second input is also used.
  • the second input is a predetermined measurement variance or error.
  • a predetermined measurement variance or error can be any variance or error which is determined prior to the ANOVA analysis.
  • the predetermined variance or error of a measurement can be any variance or error which is determined prior to the ANOVA analysis.
  • the predetermined variance or error of a measurement can be any variance or error which is determined prior to the ANOVA analysis.
  • the predetermined variance or error of a measurement of a variable is determined based on prior measurements ofthe same variable using the same measurement technological platform, e.g., prior measurements of the expression level ofthe same gene or protein using the same type of microarrays.
  • the predetermined variance or error of a measurement of a variable is determined based on concurrent measurements of a plurality of different variables using the same measurement technological platform, e.g., measurements ofthe expression levels of a plurality of different genes using one or more microarrays.
  • the plurality of different variables may include the variable whose predetermined variance or error is to be r. determined.
  • the plurality of different variables may include only variables other than the variable whose predetermined variance or error is to be determined.
  • the predetermined error or variance can be determined from such prior or concurrent measurements using a suitable error model. Determination ofthe predetermined variance or error based on a combination ofthe above discussed prior and concurrent measurements is jC also envisioned.
  • a predetermined measurement variance or error is also referred to as a "prior,” which is intended to encompass a predetermined variance or error determined by any ofthe methods disclosed above.
  • the additional input is an estimated standard error ofthe measurement ;; sauce, ⁇ increment.
  • the invention provides a method of analyzing a plurality
  • each measurement y is first weighted with a weighting factor to obtain an error-weighted measurement.
  • the weighting factor can be determined based on the measurement variance ofthe measurement. In one embodiment the weighting factor is calculated according to Equation (34), infra. Then a
  • the within- group variance consists of a propagated variance and a scattered variance.
  • the propagated variance can be determined based on the predetermined measurement variances ofthe plurality of measurements, and the scattered variance can be determined based on deviation of each ofthe plurality of measurements with respect to a respective group mean, i.e., a
  • the propagated variance for each measurement group is determined according to Equation (37), infra.
  • the scattered variance is determined according to
  • a between-group variance for said k different measurement groups is then determined.
  • the between-group variance is a variance of a plurality of group means, one for each ofthe k different measurement groups, with respect to a mean ofthe plurality of error- weighted measurements.
  • the within-group variance is then compared with the between-group variance to determine, e.g., if the condition or perturbation has an effect on
  • the method ofthe invention can also be carried out without error- weighting the measurements.
  • a within-group variance for the k different measurement groups is determined.
  • the within-group variance consists of a propagated variance and a scattered variance.
  • the propagated variance can be determined based on the measurement
  • the scattered variance can be determined based on deviation of each ofthe plurality of measurements with respect to a respective group mean, i.e., a mean of measurements in a respective measurement group. It will be apparent to one skilled in the art that if a measurement group contains only one measurement, the mean is the same as the measurement.
  • the on propagated variance for each measurement group is determined according to Equation (37a), infra.
  • the scattered variance is determined according to Equation (40a), infra.
  • a between-group variance for said k different measurement groups is then determined.
  • the between-group variance is a variance of a plurality of group means, one for each ofthe k different measurement groups, with respect
  • the within-group variance is then compared with the between-group variance to determine, e.g., if the condition or perturbation has an effect on the variable.
  • the within-group variance is a sum of a plurality of
  • the measurement group propagated variance can be determined based on the predetermined measurement variances ofthe one or more measurements in the measurement group.
  • the measurement group scattered variance is a variance ofthe one or more measurements with respect to the mean ofthe one or more
  • r. measurements or error- weighted measurements in the measurement group It will be apparent to one skilled in the art that if a measurement group contains only one measurement, the mean is the same as the measurement. In a preferred embodiment, the mean of said error- weighted measurement is determined according to Equation (36), infra, and the scattered variance is determined according to Equation (40), infra.
  • the propagated variance and the scattered variance are preferably combined in such a manner that when the number of measurements in a measurement group is small, the propagated error is dominant, whereas when the number of measurements in a measurement group increases, the scattered error becomes more and more dominant, and when the number of measurements in a measurement group become very large, the group variance . approaches the scattered error.
  • the propagated variance and the scattered variance are combined according to Equation (43), infra.
  • the comparison ofthe within-group variance and the between- group variance is carried out using an appropriate statistical test, e.g., an F-test.
  • an appropriate statistical test e.g., an F-test.
  • within-group degree of freedom, between-group degree of freedom, within- y C group mean square, and between-group mean square are first determined using the appropriate variances.
  • these parameters are determined according to the respective equations in Table 2.
  • a significance level, e.g., a p-value can then be determined, e.g., according to Equation (33).
  • the analysis can be carried out by applying traditional n ANOVA on error-weighted measurements.
  • a technology platform specific error model is used to determine the predetermined measurement error ⁇ tl .
  • Such estimated measurement error is used as a predetermined error or a "prior" in the within-group variance estimation.
  • the errors are estimated using an error model described in
  • the additional input ⁇ counter comes from an application-specific error model, e.g., a three term error model as described by Equation 8.
  • the error model Based on the knowledge of data noise sources and training data, the error model provides a conservative estimation ofthe measurement error in quantity v. Measurement error contains errors from
  • the measurement error determines the lower bound ofthe total error that includes biological variations. When the number of replicates is limited, the measurement error can be used as a "prior" in the variance estimation.
  • the measurement error provides
  • n additional information to obtain better variance estimations.
  • the additional input also brings in additional degrees-of-freedom in the analysis. Both the improved variance estimation and the additional degree-of-freedom can help increase the statistical power of the analysis.
  • the measurement error may also be used for determining the quality ofthe measurement. Weights that are inversely proportional to measurement errors in the mean
  • ⁇ - estimation may be used to penalize measurements with large errors.
  • a common error in an experiment estimated across a plurality of different biological variables v's measured in the same experiment e.g., expression levels of different genes measured by using one or more microarrys in an experiment
  • a common error is estimated according the method disclosed in U.S. Patent No. 6,351,712, which is incorporated herein by reference in its entirety. Common errors determined by other methods can also be used (See, Section 5.3.2.5, infra).
  • the invention is based, at least in part, on the discovery by the inventor that when c the number of replicate measurements in one or more measurement groups is small, the second input provides for more reliable variance estimation. Therefore, the additional input may help setting a "floor" in the within-group variance estimations for more accurate determination ofthe within-group variance. The additional input may also provide additional degrees-of-freedom to the analysis. As a result, both false-positive and false o r. negative rates may be significantly reduced.
  • the additional input e.g., the measurement error ⁇ district
  • the additional input can be used in error- weighted averaging for the group mean estimation.
  • the weight is inversely proportional to the square of ⁇ a so that measurements with larger errors will have less contribution to the mean estimation.
  • the improved ANOVA methods ofthe invention can be used to carry out analysis of any measurements, including but not limited to intensity data, such as gene expression intensity data obtained using Affymetrix micro-arrays, and ratio data, such as gene expression ratio data obtained using Agilent micro-arrays.
  • intensity data such as gene expression intensity data obtained using Affymetrix micro-arrays
  • ratio data such as gene expression ratio data obtained using Agilent micro-arrays.
  • the weighting factor can be defined as:
  • the grand average and the treatment group average are calculated as:
  • the scattered errors ofthe weighted treatment group average and the grand average are calculated as
  • the propagated error provides prior information for the error estimation.
  • the error model as described in Section 5.2, supra is used to estimate the propagated error for each y.
  • the propagated error and the scattered error are combined to improve error estimation for the group average and the grand average. The combination idea is that when the number of samples is small the prior should contribute more to the combined error estimation, whereas when the number of samples becomes large, the combined error should be dominated by the scattered error, which is the actual error estimation from data. In the case of large number of samples, the estimated error becomes less dependent on the error model.
  • the weight is defined as inversely proportional to the variance ofthe error ofthe scattered error. In one embodiment, because the error ofthe scattered error is
  • the within-group mean-square is calculated based on the estimated error.
  • the group mean-square is the error square multiplied by the number of samples.
  • the mean-square is calculated as
  • the total sum of squares is the sum ofthe group sum of squares:
  • the overall mean square is the total sum of squares divided by the total within-group DOF:
  • the total DOF is given in Section 5.3.2.3, infra.
  • the same estimation method as shown in Table 1 is used to estimate the between-treatment variance.
  • the between-treatment variance is obtained by a combination of a propagated error and a scattered error in a manner similar to the within-treatment variance as described in Section 5.2.2.1, supra.
  • the first error term in Equations 43 and 44 reflects the propagated measurement errors ofthe derived averages.
  • this error term is defined in Equation 37. It is used as prior in the mean square estimation to improve the estimation accuracy especially when the number of replicates is small. When the number of replicates is small, by mere random chances, the sum-squares without the priors can be much larger or smaller than the inherent measurement error in the measurement technology, e.g., the microarray technology, resulting in high false-negative rate and false-positive rate, respectively. The prior sets a floor in the sum-square estimation so that the estimation can not become unreasonably small.
  • the new measurement-error input provides additional prior information to help improve the power of ANOVA analysis.
  • the additional measurement-error input also brings in more degrees-of-freedom to the ANOVA analysis.
  • the additional within-treatment DOF gained from the additional ANOVA input can be determined based on the level of confidence of the predetermined variance or error.
  • the additional DOF can range from zero, if the level of confidence ofthe predetermined variance or error is low, to a large number, if the level of confidence ofthe predetermined variance or error is high.
  • the within-treatment DOF can be c assigned to a very large number (e.g., at least 20, 50, orlOO) regardless ofthe number of measurements in the measurement groups, if the predetermined variance or error is already a conservative and realistic estimation ofthe measurement error.
  • the level of confidence of the predetermined variance or error can be determined based on the manner such variance or error is obtained (see, e.g., e.g., Stoughton et al., U.S. Patent No. 6,351,712, which is
  • the within-treatment DOF ofthe propagated error is calculated as
  • V y lP n t (49)
  • the two errors are weighted when estimating the individual mean-square of treatment group t in Equation 45, .
  • Each error has reduced contribution to the combined result, so does its DOF.
  • the DOF ofthe group mean-square is the weighted
  • the total within-treatment DOF is the sum of DOF's of all individual group:
  • Equation 52 the within-treatment DOF as described by Equation 52 is increased as compared to the value in Table 1.
  • the higher degrees-of-freedom provide more statistical power to the ANOVA analysis and therefore higher sensitivity.
  • the effective numbers of samples in Table 2 are calculated as
  • Table 2 lists an exemplary embodiment of one-way improved ANOVA with error- weighting in which changes from the traditional ANOVA (Table 1) can be easily seen. In the embodiment that does not involve error- weighting the measurements, the equations in Table 2 can be modified by substituting the relevant quantities, e.g., the averages, variances, and effective number of samples, with the quantities calculated using Eqs. 35a-38a, 40a- 41a, and 53a-54a.
  • F-statistics is used to test the null-hypothesis.
  • the null-hypothesis is a null-hypothesis.
  • IMPROVED TWO-WAY ANOVA The invention also provides improved two-way and N-way ANOVA for analyzing measured and/or transformed data under two or more different conditions, e.g., perturbations.
  • 2-way improved ANOVA is described in this section. It will be apparent to a person skilled in the art that N-way improved ANOVA can be practiced similarly. The notation used is similar to those used in Box et al., Statistics For Experimenters, John Wiley & Sons, 1978.
  • the invention therefore provides an improved two-way ANOVA method for t - analyzing a plurality of measurements of a variable in a plurality of different measurement groups, each of which consisting of one or more measurements ofthe variable under a condition among a first plurality of conditions and a condition among a second plurality of conditions, and each measurement has a predetermined measurement variance.
  • the improvement comprises determining a within-group variance consisting of a propagated 1 . variance and a scattered variance.
  • the propagated variance can be determined based on the predetermined measurement variances ofthe plurality of measurements, and the scattered variance is determined based on deviation of each ofthe plurality of measurements with respect to a mean of measurements in a respective measurement group.
  • the averages or error-weighted averages, propagated and , - scattered errors, DOF's, and effective number of samples for improved two-way ANOVA are similarly derived as those for the improve one-way ANONA (Sections 5.3.2.2-5.3.2.4).
  • the within-treatment DOF of group ti is
  • the within-treatment effective number of samples is
  • Each ofthe k * n different measurement groups consists of measurements ofthe variable under conditions t and i of
  • each measurement y tlJ has a predetermined measurement variance.
  • each measurement y t is first weighted with a weighting factor to obtain an error- weighted measurement.
  • the weighting factor can be determined based on
  • the weighting factor is calculated according to Equation (34).
  • a within-group variance for the k * n different measurement groups is then determined.
  • the within-group variance consists of a propagated variance and a scattered variance.
  • the propagated variance can be determined based on the predetermined measurement variances ofthe plurality of
  • r. measurements, and the scattered variance can be determined based on deviation of each of the plurality of measurements with respect to a mean of measurements in a respective measurement group.
  • the propagated variance for each measurement group is determined according to Equation (37).
  • the scattered variance is determined according to Equation (40). Then a
  • condition group K is determined using the between-group variance of condition group TT which is a variance ofthe group means for the j . respective conditions in condition group K with respect to a mean ofthe plurality of error- weighted measurements.
  • Each group mean for a condition in condition group K is the mean of measurements in all measurement groups under a respective condition in condition group K.
  • condition group N can be determined in a similar way by using the between-group variance of condition group N which is a variance ofthe group means for the j r respective conditions in condition group N with respect to a mean of the plurality of error- weighted measurements, where each group mean for a condition in condition group N is the mean of measurements in all measurement groups under a respective condition in condition group N.
  • the effect ofthe interaction between condition group N and condition group K can also be determined using the interaction variance which is a variance of group or.
  • each group interaction mean is the mean of measurements in a measurement group over the number of measurements in the measurement group, e.g., as calculated by the equations in Table 4.
  • Table 4 summarizes the quantities in an exemplary embodiment of improved 2-way ANOVA. As can be seen from Table 4, in addition to error- weighted averaging and effective number of samples, the within-treatment sum of squares and DOFs are also r changed as compared to the traditional ANOVA.
  • Measurement variance ⁇ _ can also be determined by various other methods known in the art and used as the additional ANOVA input.
  • the methods ofthe invention are equally applicable to such measurement variance.
  • the methods as described in Sections 5.3.3.1. through 5.3.3.5 can be used in conjunction with such measurement variance.
  • the measurement error ⁇ tl 35 estimated across a plurality of different biological variables measured in the experiment is used as the measurement error ⁇ tl .
  • the number ofthe plurality of different biological variables used for determining the common errors can be chosen by one skilled person in r the art. In preferred embodiments, the number is at least 2, 5, 10, 100, 1000, or 10,000.
  • common error variance across a plurality of different genes measured on a microarray can be used as the additional input in the improved ANOVA method ofthe invention for one or more genes in the plurality of different genes.
  • Such common error can be estimated over more than one replicates ofthe experiment, e.g., o more than one replicates of a microarray experiment.
  • such a common error is estimated from one or more replicates according a method disclosed in U.S. Patent No. 6,351,712, which is incorporated herein by reference in its entirety.
  • the error estimation may come from a regression analysis.
  • the invention also provides method of using the improved ANOVA for determining if measured data are unchanged under different conditions or perturbations and for determining if replicate measurements in a group are consistent with each other.
  • the invention therefore provides, in one embodiment, a method of determining if measured data of a biological variable under different conditions or perturbations have a large or small within-group variance.
  • the method ofthe invention makes use ofthe average within-group mean-square determined across a plurality of different biological variables, and compares the individual mean-square of measurements ofthe biological variable of
  • the hypothesis is that the individual mean-square is smaller than the average. If the hypothesis can be accepted, e.g., by an appropriate statistical test, the within-group variance is deemed small. For example, in a microarray experiment, a plurality of different genes are measured under a set of different conditions or perturbations, each on one or more microarrays. In one embodiment, if Mis the number of different
  • the different biological variables used for calculating the average can also be selected if desired. For example, data
  • the individual within-group mean-square is then compared with the average in an F-test as described by Eq. 65, where the DOF ofthe average v avg is typically large.
  • the DOF of v avg is set to a large arbitrary number in the computation because the Chi-Square distribution ofthe average is approaching a Gaussian distribution when the DOF is large.
  • the individual within-group mean-square ofthe 7th biological variable is identified as smaller than the average mean-square.
  • the SmallError '_pvalue threshold is set to be greater than 0.5, thus, those c biological variables/measurements that have within-group mean-squares less than the average are accepted as having small within-group variance.
  • the within-group mean squares in Eqs. 64 and 65 are calculated using the improved ANOVA method ofthe invention, e.g., according to Table 2 or Table 4.
  • 0 value thresholds one for the ANOVA p-value and one for the SmallError p-value, are set.
  • the ANOVA p-value is determined using a method ofthe invention as described in Section 5.3. Biological variables/measurements which have the ANOVA p-value and the SmallError p-value greater than the respective threshold are determined as truely unchanged.
  • the thresholds are selected as
  • the invention also provides a method of using the improved ANOVA method for detecting variations among replicates in a treatment group, thereby determining the consistency ofthe replicates.
  • an F-test is used to compare the scattered error as determined according to Equation 40 (Equation 40a if the measurements are not error weighted) and the propagated error as determined according to Equation 37 (Equation 37a if the measurements are not error weighted). If the scattered error is significantly larger than the propagated error, it indicates there are variations beyond
  • a p-value of this F-test termed “replicate consistency p-value” is calculated as:
  • a threshold is selected for the replicate consistency p-value. If the percentage ofthe number of biological variables/measurements having the p-value below the threshold to the total number of biological variables/measurements measured is significantly above the threshold level, it indicates the inconsistency among replicates in this treatment group.
  • the threshold ofthe p-value is chosen to be less than 1% (0.01).
  • variations in replicates beyond a level due to measurement errors are used to indicate quality issues in the data generating process.
  • variations in replicates beyond a level due to measurement errors are used to indicate/identified biological variations from the j r.
  • different sources e.g., a type of cell from different animals.
  • the analytical methods ofthe present invention can preferably be implemented using a computer system, such as the computer system described in this section, according j r to the following programs and methods.
  • a computer system can also preferably store and manipulate measured data obtained in various experiments that can be used by a computer system implemented with the analytical methods of this invention. Accordingly, such computer systems are also considered part ofthe present invention.
  • Computer system 201 is illustrated here as comprising internal components and as being linked to external components.
  • the internal components of this computer system include one or more processor elements 202 interconnected with a main memory 203.
  • computer system 201 can be an Intel Pentium®-based processor of 200 MHZ or greater clock rate and with 32 MB or more main memory.
  • computer system 201 is a cluster of a plurality of computers comprising a head "node” and eight sibling "nodes,” with each node having a central processing unit (“CPU").
  • the cluster also comprises at least 128 MB of random access memory (“RAM”) on the head node and at least 256 MB of RAM on each ofthe
  • RAM random access memory
  • the computer systems ofthe present invention are not limited to those consisting of a single memory unit or a single processor unit.
  • the external components can include a mass storage 204.
  • This mass storage can be one or more hard disks that are typically packaged together with the processor and memory.
  • Such hard disk are typically of 1 GB or greater storage capacity and more preferably have at
  • each node can have its own hard drive.
  • the head node preferably has a hard drive with at least 6 GB of storage capacity whereas each sibling node preferably has a hard drive with at least 9 GB of storage capacity.
  • a computer system ofthe invention can further comprise other mass storage units
  • r including, for example, one or more floppy drives, one more CD-ROM drives, one or more DVD drives or one or more DAT drives.
  • a user interface device 205 which is most typically a monitor and a keyboard together with a graphical input device 206 such as a "mouse.”
  • the computer system is also typically linked to a network link 207 which can j r. be, e.g. , part of a local area network ("LAN”) to other, local computer systems and/or part of a wide area network (“WAN”), such as the Internet, that is connected to other, remote computer systems.
  • LAN local area network
  • WAN wide area network
  • each node is preferably connected to a network, preferably an NFS network, so that the nodes ofthe computer system j r communicate with each other and, optionally, with other computer systems by means ofthe network and can thereby share data and processing tasks with one another.
  • a network preferably an NFS network
  • the software components comprise both software components that are standard in the art and components o r. that are special to the present invention. These software components are typically stored on mass storage such as the hard drive 204, but can be stored on other computer readable media as well including, for example, one or more floppy disks, one or more CD-ROMs, one or more DVDs or one or more DATs.
  • Software component 210 represents an operating system which is responsible for managing the computer system and its network
  • the operating system can be, for example, ofthe Microsoft WindowsTM family such as Windows 95, Window 98, Windows NT or Windows 2000.
  • the operating software can be a Macintosh operating system, a UNIX operating system or the LINUX operating system.
  • Software components 211 comprises common languages and
  • ⁇ - functions that are preferably present in the system to assist programs implementing methods specific to the present invention.
  • Languages that can be used to program the analytic methods ofthe invention include, for example, C and C++, FORTRAN, PERL, HTML, JAVA, and any ofthe UNIX or LINUX shell command languages such as C shell script language.
  • the methods ofthe invention can also be programmed or modeled in
  • Software component 212 comprises any analytic methods ofthe present invention described supra, preferably programmed in a procedural language or symbolic package.
  • software component 212 preferably includes programs that cause the processor to implement steps of accepting a plurality of measured data and storing the measured data in the memory.
  • the computer system can accept measured data that are manually j r. entered by a user (e.g. , by means ofthe user interface). More preferably, however, the programs cause the computer system to retrieve measured data from a database.
  • a database can be stored on a mass storage (e.g., a hard drive) or other computer readable medium and loaded into the memory ofthe computer, or the compendium can be accessed by the computer system by means ofthe network 207.
  • METHODS FOR DETERMINING BIOLOGICAL STATE AND BIOLOGICAL RESPONSE This invention provides methods for analysis of measurement errors in measured signal data, e.g., measured expression profiles, and methods for analyzing and processing of such measured signal data.
  • the measured data can be measurements of cellular constituents
  • the data can be measured from cell samples subject to different conditions, e.g., under different perturbations.
  • the cell sample can be of any organism, e.g., eukaryote, mammal, primate, human, non-human animal such as a dog, cat, horse, cow, mouse, rat, Drosophila, C. r elegans, etc., plant such as rice, wheat, bean, tobacco, etc., and fungi.
  • the cell sample can be from a diseased or healthy organism, or an organism predisposed to disease.
  • the cell sample can be of a particular tissue type or development stage or subjected to a particular perturbation (stimulus). This section and its subsections provides some exemplary methods for obtaining the measured data of cell samples.
  • One of skill in the art would appreciate
  • this invention is not limited to the following specific methods for measuring the expression profiles and responses of a biological system.
  • TRANSCRIPT ASSAYS USING MICROARRAYS This invention is particularly useful for the determination ofthe expression state or c the transcriptional state of a cell or cell type or any other cell sample by monitoring expression profiles.
  • One aspect ofthe invention provides polynucleotide probe arrays for simultaneous determination ofthe expression levels of a plurality of genes and methods for designing and making such polynucleotide probe arrays.
  • the expression level of a nucleotide sequence in a gene can be measured by any j r. high throughput techniques. However measured, the result is either the absolute or relative amounts of transcripts or response data, including but not limited to values representing abundance ratios.
  • measurement ofthe expression profile is made by hybridization to transcript arrays, which are described in this subsection j r
  • transcript arrays which are described in this subsection j r
  • the present invention makes use of "transcript arrays" or
  • Transcript arrays can be employed for analyzing the expression profile in a cell sample and especially for measuring the expression profile of a cell sample of a particular tissue type or developmental state or exposed to a drug of interest or to perturbations to a biological pathway of interest.
  • the cell sample o r. can be from a patient, e.g., a diseased cell sample, and preferably can be compared to a healthy cell sample.
  • an expression profile is obtained by hybridizing detectably labeled polynucleotides representing the nucleotide sequences in mRNA transcripts present in a cell (e.g., fluorescently labeled cDNA synthesized from total cell mRNA) to a oc microarray.
  • a microarray is an array of positionally-addressable binding (e.g., hybridization) sites on a support for representing many ofthe nucleotide sequences in the genome of a cell or organism, preferably most or almost all ofthe genes. Each of such binding sites consists of polynucleotide probes bound to the predetermined region on the
  • Microarrays can be made in a number of ways, of which several are described herein below. However produced, microarrays share certain characteristics. The arrays are reproducible, allowing multiple copies of a given array to be produced and easily compared with each other. Preferably, the microarrays are made from materials that are stable under binding (e.g., nucleic acid hybridization) conditions. The microarrays are preferably small,
  • 1 r. e.g., between about 1 cm 2 and 25 cm 2 , preferably about 1 to 3 cm 2 .
  • both larger and smaller arrays are also contemplated and may be preferable, e.g., for simultaneously evaluating a very large number of different probes.
  • a given binding site or unique set of binding sites in the microarray will specifically bind (e.g., hybridize) to a nucleotide sequence in a single gene from a cell or
  • c organism e.g. , to exon of a specific mRNA or a specific cDNA derived therefrom.
  • the microarrays used in the methods and compositions ofthe present invention include one or more test probes, each of which has a polynucleotide sequence that is complementary to a subsequence of RNA or DNA to be detected.
  • Each probe preferably has a different nucleic acid sequence, and the position of each probe on the solid surface of j r. the array is preferably known.
  • the microarrays are preferably addressable arrays, more preferably positionally addressable arrays. More specifically, each probe ofthe array is preferably located at a known, predetermined position on the solid support such that the identity (i.e., the sequence) of each probe can be determined from its position on the array (i.e., on the support or surface).
  • the arrays are j e ordered arrays.
  • the density of probes on a microarray or a set of microarrays is about 100 different (i.e., non-identical) probes per 1 cm 2 or higher. More preferably, a microarray used in the methods ofthe invention will have at least 550 probes per 1 cm 2 , at least 1,000 probes per 1 cm 2 , at least 1,500 probes per 1 cm 2 or at least 2,000 probes per 1 cm 2 . In a o r. particularly preferred embodiment, the microarray is a high density array, preferably having a density of at least about 2,500 different probes per 1 cm 2 .
  • microarrays used in the invention therefore preferably contain at least 2,500, at least 5,000, at least 10,000, at least 15,000, at least 20,000, at least 25,000, at least 50,000 or at least 55,000 different (i.e., non-identical) probes.
  • the microarray is an array (i.e., a matrix) in which each position represents a discrete binding site for a nucleotide sequence of a transcript encoded by a gene (e.g., for an exon of an mRNA or a cDNA derived therefrom).
  • the collection of binding c sites on a microarray contains sets of binding sites for a plurality of genes.
  • the microarrays ofthe invention can comprise binding sites for products encoded by fewer than 50% ofthe genes in the genome of an organism.
  • the microarrays ofthe invention can have binding sites for the products encoded by at least 50%, at least 75%, at least 85%, at least 90%, at least 95%, at least 99%
  • the microarrays ofthe invention can having binding sites for products encoded by fewer than 50%, by at least 50%, by at least 75%, by at least 85%, by at least 90%, by at least 95%, by at least 99% or by 100% ofthe genes expressed by a cell of an organism.
  • the binding site can be a DNA or DNA analog to which a particular RNA can specifically hybridize.
  • analog can be, e.g. , a synthetic oligomer or a gene fragment, e.g. corresponding to an exon.
  • a gene or an exon in a gene is represented in the profiling arrays by a set of binding sites comprising probes with different polynucleotides that are complementary to different coding sequence segments ofthe gene or an exon ofthe gene.
  • Such polynucleotides are preferably ofthe length of 15 to 200 j r. bases, more preferably ofthe length of 20 to 100 bases, most preferably 40-60 bases. It will be understood that each probe sequence may also comprise linker sequences in addition to the sequence that is complementary to its target sequence.
  • a linker sequence refers to a sequence between the sequence that is complementary to its target sequence and the surface of support.
  • the profiling arrays ofthe jC invention comprise one probe specific to each target gene or exon.
  • the profiling arrays may contain at least 2, 5, 10, 100, 1000 probes specific to some target genes or exons.
  • the array may contain probes tiled across the sequence ofthe longest mRNA isoform of a gene at single base steps.
  • a set of polynucleotide probes of successive overlapping sequences, i.e., tiled sequences, across the genomic region containing the longest variant of an exon can be included in the exon profiling arrays.
  • the set of polynucleotide probes can comprise successive overlapping sequences at steps of a predetermined base intervals, e.g. at steps of 1, 5, or 10 base intervals, span, or are tiled across, the mRNA containing the longest variant.
  • a set of probes therefore can be used to scan the genomic region containing all variants of an exon to determine the expressed variant or variants ofthe exon to determine the expressed variant or variants ofthe exon.
  • a set of polynucleotide probes comprising exon specific probes and/or variant junction probes can j- be included in the exon profiling array.
  • a variant junction probe refers to a probe specific to the junction region ofthe particular exon variant and the neighboring exon.
  • the probe set contains variant junction probes specifically hybridizable to each of all different splice junction sequences ofthe exon.
  • the probe set contains exon specific probes specifically hybridizable
  • an exon is represented in the exon profiling arrays by a probe comprising a polynucleotide that is complementary to the full length exon.
  • an exon is represented by a single binding site on the
  • an exon is represented by one or more binding sites on the profiling arrays, each ofthe binding sites comprising a probe w ⁇ th a polynucleotide sequence that is complementary to an RNA fragment that is a substantial portion ofthe target exon.
  • the lengths of such probes are normally between about 15-600 bases, preferably between about 20-200 bases, more preferably between about j r. 30-100 bases, and most preferably between about 40-80 bases.
  • the average length of an exon is about 200 bases (see, e.g., Lewin, Genes V, Oxford University Press, Oxford, 1994).
  • a probe of length of about 40-80 allows more specific binding ofthe exon than a probe of shorter length, thereby increasing the specificity ofthe probe to the target exon.
  • one or more targeted exons may have sequence lengths less than about 40-80 bases. j r
  • flanking sequences i.e., intron sequences
  • the flanking sequence used are from the adjacent constitutively spliced exon or exons that are not involved in any alternative pathways. More preferably the flanking sequences used do not comprise a significant portion ofthe sequence ofthe adjacent exon or exons so that cross-hybridization o r can be minimized.
  • probes comprising flanking sequences in different alternatively spliced mRNAs are designed so that expression level of the exon expressed in different alternatively spliced mRNAs can be measured.
  • the DNA array or set of arrays can also comprise probes that are complementary to sequences spanning the junction regions of two adjacent exons.
  • probes comprise sequences from the two exons which are not substantially overlapped with probes for each individual exons so that
  • Probes that comprise sequences from more than one exons are useful in distinguishing alternative splicing pathways and/or expression of duplicated exons in separate genes if the exons occurs in one or more alternative spliced mRNAs and/or one or more separated genes that contain the duplicated exons but not in other alternatively spliced mRNAs and/or other genes that contain the duplicated exons.
  • any ofthe probe schemes, supra can be combined on the same profiling array and/or on different arrays within the same set of j r. profiling arrays so that a more accurate determination ofthe expression profile for a plurality of genes can be accomplished.
  • the different probe schemes can also be used for different levels of accuracies in profiling. For example, a profiling array or array set comprising a small set of probes for each exon may be used to determine the relevant genes and/or RNA splicing pathways under certain j specific conditions. An array or array set comprising larger sets of probes for the exons that are of interest is then used to more accurately determine the exon expression profile under such specific conditions.
  • Other DNA array strategies that allow more advantageous use of different probe schemes are also encompassed.
  • the microarrays used in the invention have binding sites (i.e., probes) for or. sets of exons for one or more genes relevant to the action of a drug of interest or in a biological pathway of interest.
  • a "gene” is identified as a portion of DNA that is transcribed by RNA polymerase, which may include a 5' untranslated region ("UTR"), introns, exons and a 3' UTR.
  • UTR 5' untranslated region
  • the number of genes in a genome can be estimated from the number of mRNAs expressed by the cell or organism, or by extrapolation of a well o c characterized portion ofthe genome.
  • the number of ORFs can be determined and mRNA coding regions identified by analysis ofthe DNA sequence.
  • the genome of Saccharomyces cerevisiae has been completely sequenced and is reported to have approximately 6275 ORFs encoding r sequences longer the 99 amino acid residues in length. Analysis of these ORFs indicates that there are 5,885 ORFs that are likely to encode protein products (Goffeau et al, 1996, Science 274:546-567).
  • the human genome is estimated to contain approximately 30,000 to 130,000 genes (see Crollius et al., 2000, Nature Genetics 25:235- 238; Ewing et al., 2000, Nature Genetics 25:232-234). Genome sequences for other
  • an array set comprising in total probes for all known or predicted exons in the genome of an organism.
  • the present invention provides an array set comprising one or two
  • the level of hybridization to the site in the array corresponding to an exon of any particular gene will reflect the prevalence in the cell of mRNA or mRNAs containing the exon transcribed from j r. that gene.
  • cDNA complementary to the total cellular mRNA when detectably labeled (e.g., with a fluorophore) cDNA complementary to the total cellular mRNA is hybridized to a microarray, the site on the array corresponding to an exon of a gene (i.e., capable of specifically binding the product or products ofthe gene expressing) that is not transcribed or is removed during RNA splicing in the cell will have little or no signal (e.g., fluorescent signal), and an exon of a gene for j c which the encoded mRNA expressing the exon is prevalent will have a relatively strong signal.
  • the relative abundance of different mRNAs produced from the same gene by alternative splicing is then determined by the signal strength pattern across the whole set of exons monitored for the gene.
  • cDNAs from cell samples from two different conditions are preferred embodiments.
  • ⁇ design are hybridized to the binding sites ofthe microarray using a two-color protocol.
  • drug responses one cell sample is exposed to a drug and another cell sample ofthe same type is not exposed to the drug.
  • pathway responses one cell is exposed to a pathway perturbation and another cell ofthe same type is not exposed to the pathway perturbation.
  • the cDNA derived from each ofthe two cell types are differently labeled o r (e.g., with Cy3 and Cy5) so that they can be distinguished.
  • cDNA from a cell treated with a drug is synthesized using a fluorescein-labeled dNTP
  • cDNA from a second cell, not drug-exposed is synthesized using a rhodamine-labeled dNTP.
  • the cDNA from the drug-treated (or pathway perturbed) cell will fluoresce green when the fluorophore is stimulated and the cDNA from
  • the untreated cell will fluoresce red.
  • the drug treatment has no effect, either directly or indirectly, on the transcription and/or post-transcriptional splicing of a particular gene in a cell, the exon expression patterns will be indistinguishable in both cells and, upon reverse transcription, red-labeled and green-labeled cDNA will be equally prevalent.
  • the binding site(s) for that species of RNA When hybridized to the microarray, the binding site(s) for that species of RNA
  • c will emit wavelengths characteristic of both fluorophores.
  • the drug-exposed cell is treated with a drug that, directly or indirectly, change the transcription and/or post-transcriptional splicing of a particular gene in the cell, the exon expression pattern as represented by ratio of green to red fluorescence for each exon binding site will change.
  • the ratios for each exon j r. expressed in the mRNA will increase, whereas when the drug decreases the prevalence of an mRNA, the ratio for each exons expressed in the mRNA will decrease.
  • cDNA from a single cell, and compare, for example, the absolute amount of a particular exon in, e.g., a drug-treated or pathway-perturbed cell and an untreated cell.
  • labeling with more than two colors is also contemplated in the present invention. In some embodiments oc ofthe invention, at least 5, 10, 20, or 100 dyes of different colors can be used for labeling. Such labeling permits simultaneous hybridizing ofthe distinguishably labeled cDNA populations to the same array, and thus measuring, and optionally comparing the expression levels of, mRNA molecules derived from more than two samples.
  • Dyes that can be used c include, but are not limited to, fluorescein and its derivatives, rhodamine and its derivatives, texas red, 5'carboxy-fluorescein (“FMA”), 2',7'-dimethoxy-4',5'-dichloro-6-carboxy- fluorescein (“JOE”), N,N,N',N'-tetramethyl-6-carboxy-rhodamine (“TAMRA”), 6'carboxy- X-rhodamine (“ROX”), HEX, TET, IRD40, and IRD41, cyamine dyes, including but are not limited to Cy3, Cy3.5 and Cy5; BODIPY dyes including but are not limited to BODIPY-
  • ALEXA dyes including but are not limited to ALEXA-488, ALEXA-532, ALEXA-546, ALEXA- 568, and ALEXA-594; as well as other fluorescent dyes which will be known to those who are skilled in the art.
  • hybridization data are measured at a plurality
  • hybridization levels are most preferably measured at hybridization times spanning the range from 0 to in excess of what is required for sampling ofthe bound polynucleotides (i.e., the probe or probes) by the labeled polynucleotides so that the mixture is close to or substantially reached equilibrium, and j r. duplexes are at concentrations dependent on affinity and abundance rather than diffusion.
  • the hybridization times are preferably short enough that irreversible binding interactions between the labeled polynucleotide and the probes and/or the surface do not occur, or are at least limited.
  • typical hybridization j c times may be approximately 0-72 hours.
  • Appropriate hybridization times for other embodiments will depend on the particular polynucleotide sequences and probes used, and may be determined by those skilled in the art (see, e.g., Sambrook et al, Eds., 1989, Molecular Cloning: A Laboratory Manual, 2nd ed., Vol. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York). or. In one embodiment, hybridization levels at different hybridization times are measured separately on different, identical microarrays.
  • the microarray is washed briefly, preferably in room temperature in an aqueous solution of high to moderate salt concentration (e.g., 0.5 to 3 M salt concentration) under conditions which retain all bound oc or hybridized polynucleotides while removing all unbound polynucleotides.
  • high to moderate salt concentration e.g., 0.5 to 3 M salt concentration
  • the detectable label on the remaining, hybridized polynucleotide molecules on each probe is then measured by a method which is appropriate to the particular labeling method used.
  • the resulted hybridization levels are then combined to form a hybridization curve.
  • hybridization levels are measured in real time using a single microarray.
  • the microarray is allowed to hybridize to the sample without interruption and the microarray is interrogated at each hybridization time in a non-invasive manner.
  • At least two hybridization levels at two different hybridization times are measured, a first one at a hybridization time that is close to the time scale of cross- hybridization equilibrium and a second one measured at a hybridization time that is longer than the first one.
  • the time scale of cross-hybridization equilibrium depends, inter alia, on
  • the first hybridization level is measured at between 1 to 10 hours, whereas the second hybridization time is measured at about 2, 4, 6, 10, 12, 16, 18, 48 or 72 times as long as the first hybridization time.
  • the "probe" to which a particular polynucleotide molecule, such an exon, specifically hybridizes according to the invention is a complementary polynucleotide sequence.
  • the probes j c normally comprise nucleotide sequences greater than about 40 bases in length.
  • the probes when a large set of redundant probes is to be used for an exon, the probes normally comprise nucleotide sequences of about 40-60 bases.
  • the probes can also comprise sequences complementary to full length exons.
  • the lengths of exons can range from less than 50 bases to more than 200 bases. Therefore, when a probe length longer o r, than exon is to be used, it is preferable to augment the exon sequence with adjacent constitutively spliced exon sequences such that the probe sequence is complementary to the continuous mRNA fragment that contains the target exon. This will allow comparable hybridization stringency among the probes of an exon profiling array.
  • each probe sequence may also comprise linker sequences in addition to the sequence o c that is complementary to its target sequence.
  • the probes may comprise DNA or DNA "mimics" (e.g., derivatives and analogues) corresponding to a portion of each exon of each gene in an organism's genome.
  • the probes ofthe microarray are complementary RNA or RNA mimics.
  • DNA c mimics are polymers composed of subunits capable of specific, Watson-Crick-like hybridization with DNA, or of specific hybridization with RNA.
  • the nucleic acids can be modified at the base moiety, at the sugar moiety, or at the phosphate backbone.
  • Exemplary DNA mimics include, e.g., phosphorothioates.
  • DNA can be obtained, e.g., by polymerase chain reaction (PCR) amplification of exon segments from genomic DNA, cDNA (e.g., by PCR) amplification of exon segments from genomic DNA, cDNA (e.g
  • PCR primers are preferably chosen based on known sequence ofthe exons or cDNA that result in amplification of unique fragments (7.e., fragments that do not share more than 10 bases of contiguous identical sequence with any other fragment on the microarray).
  • Computer programs that are well known in the art are useful in the design of primers with the required specificity and optimal amplification
  • each probe on the microarray will be between 20 bases and 600 bases, and usually between 30 and 200 bases in length.
  • PCR methods are well known in the art, and are described, for example, in Innis et al, eds., 1990, PCR Protocols: A Guide to Methods and Applications, Academic Press Inc., San Diego, CA. It will be apparent to one skilled in the art that controlled robotic j r. systems are useful for isolating and amplifying nucleic acids.
  • An alternative, preferred means for generating the polynucleotide probes ofthe microarray is by synthesis of synthetic polynucleotides or oligonucleotides, e.g., using N- phosphonate or phosphoramidite chemistries (Froehler et al, 1986, Nucleic Acid Res. 74:5399-5407; McBride et al, 1983, Tetrahedron Lett. 24:246-248).
  • Synthetic sequences j c are typically between about 15 and about 600 bases in length, more typically between about 20 and about 100 bases, most preferably between about 40 and about 70 bases in length.
  • synthetic nucleic acids include non-natural bases, such as, but by no means limited to, inosine.
  • nucleic acid analogues may be used as binding sites for hybridization.
  • An example of a suitable nucleic acid analogue is peptide nucleic o r acid (see, e.g., Egholm et al, 1993, Nature 363:566-568; U.S. Patent No. 5,539,083).
  • the hybridization sites are made from plasmid or phage clones of genes, cDNAs (e.g., expressed sequence tags), or inserts therefrom (Nguyen et al, 1995, Genomics 2°:207-209).
  • polynucleotide probes can be deposited on a support to form the array.
  • polynucleotide probes can be synthesized directly on the support to form the array.
  • the probes are attached to a solid support or surface, which may be made, e.g., from c glass, plastic (e.g., polypropylene, nylon), polyacrylamide, nitrocellulose, gel, or other porous or nonporous material.
  • a preferred method for attaching the nucleic acids to a surface is by printing on glass plates, as is described generally by Schena et al, 1995, Science 270:467-470. This method is especially useful for preparing microarrays of cDNA (See also, DeRisi et al, 1996, Nature
  • a second preferred method for making microarrays is by making high-density polynucleotide arrays. Techniques are known for producing arrays containing thousands of oligonucleotides complementary to defined sequences, at defined locations on a surface
  • microarrays e.g., by masking (Maskos and Southern, 1992, Nucl Acids. Res. 20:1679-1684), may also be used.
  • masking Moskos and Southern, 1992, Nucl Acids. Res. 20:1679-1684
  • j c any type of array for example, dot blots on a nylon hybridization membrane (see Sambrook et al. , supra) could be used.
  • very small arrays will frequently be preferred because hybridization volumes will be smaller.
  • microarrays ofthe invention are manufactured by means of an In jet printing device for oligonucleotide synthesis, e.g.,
  • the polynucleotide probes in such o c microarrays are preferably synthesized in arrays, e.g.
  • microdroplets of a high surface tension solvent such as propylene carbonate.
  • the microdroplets have small volumes (e.g., 100 pL or less, more preferably 50 pL or less) and are separated from each other on the microarray (e.g., by c hydrophobic domains) to form circular surface tension wells which define the locations of the array elements (i.e., the different probes).
  • Polynucleotide probes are normally attached to the surface covalently at the 3' end ofthe polynucleotide. Alternatively, polynucleotide probes can be attached to the surface covalently at the 5' end ofthe polynucleotide (see for example, Blanchard, 1998, in Synthetic DNA Arrays in Genetic Engineering, Vol. 20, J.K.
  • Target polynucleotides which may be analyzed by the methods and compositions of the invention include RNA molecules such as, but by no means limited to messenger RNA
  • Target polynucleotides which may also be analyzed by the methods and compositions ofthe present invention include, but are not limited to DNA molecules such as genomic DNA molecules, cDNA molecules, and fragments thereof including j r. oligonucleotides, ESTs, STSs, etc.
  • the target polynucleotides may be from any source.
  • the target polynucleotide molecules may be naturally occurring nucleic acid molecules such as genomic or extragenomic DNA molecules isolated from an organism, or RNA molecules, such as mRNA molecules, isolated from an organism.
  • the polynucleotide j c molecules may be synthesized, including, e.g., nucleic acid molecules synthesized enzymatically in vivo or in vitro, such as cDNA molecules, or polynucleotide molecules synthesized by PCR, RNA molecules synthesized by in vitro transcription, etc.
  • the sample of target polynucleotides can comprise, e.g., molecules of DNA, RNA, or copolymers of DNA and RNA.
  • the target polynucleotides ofthe invention will o r. correspond to particular genes or to particular gene transcripts (e.g. , to particular mRNA sequences expressed in cells or to particular cDNA sequences derived from such mRNA sequences).
  • the target polynucleotides may correspond to particular fragments of a gene transcript.
  • the target polynucleotides of the invention may correspond to particular fragments of a gene transcript.
  • the target polynucleotides may correspond to particular fragments of a gene transcript.
  • the target polynucleotides may correspond to particular fragments of a gene transcript.
  • the target polynucleotides may correspond to particular fragments of a gene transcript.
  • the target polynucleotides may correspond to particular fragments of a gene transcript.
  • 35 polynucleotides may correspond to different exons ofthe same gene, e.g., so that different splice variants of that gene may be detected and/or analyzed.
  • the target polynucleotides to be analyzed are prepared in c vitro from nucleic acids extracted from cells.
  • RNA is extracted from cells (e.g., total cellular RNA, poly(A) + messenger RNA, fraction thereof) and messenger RNA is purified from the total extracted RNA.
  • Methods for preparing total and poly(A) + RNA are well known in the art, and are described generally, e.g., in Sambrook et al, supra.
  • RNA is extracted from cells ofthe various types of
  • RNA is extracted from cells using guanidinium thiocyanate lysis followed by purification on RNeasy columns (Qiagen). cDNA is then synthesized from the purified mRNA using, e.g., oligo-dT or random primers.
  • RNA is extracted from cells using guanidinium thiocyanate lysis followed by purification on RNeasy columns (Qiagen). cDNA is then synthesized from the purified mRNA using, e.g., oligo-dT or random primers.
  • the target polynucleotides are cRNA prepared from purified messenger RNA extracted from cells.
  • cRNA is defined here as RNA complementary to the source RNA.
  • the extracted RNAs are amplified using a process in which doubled-stranded cDNAs are synthesized from the RNAs using a primer linked to an RNA polymerase promoter in a direction capable of directing transcription of anti-sense RNA.
  • Anti-sense j r. RNAs or cRNAs are then transcribed from the second strand ofthe double-stranded cDNAs using an RNA polymerase (see, e.g., U.S. Patent Nos.
  • the target polynucleotides are short and/or fragmented polynucleotide molecules which are representative ofthe original nucleic acid population ofthe cell. o .
  • the target polynucleotides to be analyzed by the methods and compositions ofthe invention are preferably detectably labeled.
  • cDNA can be labeled directly, e.g., with nucleotide analogs, or indirectly, e.g., by making a second, labeled cDNA strand using the first strand as a template.
  • the double-stranded cDNA can be transcribed into cRNA and labeled.
  • the detectable label is a fluorescent label, e.g., by incorporation of nucleotide analogs.
  • Other labels suitable for use in the present invention include, but are not limited to, biotin, imminobiotin, antigens, cofactors, dinitrophenol, lipoic acid, olefmic
  • radioactive isotopes include 32 P, 35 S, 14 C, 15 N and 125 I.
  • Fluorescent molecules suitable for the present invention include, but are not limited to, fluorescein and its derivatives, rhodamine and its derivatives, texas red, 5'carboxy-fluorescein ("FMA"), 2',7'-
  • Fluroescent molecules that are suitable for the invention further include: cyamine dyes, including by not limited to Cy3, Cy3.5 and Cy5; BODIPY dyes including but not limited to BODIPY-FL, BODIPY-TR, BODIPY-TMR, BODIPY-630/650, and BODIPY-650/670;
  • Electron rich indicator molecules suitable for the present invention include, but are not limited to, ferritin, hemocyanin, and colloidal gold.
  • the target polynucleotides may be labeled by j r. specifically complexing a first group to the polynucleotide.
  • a second group, covalently linked to an indicator molecules and which has an affinity for the first group, can be used to indirectly detect the target polynucleotide.
  • compounds suitable for use as a first group include, but are not limited to, biotin and iminobiotin.
  • Compounds suitable for use as a second group include, but are not limited to, avidin and streptavidin.
  • nucleic acid hybridization and wash conditions are chosen so that the polynucleotide molecules to be analyzed by the invention (referred to herein as the
  • target polynucleotide molecules specifically bind or specifically hybridize to the
  • Arrays containing double-stranded probe DNA situated thereon are preferably subjected to denaturing conditions to render the DNA single-stranded prior to contacting with the target polynucleotide molecules.
  • Arrays containing single-stranded probe DNA oc may need to be denatured prior to contacting with the target polynucleotide molecules, e.g., to remove hairpins or dimers which form due to self complementary sequences.
  • Optimal hybridization conditions will depend on the length (e.g., oligomer versus c polynucleotide greater than 200 bases) and type (e.g. , RNA, or DNA) of probe and target nucleic acids.
  • length e.g., oligomer versus c polynucleotide greater than 200 bases
  • type e.g. , RNA, or DNA
  • Specific hybridization conditions for nucleic acids are described in Sambrook et al, (supra), and in Ausubel et al, 1987, Current Protocols in Molecular Biology, Greene Publishing and Wiley-Interscience, New York.
  • typical hybridization conditions are described in Sambrook et al, (supra), and in Ausubel et al, 1987, Current Protocols in Molecular Biology, Greene Publishing and Wiley-Interscience, New York.
  • hybridization conditions for use with the screening and/or signaling chips ofthe present invention include hybridization at a temperature at or near the mean melting temperature ofthe probes (e.g., within 5 °C, more preferably within 2 °C) in j r I M NaCl, 50 mM MES buffer (pH 6.5), 0.5% sodium Sarcosine and 30%> formamide.
  • RNA splicing specifically binding the product or products ofthe gene expressing) that is not transcribed or is removed during RNA splicing in the cell will have little or no signal (e.g., fluorescent signal), and an exon of a gene for which the encoded mRNA expressing the exon is prevalent will have a relatively strong signal.
  • the relative abundance of different mRNAs produced by from the same gene by alternative splicing is then determined by the signal
  • target sequences e.g., cDNAs or cRNAs
  • target sequences e.g., cDNAs or cRNAs
  • drug responses one cell sample is exposed to a drug and another cell sample ofthe same type is c not exposed to the drug.
  • pathway responses one cell is exposed to a pathway perturbation and another cell ofthe same type is not exposed to the pathway perturbation.
  • the cDNA or cRNA derived from each ofthe two cell types are differently labeled so that they can be distinguished.
  • cDNA from a cell treated with a drug (or exposed to a pathway perturbation) is synthesized using a fluorescein-labeled
  • cDNA from a second cell, not drug-exposed is synthesized using a rhodamine-labeled dNTP.
  • the relative intensity of signal from each cDNA set is determined for each site on the array, and any relative difference in abundance of a particular exon detected.
  • the exon expression pattern as represented by ratio of green to red fluorescence for each exon binding site will j c change.
  • the ratios for each exon expressed in the mRNA will increase, whereas when the drug decreases the prevalence of an mRNA, the ratio for each exons expressed in the mRNA will decrease.
  • target sequences e.g., cDNAs or cRNAs
  • cDNAs or cRNAs labeled with two different fluorophores
  • a direct and internally controlled comparison ofthe mRNA or o c exon expression levels corresponding to each arrayed gene in two cell states can be made, and variations due to minor differences in experimental conditions (e.g., hybridization conditions) will not affect subsequent analyses.
  • cDNA from a single cell, and compare, for example, the absolute c amount of a particular exon in, e.g. , a drug-treated or pathway-perturbed cell and an untreated cell.
  • single-channel detection methods e.g., using one- color fluorescence labeling
  • arrays comprising reverse-
  • RC probes are designed and produced. Because a reverse complement of a DNA sequence has sequence complexity that is equivalent to the corresponding forward- strand (FS) probe that is complementary to a target sequence with respect to a variety of measures (e.g., measures such as GC content and GC trend are invariant under the reverse complement), a RC probe is used to as a control probe for determination of level of non-
  • the significance ofthe FS probe intensity of a target sequence is determined by comparing the raw intensity measurement for the FS probe and the corresponding raw intensity measurement for the RC probe in conjunction with the respective measurement errors.
  • an exon is called present if the intensity difference between the FS probe and the
  • corresponding RC probe «« corresponding RC probe is significant. More preferably, an exon is called present if the FS probe intensity is also significantly above background level.
  • Single-channel detection methods can be used in conjunction with multi-color labeling. In one embodiment, a plurality of different samples, each labeled with a different color, is hybridized to an array. Differences between FS and RC probes for each color are used to determine the level of jc hybridization ofthe corresponding sample.
  • the fluorescence emissions at each site of a transcript array can be, preferably, detected by scanning confocal laser microscopy.
  • a separate scan, using the appropriate excitation line, is carried out for each ofthe two fluorophores used.
  • a laser can be used that allows o p, simultaneous specimen illumination at wavelengths specific to the two fluorophores and emissions from the two fluorophores can be analyzed simultaneously (see Shalon et al. , 1996, Genome Res. 6: 639-645).
  • the arrays are scanned with a laser fluorescence scanner with a computer controlled X-Y stage and a microscope objective.
  • Sequential excitation ofthe two fluorophores is achieved with a multi-line, o c mixed gas laser, and the emitted light is split by wavelength and detected with two photomultiplier tubes.
  • fluorescence laser scanning devices are described, e.g., in Schena et al, 1996, Genome Res. 6:639-645.
  • the fiber-optic bundle described by Ferguson et al, 1996, Nature Biotech. 74:1681-1684 may be used to monitor mRNA c abundance levels at a large number of sites simultaneously.
  • Signals are recorded and, in a preferred embodiment, analyzed by computer, e.g., using a 12 bit analog to digital board.
  • the scanned image is despeckled using a graphics program (e.g., Hijaak Graphics Suite) and then analyzed using an image gridding program that creates a spreadsheet ofthe average hybridization at each wavelength
  • a ratio ofthe emission ofthe two fluorophores can be calculated. The ratio is independent ofthe absolute expression level ofthe cognate gene, but is useful for genes whose expression is significantly modulated by drug administration,
  • the relative abundance of an mRNA and/or an exon expressed in an mRNA in two cells or cell lines is scored as perturbed (i.e., the abundance is different in the two sources of mRNA tested) or as not perturbed (i.e., the relative abundance is the same).
  • a difference between the two sources of j RNA of at least a factor of about 25% i. e. , RNA is 25% more abundant in one source than in the other source
  • more usually about 50%, even more often by a factor of about 2 (i.e., twice as abundant), 3 (three times as abundant), or 5 (five times as abundant) is scored as a perturbation.
  • the transcriptional state of a cell may be measured by other gene expression technologies known in the art.
  • Several such technologies produce pools of restriction fragments of limited complexity for electrophoretic analysis, such as methods combining o c double restriction enzyme digestion with phasing primers (see, e.g. , European Patent O 534858 Al, filed September 24, 1992, by Zabeau et al), or methods selecting restriction fragments with sites closest to a defined mRNA end (see, e.g., Prashar et al, 1996, Proc. Natl. Acad. Sci. USA 93:659-663).
  • cDNA pools statistically sample cDNA pools, such as by sequencing sufficient bases (e.g., 20-50 bases) in each of multiple cDNAs to identify each cDNA, or by sequencing short tags (e.g., 9-10 bases) that are generated at known positions relative to a defined mRNA end (see, e.g., Velculescu, 1995, Science 270:484- 487).
  • sequencing sufficient bases e.g., 20-50 bases
  • sequencing short tags e.g., 9-10 bases
  • aspects ofthe biological state other than the transcriptional state can be measured to produce the measured data to be analyzed according to the invention.
  • gene expression data may include translational state , c measurements or even protein expression measurements.
  • protein expression interaction maps based on protein expression maps are used. Details of embodiments in which aspects ofthe biological state other than the transcriptional state are described in this section.
  • Measurement ofthe translational state may be performed according to several methods. For example, whole genome monitoring of protein (z ' .e., the "proteome,” Goffeau et al, 1996, Science 274:546-567; Aebersold et al, 1999, Nature Biotechnology 10:994- j c 999) can be carried out by constructing a microarray in which binding sites comprise immobilized, preferably monoclonal, antibodies specific to a plurality of protein species encoded by the cell genome (see, e.g., Zhu et al, 2001, Science 293:2101-2105; MacBeath et al, 2000, Science 289:1760-63; de Wildt et al., 2000, Nature Biotechnology 18:989-994).
  • binding sites comprise immobilized, preferably monoclonal, antibodies specific to a plurality of protein species encoded by the cell genome (see, e.g., Zhu et al, 2001, Science 293:2101-2105;
  • monoclonal antibodies are raised against synthetic peptide fragments designed based on genomic sequence ofthe cell.
  • proteins from the cell are contacted to the array and their binding is assayed with assays known in the art.
  • proteins can be separated and measured by two-dimensional gel c electrophoresis systems.
  • Two-dimensional gel electrophoresis is well-known in the art and typically involves iso-electric focusing along a first dimension followed by SDS-PAGE electrophoresis along a second dimension. See, e.g., Hames et al, 1990, Gel Electrophoresis of Proteins: A Practical Approach, IRL Press, New York; Shevchenko et al., 1996, Proc. Natl. Acad. Sci. USA 93:1440-1445; Sagliocco et al, 1996, Yeast 12:1519-
  • the resulting electropherograms can be analyzed by numerous techniques, including mass spectrometric techniques, Western blotting and immunoblot analysis using polyclonal and monoclonal antibodies, and internal and N- terminal micro-sequencing. Using these techniques, it is possible to identify a substantial
  • ⁇ fraction of all the proteins produced under given physiological conditions including in cells (e.g., in yeast) exposed to a drug, or in cells modified by, e.g., deletion or over-expression of a specific gene.
  • association involves association in multimeric units, for example association of an activated DNA binding complex with DNA
  • amount of associated protein or secondary consequences ofthe association such as amounts of mRNA transcribed
  • performance ofthe function can be observed.
  • the changes in protein activities form the response data analyzed by the foregoing methods of this invention.
  • response data may be formed of mixed o c aspects ofthe biological state of a cell.
  • Response data can be constructed from, e.g., changes in certain mRNA abundances, changes in certain protein abundances, and changes in certain protein activities.
  • Drug responses are obtained for use in the instant invention by measuring the gene expression state changed by drug exposure.
  • the biological response described on the exon level can also be measured by exon profiling methods.
  • the measured response data include values representing gene expression level values or gene expression level ratios for a
  • cell can be exposed to graded levels ofthe drug or drug candidate of interest.
  • the compound When the cells are grown in vitro, the compound is usually added to their nutrient medium.
  • the drag is added in a graded amount that depends on the particular characteristics ofthe drug, but usually will be between about 1 ng/ml and 100
  • a drug will be solubilized in a solvent such as DMSO.
  • exon expression profiles of cells exposed to the drug and of cells not exposed to the drug are measured according to the methods described in the previous section.
  • gene transcript arrays are used to find the genes with altered gene expression profiles due to exposure to the drug.
  • j p It is preferable for measurements of drug responses, in the case of two-colored differential hybridization described above, to measure with reversed labeling. Also, it is preferable that the levels of drug exposure used provide sufficient resolution of rapidly changing regions ofthe drug response, e.g., by using approximately ten levels of drug exposure.
  • One aspect ofthe invention provides methods for the analysis of biological state.
  • the methods of this invention are also useful for the analysis of responses of a cell sample to perturbations designed to probe cellular state.
  • Preferred perturbations are those that cause o p . a change in the amount of alternative splicing that occurs in one or more RNA transcripts.
  • Methods for targeted perturbation of cells are widely known and applied in the art.
  • such methods include use of titratable expression systems, use of transfection or viral transduction systems, direct modifications to RNA abundances or activities, direct modifications of protein abundances, direct modification of protein activities including use of drugs (or chemical moieties in general), and post-transcriptional gene silencing (PTGS)
  • RNAi RNA interference
  • Tet In mammalian cells, several means of titrating expression of genes are available (Spencer, 1996, Trends Genet. 12:181-187).
  • Tet system is widely used, both in its original form, the "forward" system, in which addition of doxycycline represses transcription, and in the newer “reverse” system, in which doxycycline addition stimulates
  • the other hybrid protein contains a transcriptional activation domain also fused to FKBP12.
  • the CID inducing molecule is FK1012, a homodimeric version of FK506 that is able to bind simultaneously both the DNA binding and transcriptional j , activating hybrid proteins .
  • graded transcription of the controlled gene is activated.
  • Transfection or viral transduction of target genes can introduce controllable perturbations in biological gene expression states in mammalian cells.
  • transfection or transduction of a target gene can be used with cells that do not naturally o p , express the target gene of interest.
  • Such non-expressing cells can be derived from a tissue not normally expressing the target gene or the target gene can be specifically mutated in the cell.
  • the target gene of interest can be cloned into one of many mammalian expression plasmids, for example, the pcDNA3.1 +/- system (Invitrogen, Inc.) or retroviral vectors, and introduced into the non-expressing host cells.
  • Transfected or transduced cells expressing o c the target gene may be isolated by selection for a drug resistance marker encoded by the expression vector.
  • the level of gene transcription is monotonically related to the transfection dosage. In this way, the effects of varying levels ofthe target gene may be investigated.
  • Other methods of modifying RNA abundances and activities and thus gene c abundances include ribozymes, antisense species, and RNA aptamers (Good et al., 1997, Gene Therapy 4: 45-54). Controllable application or exposure of a cell to these entities permits controllable perturbation of RNA abundances.
  • Ribozymes are RNAs which are capable of catalyzing RNA cleavage reactions. (Cech, 1987, Science 236:1532-1539; PCT International Publication WO 90/11364,
  • RNA ribozymes can be designed to specifically cleave a particular target mRNA. Rules have been established for the design of short RNA molecules with ribozyme activity, which are capable of cleaving other RNA molecules in a highly sequence specific way and can be targeted to virtually all kinds of RNA. (Haseloff et al., 1988, Nature
  • Ribozyme methods involve exposing a cell to, inducing expression in a cell, etc. of such small RNA ribozyme molecules. (Grassi and Marini, 1996, Annals of Medicine 28: 499-510; Gibson, 1996, Cancer and Metastasis Reviews 15: 287-299).
  • activity of a target RNA (preferable mRNA) species in another embodiment, activity of a target RNA (preferable mRNA) species,
  • an "antisense” nucleic acid as used herein refers to a nucleic acid capable of hybridizing to a sequence-specific (e.g., non-poly A) portion ofthe target RNA, for example its translation initiation region, by virtue of some sequence complementarity to a coding and/or non-coding region.
  • the antisense nucleic acids ofthe j c invention can be oligonucleotides that are double-stranded or single-stranded, RNA or DNA or a modification or derivative thereof, which can be directly administered in a controllable manner to a cell or which can be produced intracellularly by transcription of exogenous, introduced sequences in controllable quantities sufficient to perturb translation ofthe target RNA. o p.
  • RNA aptamers can be introduced into or expressed in a cell.
  • RNA aptamers are specific RNA ligands for proteins, such as for Tat and Rev RNA (Good et al, 1997, Gene Therapy 4: 45-54) that can specifically inhibit their translation.
  • RNA interference can also be used to modify RNA abundances (Guo et al., 1995, Cell 81:611-620; Fire et al, 1998, o c Nature 391 :806-811).
  • dsRNAs are injected into cells to specifically block expression of its homologous gene.
  • both the sense strand and the anti-sense strand can inactivate the corresponding gene. It is suggested that the dsRNAs are cut by nuclease into 21-23 nucleotide fragments.
  • one or more dsRNAs having sequences homologous to the sequences of one or more mRNAs whose abundances are to be modified are transfected into a cell or tissue sample. Any standard method for introducing nucleic acids into cells can be used.
  • Methods of modifying protein abundances include, inter alia, those altering protein
  • one such method employs a heat-inducible or drug-inducible N-terminal degron, which is an N-terminal protein fragment that exposes a degradation signal promoting rapid protein degradation at a higher temperature (e.g., 37° C) and which is hidden to prevent rapid degradation at a lower temperature (e.g., 23 ° C) (Dohmen et. al, 1994, Science 263:1273-1276).
  • a degron is Arg-DHFR ts , j c a variant of murine dihydrofolate reductase in which the N-terminal Val is replaced by Arg and the Pro at position 66 is replaced with Leu.
  • a gene for a target protein, P is replaced by standard gene targeting methods known in the art (Lodish et al, 1995, Molecular Biology ofthe Cell, W.H. Freeman and Co., New York, especially chap 8) with a gene coding for the fusion protein Ub-Arg-DHFR ts -P ("Ub” stands o p . for ubiquitin).
  • Ub stands o p . for ubiquitin
  • the N-terminal ubiquitin is rapidly cleaved after translation exposing the N- terminal degron. At lower temperatures, lysines internal to Arg-DHFR ts are not exposed, ubiquitination ofthe fusion protein does not occur, degradation is slow, and active target protein levels are high.
  • the false positive rate is the number of false positives divided by the total number of gene or probes in the microarray. The lower the false positive rate, j c the better the method.
  • sample A contains measurements obtained using cell-A549, human lung cancer cell (Sample B).
  • Sample A o p . and Sample B are very different, and therefore, differential expressions between the two different samples are expected.
  • the detection sensitivity is evaluated based on the rate of detection, which is the number of genes or probes detected as differentially expressed divided by the total number of genes or probes in the microarray. The higher the detection rate, the more sensitive the method.
  • o c Table 5 summarizes the test results. For the given threshold p-value ⁇ 0.01, the improved ANOVA method provides both much lower false positive rate and much higher detection sensitivity. This indicates that the improved ANOVA method ofthe invention with two inputs has higher statistical power than the traditional ANOVA.

Abstract

The present invention provides improved ANOVA methods for analyzing measured data and transformed data. The improved ANOVA method takes two data types as its input, one is the measurements, the other is a predetermined error associated with the measurements. The latter can come from a technology/platform-specific error model. Because of the additional input information, the statistical power is increased. The methods of the invention is particularly useful for analyzing gene or protein expression data.

Description

IMPROVED ANOVA METHOD FOR DATA ANALYSIS
1. FIELD OF THE INVENTION
The present invention relates to improved Analysis of Variance (ANOVA) methods for analyzing measurement data, e.g., gene/protein expression data.
2, BACKGROUND OF THE INVENTION DNA array technologies have made it possible to monitor the expression level of a large number of genetic transcripts at any one time (see, e.g., Schena et al, 1995, Science 270:467-470; Loc hart et al, 1996, Nature Biotechnology 74:1675-1680; Blanchard et al, 1996, Nature Biotechnology 14:1649; Ashby et al., U.S. Patent No. 5,569,588, issued October 29, 1996). Ofthe two main formats of DNA arrays, spotted cDNA arrays are prepared by depositing PCR products of cDNA fragments with sizes ranging from about 0.6 to 2.4kb, from full length cDNAs, ESTs, etc., onto a suitable surface (see, e.g., DeRisi et al, 1996, Nature Genetics 14:457-460; Shalon et al, 1996, Genome Res. 5:689-645; Schena et al, 1995, Proc. Natl Acad. Sci. U.S.A. 53:10539-11286; and Duggan et al., Nature Genetics Supplement 27:10-14). Alternatively, high-density oligonucleotide arrays containing thousands of oligonucleotides complementary to defined sequences, at defined locations on a surface are synthesized in situ on the surface by, for example, photolithographic techniques (see, e.g., Fodor et al, 1991, Science 251:767-773; Pease et al, 1994, Proc. Natl. Acad. Sci. U.S.A. 97:5022-5026; Lockhart et al , 1996, Nature Biotechnology 14: 1675; McGall et al, 1996, Proc. Natl. Acad. Sci. U.S.A. 93:13555-13560; U.S. Patent Nos. 5,578,832; 5,556,752; 5,510,270; and 6,040,138). Methods for generating arrays using inkjet technology for in situ oligonucleotide synthesis are also known in the art
(see, e.g., Blanchard, International Patent Publication WO 98/41531, published September
24, 1998; Blanchard et l, 1996, Biosensors and Bioelectronics 11:687-690; Blanchard,
1998, in Synthetic DNA Arrays in Genetic Engineering, Vol. 20, J.K. Setlow, Ed., Plenum Press, New York at pages 111-123). Efforts to further increase the information capacity of DNA arrays range from further reducing feature size on DNA arrays so as to further increase the number of probes in a given surface area to sensitivity- and specificity-based probe design and selection aimed at reducing the number of redundant probes needed for the detection of each target nucleic acid thereby increasing the number of target nucleic acids monitored without increasing probe density (see, e.g., Friend et al., International Publication No. WO 01/05935, published January 25, 2001).
By simultaneously monitoring tens of thousands of genes, DNA array technologies have allowed, inter alia, genome-wide analysis of mRNA expression in a cell or a cell type or any biological sample. Aided by sophisticated data management and analysis methodologies, the transcriptional state of a cell or cell type as well as changes ofthe transcriptional state in response to external perturbations, including but not limited to drug perturbations, can be characterized on the mRNA level (see, e.g., Stoughton et al., International Publication No. WO 00/39336, published July 6, 2000; Friend et al.,
International Publication No. WO 00/24936, published May 4, 2000). Applications of such technologies include, for example, identification of genes which are up regulated or down regulated in various physiological states, particularly diseased states. Additional exemplary uses for DNA arrays include the analyses of members of signaling pathways, and the identification of targets for various drugs. See, e.g., Friend and Hartwell, International Publication No. WO 98/38329 (published September 3, 1998); Stoughton, International Publication No. WO 99/66067 (published December 23, 1999); Stoughton and Friend, International Publication No. WO 99/58708 (published November 18, 1999); Friend and Stoughton, International Publication No. WO 99/59037 (published November 18, 1999); Friend et al., U.S. Patent No. 6,218,122.
Protein microarrays are used to monitor the genome-wide protein expression in cells (i.e., the "proteome," Goffeau et al, 1996, Science 274:546-567; Aebersold et al., 1999, Nature Biotechnology 10:994-999). Protein microarrays contain binding sites comprise immobilized, preferably monoclonal, antibodies specific to a plurality of protein species encoded by the cell genome (see, e.g., Zhu et al., 2001, Science 293:2101-2105; MacBeath et al, 2000, Science 289:1760-63; de Wildt et al, 2000, Nature Biotechnology 18:989-994). Protein expression in a cell can also be separated and measured by two-dimensional gel electrophoresis techniques. Two-dimensional gel electrophoresis is well-known in the art and typically involves iso-electric focusing along a first dimension followed by SDS-PAGE electrophoresis along a second dimension. See, e.g., Hames et al, 1990, Gel Electrophoresis of Proteins: A Practical Approach, IRL Press, New York; Shevchenko et al, 1996, Proc. Natl. Acad. Sci. USA 93:1440-1445; Sagliocco et al, 1996, Yeast 12:1519- 1533; Lander, 1996, Science 274:536-539; and Beaumont et al., Life Science News 7, 2001, Amersham Pharmacia Biotech. The resulting electropherograms can be analyzed by numerous techniques, including mass spectrometric techniques, Western blotting and immunoblot analysis using polyclonal and monoclonal antibodies, and internal and N- terminal micro-sequencing. Using these techniques, it is possible to identify a substantial fraction of all the proteins produced under given physiological conditions, including in cells (e-g- , in yeast) exposed to a drug, or in cells modified by, e.g., deletion or over-expression of a specific gene.
Analysis of Variance (ANOVA) method (see, e.g., Statistics For Experimenters, by Box, Hunter and Hunter, John Wiley & Sons, 1978) are used in gene or protein expression data analysis to determine differential expressions under different treatment conditions. In a one-way ANOVA, there is one experimental factor under investigation. For example, the factor may be the effects of several different compounds vs. the vehicle, which is the baseline reference. For example, when effects of a set of drugs is under investigation, the number of compounds is the number of levels ofthe factor. The goal is to find out from measured data whether a gene or protein is affected by the compounds. If the expression level ofthe gene or protein is increased or decreased after the treatments, the gene or protein is said differentially expressed. In a two-way ANOVA, there are two factors under investigation, for example, the drug effect and the dosage effect. Each factor may have multiple levels. Interaction between the two factors is also included in the ANOVA analysis. ANOVA is often used for determining whether there are statistical differences among the means of measurements in different measurement groups. As an example, the different measurement groups may contain measurements of expression levels of a gene or protein under different drug treatments. In each group, there may be several replicate measurements under the same treatment. First, one finds the within-group variance and the between-group variance. The within-group variance is the measurement variance of measurements within a treatment group. The between-group variance is the measurement variance ofthe means of different treatment groups. The within-group variance reflects the measurement error ofthe measurement technology, and the between-group variance includes both the measurement error ofthe measurement technology and changes caused by different treatments. Then the between-group variance is compared to the within-group variance. If the between-group variance is significantly larger than the within-group variance, it may be concluded that the different treatments have produced statistically significant changes in gene expression levels. In ANOVA analysis, the underlying null- hypothesis is that all treatment groups have the same mean. With the estimated mean squares and degrees of freedom, a p-value of F-statistics can be calculated. The p-value is the probability that the null-hypothesis may be accepted. When the p-value is lower than a given threshold, for example p-value<0.01, the null-hypothesis can be rejected and the alternative hypothesis, which means that some ofthe expression levels have different means, can be accepted. In other words, some treatments have produced changes in the expression level ofthe gene.
In a traditional ANOVA method, only measurement quantities, for example, the gene expression intensity or ratio, are used to determine the mean squares and degrees of freedom. The traditional ANOVA relies on a large number of measurement replicates to get a reliable estimation ofthe within-group variance. However, in gene or protein expression studies, limited by the small quantity of samples and the high cost of carrying out the measurements (such as DNA microarray measurements), the number of replicates is often small. The degrees of freedom is often small. By random chances, due to the small number of replicates, the estimated within-group variance of expression levels of some of genes or proteins can be very small, often much smaller than the actual measurement error inherent in the measurement technology. As a result, the between-group variance can be much larger than the underestimated within-group variance, leading to small p-value, which in turn incorrectly indicates statistically significant difference in the expression levels ofthe gene or protein. Such incorrectly identification of a gene or protein is called a "false positive." High false positive rate is a severe problem in gene expression analysis when the traditional ANOVA method is used. Large number of false positives often requires follow-up validation using other expression profiling technologies. Low degrees-of-freedom also reduces the detection sensitivity. As a result, small changes in differential expression may not be detected. Such missed or detected differentially expressed genes or proteins are called "false negatives."
Measurement errors in microarray experiments are often described by error models (see, e.g., Supplementary material to Roberts et al, 2000, Science, 287:873-880; and Rocke et al., 2001, J. Computational Biology 8:557-569). Measurement errors can also be described as a sum of a common error and a scatter error (see, e.g., Stoughton et al., U.S. Patent No. 6,351,712). An error-weighted average is used in combining ratio profiles (see, e.g., Stoughton et al, U.S. Patent No. 6,351,712).
Various ANOVA models have been described for analyzing microarray data (see, e.g., Kerr et al., 2000, J. Computational Biol. 7:819; Kerr et al., 2001, Genetical Research 77:123; Wolfmger et al, 2001, J. Computational Biol. 8:625; Pritchard et al., 2001, Proc. Natl. Acad. Sci. USA 98:13266; Lonnstedt et al. 2002, Statistica Sinica 12:31; and Wu et al., "MAANOVA: A software package for the Analysis of Spotted cDNA Microarray Experiments, published at www.jax.org/churchill/pubs/wu_maanova.pdf, accessed on October 25, 2002). These methods are normally applied to transformed microarray data, e.g., logarithmic transformed data, to decompose microarray data into different terms according to sources of variations, e.g., variations due to arrays, dyes, genes, and interactions thereof. Measured expression level changes due to one or more of such sources, e.g., expression level changes as a result of real changes in gene expression in the cells, can then be determined.
It is therefore desirable to have methods that are more accurate in determining differences in measured data among different perturbation groups. It is desirable to have methods for analyzing gene or protein expression with improved false-positive and/or false- negative rate. Discussion or citation of a reference herein shall not be construed as an admission that such reference is prior art to the present invention.
3. SUMMARY OF THE INVENTION The present invention provides methods for analyzing measurement data. In the metiiods ofthe invention, in addition to the measurement quantities themselves, e.g., measured expression levels of a gene, a second type of inputs, e.g., estimated variances of the measurements, are also used.
In a first aspect ofthe invention, the invention provides methods of analyzing a plurality of measurements of a variable {y„} in k different measurement groups by ANOVA analysis, where yH is the ?th measurement in the tth measurement group, t - 1, 2, ..., k and /* = 1, 2, ..., n„ nt being the number of measurements in measurement group t. Each ofthe k different measurement groups consists of one or more measurements ofthe variable under a condition common to the measurement group, and each ofthe measurement^ has a predetermined measurement variance, e.g., σ,. . In the methods ofthe invention, k and nt can be any appropriate integer. In preferred embodiments ofthe invention, the variable is the transcript level of a gene or the abundance of a protein. Preferably, the transcript level is measured using a DNA microarray. Preferably, the abundance is measured using a protein microarray or two-dimensional gel electrophoresis.
In a preferred embodiment, the invention provides a method for analyzing the plurality of measurements ofthe variable ytl} comprising (a) determining a within-group variance for the k different measurement groups, wherein the within-group variance consists of a propagated variance and a scattered variance, the propagated variance being determined o based on predetermined measurement variances of the plurality of measurements, and the scattered variance being determined based on deviation of each ofthe plurality of measurements with respect to a mean of measurements in a respective measurement group; (b) determining a between-group variance for the k different measurement groups, wherein the between-group variance is a variance of a plurality of group means, one for each ofthe k 5 different measurement groups, with respect to a mean of the plurality of measurements, wherein each the group mean is the mean ofthe one or more measurements in a measurement group; and (c) comparing the within-group variance with the between-group variance, thereby analyzing the plurality of measurements.
In another preferred embodiment, the invention provides a method for analyzing the 0 plurality of measurements ofthe variable {ytl} comprising (a) weighting each measurement ytl with a weighting factor to obtain an error- weighted measurement, wherein the weighting factor is determined based on the predetermined measurement variance ofthe measurement; (b) determining a within-group variance for the k different measurement groups, wherein the within-group variance consists of a propagated variance and a scattered variance, the 5 propagated variance being determined based on predetermined measurement variances of the plurality of measurements, and the scattered variance being determined based on deviation of each ofthe plurality of measurements with respect to a mean of error- weighted measurements in a respective measurement group; (c) determining a between-group variance for the k different measurement groups, wherein the between-group variance is a o variance of a plurality of group means, one for each of the k different measurement groups, with respect to a mean ofthe plurality of error- weighted measurements, wherein each group mean is the mean ofthe one or more measurements in a measurement group; and (d) comparing the within-group variance with the between-group variance, thereby analyzing the plurality of measurements. 5 In one embodiment ofthe methods ofthe invention, the comparing step comprises determining the significance level of a statistical metric by a statistical test, wherein the statistical metric is determined by a method comprising (i) determining a within-group degree of freedom; (ii) determining a between-group degree of freedom; (iii) determining a within-group mean square; (iv) determining a between-group mean square; and (v) calculating the statistical metric.
In still another preferred embodiment, the invention provides an improved ANOVA method for analyzing the plurality of measurements ofthe variable {yti} in which the o improvement comprises determining a within-group variance consisting of a propagated variance and a scattered variance, the propagated variance being determined based on predetermined measurement variances ofthe one or more measurements, and the scattered variance being determined based on deviation of each ofthe one or more measurements with respect to a mean of measurements in a respective measurement group. 5 In the methods ofthe invention, the within-group variance can be determined based on group variances ofthe k different measurement groups, each group variance consisting of a measurement group propagated variance and a measurement group scattered variance. Each measurement group propagated variance can be detennined based on the predetermined measurement variances ofthe one or more measurements in the 0 measurement group. The measurement group scattered variance can be a variance ofthe one or more measurements with respect to the mean ofthe one or more measurements in the measurement group.
Preferably, the propagated variance for each measurement group is determined according to the equation 5
Σ σv y,p
0 where σ ^ is the propagated variance. Preferably, the mean ofthe one or more
measurements is determined according to the equation
5
Figure imgf000009_0001
where j/, is the mean ofthe one or more measurements, and the scattered variance is determined according to the equation
Figure imgf000009_0002
where σ ff „ is the scattered variance.
In a preferred embodiment, the group propagated variance and the group scattered variance for each group are combined according to the equation
Figure imgf000009_0003
In one embodiment, the method ofthe invention further comprises a step of determining the predetermined measurement variance for each measurement. In a preferred embodiment, the predetermined measurement variance of each measurement yti is determined according to an error model, e.g., a three-term error model according to equation
σ« = 2 + b2 • yti + a2 • yti 2
2 wherein σ ~ t is the predetermined measurement variance of measurement v„, a is a fractional error coefficient, b is a Poisson error coefficient, and c is a standard deviation of background noise.
In embodiments ofthe invention which involve error- weighting the measurements or involve using error-weighted measurements, it is preferable that each error-weighted measurement is generated using a weighting factor which is determined based on the predetermined measurement variance ofthe measurement. In a preferred embodiment, the weighting factor for measurement yti is determined according to the equation
"UA,. = σ
In such embodiments, it is preferable that the propagated variance for each measurement . „ group is determined according to the equation
Figure imgf000010_0001
15 where σ ^ is the propagated variance. In another preferred embodiment, the mean ofthe
one or more error- weighted measurements is determined according to the equation
20 Σ w„- yti i=l yt l wti ι=l
25
where yt is the mean ofthe one or more error- weighted measurements, and wherein the scattered variance is determined according to the equation
Figure imgf000010_0002
2
- c where σ π is the scattered variance. In such embodiment which involve en-or-weighting the measurements or involve using error-weighted measurements, it is also preferable that the methods further comprise using an effective number of samples instead ofthe actual number of samples for each measurement group. In one embodiment, the effective numbers of samples are calculated according to equations
Figure imgf000011_0001
and
eN= ent t=l
In a specific embodiment ofthe invention, the within-group degree of freedom is calculated according to the equation
Figure imgf000011_0002
where vR is the within-group degree of freedom the between-group degree of freedom is calculated according to the equation
vτ =k—l
where vr is the between-group degree of freedom, the within-group mean square is calculated according to the equation
SR " ^R ' VR where sR is the within-group mean square and where SR is calculated according to the equation
Figure imgf000012_0001
where vRt is calculated according to the equation
Rt = n . - 1 + n
and the between-group mean square is calculated according to the equation
: γ I V ,
where sτ is the between-group mean square and where Sr is calculated according to the equation
Figure imgf000012_0002
where y is calculated according to the equation
Figure imgf000012_0003
where N is the total number of measurements. In another specific embodiment ofthe invention, the within-group degree of freedom is calculated according to the equation
Figure imgf000013_0001
where vR is the within-group degree of freedom, ent is calculated according to the equation
Figure imgf000013_0002
and eN is calculated according to the equation
Figure imgf000013_0003
the between-group degree of freedom is calculated according to the equation
vτ =k—l
where vτ is the between-group degree of freedom, the within-group mean square is calculated according to the equation
Figure imgf000013_0004
where sR is the within-group mean square and where SR is calculated according to the equation
Figure imgf000014_0001
where vRt is calculated according to the equation
Rt n 4 - 1 + n
and the between-group mean square is calculated according to the equation
Figure imgf000014_0002
where sτ is the between-group mean square and where Sr is calculated according to the equation
Figure imgf000014_0003
where y is calculated according to the equation
Figure imgf000014_0004
where N is the total number of measurements. Preferably, in the methods ofthe invention statistical test is an F-test and the significance level is a p-value determined according to the equation
pvalue= l-fcd sτ 2 /$ ,vT,vR In another aspect ofthe invention, the invention provides methods of analyzing a plurality of measurements of a variable {ytlJ} in k * n different measurement groups, where ytlJ is thejth measurement in the tith measurement group, t = 1, 2, ..., n, i = 1, 2, ..., k, j = 1, ..., mti, mtl being the number of measurements in measurement group ti. Each ofthe k * n different measurement groups consists of measurements ofthe variable under common conditions t and i of condition groups N having n different conditions and T having k different conditions, respectively, and each measurement yt has a predetermined measurement variance, e.g., <?((7 . In the methods ofthe invention, k, n and mti can be any appropriate integer.
In one preferred embodiment, the invention provides a method comprising (a) determining a within-group variance for the k * n different measurement groups, wherein the within-group variance consists of a propagated variance and a scattered variance, the propagated variance being determined based on predetermined measurement variances of the plurality of measurements, and the scattered variance being determined based on deviation of each ofthe plurality of measurements with respect to a mean of measurements in a respective measurement group; (b) determining (bl) a between-group variance for condition group K, wherein the between-group variance is a variance of condition group means for the condition group K, one for each condition in the condition group K, with respect to a mean ofthe plurality of measurements, wherein each condition group mean for the condition group Kis the mean of measurements in all measurement groups under a respective condition in the condition group K, or (b2) an interaction variance, wherein the interaction variance is a variance of group interaction means, one for each ofthe k * n different measurement groups, with respect to a mean ofthe plurality of measurements and the condition group mean for the condition group K and the condition group mean for the condition group N, wherein each group interaction mean is the mean of measurements in the measurement groups over the number of measurements in the measurement group, and wherein each condition group mean for the condition group K is the mean of measurements in all measurement groups under a respective condition in the condition group A' and each condition group mean for the condition group N is the mean of measurements in all measurement groups under a respective condition in the condition group N; and (c) comparing the within-group variance with the between-group variance, thereby analyzing the plurality of measurements. In a preferred embodiment, the method further comprises prior to step (a) a step of weighting each measurement ytlJ with a weighting factor to obtain an error- weighted measurement, where the weighting factor is determined based on the predetermined measurement variance ofthe measurement.
5 In a preferred embodiment, the comparing step comprises determining the significance level of a statistical metric by a statistical test, wherein the statistical metric is determined by a method comprising (i) determining a within-group degree of freedom; (ii) determining a between-group degree of freedom, if the between-group variance is determined in step (b), or an interaction degree of freedom, if the interaction variance is
20 determined in step (b); (iii) determining a within-group mean square; (iv) determining a between-group mean square, if the between-group variance is determined in step (b), or an interaction mean square, if the interaction variance is determined in step (b); and (v) calculating the statistical metric.
In another preferred embodiment, the invention provides an improved two-way
25 ANOVA method for analyzing the plurality of measurements {yt } in which the improvement comprises determining a within-group variance consisting of a propagated variance and a scattered variance, the propagated variance being determined based on predetermined measurement variances ofthe plurality of measurements, and the scattered variance being determined based on deviation of each ofthe plurality of measurements with
20 respect to a mean of measurements in a respective measurement group. Preferably, each of the one or more measurements is an error- weighted measurement generated using a weighting factor determined based on the predetermined measurement variance ofthe measurement.
In embodiments ofthe invention which involve error- weighting the measurements
25 or involve analyzing error- weighted measurements, it is preferable that each ofthe one or more measurements is an error- weighted measurement generated using a weighting factor which is determined based on the predetermined measurement variance ofthe measurement. In a preferred embodiment, the weighting factor for measurement ytlJ is determined according to the equation
30
Wtij = σ tij
2
-<- where o~ ήϊ is the predetermined measurement variance of measurement yhJ. In the methods ofthe invention, the within-group variance can be determined based on group variances ofthe n * k different measurement groups, each group variance consisting of a measurement group propagated variance and a measurement group scattered variance. Each measurement group propagated variance can be determined based on the predetermined measurement variances ofthe one or more measurements in the measurement group. The measurement group scattered variance can be a variance ofthe one or more measurements with respect to the mean ofthe one or more measurements in the measurement group. Preferably, the propagated variance for each measurement group is determined according to the equation
tij
7=1 yaP m.
9 where σ - is the propagated variance. Preferably, the mean ofthe one or more
measurements in a group is determined according to the equation
Figure imgf000017_0001
where y(i is the mean ofthe one or more measurements, and the scattered variance is determined according to the equation
Figure imgf000017_0002
where σ~tιs . is the scattered variance.
In a preferred embodiment, the group propagated variance and the group scattered variance for each group are combined according to the equation
σ p + (mti -l)- σ yγ y ti mti
In one embodiment, the method ofthe invention further comprises a step of determining the predetermined measurement variance for each measurement. In a preferred embodiment, the predetermined measurement variance of each measurement^ is determined according to an error model, e.g., a three-term error model according to equation
a^ = c2 + b2 - yw + a2 - yt 2
2 wherein o~ tij is the predetermined measurement variance ofthe measurement^, a is a fractional error coefficient, b is a Poisson error coefficient, and c is a standard deviation of background noise.
In a specific embodiment ofthe invention, a between-group variance is determined in step (b). In this embodiment, the within-group degree of freedom is calculated according to the equation
k n I v τ> - N - n k + ^ τ~' t = li = l m ti
where vR is the within-group degree of freedom, the between-group degree of freedom is calculated according to the equation
vτ =k-l where Vτ is the between-group degree of freedom, the within-group mean square is calculated according to the equation
sR =SR lvR
where sR is the within-group mean square and where SR is calculated according to the equation
Figure imgf000019_0001
where V R is calculated according to the equation
vRa = ma -l+—
and the between-group mean square is calculated according to the equation
Figure imgf000019_0002
where Sγ is the between-group mean square and where Sr is calculated according to the equation
T = ∑ N t (Yt - y f t = l where N, is calculated according to the equation
Figure imgf000020_0001
where y is calculated according to the equation
1 k n m,i -A τT / lI /? iJ /? i*- > tij t=\ i=\ j=\
where N is the total number of measurements, and yt is calculated according to the equation
Figure imgf000020_0002
In such embodiment, the statistical test can be an F-test and the significance level can be a p-value determined according to the equation
Figure imgf000020_0003
In another specific embodiment ofthe invention, an interaction variance is determined in step (b). In this embodiment, the within-group degree of freedom is calculated according to the equation
7? = N - n - k + ∑ ∑ t = 1 z = 1 n
where vR is the within-group degree of freedom, the interaction degree of freedom is calculated according to the equation
Figure imgf000021_0001
where V7 is the interaction degree of freedom, the v/ithin-group mean square is calculated according to the equation
SR /vR
where s ,R2 i •s the within-group mean square and where SR is calculated according to the equation
Figure imgf000021_0002
where V R . is calculated according to the equation
V*ι = mti -1 +mu
and the interaction mean square is calculated according to the equation
Figure imgf000021_0003
9 • where st is the interaction mean square and where S7 is calculated according to the equation
Figure imgf000022_0001
where y is calculated according to the equation
y ttij
Figure imgf000022_0002
where N is the total number of measurements, y is calculated according to the equation
n mti
, tij ι 1Ξ==Ll1 A . i==\! yt
N,
where Nt is calculated according to the equation
n
Nt =∑j ti i=\
yi is calculated according to the equation
Figure imgf000022_0003
where Nt is calculated according to the equation
Figure imgf000022_0004
and yti is calculated according to the equation
∑JV, tij
7=1 yti = ,s
In such embodiment, the statistical test can be an F-test and the significance level o can be a p-value determined according to the equation
pvalue = 1 - fcdf
Figure imgf000023_0001
/ SR , V J , V R J
5 In still another aspect ofthe invention, the invention provides a method of determining if a plurality of measurements in k different measurement groups of a first variable {y(/(l)} of a plurality of M different variables are unchanged among the plurality of different measurement groups, wherein a plurality of measurements of each ofthe plurality of M different variables {yti(j)} in k different measurement groups are available, wherein 0 yt j) is t e *th measurement of they'th variable in the tth measurement group, t = 1, 2, ..., k; i - 1, 2, ..., n„ n, being the number of measurements in measurement group t; and / = 1, 2, ..., M; wherein each measurement group consists of one or more measurements of a respective variable under a condition common to the measurement group, and wherein each measurement yti(j) has a predetermined measurement variance, the method comprising (a) 5 determining for each of the plurality of M different variables a within-group variance for the k different measurement groups ofthe variable, wherein the within-group variance consists of a propagated variance and a scattered variance, the propagated variance being determined based on predetermined measurement variances ofthe plurality of measurements, and the scattered variance being determined based on deviation of each ofthe plurality of 0 measurements with respect to a mean of measurements in a respective measurement group; (b) determining a between-group variance for the k different measurement groups ofthe first variable, wherein the between-group variance is a variance of a plurality of group means, one for each ofthe k different measurement groups, with respect to a mean ofthe plurality of measurements, wherein each group mean is the mean ofthe one or more measurements in 5 a measurement group; (c) determining an average within-group variance of measurements of the M different variables, wherein the average within-group variance is an average of within-group variances ofthe measurements ofthe plurality of different M variables; and (d) comparing the within-group variance ofthe first variable with the between-group variance c ofthe first variable and with the average within-group variance, thereby determining if the plurality of measurements ofthe first variable are unchanged among the plurality of different measurement groups. In the method ofthe invention, k, nt, and Mean be any appropriate integer. The method is equally applicable to any ofthe other variable among the M variables by repeating steps (a)-(d) such other variables in the plurality of M different Q variables.
In the method ofthe invention, the within-group variance for each variable can be determined based on group variances ofthe k different measurement groups for the variable, each of such group variance consisting of a measurement group propagated variance and a measurement group scattered variance. Each measurement group propagated variance can ι r be determined based on the predetermined measurement variances ofthe one or more measurements in the measurement group. The measurement group scattered variance can be a variance ofthe one or more measurements with respect to the mean ofthe one or more measurements in the measurement group.
Preferably, the propagated variance for each measurement group is determined
2 according to the propagated variance for each measurement group is determined according to the equation
σ ti 2 (J) σ
25 y,p U) = ( = 1 n .
where σ^ (j) is the propagated variance and <7; (J) is the predetermined measurement
30 variance of measurement^,^. Preferably, the mean ofthe one or more measurements is determined according to the equation
35 ∑ y ti ( j) ι= y t ( J) n
where yt (J) is the mean ofthe one or more measurements, and the scattered variance is determined according to the equation
Figure imgf000025_0001
where G y -*s ( ^^ /) / is the scattered variance.
In a preferred embodiment, the group propagated variance and the group scattered variance for each group are combined according to the equation
Figure imgf000025_0002
In one embodiment, the method ofthe invention further comprises a step of determining the predetermined measurement variance for each measurement. In a preferred embodiment, the predetermined measurement variance of each measurement yti(j) is determined according to an error model, e.g., a three-term error model according to equation
σlU) = c2 + b2 -ytiU) + a2 -yti{j)2
wherein O ^ (J) is the predetermined measurement variance ofthe measurement y„Q), a is a fractional error coefficient, 6 is a Poisson error coefficient, and c is a standard deviation of background noise.
Preferably, step of comparing the within-group variance ofthe first variable with the between-group variance ofthe first variable is carried out according to the equation
Figure imgf000026_0001
wherein
Figure imgf000026_0002
and
vr(l) = k-1
and
sR 2(l) = SR(l)/vR(l)
where
Figure imgf000026_0003
where
Rt (!) = », -1 + n
and Jr 2(l) = 5r(l)/vr(l)
where
Figure imgf000027_0001
where
Figure imgf000027_0002
where N is the total number of measurements of said first variable; and the comparing the within-group variance ofthe first variable with the average within-group variance comparing is carried out according to the equation
SmallError_ pvalue (1) =
Figure imgf000027_0003
where
ι M
wherein
sR ~U) = SR(j)/vR(j)
where
Figure imgf000028_0001
where
10 V Rt U) = nt - l + n
and where Vavg is chosen to be at least about 20 to about 100.
15
In still another aspect ofthe invention, the invention provides a method for analyzing variation among a plurality of measurements of a variable {yλ, where v, is the tth measurement, t = 1, 2, ..., n„ nt being the number of measurements. The plurality of measurements are measured under a common condition, and each ofthe measurement has a predetermined measurement variance. The method comprises (a) determining a propagated
20 variance and a scattered variance, wherein the propagated variance is determined based on predetermined measurement variances ofthe plurality of measurements, and wherein the scattered variance is a variance ofthe plurality of measurements with respect to the mean of the plurality of measurements; and (b) comparing the propagated variance and the scattered rjr variance, thereby analyzing variation in the plurality of measurements ofthe variable. In the method ofthe invention, nt can be any appropriate integer.
In a preferred embodiment, the propagated variance for each measurement group is determined according to the equation
30 σ σ t = l y , p n
35 9 2 where σv is the propagated variance and σt is the predetermined measurement variance
of measurement^, and the scattered variance is determined according to the equation
Figure imgf000029_0001
2 where σ y _t s „ is the scattered variance and where
Σ y t t = \ y n t
In a preferred embodiment, the method further comprises prior to step (a) a step of weighting each measurement yt with a weighting factor to obtain an error- weighted measurement, where the weighting factor is determined based on the predetermined measurement variance ofthe measurement. In one embodiment, the weighting factor is determined according to the equation
1 w t =
Figure imgf000029_0002
Preferably, the comparing is carried out using an F-test according to the equation
Consistenc y _ pvalue = 1 - - 1, nt
Figure imgf000029_0003
The invention also provides a computer system comprising a processor and a memory coupled to the processor and encoding one or more programs which programs cause the processor to carry out any one ofthe methods described above.
The invention also provides a computer program product for use in conjunction with a computer having a processor and a memory connected to the processor, which computer program product comprises a computer readable storage medium having a computer program mechanism encoded thereon. The computer program mechanism may be loaded into the memory ofthe computer and cause the computer to carry out any one ofthe methods described above.
10
4. BRIEF DESCRIPTION OF FIGURES FIG.l is a schematic illustration of an embodiment ofthe improved ANOVA data processing method. Briefly, the processing blocks in the diagram are listed as following.
H _ Pre-Processing (optional) — this pre-processing module handles data normalization and data transformation of, e.g., intensity data. Intensity error of a micro-array measurement has a multiplicative component that is proportional to the intensity itself. To meet the variance homogeneity requirement in ANOVA, the intensity data can be transformed by an error model based transformation which converts the intensity x and intensity error σx to a new
2o domain y and σy where the transformed intensity error is constant over the entire range of the transformed intensity. ANOVA — this improved ANOVA module takes two variables y and σy as inputs. For one-way ANOVA, the module computes one ANOVA p-value for each measurement y. For two-way ANOVA, it computes two p-values ofthe two factors and one p-value for the interaction between the two factors. The module also estimates the
25 error- weighted mean of replicates in each treatment group and the error ofthe mean. Post- Processing (optional) - This module inversely transforms the result mean and error from the transformed domain to the original domain.
FIG. 2 illustrates an exemplary embodiment of a computer system useful for 30 implementing the methods of this invention.
5. DETAILED DESCRIPTION OF THE INVENTION The present invention provides improved ANOVA methods for processing and analyzing measured data and transformed data. The methods ofthe invention is particularly 35 useful for analyzing gene or protein expression data. In this disclosure, a measurement group refers to a set of one or more replicate measurements of a variable of particular type of sample, e.g., a particular type of cells, a cells sample under a particular condition or perturbation, such as exposure to a drug, a genetic mutation and so on. The variable can be any measurable variable ofthe sample including but not limited to expression or transcript level of a gene or abundance of a protein, level or concentration of a small cellular molecule, e.g., a metabolite, level of a clinical indicator, e.g., level or concentration of a chemical in blood. As an example, a measurement group may contain a set of gene expression levels of a biological sample
1 „ which has undergone a particular drug treatment, whereas different measurement groups contain different sets of one or more gene expression levels ofthe biological sample under different drug treatments, e.g., different doses the same drug, different drugs, different time after the treatment by a drug, etc. In the disclosure, the term "measurement group" is used interchangeably with the terms "treatment group" or "sample group."
, <- In this disclosure, for simplicity reasons, the invention is often described in terms of measured data obtained from a DNA microarray experiment. It will be apparent to a person of ordinary skill in the art that the methods ofthe present invention are equally applicable to data measured in many other kinds of experiments, e.g., data measured in a protein array experiment or data measured in a two-dimensional (2D) protein gel experiment.
20
5.1. BIOLOGICAL STATE AND EXPRESSION PROFILE
The state of a cell or other biological sample is represented by cellular constituents
(any measurable biological variables) as defined in Section 5.1.1, infra. Those cellular constituents vary in response to perturbations, or under different conditions. The measured jr data can be measurements of such cellular constituents or measurements of responses of cellular constituents.
5.1.1 BIOLOGICAL STATE As used herein, the term "biological sample" is broadly defined to include any cell, ~ tissue, organ or multicellular organism. A biological sample can be derived, for example, from cell or tissue cultures in vitro. Alternatively, a biological sample can be derived from a living organism or from a population of single cell organisms. In preferred embodiments, the biological sample comprises a living cell or organism.
The state of a biological sample can be measured by the content, activities or ~r structures of its cellular constituents. The state of a biological sample, as used herein, is taken from the state of a collection of cellular constituents, which are sufficient to characterize the cell or organism for an intended purpose including, but not limited to characterizing the effects of a drug or other perturbation. The term "cellular constituent" is r also broadly defined in this disclosure to encompass any kind of measurable biological variable. The measurements and/or observations made on the state of these constituents can be of their abundances (i.e., amounts or concentrations in a biological sample) e.g., of mRNA or proteins, or their activities, or their states of modification (e.g., phosphorylation), or other measurements relevant to the biology of a biological sample. In various
, n embodiments, this invention includes making such measurements and/or observations on different collections of cellular constituents. These different collections of cellular constituents are also called herein aspects ofthe biological state of a biological sample.
One aspect ofthe biological state of a biological sample (e.g., a cell or cell culture) usefully measured in the present invention is its transcriptional state. In fact, the
, <. transcriptional state is the currently preferred aspect ofthe biological state measured in this invention. The transcriptional state of a biological sample includes the identities and abundances ofthe constituent RNA species, especially mRNAs, in the cell under a given set of conditions. Preferably, a substantial fraction of all constituent RNA species in the biological sample are measured, but at least a sufficient fraction is measured to characterize ς. the action of a drug or other perturbation of interest. The transcriptional state of a biological sample can be conveniently determined by, e.g., measuring cDNA abundances by any of several existing gene expression technologies. One particularly preferred embodiment of the invention employs DNA arrays for measuring mRNA or transcript level of a large number of genes. The other preferred embodiment ofthe invention employs DNA arrays r for measuring expression levels of a large number of genes or exons in the genome of an organism.
Another aspect ofthe biological state of a biological sample usefully measured in the present invention is its translational state. The translational state of a biological sample includes the identities and abundances ofthe constituent protein species in the biological
~n sample under a given set of conditions. Preferably, a substantial fraction of all constituent protein species in the biological sample is measured, but at least a sufficient fraction is measured to characterize the action of a drug of interest. As is known to those of skill in the art, the transcriptional state is often representative ofthe translational state.
Still another aspect ofthe biological state of a biological sample is its small
~c molecule state, e.g., metabolic state. The small molecule state of a biological sample comprises identities and abundances of small molecules present in a cell. Small molecules refer to molecules of molecular weights of less than about 5000, including but are not limited to sugars, fatty acids, amino acids, nucleotides, intermediates of cellular processes,
,- e.g., intermediates of metabolic and signaling pathways.
Other aspects ofthe biological state of a biological sample are also of use in this invention. For example, the activity state of a biological sample, as that term is used herein, includes the activities ofthe constituent protein species (and also optionally catalytically active nucleic acid species) in the biological sample under a given set of conditions. As is
. known to those of skill in the art, the translational state is often representative ofthe activity state.
This invention is also adaptable, where relevant, to "mixed" aspects ofthe biological state of a biological sample in which measurements of different aspects ofthe biological state of a biological sample are combined. For example, in one mixed aspect, the
, - abundances of certain RNA species and of certain protein species, are combined with measurements ofthe activities of certain other protein species. Further, it will be appreciated from the following that this invention is also adaptable to other aspects ofthe biological state ofthe biological sample that are measurable.
The biological state of a biological sample (e.g., a cell or cell culture) is represented
^ . by a profile of some number of cellular constituents. Such a profile of cellular constituents can be represented by the vector S,
s = [sl t 3, i ' Sλ (1)
25 where S- is the level ofthe fth cellular constituent, for example, the transcript level of gene i, or alternatively, the abundance or activity level of protein i. In preferred embodiments, k is more than 2, preferably more than 10, more preferably more than 100 , still more preferably more than 1000, still more preferably more than 10,000, still more n preferably more than 25,000, still more preferably more than 50,000, and most preferably more than 100,000.
In some embodiments, cellular constituents are measured as continuous variables. For example, transcriptional rates are typically measured as number of molecules synthesized per unit of time. Transcriptional rate may also be measured as percentage of a or control rate. However, in some other embodiments, cellular constituents may be measured as categorical variables. For example, transcriptional rates may be measured as either "on" or "off, where the value "on" indicates a transcriptional rate above a predetermined threshold and value "off indicates a transcriptional rate below that threshold.
5
5.1.2 BIOLOGICAL RESPONSES AND EXPRESSION PROFILES
The responses of a biological sample to a perturbation, i.e., under a condition, such as the application of a drug, can be measured by observing the changes in the biological state ofthe biological sample. For example, the responses of a biological sample can be
, „ responses of a living cell or organism to a perturbation, e.g., application of a drug, a genetic mutation, an environmental change, and so on, to the living cell or organism. A response profile is a collection of changes of cellular constituents. In the present invention, the response profile of a biological sample (e.g., a cell or cell culture) to the perturbation m is defined as the vector v m):
15
,("0 (m) (m) (m)
= V< v V (2)
Where V-" is the amplitude of response of cellular constituent i under the
20 perturbation m. In some particularly preferred embodiments of this invention, the biological response to the application of a drug, a drug candidate or any other perturbation, is measured by the induced change in the transcript level of at least 2 genes and/or proteins, preferably more than 10 genes and/or proteins, more preferably more than 100 genes and/or proteins, still more preferably more than 1000 genes and/or proteins, still more preferably
25 more than 10,000 genes and/or proteins, still more preferably more than 25,000 genes and/or proteins, still more preferably more than 50,000 genes and/or proteins, and most preferably more than 100,000 genes and/or proteins. In another preferred embodiment of the invention, the biological response to the application of a drug, a drug candidate or any other perturbation, is measured by the induced change in the expression levels of a plurality
30 of exons in at least 2 genes and/or proteins, preferably more than 10 genes and/or proteins, more preferably more than 100 genes and/or proteins, still more preferably more than 1000 genes and/or proteins, still more preferably more than 10,000 genes and/or proteins, still more preferably more than 25,000 genes and/or proteins, still more preferably more than 50,000 genes and/or proteins, and most preferably more than 100,000 genes and/or proteins.
35 In some embodiments ofthe invention, the response is simply the difference between biological variables before and after perturbation. In some preferred embodiments, the response is defined as the ratio of cellular constituents before and after a perturbation is applied. In other embodiments, the response may be a function of time after the perturbation, i.e., v(m) = v(m)(f). For example v(m)(t) may be the difference or ratio of cellular constituents before the perturbation and at time t after the perturbation.
In some preferred embodiments, vf is set to zero if the response of gene i is below some threshold amplitude or confidence level determined from knowledge ofthe measurement error behavior. In such embodiments, those cellular constituents whose o measured responses are lower than the threshold are given the response value of zero, whereas those cellular constituents whose measured responses are greater than the threshold retain their measured response values. This truncation ofthe response vector is a good strategy when most ofthe smaller responses are expected to be greatly dominated by measurement error. After the truncation, the response vector v(m) also approximates a 5 'matched detector' (see, e.g., Van Trees, 1968, Detection, Estimation, and Modulation Theory Vol. I, Wiley & Sons) for the existence of similar perturbations. It is apparent to those skilled in the art that the truncation levels can be set based upon the purpose of detection and the measurement errors. For example, in some embodiments, genes whose transcript level changes are lower than two fold or more preferably four fold are given the 0 value of zero.
In some preferred embodiments, perturbations are applied at several levels of strength. For example, different amounts of a drug may be applied to a biological sample to observe its response. In such embodiments, the perturbation responses may be interpolated by approximating each by a single parameterized "model" function ofthe perturbation 5 strength u. An exemplary model function appropriate for approximating transcriptional state data is the Hill function, which has adjustable parameters a, u0, and n.
a(ulu0)n
H(u) - (3) 1 + (ulu " 0
The adjustable parameters are selected independently for each cellular constituent ofthe perturbation response. Preferably, the adjustable parameters are selected for each cellular constituent so that the sum ofthe squares ofthe differences between the model function 5 (e.g., the Hill function, Equation 3) and the corresponding experimental data at each perturbation strength is minimized. This preferable parameter adjustment method is well known in the art as a least squares fit. Other possible model functions are based on polynomial fitting, for example by various known classes of polynomials. More detailed description of model fitting and biological response has been disclosed in Friend and Stoughton, Methods of Determining Protein Activity Levels Using Gene Expression Profiles, PCT publication WO 99/59037, which is incorporated herein by reference in its entirety for all purposes.
5.2. DATA TRANSFORMATIONS Measured data obtained in a microarray experiment often contain errors due both to the inherent stochastic nature of gene expression and to measurement errors from various external sources. The many sources of measurement error that may occur in a measured signal include those that fall into three categories - additive error, multiplicative error, and
Poisson error. The signal magnitude-independent or intensity-independent additive error includes errors resulted from, e.g., background fluctuation, or spot-to-spot variations in signal intensity among negative control spots, etc. The signal magnitude-dependent or intensity-dependent multiplicative error, which is assumed to be directly proportional to the signal intensity, includes errors resulted from, e.g., the scatter observed for ratios that should be unity. The multiplicative error is also termed fractional error. The third type of error is a result of variation in number of available binding sites in a spot. This type of error depends on the square-root ofthe signal magnitude, e.g., measured intensity. It is also called the
Poisson error, because it is believed that the number of binding sites on a microarray spot follows a Poisson distribution, and has a variance which is proportional to the average number of binding sites.
5.2.1. ERROR MODEL BASED TRANSFORMATIONS In a preferred embodiment, measured data are first transformed by an error model based transformation before analyzed by the improved ANOVA method ofthe invention. The results from the ANOVA analysis can be transformed back by an appropriate inverse transformation. An error model based data transformation method is described in Weng, U.S. Provisional Patent Application No. 60/353,845, filed on January 31, 2002, which is incorporated by reference herewith in its entirety.
5.2.1.1. ERROR MODELS Errors in measured data can be described by error models (see, e.g., Supplementary material to Roberts et al, 2000, Science, 287:873-880; and Rocke et al, 2001, J. Computational Biology 8:557-569). In preferred embodiments, an error model (see, e.g.,
<- Supplementary material to Roberts et al, 2000, Science, 287:873-880; and Rocke et al., 2001, J. Computational Biology 8:557-569) contains two or three error terms to describe the dominant error sources. In a two-term error model, a first error term is used to describe the low-level additive error which comes from, e.g., the background ofthe array chip. Since this additive error has a constant variance, in this disclosure, it is also called the constant
, error. The constant error is independent from the hybridization levels of individual spots on a microarray. It may come from scanner electronics noise and/or fluorescence due to nonspecific binding of fluorescence molecules to the surface ofthe microarray. In one embodiment, this constant additive error is taken to have a normal distribution with a mean bkg and a standard deviation σbkg. After background level subtraction, which is typically
2 applied in microarray data processing, the additive mean bkg becomes zero. In this disclosure, it is often assumed that the background intensity offset has been corrected. An ordinary skilled artisan in the art will appreciate that in cases where the background mean is not corrected, the methods ofthe invention can be used with an additional step of making such a correction.
2o The second error source is the multiplicative error that is the combined result ofthe speckle noise inherent in the coherent laser scanner and the fluorescence dye related noise. The multiplicative error is also called fractional error because its level is directly proportional to the magnitude ofthe measured signal, e.g., the measured intensity level. It is the dominant error source at high intensity levels. In one embodiment in which the
25 measured signal is obtained from a microarray experiment, the standard deviation ofthe fractional error in the k'th. spot can be approximated as
σfmc(k) χ a - x(k) (4)
30 where x(k) is the measured intensity in the /c'th spot. The constant a in Equation 4 is termed fractional error coefficient, and describes the proportion ofthe fractional error to the intensity ofthe measured signal. In one embodiment, the constant has a value in the range of 0.1 to 0.2. This constant may vary depending on the particular microarray technology used for obtaining the measured signal and/or the particular hybridization protocol used in
35 the measurement. In one embodiment, parameter a is determined during the error building phase by measuring the variance ofthe log ratio near the high intensity side in a same-vs.- same ratio experiment where the intensities in the ratio numerator and denominator come from the same sample and treatment. At high intensities, the variance of log ratio x, over x2 relates to parameter a
{a - xx)2 (a - x2)
Var{\n(xl / x2)} « 2 + .. 2 = 2 - al (5) j x2
when x, and x2 » σbkg . In one embodiment, x1 and x2 are at least 4, 10, 50, 100, or 200 times σbkg.
In a two-term error model, the measurement error in a measured signal, e.g., measured intensity, x(k) can be defined as
<*χ(X) = ^ bkg(kf + σ fa « ^σbks(k)2 + a2 - x(kf (6)
In a preferred embodiment ofthe invention, the background noise variances in Equation 6 are taken as slightly different in different microarray spots or regions of a microarray chip.
In one embodiment, the difference is less than 20%, 10%, 5%, or 1%.
In a three-term error model, an extra square-root term is included to describe measurement errors originated from variation in the number of available binding sites in a microarray spot. This term is also called the Poisson term. In one embodiment, without knowledge of actual number of binding sites in a microarray spot, the measured intensity is used to provide an estimate ofthe average number of binding sites. In such an embodiment, the Poisson error can be approximated as
σPoisson(k)
Figure imgf000038_0001
(7)
where parameter b is an overall proportional factor, termed Poisson error coefficient. In a three-term error model, the measurement error in a measured signal, e.g, a measured fluorescence intensity, x(k) can be defined as *,(*) = Vσ* ( )2 + σPoisson{kf + σfrac(k)2
(8)
^bkg(kY + b2 ■ x(k) + a2 • x(kf
In a preferred embodiment, during error model development, when σblcg and parameter a have been determined, parameter b in Equation 8 is determined by measuring the intensity variance in the middle intensity ranges ofthe same-vs.-same experiments. In one embodiment, the intensity variance is measured in the 25 to 75 percentile range, 35 to 65 percentile range, or 45 to 50 percentile range for determination of b.
In a preferred embodiment, after the error model development phase, parameters a and b are fixed for an error model under a given microarray technology and experiment protocol. The background noise σbkg can be estimated for each particular microarray experiment. In another preferred embodiment, when a set of replicate experiments are carried out, the background noise σbkg for the set can be obtained by averaging the background noise estimated for each ofthe replicate experiments.
The two-term error model as described by Equation 6 can been seen as a simplified version ofthe three-term error model described by Equation 8 by setting the Poisson parameter b to zero. In this disclosure, Equation 8 is used as the general mathematical description of error models. It will be apparent to an ordinarily skilled artisan that any results obtained based on Equation 8 are also applicable to a two-term error model by setting the Poisson parameter b to zero.
It will be apparent to an ordinarily skilled artisan that other methods may also be used to determine an error model (see, e.g., Rocke et al., 2001, J. Computational Biology
8:557-569).
5.2.1.2. INTENSITY TRANSFORMATIONS
It is clear from Equation 8 that microarray intensity measurements do not meet the constant- variance requirement. There are different measurement errors (or variances) in different intensities. The intensity error is a function of intensity itself. To overcome this problem, a function/0 is needed to transform measured data, e.g. the intensity data, x to a new domain in which the variance becomes a constant. All analysis and data processing can then be carried out in the transformed domain. In a preferred embodiment, such a transformation is described as
y k) = f(x(kj), for all x and (9)
σ (k) « C, for all x where C is a constant. (10)
Preferably the transformation works for both positive and negative (e.g, negative signals obtained after background subtraction) x. More preferably the transformation meets the following additional constraints:
(i) Monotonic: If x(kj)>x(kj, t enyfk^yfl^ for all x;
Figure imgf000040_0001
and
(iii) Smooth: The first and the second derivatives ofthe function f should be continuous functions.
Still more preferably, an inverse transformation function g exists so that the transformed data in the transformed domain can be transformed back to the original domain. The inverse transformation does the following operation:
x(k) = g(y(k)), for all y (11)
Preferably, the inverse transformation function g meets above four constraints as well. In one embodiment, the error in the inversely transformed intensity can be determined when the first derivative/' Q ofthe forward transformation function/is available:
(12) σ χ { ) df(x{k))/dx(k) ~ f'(x(k))
It is most preferable that the forward transformation function/ its first derivative/, and the inverse transformation function g are all in analytical closed-forms.
A transformation based on an error model is provided and used to transform measured data obtained in an experiment to a transformed domain such that the measurement errors in transformed data are equal to the measurement errors in the measured data normalized by errors determined based on an error model. As used in this
<- disclosure, such an measurement error, i.e., a measurement error which equals the measurement error in the measured signal normalized by an error determined based on an error model, is also referred to as a normalized error. Any suitable error model can be used in the invention. In a preferred embodiment, the error model is a two-term or a three-term error model described in Section 5.1.1.1. In a particularly preferred embodiment, the
. π variance ofthe transformed data in the transformed domain is close to a constant. More preferably, the transformation meets all requirements discussed in Section 5.2.1.2. The basic concept ofthe new transformation method is to apply an error model to normalize errors in real measurements, e.g., standard deviations in measured data, such that the normalized errors are close to a constant. Then a transformation function/) is found by the
, - integration ofthe normalization function. The methods are applicable to any set of measured data whose errors can be described by a particular error model.
In a specific embodiment, the real measurement standard deviation Δx is for the positive intensity x>0. The real standard deviation Δx is usually known before the transformation. An error model in Equation 8 provides σx that is an estimate ofthe real
20 standard deviation Δx for different intensities. In one embodiment, Δx is an error determined by the experiment. In another embodiment, Δx is calculated using an error model ofthe experiment. In a preferred embodiment, Δx is chosen to be the larger of an experimentally determined error or an error model-calculated error. Assuming the transformed standard deviation is Ay, the following approximation relates the two errors
25 with the first derivative function ofthe transformation:
dy Δv f,(x)-ix x 13)
30
If the equation is rearranged, one obtains
Ay * Ax - f' (x) (14)
35 Because Equation 8 is an approximation of Δx, if a normalization function ' is defined as follows:
= /'(*)= for x>0, (15) c2 + b - x + a2 - x
where a, b, and c are defined as in Section 5.2.1.1 , one can expect that the variance of y is o close to a constant.
Equation 15 provides an analytical form ofthe first derivative function ofthe desired transformation. To obtain the transformation function itself, both sides of Equation 15 are integrated:
5 dx y = f(χ) = jf'(χ)- dx = j-7 2 , r 2 , 2 2 , for x>0 (16) c + b - x + a x
Fortunately, the integral in Equation 16 does have an analytical solution. By using a 0 symbolic-solution software, such as Mathematica, or using an integral table, the solution is obtained as
5
Figure imgf000042_0001
(17)
Applying the zero intercept constraint (ii) in Section 5.2.1.2, i.e., y = 0 when x = 0, the 0 constant d in Equation 17 is found to be
Figure imgf000042_0002
As indicated in Equation 11 in Section 5.2.1.2, preferably one finds the inverse transformation function g(y) so that the transformed intensity y can be converted back to the original x scale whenever necessary. By using linear algebra or a symbolic-solution software, such as Maple, one finds
-{4 - a2 - c2 - a2 - e2aiy-d) + 2 - a - b2 • ea {y~d - b ) x = g(y) = Λ _3 d) , for y>0 - a3 - ea<
(19)
To complete the forward and the inverse transformation pair for both intensity and its error, the standard deviation ofthe inversely transformed intensity can be estimated by using Equation 12. In a specific embodiment, the transformation function can be further defined to be symmetric to zero for all x. When x<0, the absolute value |x| is used to replace x in the forward transformation in Equation 17 and to give a negative sign to the result y. In the inverse transformation in Equation 19, when y<0, the absolute value |y| is used to replace;; and to give a negative sign to the result x. Under the forward transformation, the estimated transformed error σy is one over all intensity ranges of x or y, so that constant C=l in Equation 10. The transformation also meets all other requirements and constraints stated in Section 5.2.1.2. In addition, the transformation has several other interesting properties:
_ . \n(Λ - a - x) y - f(x) « , when x is very large (20) a
1 y' - f'(x) ~ ~ , when |x| is veiy small (21) c
The transformation described in this section is applicable to any measured data in which the errors can be described by a three-term error model. In preferred embodiments, the measured data are measured in a microarray gene expression experiment. In other preferred embodiments, the measured data are measured in a protein array experiment or a 2D gel protein experiment.
In one preferred embodiment, the measured data are signal data obtained in an microarray experiment in which two spots or probes on a microarray are used for obtaining
- each measured signal, one comprising the targeted nucleotide sequence, i.e., the target probe (TP), e.g., a perfect-match probe, and the other comprising a reference sequence, i.e., a reference probe (RP), e.g., a mutated mismatch probe. The RP probe is used as a negative control, e.g., to remove undesired effects from non-specific hybridization. In one embodiment, the measured signal obtained in such a manner is defined as the difference
, ^ between the intensities ofthe TP and RP, x^- ^. In such an embodiment, the fractional error, the Poisson error, and the background constant error σbkg are described respectively according to equations
vfi k) * a • x(k) = a • xτp(k? + Rpikf (22)
15
σpoisso k) « b
Figure imgf000044_0001
(23)
Figure imgf000044_0002
20
The transformation described in this section remains applicable if Equations 22-24 are used to calculate the fractional error, the Poisson error and the background constant error, respectively. In one embodiment, the TP probe is a perfect-match probe (PM), and the RP probe is a mismatch probe (MM) (see, e.g., Lockhart et al, 1996, Nature Biotechnology
25 14:1675). In another embodiment, the RP probe is a reverse probe ofthe TP probe, i.e., the RP probe has a sequence that is the reverse complement ofthe TP probe (see, Shoemaker et al., U.S. Patent Application Serial No. 09/781,814, filed on February 12, 2001; and Shoemaker et al., U.S. Patent Application Serial No. 09/724,538, filed on November 28, 2000).
30
It will be apparent to one skilled in the art that although the transformations as described by equations 17 and 19 are preferably carried out using parameters a, b, and c chosen based on a three-term error model, the transformations as described by equations 17 and 19 can also be used by replacing parameters a, b, and c with other parameters. Embodiments using such parameters are also encompassed by the present invention.
35 5.2.2. OTHER TRANSFORMATIONS Another transformation that can be used to transform the data before ANOVA analysis is a logarithm transformation:
y(k) = f(x(k)) = ln(x(k)), for x> 0 (25)
In Equation 8, when intensity x is very high, the fractional error is the dominant error source. In this case, the standard deviation of is approximately a constant:
\ a - x(k) σ v(k) σx(k) - f'[x(k)) « — — — = a, when is very large (26) y x(k)
When intensity x is low, the standard deviation of y is inversely proportional to x, and is approaching infinity:
σy(k) « σx(k) f' x(k)) « — AA when x is very small (27)
X(fZ
Still another transformation that can be used to transform the data is a piecewise hybrid transformation (see, e.g., D. Holder, et al, "Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach", presented in Genelogic Workshop on Low Level Analysis of Affymetrix Genechip® data, Nov 19, 2001, Bethesda, MD, http://oz.berkeley.edu/users/terry/zarray/Affy/GL_Workshop/Holder.ppt). This hybrid transformation uses a linear function at the low intensity side and a logarithm function for high intensities. An arbitrary parameter c ' defines the boundary between the linear and the logarithmic functions. Equation 28 is the mathematical definition ofthe hybrid transformation function.
y(k) = f(x(k)) = x(k), for 0 < x k) < c'
y(k) = f{x{k)) = c' \n(x(k) / c') + c' , for x(k) ≥ c' (28) y(k) = f(x(k)) = 0, for χ(k) < 0
In one embodiment, parameter c ' in Equation 28 is chosen to be 20. Errors ofthe hybrid- transformed intensities can be estimated as
σy (k) » σx(k) - f'{x(k)) = σx{k), foχ 0 < x{k) < c<
σy{k) « σx{k) f'(x(kj) = c' σx(k) I x(k), for x(/c) > d (29)
5.3. IMPROVED ANOVA METHODS FOR ANALYZING MEASURED DATA The present invention provides improved ANOVA methods for processing and analyzing measured data and transformed data, e.g., data transformed by a method described in Section 5.2. When transformed data are analyzed by the improved ANOVA method of the invention, it is optional to transform the results back by an appropriate inverse transformation.
5.3.1. ANOVA METHOD Analysis of Variance (ANOVA) is a widely used method for analyzing gene or protein expression data (see, e.g., Statistics For Experimenters, Box, Hunter and Hunter, John Wiley and Sons, 1978; Siegel et al., Nonparametric statistics for the behavioural sciences, McGraw Hill, 2nd edition, 1998; Conover, Practical Nonparametric Statistics, John Wiley and Sons, 3rd edition, 1998; Airman, Practical Statistics for Medical Research, CRC Press, 1991; Berry, Statistical Methods in Medical Research, Blackwell Science, Inc.,
2001). ANOVA is a method for determining whether there are statistical differences among the means of different measurement groups. Each different measurement group contains one or more replicate measurements of a particular type of sample, e.g., a particular type of cells, a cells sample under a particular condition or perturbation, such as exposure to a drug, a genetic mutation and so on. For example, gene expression levels or protein abundances or activity levels from cells under different conditions or perturbations, including but are not limited to different drug treatments, e.g., different doses the a particular drug, different drugs, different time after the treatment by a drug, etc., can be measured using DNA or protein micro-arrays and 2D gel electrophoresis or mass spectrometry. In such cases, each measurement group contains replicate measurements of expression level of a gene or abundance/activity level of a protein in cells under a condition or perturbation common to the measurements, whereas different measurement groups contain measurements of expression level ofthe gene or abundance/activity levels ofthe protein in cells under
<- conditions or perturbations different among different groups. In ANOVA, one first finds the within-group variance, i.e., variance of individual measurements, and the between-group variance, i.e., variance ofthe means of different treatment groups. The within-group variance reflects the measurement error ofthe measurement technology, while the between- group variance includes both the measurement error ofthe measurement technology and the
, π changes caused by different treatments. The between-group variance is then compared to the within-group variance. If the between-group variance is significantly larger than the within-group variance, it may be concluded that the different treatments have produced statistically significant changes in measurements, e.g., gene expression levels. In this disclosure, the term "treatment group" or "sample group" are also used. These terms and
. <- the term "measurement group" are intended to be equivalent and are used interchangeably. In the disclosure, the term "average" and "mean" are also used interchangeably.
In ANOVA, the underlying null-hypothesis H0 is that all treatment groups have the same mean, i.e., the mean of each treatment group does not change among different treatment groups. For example, in cases of expression levels of a gene or abundances of a
~n protein under different conditions, H0 is that the mean expression level ofthe gene or abundances ofthe protein is the same under the different conditions. To test H0, a statistical distribution of residues can be selected. For example, in ANOVA F-statistics is often used with an underlying assumption of normal distribution. A statistical metric can then be chosen. The significance level of he statistical metric is then determined. If the
~c significance level is greater than a given threshold, the null-hypothesis is accepted, whereas if the significance level is lower than the threshold the null-hypothesis is rejected and the alternative hypothesis Ht that one or more measurement groups have different means is accepted. Alternatively, a randomization method, e.g., permutation, can be employed to obtain a distribution free significance level (see, e..g., Wu et al., "MAANOVA: A software r. package for the Analysis of Spotted cDNA Microarray Experiments, published at www.jax.org/churchill/pubs/wu_maanova.pdf, accessed on October 25, 2002, which is incorporated by reference herein in its entirety). A skilled person in the art will be able to select an appropriate statistical test once the data to be analyzed are given.
To calculate a statistical metric, e.g., a metric for F-statistics, assuming there are a r total of k different treatment groups, e.g., k different compounds or k different dose levels of one compound, and a total of nt replicate measurements in treatment group t, then the tth replicate measurement in the tth treatment group is written as v,„ where the treatment index t is from 1 to k, and the replicate index i is from 1 to nt. The total number of measurements is
N = ∑nt (30) t=\
o The grand average of all N measurements is
Figure imgf000048_0001
5
For a given treatment group t, the group average is
Figure imgf000048_0002
0
The traditional one-way AΝOVA table is presented in Table 1 (see, e.g., Box et al., Statistics For Experimenters, John Wiley & Sons, 1978). The AΝOVA table summarizes the quantities used in AΝOVA calculation.
5
0
5 Table 1. One-way traditional ANOVA
Figure imgf000049_0002
In the traditional ANOVA, the statistical metric for F-statistics can be the ratio ofthe estimated mean squares in Table 1, i.e., sT 2/sR 2. The probability that the null-hypothesis may be accepted using F-test is then calculated as a p-value according to Equation 33. When the p-value is lower than a given threshold, for example p-value<0.01, the null-hypothesis is rejected and the alternative hypothesis Hj is accepted.
pvaluΘ
Figure imgf000049_0001
(33)
Other statistical test can also be used in ANOVA. A skilled person in the art will be able to select a suitable statistical test based on, e.g., the type of data to be analyzed and the goal of the analysis (see, e..g., Wu et al., "MAANOVA: A software package for the Analysis of
Spotted cDNA Microarray Experiments, published at wrww.jax.org/churchill/pubs/wu_maanova.pdf, accessed on October 25, 2002, which is incorporated by reference herein in its entirety). 5.3.2. IMPROVED ANOVA METHODS The invention provides an improved ANOVA method for analyzing measurement data. In the improved ANOVA method ofthe invention, for each measurement yh of a
<- biological variable y, in addition to the measurement quantity ytl itself, e.g., measured expression level of a gene, a second input is also used. The second input is a predetermined measurement variance or error. In the present invention, a predetermined measurement variance or error can be any variance or error which is determined prior to the ANOVA analysis. For example, the predetermined variance or error of a measurement can be
, determined based on measured data which are obtained either prior to or concurrently with the measurements to be analyzed. In one embodiment, the predetermined variance or error of a measurement of a variable is determined based on prior measurements ofthe same variable using the same measurement technological platform, e.g., prior measurements of the expression level ofthe same gene or protein using the same type of microarrays. In
, - another embodiment, the predetermined variance or error of a measurement of a variable is determined based on concurrent measurements of a plurality of different variables using the same measurement technological platform, e.g., measurements ofthe expression levels of a plurality of different genes using one or more microarrays. The plurality of different variables may include the variable whose predetermined variance or error is to be r. determined. Alternatively, the plurality of different variables may include only variables other than the variable whose predetermined variance or error is to be determined. The predetermined error or variance can be determined from such prior or concurrent measurements using a suitable error model. Determination ofthe predetermined variance or error based on a combination ofthe above discussed prior and concurrent measurements is jC also envisioned. In this disclosure, a predetermined measurement variance or error is also referred to as a "prior," which is intended to encompass a predetermined variance or error determined by any ofthe methods disclosed above. In one embodiment, the additional input is an estimated standard error ofthe measurement ;;„, σ„.
In preferred embodiments, the invention provides a method of analyzing a plurality
3Q of measurements of a variable ( '„} in k different measurement groups, where y„ is the tth measurement in the tth measurement group, t — 1, 2, ..., k and = 1, 2, ..., n„ n, being the number of measurements in measurement group t. Each ofthe k different measurement groups consists of measurements ofthe variable under a particular condition or perturbation common to the measurement group. Each measurement v„ has a predetermined
32 measurement variance. In the method ofthe invention, each measurement y is first weighted with a weighting factor to obtain an error-weighted measurement. The weighting factor can be determined based on the measurement variance ofthe measurement. In one embodiment the weighting factor is calculated according to Equation (34), infra. Then a
- within-group variance for the k different measurement groups is determined. The within- group variance consists of a propagated variance and a scattered variance. The propagated variance can be determined based on the predetermined measurement variances ofthe plurality of measurements, and the scattered variance can be determined based on deviation of each ofthe plurality of measurements with respect to a respective group mean, i.e., a
, . mean of error- weighted measurements in a respective measurement group. It will be apparent to one skilled in the art that if a measurement group contains only one measurement, the mean is the same as the measurement. In a preferred embodiment, the propagated variance for each measurement group is determined according to Equation (37), infra. In another preferred embodiment, the scattered variance is determined according to
1 <- Equation (40), infra. A between-group variance for said k different measurement groups is then determined. The between-group variance is a variance of a plurality of group means, one for each ofthe k different measurement groups, with respect to a mean ofthe plurality of error- weighted measurements. The within-group variance is then compared with the between-group variance to determine, e.g., if the condition or perturbation has an effect on
~π the variable.
The method ofthe invention can also be carried out without error- weighting the measurements. In such a method, a within-group variance for the k different measurement groups is determined. The within-group variance consists of a propagated variance and a scattered variance. The propagated variance can be determined based on the measurement
«_- variances ofthe plurality of measurements, and the scattered variance can be determined based on deviation of each ofthe plurality of measurements with respect to a respective group mean, i.e., a mean of measurements in a respective measurement group. It will be apparent to one skilled in the art that if a measurement group contains only one measurement, the mean is the same as the measurement. In a preferred embodiment, the on propagated variance for each measurement group is determined according to Equation (37a), infra. In another preferred embodiment, the scattered variance is determined according to Equation (40a), infra. A between-group variance for said k different measurement groups is then determined. The between-group variance is a variance of a plurality of group means, one for each ofthe k different measurement groups, with respect
~ - to a mean ofthe plurality of error- weighted measurements. The within-group variance is then compared with the between-group variance to determine, e.g., if the condition or perturbation has an effect on the variable.
In a preferred embodiment, the within-group variance is a sum of a plurality of
_- group variances, each of which consisting of a measurement group propagated variance and a measurement group scattered variance. The measurement group propagated variance can be determined based on the predetermined measurement variances ofthe one or more measurements in the measurement group. The measurement group scattered variance is a variance ofthe one or more measurements with respect to the mean ofthe one or more
, r. measurements or error- weighted measurements in the measurement group. It will be apparent to one skilled in the art that if a measurement group contains only one measurement, the mean is the same as the measurement. In a preferred embodiment, the mean of said error- weighted measurement is determined according to Equation (36), infra, and the scattered variance is determined according to Equation (40), infra.
1 _ The propagated variance and the scattered variance are preferably combined in such a manner that when the number of measurements in a measurement group is small, the propagated error is dominant, whereas when the number of measurements in a measurement group increases, the scattered error becomes more and more dominant, and when the number of measurements in a measurement group become very large, the group variance . approaches the scattered error. In another preferred embodiment, the propagated variance and the scattered variance are combined according to Equation (43), infra.
In one embodiment, the comparison ofthe within-group variance and the between- group variance is carried out using an appropriate statistical test, e.g., an F-test. In such an embodiment, within-group degree of freedom, between-group degree of freedom, within- yC group mean square, and between-group mean square are first determined using the appropriate variances. In a preferred embodiment, these parameters are determined according to the respective equations in Table 2. A significance level, e.g., a p-value, can then be determined, e.g., according to Equation (33).
In another embodiment, the analysis can be carried out by applying traditional n ANOVA on error-weighted measurements.
In a preferred embodiment, a technology platform specific error model is used to determine the predetermined measurement error σtl. Such estimated measurement error is used as a predetermined error or a "prior" in the within-group variance estimation. In a more preferred embodiment, the errors are estimated using an error model described in
3 Section 5.2., supra. Preferably, the additional input σ„ comes from an application-specific error model, e.g., a three term error model as described by Equation 8. Based on the knowledge of data noise sources and training data, the error model provides a conservative estimation ofthe measurement error in quantity v. Measurement error contains errors from
- various sources, including sample preparation (inverse-transcription, amplification, et al), labeling and dye related bias, hybridization, chip quality variation, chip scanning and image feature extraction. The measurement error determines the lower bound ofthe total error that includes biological variations. When the number of replicates is limited, the measurement error can be used as a "prior" in the variance estimation. The measurement error provides
, n additional information to obtain better variance estimations. The additional input also brings in additional degrees-of-freedom in the analysis. Both the improved variance estimation and the additional degree-of-freedom can help increase the statistical power of the analysis. The measurement error may also be used for determining the quality ofthe measurement. Weights that are inversely proportional to measurement errors in the mean
, <- estimation may be used to penalize measurements with large errors.
In another embodiment, a common error in an experiment estimated across a plurality of different biological variables v's measured in the same experiment, e.g., expression levels of different genes measured by using one or more microarrys in an experiment, is used to determine the measurement error for one or more ofthe r. measurements of v's. In a more preferred embodiment, such a common error is estimated according the method disclosed in U.S. Patent No. 6,351,712, which is incorporated herein by reference in its entirety. Common errors determined by other methods can also be used (See, Section 5.3.2.5, infra).
The invention is based, at least in part, on the discovery by the inventor that when c the number of replicate measurements in one or more measurement groups is small, the second input provides for more reliable variance estimation. Therefore, the additional input may help setting a "floor" in the within-group variance estimations for more accurate determination ofthe within-group variance. The additional input may also provide additional degrees-of-freedom to the analysis. As a result, both false-positive and false or. negative rates may be significantly reduced.
In addition, the additional input, e.g., the measurement error σ„, can be used in error- weighted averaging for the group mean estimation. In one embodiment, the weight is inversely proportional to the square of σa so that measurements with larger errors will have less contribution to the mean estimation.
35 The improved ANOVA methods ofthe invention can be used to carry out analysis of any measurements, including but not limited to intensity data, such as gene expression intensity data obtained using Affymetrix micro-arrays, and ratio data, such as gene expression ratio data obtained using Agilent micro-arrays.
5.3.2.1. ERROR- WEIGHTED AVERAGE In traditional ANOVA, the grand average and the treatment-group average are computed using Eqs. 31 and 32. In such calculations, all measurements are treated equally. , r. This is equivalent to assign one as the weight to all data points. If it is known a priori that some data points have larger measurement errors than others, it may be desirable to give them smaller weights in the calculation ofthe average to minimize their impacts to the accuracy ofthe mean estimation.
In a preferred embodiment, the weighting factor can be defined as:
15
-HA, = (34) σ
20 where σtl is standard error ofthe measurement. It will be apparent to an ordinary skilled person in the art that other error may also be used to define the weighting factor
In this embodiment, the grand average and the treatment group average are calculated as:
25
ΣΣW
Figure imgf000054_0001
and
35
Figure imgf000055_0001
respectively.
The propagated errors ofthe weighted treatment group average and the grand average are calculated as
Figure imgf000055_0002
and
2 1 1 1 1 σ- =- k n, k n, -. k \ k (38)
ΣΣWti ΣΣ^ Σ 2 Σ wt t=\ i=\ t=\ ι=ι σti t=\ σ-ιP (=ι
respectively, where a treatment group weight is defined as
W t ? (39)
The scattered errors ofthe weighted treatment group average and the grand average are calculated as
Figure imgf000056_0001
and
Figure imgf000056_0002
respectively.
The propagated error provides prior information for the error estimation. In a preferred embodiment, the error model as described in Section 5.2, supra, is used to estimate the propagated error for each y. In another embodiment, the propagated error and the scattered error are combined to improve error estimation for the group average and the grand average. The combination idea is that when the number of samples is small the prior should contribute more to the combined error estimation, whereas when the number of samples becomes large, the combined error should be dominated by the scattered error, which is the actual error estimation from data. In the case of large number of samples, the estimated error becomes less dependent on the error model. The weight is defined as inversely proportional to the variance ofthe error ofthe scattered error. In one embodiment, because the error ofthe scattered error is
Figure imgf000056_0003
a weight that is proportional to 2(n-l) is used to weight the scattered error. It is also known that the sum of all weights should be equal to one. The total error ofthe treatment-group average and the total error ofthe grand average are calculated as
Figure imgf000057_0001
and
Figure imgf000057_0002
respectively.
In another embodiment, in the method that does not involve error-weighting the measurements, the Eqs. 35-38 and 40-41 are replaced respectively with the following equations,
Figure imgf000057_0003
Figure imgf000057_0004
Figure imgf000057_0005
Figure imgf000057_0006
Figure imgf000058_0001
Figure imgf000058_0002
5.3.2.2. IMPROVED MEAN-SQUARE ESTIMATION In Table 1, when the within-group degrees-of-freedom N-k is small, the error of within-group mean-square is large. The error ofthe mean-square estimation is inversely proportional to the square of N-k, which is the sum of degrees-of-freedom (nk-\) within the treatment group. When the number of replicates nk is small, there is less confidence in the within-group variance estimation.
In one embodiment, instead of calculating within-group mean-square using the within-group sum of squares, the within-group mean-square is calculated based on the estimated error. By definition the group mean-square is the error square multiplied by the number of samples. For treatment group t, the mean-square is calculated as
Figure imgf000058_0003
By definition, the sum of squares of group t is the group mean-square multiplied by the DOF ofthe group mean-square estimation:
SR =VRt -SRt (46)
where the group degree-of-freedom (DOF) is given in Section 5.3.2.3, infra.
The total sum of squares is the sum ofthe group sum of squares:
Figure imgf000059_0001
By definition, the overall mean square is the total sum of squares divided by the total within-group DOF:
SD
S D (48)
Vn
where the total DOF is given in Section 5.3.2.3, infra. In one embodiment, if a prior for the between-treatment variance that depends on the treatments is not available, the same estimation method as shown in Table 1 is used to estimate the between-treatment variance. In another embodiment, if a prior for the between- treatment variance is available, the between-treatment variance is obtained by a combination of a propagated error and a scattered error in a manner similar to the within-treatment variance as described in Section 5.2.2.1, supra.
The first error term in Equations 43 and 44 reflects the propagated measurement errors ofthe derived averages. In one embodiment, this error term is defined in Equation 37. It is used as prior in the mean square estimation to improve the estimation accuracy especially when the number of replicates is small. When the number of replicates is small, by mere random chances, the sum-squares without the priors can be much larger or smaller than the inherent measurement error in the measurement technology, e.g., the microarray technology, resulting in high false-negative rate and false-positive rate, respectively. The prior sets a floor in the sum-square estimation so that the estimation can not become unreasonably small. The new measurement-error input provides additional prior information to help improve the power of ANOVA analysis.
5.3.2.3. ADDITIONAL DEGREES-OF-FREEDOM The additional measurement-error input also brings in more degrees-of-freedom to the ANOVA analysis. The additional within-treatment DOF gained from the additional ANOVA input can be determined based on the level of confidence of the predetermined variance or error. The additional DOF can range from zero, if the level of confidence ofthe predetermined variance or error is low, to a large number, if the level of confidence ofthe predetermined variance or error is high. For example, the within-treatment DOF can be c assigned to a very large number (e.g., at least 20, 50, orlOO) regardless ofthe number of measurements in the measurement groups, if the predetermined variance or error is already a conservative and realistic estimation ofthe measurement error. The level of confidence of the predetermined variance or error can be determined based on the manner such variance or error is obtained (see, e.g., e.g., Stoughton et al., U.S. Patent No. 6,351,712, which is
, π incorporated by reference herein in its entirety).
In a preferred embodiment, from the propagated error in Equation 37 and the scattered error in Equation 40, the within-treatment DOF ofthe propagated error is calculated as
15
VylP = nt (49)
and the within-treatment DOF ofthe scattered error is
yy-fS =nt -\ (50)
20
In a preferred embodiment, the two errors are weighted when estimating the individual mean-square of treatment group t in Equation 45, . Each error has reduced contribution to the combined result, so does its DOF. The DOF ofthe group mean-square is the weighted
25 combination ofthe DOF's ofthe two individual error terms:
Figure imgf000060_0001
30
Based on Equation 47, the total within-treatment DOF is the sum of DOF's of all individual group:
35
Figure imgf000061_0001
It can be seen that the within-treatment DOF as described by Equation 52 is increased as compared to the value in Table 1. The higher degrees-of-freedom provide more statistical power to the ANOVA analysis and therefore higher sensitivity.
When the error-weighting method is used to compute the averages and their errors, for example using Equations 36, 37 and 40, those data points that have large measurement errors contribute less to the mean and variance estimations. The total effective number of samples used in the calculation is therefore equal or less than the total number of actual input samples. In one embodiment, the effective numbers of samples in Table 2 are calculated as
minf en, = (σ = (min(σ )2 - ∑wti σ y-,p ι=l
(53)
eN= ∑ent (54)
In the embodiment that does not involve error-weighting the measurements, Eqs.53 and 54 are replaced respectively with the following equations,
eι =« ',/ 5 (53a)
eτV=τV. (54a) Table 2 lists an exemplary embodiment of one-way improved ANOVA with error- weighting in which changes from the traditional ANOVA (Table 1) can be easily seen. In the embodiment that does not involve error- weighting the measurements, the equations in Table 2 can be modified by substituting the relevant quantities, e.g., the averages, variances, and effective number of samples, with the quantities calculated using Eqs. 35a-38a, 40a- 41a, and 53a-54a.
In one embodiment, F-statistics is used to test the null-hypothesis. For example, the
10 statistical metric can be calculated using the mean squares and DOFs of Table 2, and the ANOVA p-value can be calculated using Equation 33. It will be apparent to one skilled in the art that any other statistical tests, e.g., t-test, in which the estimation of variance within treatment replicates is used for determining the statistical metric may also be used in the , <- present invention.
Table 2. An Exemplary Embodiment of Improved One-way ANOVA
3
Figure imgf000062_0001
35 5.3.2.4. IMPROVED TWO-WAY ANOVA The invention also provides improved two-way and N-way ANOVA for analyzing measured and/or transformed data under two or more different conditions, e.g., perturbations. For simplicity, 2-way improved ANOVA is described in this section. It will be apparent to a person skilled in the art that N-way improved ANOVA can be practiced similarly. The notation used is similar to those used in Box et al., Statistics For Experimenters, John Wiley & Sons, 1978.
If there are two factors P having n levels, denoted as i, where i can be from 1 to n , ■4 r. and T having k levels, denoted as t, where t can be from from 1 to k; and if the number of replicates in group ti is mtl, each replicate is denoted by asy, i.e., the index for the replicates is/, wherein / can be from 1 to m the relevant quantities for the traditional 2-way ANOVA are summarized in Table 3. The total number of samples and the grand average are
15
N = ∑∑™a , (55) t= =1
and
20
Figure imgf000063_0001
respectively. The number of samples for Factor P and Factor T are
Figure imgf000063_0002
and
Figure imgf000063_0003
respectively. Group averages for Factor P and Factor T are
35
Figure imgf000064_0001
and
Figure imgf000064_0002
respectively. The interaction average is
Figure imgf000064_0003
ya (61) m,;
Table 3. Traditional Two-way ANOVA Table
Figure imgf000064_0004
The invention therefore provides an improved two-way ANOVA method for t- analyzing a plurality of measurements of a variable in a plurality of different measurement groups, each of which consisting of one or more measurements ofthe variable under a condition among a first plurality of conditions and a condition among a second plurality of conditions, and each measurement has a predetermined measurement variance. The improvement comprises determining a within-group variance consisting of a propagated 1 . variance and a scattered variance. The propagated variance can be determined based on the predetermined measurement variances ofthe plurality of measurements, and the scattered variance is determined based on deviation of each ofthe plurality of measurements with respect to a mean of measurements in a respective measurement group.
In one embodiment, the averages or error-weighted averages, propagated and , - scattered errors, DOF's, and effective number of samples for improved two-way ANOVA are similarly derived as those for the improve one-way ANONA (Sections 5.3.2.2-5.3.2.4). For example, the within-treatment DOF of group ti is
20 vsa = ma -l+- (62)
™a
As another example, the within-treatment effective number of samples is
25
emti = wtij , (63)
Figure imgf000065_0001
30 In one embodiment, the invention provides a method of analyzing a plurality of measurements of a variable {ytlJ} in k * n different measurement groups, where yt is they'th measurement in the t/th measurement group, t = 1, 2, ..., n, i = 1, 2, ..., and k, j = 1, ..., mn, mh being the number of measurements in measurement group ti. Each ofthe k * n different measurement groups consists of measurements ofthe variable under conditions t and i of
35 condition groups N having n different conditions and K having k different conditions, respectively. Each measurement ytlJ has a predetermined measurement variance. In the method ofthe invention, each measurement yt is first weighted with a weighting factor to obtain an error- weighted measurement. The weighting factor can be determined based on
- the predetermined measurement variance ofthe measurement. In one embodiment the weighting factor is calculated according to Equation (34). A within-group variance for the k * n different measurement groups is then determined. The within-group variance consists of a propagated variance and a scattered variance. The propagated variance can be determined based on the predetermined measurement variances ofthe plurality of
, r. measurements, and the scattered variance can be determined based on deviation of each of the plurality of measurements with respect to a mean of measurements in a respective measurement group. In a preferred embodiment, the propagated variance for each measurement group is determined according to Equation (37). In another preferred embodiment, the scattered variance is determined according to Equation (40). Then a
, - between-group variance for condition group K or N, or for interaction between condition groups K and N is determined. The within-group variance is then compared with the between-group variance to determine, e.g., if either of two groups of conditions have effects on the variable. In one embodiment, the effect of condition group K is determined using the between-group variance of condition group TT which is a variance ofthe group means for the j . respective conditions in condition group K with respect to a mean ofthe plurality of error- weighted measurements. Each group mean for a condition in condition group K is the mean of measurements in all measurement groups under a respective condition in condition group K. The effect of condition group N can be determined in a similar way by using the between-group variance of condition group N which is a variance ofthe group means for the j r respective conditions in condition group N with respect to a mean of the plurality of error- weighted measurements, where each group mean for a condition in condition group N is the mean of measurements in all measurement groups under a respective condition in condition group N. The effect ofthe interaction between condition group N and condition group K can also be determined using the interaction variance which is a variance of group or. interaction means with respect to the mean ofthe plurality of error- weighted measurements and the condition group mean for condition group A' and the condition group mean for condition group N, where each group interaction mean is the mean of measurements in a measurement group over the number of measurements in the measurement group, e.g., as calculated by the equations in Table 4. The method is also applicable to error-weighted <r measurements . Table 4 summarizes the quantities in an exemplary embodiment of improved 2-way ANOVA. As can be seen from Table 4, in addition to error- weighted averaging and effective number of samples, the within-treatment sum of squares and DOFs are also r changed as compared to the traditional ANOVA.
Figure imgf000067_0001
25
It will be apparent to a skilled person in the art that improved N-way ANOVA can be obtained by, e.g., properly introducing more indices and computing the appropriate ANOVA table.
5.3.2.5. OTHER MEASUREMENT ERRORS AS ANOVA INPUTS
30 Measurement variance σ„ can also be determined by various other methods known in the art and used as the additional ANOVA input. The methods ofthe invention are equally applicable to such measurement variance. For example, the methods as described in Sections 5.3.3.1. through 5.3.3.5 can be used in conjunction with such measurement variance. In one embodiment, common error variance associated with an experiment
35 estimated across a plurality of different biological variables measured in the experiment is used as the measurement error σtl. The number ofthe plurality of different biological variables used for determining the common errors can be chosen by one skilled person in r the art. In preferred embodiments, the number is at least 2, 5, 10, 100, 1000, or 10,000. For example, in a microarray experiment, common error variance across a plurality of different genes measured on a microarray can be used as the additional input in the improved ANOVA method ofthe invention for one or more genes in the plurality of different genes. Such common error can be estimated over more than one replicates ofthe experiment, e.g., o more than one replicates of a microarray experiment. In a preferred embodiment, such a common error is estimated from one or more replicates according a method disclosed in U.S. Patent No. 6,351,712, which is incorporated herein by reference in its entirety. In another embodiment, the error estimation may come from a regression analysis. In one embodiment using regression analysis, the relationship between intensity and intensity error
2 can be find by fitting a function to the data set of one or several replicates. Error variance across a plurality of control or reference probes on a microarray can also be used as the additional input in the improved ANOVA method ofthe invention. It will be apparent to one skilled in the art that measurement variance determined by any other suitable method can also be used as the additional ANOVA input. ,
20
5.4. OTHER APPLICATIONS OF IMPROVED ANOVA The invention also provides method of using the improved ANOVA for determining if measured data are unchanged under different conditions or perturbations and for determining if replicate measurements in a group are consistent with each other.
25
5.4.1. METHOD OF DETERMINING UNCHANGED MEASUREMENTS Measurements for which the null-hypothesis cannot be rejected, e.g., determined by ANOVA as not different under different conditions or perturbations, are not necessarily unchanged. For example, although an ANOVA analysis of measured expression levels of a
3Q gene or protein under a set of different drug treatments indicates that the expression of the gene or protein is not changed under the set of treatments, the expression levels may nonetheless have changed under some circumstances. This is due to the fact that there are several factors that may affect an ANOVA test. One possibility that a null-hypothesis can not be rejected is that the null is indeed true where treatment groups do have the same mean.
35 Another possibility that a null-hypothesis cannot be rejected is that the within-group variance is too large. For genes or proteins whose expression levels have high ANOVA p- value (e.g. p-value>0.95), if they also have small within-group variances, it may be expected that they are unchanged. r The invention therefore provides, in one embodiment, a method of determining if measured data of a biological variable under different conditions or perturbations have a large or small within-group variance. The method ofthe invention makes use ofthe average within-group mean-square determined across a plurality of different biological variables, and compares the individual mean-square of measurements ofthe biological variable of
. interests with the average. The hypothesis is that the individual mean-square is smaller than the average. If the hypothesis can be accepted, e.g., by an appropriate statistical test, the within-group variance is deemed small. For example, in a microarray experiment, a plurality of different genes are measured under a set of different conditions or perturbations, each on one or more microarrays. In one embodiment, if Mis the number of different
1 <- biological variables, e.g., the total number of measurements in each gene or protein expression profile, and sR (j) is the within-group mean-square ofthe j 'th biological variables, the average within-group mean-square is defined as:
Figure imgf000069_0001
Preferably, Mis at least 2, 5, 10, 50, 100, 1000, 10,000 or 100,000. The different biological variables used for calculating the average can also be selected if desired. For example, data
25 generated from bad array spots can be excluded. The individual within-group mean-square is then compared with the average in an F-test as described by Eq. 65, where the DOF ofthe average vavg is typically large. In one embodiment, the DOF of vavg is set to a large arbitrary number in the computation because the Chi-Square distribution ofthe average is approaching a Gaussian distribution when the DOF is large. The within-group variance of
30 the ith biological variable can be compared with the average within-group variance using Eq. (65)
SmallError_pvalue(ϊ) = l-fcdf{sR 2(i)/sR 2,vR,vavg). (65)
35 When SmallError _pvalue(i) is large, the individual within-group mean-square ofthe 7th biological variable is identified as smaller than the average mean-square. In a preferred embodiment, the SmallError '_pvalue threshold is set to be greater than 0.5, thus, those c biological variables/measurements that have within-group mean-squares less than the average are accepted as having small within-group variance. In a preferred embodiment, the within-group mean squares in Eqs. 64 and 65 are calculated using the improved ANOVA method ofthe invention, e.g., according to Table 2 or Table 4.
In one embodiment, to detect unchanged biological variables/measurements, two p-
, 0 value thresholds, one for the ANOVA p-value and one for the SmallError p-value, are set. In a preferred embodiment, the ANOVA p-value is determined using a method ofthe invention as described in Section 5.3. Biological variables/measurements which have the ANOVA p-value and the SmallError p-value greater than the respective threshold are determined as truely unchanged. In a preferred embodiment, the thresholds are selected as
, r the ANOVA p-value > 0.95 and the SmallError p-value > 0.5. It will be apparent to one skilled in the art that other thresholds for the ANOVA p-value and the SmallError p-value can also be selected according to particular experimental conditions. In another preferred embodiment, the thresholds are selected as the ANOVA p-value > 0.95 and the SmallError p-value > 0.9.
20
5.4.2. METHOD OF ANALYZING VARIATIONS AMONG REPLICATE
MEASUREMENTS The invention also provides a method of using the improved ANOVA method for detecting variations among replicates in a treatment group, thereby determining the consistency ofthe replicates.
25 In one embodiment, for a given variable y in treatment group t, an F-test is used to compare the scattered error as determined according to Equation 40 (Equation 40a if the measurements are not error weighted) and the propagated error as determined according to Equation 37 (Equation 37a if the measurements are not error weighted). If the scattered error is significantly larger than the propagated error, it indicates there are variations beyond
30 measurement error. In one embodiment, a p-value of this F-test, termed "replicate consistency p-value" is calculated as:
35
Figure imgf000071_0001
(66)
. ft A threshold is selected for the replicate consistency p-value. If the percentage ofthe number of biological variables/measurements having the p-value below the threshold to the total number of biological variables/measurements measured is significantly above the threshold level, it indicates the inconsistency among replicates in this treatment group. In a preferred embodiment, the threshold ofthe p-value is chosen to be less than 1% (0.01).
1 - In one embodiment, variations in replicates beyond a level due to measurement errors are used to indicate quality issues in the data generating process. In another embodiment, when replicates are obtained from measurements of samples from different sources, e.g., a type of cell from different animals, variations in replicates beyond a level due to measurement errors are used to indicate/identified biological variations from the jr. different sources, e.g., a type of cell from different animals.
5.5. IMPLEMENTATION SYSTEMS AND METHODS The analytical methods ofthe present invention can preferably be implemented using a computer system, such as the computer system described in this section, according jr to the following programs and methods. Such a computer system can also preferably store and manipulate measured data obtained in various experiments that can be used by a computer system implemented with the analytical methods of this invention. Accordingly, such computer systems are also considered part ofthe present invention.
An exemplary computer system suitable from implementing the analytic methods of o . this invention is illustrated in FIG. 2. Computer system 201 is illustrated here as comprising internal components and as being linked to external components. The internal components of this computer system include one or more processor elements 202 interconnected with a main memory 203. For example, computer system 201 can be an Intel Pentium®-based processor of 200 MHZ or greater clock rate and with 32 MB or more main memory. In a or preferred embodiment, computer system 201 is a cluster of a plurality of computers comprising a head "node" and eight sibling "nodes," with each node having a central processing unit ("CPU"). In addition, the cluster also comprises at least 128 MB of random access memory ("RAM") on the head node and at least 256 MB of RAM on each ofthe
<- eight sibling nodes. Therefore, the computer systems ofthe present invention are not limited to those consisting of a single memory unit or a single processor unit.
The external components can include a mass storage 204. This mass storage can be one or more hard disks that are typically packaged together with the processor and memory. Such hard disk are typically of 1 GB or greater storage capacity and more preferably have at
, r. least 6 GB of storage capacity. For example, in a preferred embodiment, described above, wherein a computer system ofthe invention comprises several nodes, each node can have its own hard drive. The head node preferably has a hard drive with at least 6 GB of storage capacity whereas each sibling node preferably has a hard drive with at least 9 GB of storage capacity. A computer system ofthe invention can further comprise other mass storage units
, r including, for example, one or more floppy drives, one more CD-ROM drives, one or more DVD drives or one or more DAT drives.
Other external components typically include a user interface device 205, which is most typically a monitor and a keyboard together with a graphical input device 206 such as a "mouse." The computer system is also typically linked to a network link 207 which can jr. be, e.g. , part of a local area network ("LAN") to other, local computer systems and/or part of a wide area network ("WAN"), such as the Internet, that is connected to other, remote computer systems. For example, in the preferred embodiment, discussed above, wherein the computer system comprises a plurality of nodes, each node is preferably connected to a network, preferably an NFS network, so that the nodes ofthe computer system jr communicate with each other and, optionally, with other computer systems by means ofthe network and can thereby share data and processing tasks with one another.
Loaded into memory during operation of such a computer system are several software components that are also shown schematically in FIG. 2. The software components comprise both software components that are standard in the art and components or. that are special to the present invention. These software components are typically stored on mass storage such as the hard drive 204, but can be stored on other computer readable media as well including, for example, one or more floppy disks, one or more CD-ROMs, one or more DVDs or one or more DATs. Software component 210 represents an operating system which is responsible for managing the computer system and its network
- - interconnections. The operating system can be, for example, ofthe Microsoft Windows™ family such as Windows 95, Window 98, Windows NT or Windows 2000. Alternatively, the operating software can be a Macintosh operating system, a UNIX operating system or the LINUX operating system. Software components 211 comprises common languages and
<- functions that are preferably present in the system to assist programs implementing methods specific to the present invention. Languages that can be used to program the analytic methods ofthe invention include, for example, C and C++, FORTRAN, PERL, HTML, JAVA, and any ofthe UNIX or LINUX shell command languages such as C shell script language. The methods ofthe invention can also be programmed or modeled in
. π mathematical software packages that allow symbolic entry of equations and high-level specification of processing, including specific algorithms to be used, thereby freeing a user ofthe need to procedurally program individual equations and algorithms. Such packages include, e.g., Matlab from Mathworks (Natick, MA), Mathematica from Wolfram Research (Champaign, IL) or S-Plus from MathSoft (Seattle, WA).
. - Software component 212 comprises any analytic methods ofthe present invention described supra, preferably programmed in a procedural language or symbolic package. For example, software component 212 preferably includes programs that cause the processor to implement steps of accepting a plurality of measured data and storing the measured data in the memory. For example, the computer system can accept measured data that are manually jr. entered by a user (e.g. , by means ofthe user interface). More preferably, however, the programs cause the computer system to retrieve measured data from a database. Such a database can be stored on a mass storage (e.g., a hard drive) or other computer readable medium and loaded into the memory ofthe computer, or the compendium can be accessed by the computer system by means ofthe network 207. jr In addition to the exemplary program structures and computer systems described herein, other, alternative program structures and computer systems will be readily apparent to the skilled artisan. Such alternative systems, which do not depart from the above described computer system and programs structures either in spirit or in scope, are therefore intended to be comprehended within the accompanying claims.
30
5.6. METHODS FOR DETERMINING BIOLOGICAL STATE AND BIOLOGICAL RESPONSE This invention provides methods for analysis of measurement errors in measured signal data, e.g., measured expression profiles, and methods for analyzing and processing of such measured signal data. The measured data can be measurements of cellular constituents
35 in a cell or organism or responses of a cell or organism to a perturbation. The data can be measured from cell samples subject to different conditions, e.g., under different perturbations. The cell sample can be of any organism, e.g., eukaryote, mammal, primate, human, non-human animal such as a dog, cat, horse, cow, mouse, rat, Drosophila, C. r elegans, etc., plant such as rice, wheat, bean, tobacco, etc., and fungi. The cell sample can be from a diseased or healthy organism, or an organism predisposed to disease. The cell sample can be of a particular tissue type or development stage or subjected to a particular perturbation (stimulus). This section and its subsections provides some exemplary methods for obtaining the measured data of cell samples. One of skill in the art would appreciate
, r. that this invention is not limited to the following specific methods for measuring the expression profiles and responses of a biological system.
5.6.1. TRANSCRIPT ASSAYS USING MICROARRAYS This invention is particularly useful for the determination ofthe expression state or c the transcriptional state of a cell or cell type or any other cell sample by monitoring expression profiles. One aspect ofthe invention provides polynucleotide probe arrays for simultaneous determination ofthe expression levels of a plurality of genes and methods for designing and making such polynucleotide probe arrays.
The expression level of a nucleotide sequence in a gene can be measured by any jr. high throughput techniques. However measured, the result is either the absolute or relative amounts of transcripts or response data, including but not limited to values representing abundance ratios.
Preferably, measurement ofthe expression profile is made by hybridization to transcript arrays, which are described in this subsection jr In a preferred embodiment, the present invention makes use of "transcript arrays" or
"profiling arrays". Transcript arrays can be employed for analyzing the expression profile in a cell sample and especially for measuring the expression profile of a cell sample of a particular tissue type or developmental state or exposed to a drug of interest or to perturbations to a biological pathway of interest. In another embodiment, the cell sample or. can be from a patient, e.g., a diseased cell sample, and preferably can be compared to a healthy cell sample.
In one embodiment, an expression profile is obtained by hybridizing detectably labeled polynucleotides representing the nucleotide sequences in mRNA transcripts present in a cell (e.g., fluorescently labeled cDNA synthesized from total cell mRNA) to a oc microarray. A microarray is an array of positionally-addressable binding (e.g., hybridization) sites on a support for representing many ofthe nucleotide sequences in the genome of a cell or organism, preferably most or almost all ofthe genes. Each of such binding sites consists of polynucleotide probes bound to the predetermined region on the
<- support. Microarrays can be made in a number of ways, of which several are described herein below. However produced, microarrays share certain characteristics. The arrays are reproducible, allowing multiple copies of a given array to be produced and easily compared with each other. Preferably, the microarrays are made from materials that are stable under binding (e.g., nucleic acid hybridization) conditions. The microarrays are preferably small,
1 r. e.g., between about 1 cm2 and 25 cm2, preferably about 1 to 3 cm2. However, both larger and smaller arrays are also contemplated and may be preferable, e.g., for simultaneously evaluating a very large number of different probes.
Preferably, a given binding site or unique set of binding sites in the microarray will specifically bind (e.g., hybridize) to a nucleotide sequence in a single gene from a cell or
, c organism (e.g. , to exon of a specific mRNA or a specific cDNA derived therefrom).
The microarrays used in the methods and compositions ofthe present invention include one or more test probes, each of which has a polynucleotide sequence that is complementary to a subsequence of RNA or DNA to be detected. Each probe preferably has a different nucleic acid sequence, and the position of each probe on the solid surface of jr. the array is preferably known. Indeed, the microarrays are preferably addressable arrays, more preferably positionally addressable arrays. More specifically, each probe ofthe array is preferably located at a known, predetermined position on the solid support such that the identity (i.e., the sequence) of each probe can be determined from its position on the array (i.e., on the support or surface). In some embodiments ofthe invention, the arrays are je ordered arrays.
Preferably, the density of probes on a microarray or a set of microarrays is about 100 different (i.e., non-identical) probes per 1 cm2 or higher. More preferably, a microarray used in the methods ofthe invention will have at least 550 probes per 1 cm2, at least 1,000 probes per 1 cm2, at least 1,500 probes per 1 cm2 or at least 2,000 probes per 1 cm2. In a or. particularly preferred embodiment, the microarray is a high density array, preferably having a density of at least about 2,500 different probes per 1 cm2. The microarrays used in the invention therefore preferably contain at least 2,500, at least 5,000, at least 10,000, at least 15,000, at least 20,000, at least 25,000, at least 50,000 or at least 55,000 different (i.e., non-identical) probes.
35 In one embodiment, the microarray is an array (i.e., a matrix) in which each position represents a discrete binding site for a nucleotide sequence of a transcript encoded by a gene (e.g., for an exon of an mRNA or a cDNA derived therefrom). The collection of binding c sites on a microarray contains sets of binding sites for a plurality of genes. For example, in various embodiments, the microarrays ofthe invention can comprise binding sites for products encoded by fewer than 50% ofthe genes in the genome of an organism. Alternatively, the microarrays ofthe invention can have binding sites for the products encoded by at least 50%, at least 75%, at least 85%, at least 90%, at least 95%, at least 99%
1 r. or 100%) ofthe genes in the genome of an organism. In other embodiments, the microarrays ofthe invention can having binding sites for products encoded by fewer than 50%, by at least 50%, by at least 75%, by at least 85%, by at least 90%, by at least 95%, by at least 99% or by 100% ofthe genes expressed by a cell of an organism. The binding site can be a DNA or DNA analog to which a particular RNA can specifically hybridize. The DNA or DNA
, analog can be, e.g. , a synthetic oligomer or a gene fragment, e.g. corresponding to an exon. In some embodiments ofthe present invention, a gene or an exon in a gene is represented in the profiling arrays by a set of binding sites comprising probes with different polynucleotides that are complementary to different coding sequence segments ofthe gene or an exon ofthe gene. Such polynucleotides are preferably ofthe length of 15 to 200 jr. bases, more preferably ofthe length of 20 to 100 bases, most preferably 40-60 bases. It will be understood that each probe sequence may also comprise linker sequences in addition to the sequence that is complementary to its target sequence. As used herein, a linker sequence refers to a sequence between the sequence that is complementary to its target sequence and the surface of support. For example, in preferred embodiments the profiling arrays ofthe jC invention comprise one probe specific to each target gene or exon. However, if desired, the profiling arrays may contain at least 2, 5, 10, 100, 1000 probes specific to some target genes or exons. For example, the array may contain probes tiled across the sequence ofthe longest mRNA isoform of a gene at single base steps.
In specific embodiments ofthe invention, when an exon has alternative spliced or. variants, a set of polynucleotide probes of successive overlapping sequences, i.e., tiled sequences, across the genomic region containing the longest variant of an exon can be included in the exon profiling arrays. The set of polynucleotide probes can comprise successive overlapping sequences at steps of a predetermined base intervals, e.g. at steps of 1, 5, or 10 base intervals, span, or are tiled across, the mRNA containing the longest variant. o - Such set of probes therefore can be used to scan the genomic region containing all variants of an exon to determine the expressed variant or variants ofthe exon to determine the expressed variant or variants ofthe exon. Alternatively or additionally, a set of polynucleotide probes comprising exon specific probes and/or variant junction probes can j- be included in the exon profiling array. As used herein, a variant junction probe refers to a probe specific to the junction region ofthe particular exon variant and the neighboring exon. In a preferred embodiment, the probe set contains variant junction probes specifically hybridizable to each of all different splice junction sequences ofthe exon. In another preferred embodiment, the probe set contains exon specific probes specifically hybridizable
, r. to the common sequences in all different variants ofthe exon, and/or variant junction probes specifically hybridizable to the different splice junction sequences ofthe exon.
In some other embodiments ofthe invention, an exon is represented in the exon profiling arrays by a probe comprising a polynucleotide that is complementary to the full length exon. In such embodiments, an exon is represented by a single binding site on the
, c profiling arrays. In some preferred embodiments ofthe invention, an exon is represented by one or more binding sites on the profiling arrays, each ofthe binding sites comprising a probe wάth a polynucleotide sequence that is complementary to an RNA fragment that is a substantial portion ofthe target exon. The lengths of such probes are normally between about 15-600 bases, preferably between about 20-200 bases, more preferably between about jr. 30-100 bases, and most preferably between about 40-80 bases. The average length of an exon is about 200 bases (see, e.g., Lewin, Genes V, Oxford University Press, Oxford, 1994). A probe of length of about 40-80 allows more specific binding ofthe exon than a probe of shorter length, thereby increasing the specificity ofthe probe to the target exon. For certain genes, one or more targeted exons may have sequence lengths less than about 40-80 bases. jr In such cases, if probes with sequences longer than the target exons are to be used, it may be desirable to design probes comprising sequences that include the entire target exon flanked by sequences from the adjacent constitutively splice exon or exons such that the probe sequences are complementary to the corresponding sequence segments in the mRNAs. Using flanking sequence from adjacent constitutively spliced exon or exons rather than the or. genomic flanking sequences, i.e., intron sequences, permits comparable hybridization stringency with other probes ofthe same length. Preferably the flanking sequence used are from the adjacent constitutively spliced exon or exons that are not involved in any alternative pathways. More preferably the flanking sequences used do not comprise a significant portion ofthe sequence ofthe adjacent exon or exons so that cross-hybridization o r can be minimized. In some embodiments, when a target exon that is shorter than the desired probe length is involved in alternative splicing, probes comprising flanking sequences in different alternatively spliced mRNAs are designed so that expression level of the exon expressed in different alternatively spliced mRNAs can be measured. c In some other embodiments ofthe invention, when alternative splicing pathways and/or exon duplication in separate genes are to be distinguished, the DNA array or set of arrays can also comprise probes that are complementary to sequences spanning the junction regions of two adjacent exons. Preferably, such probes comprise sequences from the two exons which are not substantially overlapped with probes for each individual exons so that
. r. cross hybridization can be minimized. Probes that comprise sequences from more than one exons are useful in distinguishing alternative splicing pathways and/or expression of duplicated exons in separate genes if the exons occurs in one or more alternative spliced mRNAs and/or one or more separated genes that contain the duplicated exons but not in other alternatively spliced mRNAs and/or other genes that contain the duplicated exons.
, c Alternatively, for duplicate exons in separate genes, if the exons from different genes show substantial difference in sequence homology, it is preferable to include probes that are different so that the exons from different genes can be distinguished.
It will be apparent to one skilled in the art that any ofthe probe schemes, supra, can be combined on the same profiling array and/or on different arrays within the same set of jr. profiling arrays so that a more accurate determination ofthe expression profile for a plurality of genes can be accomplished. It will also be apparent to one skilled in the art that the different probe schemes can also be used for different levels of accuracies in profiling. For example, a profiling array or array set comprising a small set of probes for each exon may be used to determine the relevant genes and/or RNA splicing pathways under certain j specific conditions. An array or array set comprising larger sets of probes for the exons that are of interest is then used to more accurately determine the exon expression profile under such specific conditions. Other DNA array strategies that allow more advantageous use of different probe schemes are also encompassed.
Preferably, the microarrays used in the invention have binding sites (i.e., probes) for or. sets of exons for one or more genes relevant to the action of a drug of interest or in a biological pathway of interest. As discussed above, a "gene" is identified as a portion of DNA that is transcribed by RNA polymerase, which may include a 5' untranslated region ("UTR"), introns, exons and a 3' UTR. The number of genes in a genome can be estimated from the number of mRNAs expressed by the cell or organism, or by extrapolation of a well oc characterized portion ofthe genome. When the genome ofthe organism of interest has been sequenced, the number of ORFs can be determined and mRNA coding regions identified by analysis ofthe DNA sequence. For example, the genome of Saccharomyces cerevisiae has been completely sequenced and is reported to have approximately 6275 ORFs encoding r sequences longer the 99 amino acid residues in length. Analysis of these ORFs indicates that there are 5,885 ORFs that are likely to encode protein products (Goffeau et al, 1996, Science 274:546-567). In contrast, the human genome is estimated to contain approximately 30,000 to 130,000 genes (see Crollius et al., 2000, Nature Genetics 25:235- 238; Ewing et al., 2000, Nature Genetics 25:232-234). Genome sequences for other
1 r. organisms, including but not limited to Drosophila, C. elegans, plants, e.g., rice and Arabidopsis, and mammals, e.g., mouse and human, are also completed or nearly completed. Thus, in preferred embodiments ofthe invention, an array set comprising in total probes for all known or predicted exons in the genome of an organism is provided. As a non-limiting example, the present invention provides an array set comprising one or two
, r probes for each known or predicted exon in the human genome.
It will be appreciated that when cDNA complementary to the RNA of a cell is made and hybridized to a microarray under suitable hybridization conditions, the level of hybridization to the site in the array corresponding to an exon of any particular gene will reflect the prevalence in the cell of mRNA or mRNAs containing the exon transcribed from jr. that gene. For example, when detectably labeled (e.g., with a fluorophore) cDNA complementary to the total cellular mRNA is hybridized to a microarray, the site on the array corresponding to an exon of a gene (i.e., capable of specifically binding the product or products ofthe gene expressing) that is not transcribed or is removed during RNA splicing in the cell will have little or no signal (e.g., fluorescent signal), and an exon of a gene for jc which the encoded mRNA expressing the exon is prevalent will have a relatively strong signal. The relative abundance of different mRNAs produced from the same gene by alternative splicing is then determined by the signal strength pattern across the whole set of exons monitored for the gene.
In preferred embodiments, cDNAs from cell samples from two different conditions
~„ are hybridized to the binding sites ofthe microarray using a two-color protocol. In the case of drug responses one cell sample is exposed to a drug and another cell sample ofthe same type is not exposed to the drug. In the case of pathway responses one cell is exposed to a pathway perturbation and another cell ofthe same type is not exposed to the pathway perturbation. The cDNA derived from each ofthe two cell types are differently labeled or (e.g., with Cy3 and Cy5) so that they can be distinguished. In one embodiment, for example, cDNA from a cell treated with a drug (or exposed to a pathway perturbation) is synthesized using a fluorescein-labeled dNTP, and cDNA from a second cell, not drug-exposed, is synthesized using a rhodamine-labeled dNTP. When the two cDNAs are c mixed and hybridized to the microarray, the relative intensity of signal from each cDNA set is determined for each site on the array, and any relative difference in abundance of a particular exon detected.
In the example described above, the cDNA from the drug-treated (or pathway perturbed) cell will fluoresce green when the fluorophore is stimulated and the cDNA from
, r. the untreated cell will fluoresce red. As a result, when the drug treatment has no effect, either directly or indirectly, on the transcription and/or post-transcriptional splicing of a particular gene in a cell, the exon expression patterns will be indistinguishable in both cells and, upon reverse transcription, red-labeled and green-labeled cDNA will be equally prevalent. When hybridized to the microarray, the binding site(s) for that species of RNA
, c will emit wavelengths characteristic of both fluorophores. In contrast, when the drug-exposed cell is treated with a drug that, directly or indirectly, change the transcription and/or post-transcriptional splicing of a particular gene in the cell, the exon expression pattern as represented by ratio of green to red fluorescence for each exon binding site will change. When the drug increases the prevalence of an mRNA, the ratios for each exon jr. expressed in the mRNA will increase, whereas when the drug decreases the prevalence of an mRNA, the ratio for each exons expressed in the mRNA will decrease.
The use of a two-color fluorescence labeling and detection scheme to define alterations in gene expression has been described in comiection with detection of mRNAs, e.g., in Shena et al, 1995, Quantitative monitoring of gene expression patterns with a jr complementary DNA microarray, Science 270:467-470, which is incorporated by reference in its entirety for all purposes. The scheme is equally applicable to labeling and detection of exons. An advantage of using cDNA labeled with two different fluorophores is that a direct and internally controlled comparison ofthe mRNA or exon expression levels corresponding to each arrayed gene in two cell states can be made, and variations due to minor differences or. in experimental conditions (e.g., hybridization conditions) will not affect subsequent analyses. However, it will be recognized that it is also possible to use cDNA from a single cell, and compare, for example, the absolute amount of a particular exon in, e.g., a drug-treated or pathway-perturbed cell and an untreated cell. Furthermore, labeling with more than two colors is also contemplated in the present invention. In some embodiments oc ofthe invention, at least 5, 10, 20, or 100 dyes of different colors can be used for labeling. Such labeling permits simultaneous hybridizing ofthe distinguishably labeled cDNA populations to the same array, and thus measuring, and optionally comparing the expression levels of, mRNA molecules derived from more than two samples. Dyes that can be used c include, but are not limited to, fluorescein and its derivatives, rhodamine and its derivatives, texas red, 5'carboxy-fluorescein ("FMA"), 2',7'-dimethoxy-4',5'-dichloro-6-carboxy- fluorescein ("JOE"), N,N,N',N'-tetramethyl-6-carboxy-rhodamine ("TAMRA"), 6'carboxy- X-rhodamine ("ROX"), HEX, TET, IRD40, and IRD41, cyamine dyes, including but are not limited to Cy3, Cy3.5 and Cy5; BODIPY dyes including but are not limited to BODIPY-
10 FL, BODIPY-TR, BODIPY-TMR, BODIPY-630/650, and BODIPY-650/670; and ALEXA dyes, including but are not limited to ALEXA-488, ALEXA-532, ALEXA-546, ALEXA- 568, and ALEXA-594; as well as other fluorescent dyes which will be known to those who are skilled in the art.
In some embodiments ofthe invention, hybridization data are measured at a plurality
, - of different hybridization times so that the evolution of hybridization levels to equilibrium can be determined. In such embodiments, hybridization levels are most preferably measured at hybridization times spanning the range from 0 to in excess of what is required for sampling ofthe bound polynucleotides (i.e., the probe or probes) by the labeled polynucleotides so that the mixture is close to or substantially reached equilibrium, and jr. duplexes are at concentrations dependent on affinity and abundance rather than diffusion. However, the hybridization times are preferably short enough that irreversible binding interactions between the labeled polynucleotide and the probes and/or the surface do not occur, or are at least limited. For example, in embodiments wherein polynucleotide arrays are used to probe a complex mixture of fragmented polynucleotides, typical hybridization jc times may be approximately 0-72 hours. Appropriate hybridization times for other embodiments will depend on the particular polynucleotide sequences and probes used, and may be determined by those skilled in the art (see, e.g., Sambrook et al, Eds., 1989, Molecular Cloning: A Laboratory Manual, 2nd ed., Vol. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York). or. In one embodiment, hybridization levels at different hybridization times are measured separately on different, identical microarrays. For each such measurement, at hybridization time when hybridization level is measured, the microarray is washed briefly, preferably in room temperature in an aqueous solution of high to moderate salt concentration (e.g., 0.5 to 3 M salt concentration) under conditions which retain all bound oc or hybridized polynucleotides while removing all unbound polynucleotides. The detectable label on the remaining, hybridized polynucleotide molecules on each probe is then measured by a method which is appropriate to the particular labeling method used. The resulted hybridization levels are then combined to form a hybridization curve. In another - embodiment, hybridization levels are measured in real time using a single microarray. In this embodiment, the microarray is allowed to hybridize to the sample without interruption and the microarray is interrogated at each hybridization time in a non-invasive manner. In still another embodiment, one can use one array, hybridize for a short time, wash and measure the hybridization level, put back to the same sample, hybridize for another period
1 r. of time, wash and measure again to get the hybridization time curve.
Preferably, at least two hybridization levels at two different hybridization times are measured, a first one at a hybridization time that is close to the time scale of cross- hybridization equilibrium and a second one measured at a hybridization time that is longer than the first one. The time scale of cross-hybridization equilibrium depends, inter alia, on
, c sample composition and probe sequence and may be determined by one skilled in the art. In preferred embodiments, the first hybridization level is measured at between 1 to 10 hours, whereas the second hybridization time is measured at about 2, 4, 6, 10, 12, 16, 18, 48 or 72 times as long as the first hybridization time.
20 5.6.2. PREPARING PROBES FOR MICROARRAYS As noted above, the "probe" to which a particular polynucleotide molecule, such an exon, specifically hybridizes according to the invention is a complementary polynucleotide sequence. Preferably one or more probes are selected for each target exon. For example, when a minimum number of probes are to be used for the detection of an exon, the probes jc normally comprise nucleotide sequences greater than about 40 bases in length.
Alternatively, when a large set of redundant probes is to be used for an exon, the probes normally comprise nucleotide sequences of about 40-60 bases. The probes can also comprise sequences complementary to full length exons. The lengths of exons can range from less than 50 bases to more than 200 bases. Therefore, when a probe length longer or, than exon is to be used, it is preferable to augment the exon sequence with adjacent constitutively spliced exon sequences such that the probe sequence is complementary to the continuous mRNA fragment that contains the target exon. This will allow comparable hybridization stringency among the probes of an exon profiling array. It will be understood that each probe sequence may also comprise linker sequences in addition to the sequence o c that is complementary to its target sequence. The probes may comprise DNA or DNA "mimics" (e.g., derivatives and analogues) corresponding to a portion of each exon of each gene in an organism's genome. In one embodiment, the probes ofthe microarray are complementary RNA or RNA mimics. DNA c mimics are polymers composed of subunits capable of specific, Watson-Crick-like hybridization with DNA, or of specific hybridization with RNA. The nucleic acids can be modified at the base moiety, at the sugar moiety, or at the phosphate backbone. Exemplary DNA mimics include, e.g., phosphorothioates. DNA can be obtained, e.g., by polymerase chain reaction (PCR) amplification of exon segments from genomic DNA, cDNA (e.g., by
, r. RT-PCR), or cloned sequences. PCR primers are preferably chosen based on known sequence ofthe exons or cDNA that result in amplification of unique fragments (7.e., fragments that do not share more than 10 bases of contiguous identical sequence with any other fragment on the microarray). Computer programs that are well known in the art are useful in the design of primers with the required specificity and optimal amplification
, c properties, such as Oligo version 5.0 (National Biosciences). Typically each probe on the microarray will be between 20 bases and 600 bases, and usually between 30 and 200 bases in length. PCR methods are well known in the art, and are described, for example, in Innis et al, eds., 1990, PCR Protocols: A Guide to Methods and Applications, Academic Press Inc., San Diego, CA. It will be apparent to one skilled in the art that controlled robotic jr. systems are useful for isolating and amplifying nucleic acids.
An alternative, preferred means for generating the polynucleotide probes ofthe microarray is by synthesis of synthetic polynucleotides or oligonucleotides, e.g., using N- phosphonate or phosphoramidite chemistries (Froehler et al, 1986, Nucleic Acid Res. 74:5399-5407; McBride et al, 1983, Tetrahedron Lett. 24:246-248). Synthetic sequences jc are typically between about 15 and about 600 bases in length, more typically between about 20 and about 100 bases, most preferably between about 40 and about 70 bases in length. In some embodiments, synthetic nucleic acids include non-natural bases, such as, but by no means limited to, inosine. As noted above, nucleic acid analogues may be used as binding sites for hybridization. An example of a suitable nucleic acid analogue is peptide nucleic or acid (see, e.g., Egholm et al, 1993, Nature 363:566-568; U.S. Patent No. 5,539,083).
In alternative embodiments, the hybridization sites (i.e., the probes) are made from plasmid or phage clones of genes, cDNAs (e.g., expressed sequence tags), or inserts therefrom (Nguyen et al, 1995, Genomics 2°:207-209).
35 5.6.3. ATTACHING PROBES TO THE SOLID SURFACE Preformed polynucleotide probes can be deposited on a support to form the array. Alternatively, polynucleotide probes can be synthesized directly on the support to form the array. The probes are attached to a solid support or surface, which may be made, e.g., from c glass, plastic (e.g., polypropylene, nylon), polyacrylamide, nitrocellulose, gel, or other porous or nonporous material.
A preferred method for attaching the nucleic acids to a surface is by printing on glass plates, as is described generally by Schena et al, 1995, Science 270:467-470. This method is especially useful for preparing microarrays of cDNA (See also, DeRisi et al, 1996, Nature
10 Genetics 7 :457-460; Shalon et al, 1996, Genome Res. 6:639-645; and Schena et al, 1995, Proc. Natl. Acad. Sci. U.S.A. 93:10539-11286).
A second preferred method for making microarrays is by making high-density polynucleotide arrays. Techniques are known for producing arrays containing thousands of oligonucleotides complementary to defined sequences, at defined locations on a surface
, c using photolithographic techniques for synthesis in situ (see, Fodor et al, 1991, Science 251:767-773; Pease et al, 1994, Proc. Natl. Acad. Sci. U.S.A. 91:5022-5026; Lockhart et al, 1996, Nature Biotechnology 14:1675; U.S. Patent Nos. 5,578,832; 5,556,752; and 5,510,270) or other methods for rapid synthesis and deposition of defined oligonucleotides (Blanchard et al, Biosensors & Bioelectronics 77:687-690). When these methods are used, jr. oligonucleotides (e.g., 60-mers) of known sequence are synthesized directly on a surface such as a derivatized glass slide. The array produced can be redundant, with several polynucleotide molecules per exon.
Other methods for making microarrays, e.g., by masking (Maskos and Southern, 1992, Nucl Acids. Res. 20:1679-1684), may also be used. In principle, and as noted supra, jc any type of array, for example, dot blots on a nylon hybridization membrane (see Sambrook et al. , supra) could be used. However, as will be recognized by those skilled in the art, very small arrays will frequently be preferred because hybridization volumes will be smaller.
In a particularly preferred embodiment, microarrays ofthe invention are manufactured by means of an In jet printing device for oligonucleotide synthesis, e.g.,
O using the methods and systems described by Blanchard in International Patent Publication No. WO 98/41531, published September 24, 1998; Blanchard et al, 1996, Biosensors and Bioelectronics 77:687-690; Blanchard, 1998, m Synthetic DNA Arrays in Genetic Engineering, Vol. 20, J.K. Setiow, Ed., Plenum Press, New York at pages 111-123; and U.S. Patent No. 6,028,189 to Blanchard. Specifically, the polynucleotide probes in such oc microarrays are preferably synthesized in arrays, e.g. , on a glass slide, by serially depositing individual nucleotide bases in "microdroplets" of a high surface tension solvent such as propylene carbonate. The microdroplets have small volumes (e.g., 100 pL or less, more preferably 50 pL or less) and are separated from each other on the microarray (e.g., by c hydrophobic domains) to form circular surface tension wells which define the locations of the array elements (i.e., the different probes). Polynucleotide probes are normally attached to the surface covalently at the 3' end ofthe polynucleotide. Alternatively, polynucleotide probes can be attached to the surface covalently at the 5' end ofthe polynucleotide (see for example, Blanchard, 1998, in Synthetic DNA Arrays in Genetic Engineering, Vol. 20, J.K.
, r. Setlow, Ed., Plenum Press, New York at pages 111-123).
5.6.4. TARGET POLYNUCLEOTIDE MOLECULES Target polynucleotides which may be analyzed by the methods and compositions of the invention include RNA molecules such as, but by no means limited to messenger RNA
1 c (mRNA) molecules, ribosomal RNA (rRNA) molecules, cRNA molecules (i. e. , RNA molecules prepared from cDNA molecules that are transcribed in vivo) and fragments thereof. Target polynucleotides which may also be analyzed by the methods and compositions ofthe present invention include, but are not limited to DNA molecules such as genomic DNA molecules, cDNA molecules, and fragments thereof including jr. oligonucleotides, ESTs, STSs, etc.
The target polynucleotides may be from any source. For example, the target polynucleotide molecules may be naturally occurring nucleic acid molecules such as genomic or extragenomic DNA molecules isolated from an organism, or RNA molecules, such as mRNA molecules, isolated from an organism. Alternatively, the polynucleotide jc molecules may be synthesized, including, e.g., nucleic acid molecules synthesized enzymatically in vivo or in vitro, such as cDNA molecules, or polynucleotide molecules synthesized by PCR, RNA molecules synthesized by in vitro transcription, etc. The sample of target polynucleotides can comprise, e.g., molecules of DNA, RNA, or copolymers of DNA and RNA. In preferred embodiments, the target polynucleotides ofthe invention will or. correspond to particular genes or to particular gene transcripts (e.g. , to particular mRNA sequences expressed in cells or to particular cDNA sequences derived from such mRNA sequences). However, in many embodiments, particularly those embodiments wherein the polynucleotide molecules are derived from mammalian cells, the target polynucleotides may correspond to particular fragments of a gene transcript. For example, the target
35 polynucleotides may correspond to different exons ofthe same gene, e.g., so that different splice variants of that gene may be detected and/or analyzed.
In preferred embodiments, the target polynucleotides to be analyzed are prepared in c vitro from nucleic acids extracted from cells. For example, in one embodiment, RNA is extracted from cells (e.g., total cellular RNA, poly(A)+ messenger RNA, fraction thereof) and messenger RNA is purified from the total extracted RNA. Methods for preparing total and poly(A)+ RNA are well known in the art, and are described generally, e.g., in Sambrook et al, supra. In one embodiment, RNA is extracted from cells ofthe various types of
, r. interest in this invention using guanidinium thiocyanate lysis followed by CsCI centrifugation and an oligo dT purification (Chirgwin et al, 1979, Biochemistry 18:5294- 5299). In another embodiment, RNA is extracted from cells using guanidinium thiocyanate lysis followed by purification on RNeasy columns (Qiagen). cDNA is then synthesized from the purified mRNA using, e.g., oligo-dT or random primers. In preferred
, embodiments, the target polynucleotides are cRNA prepared from purified messenger RNA extracted from cells. As used herein, cRNA is defined here as RNA complementary to the source RNA. The extracted RNAs are amplified using a process in which doubled-stranded cDNAs are synthesized from the RNAs using a primer linked to an RNA polymerase promoter in a direction capable of directing transcription of anti-sense RNA. Anti-sense jr. RNAs or cRNAs are then transcribed from the second strand ofthe double-stranded cDNAs using an RNA polymerase (see, e.g., U.S. Patent Nos. 5,891,636, 5,716,785; 5,545,522 and 6,132,997; see also, U.S. Patent Application Serial No. 09/411,074, filed October 4, 1999 by Linsley and Schelter and U.S. Provisional Patent Application Serial No. 60/253,641, filed on November 28, 2000, by Ziman et al.). Both oligo-dT primers (U.S. Patent Nos. jc 5,545,522 and 6,132,997) or random primers (U.S. Provisional Patent Application Serial No. 60/253,641, filed on November 28, 2000, by Ziman et al.) that contain an RNA polymerase promoter or complement thereof can be used. Preferably, the target polynucleotides are short and/or fragmented polynucleotide molecules which are representative ofthe original nucleic acid population ofthe cell. o . The target polynucleotides to be analyzed by the methods and compositions ofthe invention are preferably detectably labeled. For example, cDNA can be labeled directly, e.g., with nucleotide analogs, or indirectly, e.g., by making a second, labeled cDNA strand using the first strand as a template. Alternatively, the double-stranded cDNA can be transcribed into cRNA and labeled.
35 Preferably, the detectable label is a fluorescent label, e.g., by incorporation of nucleotide analogs. Other labels suitable for use in the present invention include, but are not limited to, biotin, imminobiotin, antigens, cofactors, dinitrophenol, lipoic acid, olefmic
^ compounds, detectable polypeptides, electron rich molecules, enzymes capable of generating a detectable signal by action upon a substrate, and radioactive isotopes. Preferred radioactive isotopes include 32P, 35S, 14C, 15N and 125I. Fluorescent molecules suitable for the present invention include, but are not limited to, fluorescein and its derivatives, rhodamine and its derivatives, texas red, 5'carboxy-fluorescein ("FMA"), 2',7'-
, r. dimethoxy-4',5,-dichloro-6-carboxy-fluorescein ("JOE"), N,N,N',N'-tetramethyl-6-carboxy- rhodamine ("TAMRA"), 6'carboxy-X-rhodamine ("ROX"), HEX, TET, IRD40, and IRD41. Fluroescent molecules that are suitable for the invention further include: cyamine dyes, including by not limited to Cy3, Cy3.5 and Cy5; BODIPY dyes including but not limited to BODIPY-FL, BODIPY-TR, BODIPY-TMR, BODIPY-630/650, and BODIPY-650/670;
.5 and ALEXA dyes, including but not limited to ALEXA-488, ALEXA-532, ALEXA-546, ALEXA-568, and ALEXA-594; as well as other fluorescent dyes which will be known to those who are skilled in the art. Electron rich indicator molecules suitable for the present invention include, but are not limited to, ferritin, hemocyanin, and colloidal gold. Alternatively, in less preferred embodiments the target polynucleotides may be labeled by jr. specifically complexing a first group to the polynucleotide. A second group, covalently linked to an indicator molecules and which has an affinity for the first group, can be used to indirectly detect the target polynucleotide. In such an embodiment, compounds suitable for use as a first group include, but are not limited to, biotin and iminobiotin. Compounds suitable for use as a second group include, but are not limited to, avidin and streptavidin.
25
5.6.5. HYBRIDIZATION TO MICROARRAYS
As described supra, nucleic acid hybridization and wash conditions are chosen so that the polynucleotide molecules to be analyzed by the invention (referred to herein as the
"target polynucleotide molecules) specifically bind or specifically hybridize to the
,„ complementary polynucleotide sequences ofthe array, preferably to a specific array site, wherein its complementary DNA is located.
Arrays containing double-stranded probe DNA situated thereon are preferably subjected to denaturing conditions to render the DNA single-stranded prior to contacting with the target polynucleotide molecules. Arrays containing single-stranded probe DNA oc (e.g. , synthetic oligodeoxyribonucleic acids) may need to be denatured prior to contacting with the target polynucleotide molecules, e.g., to remove hairpins or dimers which form due to self complementary sequences.
Optimal hybridization conditions will depend on the length (e.g., oligomer versus c polynucleotide greater than 200 bases) and type (e.g. , RNA, or DNA) of probe and target nucleic acids. General parameters for specific (i.e., stringent) hybridization conditions for nucleic acids are described in Sambrook et al, (supra), and in Ausubel et al, 1987, Current Protocols in Molecular Biology, Greene Publishing and Wiley-Interscience, New York. When the cDNA microarrays of Schena et al. are used, typical hybridization conditions are
1 r. hybridization in 5 X SSC plus 0.2% SDS at 65 °C for four hours, followed by washes at 25 °C in low stringency wash buffer (1 X SSC plus 0.2% SDS), followed by 10 minutes at 25 °C in higher stringency wash buffer (0.1 X SSC plus 0.2% SDS) (Shena et al, 1996, Proc. Natl. Acad. Sci. U.S.A. 95:10614). Useful hybridization conditions are also provided in, e.g., Tijessen, 1993, Hybridization With Nucleic Acid Probes, Elsevier Science Publishers
, c B.V. and Kricka, 1992, Nonisotopic DNA Probe Techniques, Academic Press, San Diego, CA.
Particularly preferred hybridization conditions for use with the screening and/or signaling chips ofthe present invention include hybridization at a temperature at or near the mean melting temperature ofthe probes (e.g., within 5 °C, more preferably within 2 °C) in jr I M NaCl, 50 mM MES buffer (pH 6.5), 0.5% sodium Sarcosine and 30%> formamide.
5.6.6. SIGNAL DETECTION AND DATA ANALYSIS It will be appreciated that when target sequences, e.g., cDNA or cRNA, complementary to the RNA of a cell is made and hybridized to a microarray under suitable jc hybridization conditions, the level of hybridization to the site in the array corresponding to an exon of any particular gene will reflect the prevalence in the cell of mRNA or mRNAs containing the exon transcribed from that gene. For example, when detectably labeled (e.g., with a fluorophore) cDNA complementary to the total cellular mRNA is hybridized to a microarray, the site on the array corresponding to an exon of a gene (i.e., capable of p. specifically binding the product or products ofthe gene expressing) that is not transcribed or is removed during RNA splicing in the cell will have little or no signal (e.g., fluorescent signal), and an exon of a gene for which the encoded mRNA expressing the exon is prevalent will have a relatively strong signal. The relative abundance of different mRNAs produced by from the same gene by alternative splicing is then determined by the signal
„ strength pattern across the whole set of exons monitored for the gene. In preferred embodiments, target sequences, e.g., cDNAs or cRNAs, from two different cells are hybridized to the binding sites ofthe microarray. In the case of drug responses one cell sample is exposed to a drug and another cell sample ofthe same type is c not exposed to the drug. In the case of pathway responses one cell is exposed to a pathway perturbation and another cell ofthe same type is not exposed to the pathway perturbation. The cDNA or cRNA derived from each ofthe two cell types are differently labeled so that they can be distinguished. In one embodiment, for example, cDNA from a cell treated with a drug (or exposed to a pathway perturbation) is synthesized using a fluorescein-labeled
1 dNTP, and cDNA from a second cell, not drug-exposed, is synthesized using a rhodamine-labeled dNTP. When the two cDNAs are mixed and hybridized to the microarray, the relative intensity of signal from each cDNA set is determined for each site on the array, and any relative difference in abundance of a particular exon detected. In the example described above, the cDNA from the drug-treated (or pathway
, , perturbed) cell will fluoresce green when the fluorophore is stimulated and the cDNA from the untreated cell will fluoresce red. As a result, when the drug treatment has no effect, either directly or indirectly, on the transcription and/or post-transcriptional splicing of a particular gene in a cell, the exon expression patterns will be indistinguishable in both cells and, upon reverse transcription, red-labeled and green-labeled cDNA will be equally jr. prevalent. When hybridized to the microarray, the binding site(s) for that species of RNA will emit wavelengths characteristic of both fluorophores. In contrast, when the drug-exposed cell is treated with a drug that, directly or indirectly, changes the transcription and/or post-transcriptional splicing of a particular gene in the cell, the exon expression pattern as represented by ratio of green to red fluorescence for each exon binding site will jc change. When the drug increases the prevalence of an mRNA, the ratios for each exon expressed in the mRNA will increase, whereas when the drug decreases the prevalence of an mRNA, the ratio for each exons expressed in the mRNA will decrease.
The use of a two-color fluorescence labeling and detection scheme to define alterations in gene expression has been described in connection with detection of mRNAs, n e.g., in Shena et al, 1995, Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science 270:467-470, which is incorporated by reference in its entirety for all purposes. The scheme is equally applicable to labeling and detection of exons. An advantage of using target sequences, e.g., cDNAs or cRNAs, labeled with two different fluorophores is that a direct and internally controlled comparison ofthe mRNA or oc exon expression levels corresponding to each arrayed gene in two cell states can be made, and variations due to minor differences in experimental conditions (e.g., hybridization conditions) will not affect subsequent analyses. However, it will be recognized that it is also possible to use cDNA from a single cell, and compare, for example, the absolute c amount of a particular exon in, e.g. , a drug-treated or pathway-perturbed cell and an untreated cell.
In other preferred embodiments, single-channel detection methods, e.g., using one- color fluorescence labeling, are used (see U.S. provisional patent application Serial No. 60/227,966, filed on August 25, 2000). In this embodiment, arrays comprising reverse-
, p. complement (RC) probes are designed and produced. Because a reverse complement of a DNA sequence has sequence complexity that is equivalent to the corresponding forward- strand (FS) probe that is complementary to a target sequence with respect to a variety of measures (e.g., measures such as GC content and GC trend are invariant under the reverse complement), a RC probe is used to as a control probe for determination of level of non-
, c specific cross hybridization to the corresponding FS probe. The significance ofthe FS probe intensity of a target sequence is determined by comparing the raw intensity measurement for the FS probe and the corresponding raw intensity measurement for the RC probe in conjunction with the respective measurement errors. In a preferred embodiment, an exon is called present if the intensity difference between the FS probe and the
«« corresponding RC probe is significant. More preferably, an exon is called present if the FS probe intensity is also significantly above background level. Single-channel detection methods can be used in conjunction with multi-color labeling. In one embodiment, a plurality of different samples, each labeled with a different color, is hybridized to an array. Differences between FS and RC probes for each color are used to determine the level of jc hybridization ofthe corresponding sample.
When fluorescently labeled probes are used, the fluorescence emissions at each site of a transcript array can be, preferably, detected by scanning confocal laser microscopy. In one embodiment, a separate scan, using the appropriate excitation line, is carried out for each ofthe two fluorophores used. Alternatively, a laser can be used that allows op, simultaneous specimen illumination at wavelengths specific to the two fluorophores and emissions from the two fluorophores can be analyzed simultaneously (see Shalon et al. , 1996, Genome Res. 6: 639-645). In a preferred embodiment, the arrays are scanned with a laser fluorescence scanner with a computer controlled X-Y stage and a microscope objective. Sequential excitation ofthe two fluorophores is achieved with a multi-line, o c mixed gas laser, and the emitted light is split by wavelength and detected with two photomultiplier tubes. Such fluorescence laser scanning devices are described, e.g., in Schena et al, 1996, Genome Res. 6:639-645. Alternatively, the fiber-optic bundle described by Ferguson et al, 1996, Nature Biotech. 74:1681-1684, may be used to monitor mRNA c abundance levels at a large number of sites simultaneously.
Signals are recorded and, in a preferred embodiment, analyzed by computer, e.g., using a 12 bit analog to digital board. In one embodiment, the scanned image is despeckled using a graphics program (e.g., Hijaak Graphics Suite) and then analyzed using an image gridding program that creates a spreadsheet ofthe average hybridization at each wavelength
, p. at each site. If necessary, an experimentally determined correction for "cross talk" (or overlap) between the channels for the two fluors may be made. For any particular hybridization site on the transcript array, a ratio ofthe emission ofthe two fluorophores can be calculated. The ratio is independent ofthe absolute expression level ofthe cognate gene, but is useful for genes whose expression is significantly modulated by drug administration,
. - gene deletion, or any other tested event.
According to the method ofthe invention, the relative abundance of an mRNA and/or an exon expressed in an mRNA in two cells or cell lines is scored as perturbed (i.e., the abundance is different in the two sources of mRNA tested) or as not perturbed (i.e., the relative abundance is the same). As used herein, a difference between the two sources of j RNA of at least a factor of about 25% (i. e. , RNA is 25% more abundant in one source than in the other source), more usually about 50%, even more often by a factor of about 2 (i.e., twice as abundant), 3 (three times as abundant), or 5 (five times as abundant) is scored as a perturbation. Present detection methods allow reliable detection of difference of an order of about 3 -fold to about 5 -fold, but more sensitive methods are expected to be developed. jc It is, however, also advantageous to determine the magnitude ofthe relative difference in abundances for an mRNA and/or an exon expressed in an mRNA in two cells or in two cell lines. This can be carried out, as noted above, by calculating the ratio ofthe emission ofthe two fluorophores used for differential labeling, or by analogous methods that will be readily apparent to those of skill in the art.
30
5.6.7. OTHERMETHODS OF TRANSCRIPTIONAL STATE MEASUREMENT
The transcriptional state of a cell may be measured by other gene expression technologies known in the art. Several such technologies produce pools of restriction fragments of limited complexity for electrophoretic analysis, such as methods combining oc double restriction enzyme digestion with phasing primers (see, e.g. , European Patent O 534858 Al, filed September 24, 1992, by Zabeau et al), or methods selecting restriction fragments with sites closest to a defined mRNA end (see, e.g., Prashar et al, 1996, Proc. Natl. Acad. Sci. USA 93:659-663). Other methods statistically sample cDNA pools, such as by sequencing sufficient bases (e.g., 20-50 bases) in each of multiple cDNAs to identify each cDNA, or by sequencing short tags (e.g., 9-10 bases) that are generated at known positions relative to a defined mRNA end (see, e.g., Velculescu, 1995, Science 270:484- 487).
5.7. MEASUREMENT OF OTHER ASPECTS OF THE BIOLOGICAL STATE
10 In various embodiments ofthe present invention, aspects ofthe biological state other than the transcriptional state, such as the translational state, the activity state, or mixed aspects can be measured to produce the measured data to be analyzed according to the invention. Thus, in such embodiments, gene expression data may include translational state , c measurements or even protein expression measurements. In fact, in some embodiments, rather than using gene expression interaction maps based on gene expression, protein expression interaction maps based on protein expression maps are used. Details of embodiments in which aspects ofthe biological state other than the transcriptional state are described in this section.
20
5.7.1. EMBODIMENTS BASED ON TRANSLATIONAL STATE MEASUREMENTS
Measurement ofthe translational state may be performed according to several methods. For example, whole genome monitoring of protein (z'.e., the "proteome," Goffeau et al, 1996, Science 274:546-567; Aebersold et al, 1999, Nature Biotechnology 10:994- jc 999) can be carried out by constructing a microarray in which binding sites comprise immobilized, preferably monoclonal, antibodies specific to a plurality of protein species encoded by the cell genome (see, e.g., Zhu et al, 2001, Science 293:2101-2105; MacBeath et al, 2000, Science 289:1760-63; de Wildt et al., 2000, Nature Biotechnology 18:989-994). Preferably, antibodies are present for a substantial fraction ofthe encoded proteins, or at
~ n least for those proteins relevant to the action of a drug of interest. Methods for making monoclonal antibodies are well known (see, e.g., Harlow and Lane, 1988, Antibodies: A Laboratory Manual, Cold Spring Harbor, New York, which is incorporated in its entirety for all purposes). In a preferred embodiment, monoclonal antibodies are raised against synthetic peptide fragments designed based on genomic sequence ofthe cell. With such an
35 antibody array, proteins from the cell are contacted to the array and their binding is assayed with assays known in the art.
Alternatively, proteins can be separated and measured by two-dimensional gel c electrophoresis systems. Two-dimensional gel electrophoresis is well-known in the art and typically involves iso-electric focusing along a first dimension followed by SDS-PAGE electrophoresis along a second dimension. See, e.g., Hames et al, 1990, Gel Electrophoresis of Proteins: A Practical Approach, IRL Press, New York; Shevchenko et al., 1996, Proc. Natl. Acad. Sci. USA 93:1440-1445; Sagliocco et al, 1996, Yeast 12:1519-
, p. 1533; Lander, 1996, Science 274:536-539; and Beaumont et al., Life Science News 7, 2001, Amersham Pharmacia Biotech. The resulting electropherograms can be analyzed by numerous techniques, including mass spectrometric techniques, Western blotting and immunoblot analysis using polyclonal and monoclonal antibodies, and internal and N- terminal micro-sequencing. Using these techniques, it is possible to identify a substantial
, ^ fraction of all the proteins produced under given physiological conditions, including in cells (e.g., in yeast) exposed to a drug, or in cells modified by, e.g., deletion or over-expression of a specific gene.
5.7.2. EMBODIMENTS BASED ON OTHER ASPECTS OF THE BIOLOGICAL STATE jp. Even though methods of this invention are illustrated by embodiments involving gene expression, the methods ofthe invention are applicable to any cellular constituent that can be monitored. In particular, where activities of proteins can be measured, embodiments of this invention can use such measurements. Activity measurements can be performed by any functional, biochemical, or physical means appropriate to the particular activity being jc characterized. Where the activity involves a chemical transformation, the cellular protein can be contacted with the natural substrate(s), and the rate of transformation measured. Where the activity involves association in multimeric units, for example association of an activated DNA binding complex with DNA, the amount of associated protein or secondary consequences ofthe association, such as amounts of mRNA transcribed, can be measured. op. Also, where only a functional activity is known, for example, as in cell cycle control, performance ofthe function can be observed. However known and measured, the changes in protein activities form the response data analyzed by the foregoing methods of this invention.
In alternative and non-limiting embodiments, response data may be formed of mixed oc aspects ofthe biological state of a cell. Response data can be constructed from, e.g., changes in certain mRNA abundances, changes in certain protein abundances, and changes in certain protein activities.
5 5.8. MEASUREMENT OF DRUG RESPONSE DATA
Drug responses are obtained for use in the instant invention by measuring the gene expression state changed by drug exposure. The biological response described on the exon level can also be measured by exon profiling methods. The measured response data include values representing gene expression level values or gene expression level ratios for a
1 p. plurality of genes.
To measure drug response data, cell can be exposed to graded levels ofthe drug or drug candidate of interest. When the cells are grown in vitro, the compound is usually added to their nutrient medium. The drag is added in a graded amount that depends on the particular characteristics ofthe drug, but usually will be between about 1 ng/ml and 100
, c mg/ml. In some cases a drug will be solubilized in a solvent such as DMSO.
The exon expression profiles of cells exposed to the drug and of cells not exposed to the drug are measured according to the methods described in the previous section. Preferably, gene transcript arrays are used to find the genes with altered gene expression profiles due to exposure to the drug. jp. It is preferable for measurements of drug responses, in the case of two-colored differential hybridization described above, to measure with reversed labeling. Also, it is preferable that the levels of drug exposure used provide sufficient resolution of rapidly changing regions ofthe drug response, e.g., by using approximately ten levels of drug exposure.
25
5.9. METHODS FOR PROBING BIOLOGICAL STATES
One aspect ofthe invention provides methods for the analysis of biological state.
The methods of this invention are also useful for the analysis of responses of a cell sample to perturbations designed to probe cellular state. Preferred perturbations are those that cause o p. a change in the amount of alternative splicing that occurs in one or more RNA transcripts.
This section provides some illustrative methods for probing gene expression states and protein abundances and acitivities. See PCT publication WO 00/24936 for more detailed descriptions of these method.
Methods for targeted perturbation of cells are widely known and applied in the art. oc For example, such methods include use of titratable expression systems, use of transfection or viral transduction systems, direct modifications to RNA abundances or activities, direct modifications of protein abundances, direct modification of protein activities including use of drugs (or chemical moieties in general), and post-transcriptional gene silencing (PTGS)
<- or RNA interference (RNAi).
In mammalian cells, several means of titrating expression of genes are available (Spencer, 1996, Trends Genet. 12:181-187). For example, the Tet system is widely used, both in its original form, the "forward" system, in which addition of doxycycline represses transcription, and in the newer "reverse" system, in which doxycycline addition stimulates
1 n transcription (Gossen et al, 1995, Proc. Natl. Acad. Sci. USA 89:5547-5551; Hoffmann et al, 1997, Nucl. Acids. Res. 25:1078-1079; Hofrnann et al., 1996, Proc. Natl. Acad. Sci. USA 83:5185-5190; Paulus et al., 1996, Journal of Virology 70:62-67). Another commonly used controllable promoter system in mammalian cells is the ecdysone-inducible system developed by Evans and colleagues (No et al., 1996, Proc. Nat. Acad. Sci. USA
1 <- 93:3346-3351), where expression is controlled by the level of muristerone added to the cultured cells. Finally, expression can be modulated using the "chemical-induced dimerization" (CID) system developed by Schreiber, Crabtree, and colleagues (Belshaw et al., 1996, Proc. Nat. Acad. Sci. USA 93:4604-4607; Spencer, 1996, Trends Genet. 12:181 -187) and similar systems in yeast. In this system, the gene of interest is put under jp. the control ofthe CID-responsive promoter, and transfected into cells expressing two different hybrid proteins, one comprised of a DNA-binding domain fused to FKBP12, which binds FK506. The other hybrid protein contains a transcriptional activation domain also fused to FKBP12. The CID inducing molecule is FK1012, a homodimeric version of FK506 that is able to bind simultaneously both the DNA binding and transcriptional j , activating hybrid proteins . In the graded presence of FK 1012, graded transcription of the controlled gene is activated.
Transfection or viral transduction of target genes can introduce controllable perturbations in biological gene expression states in mammalian cells. Preferably, transfection or transduction of a target gene can be used with cells that do not naturally o p, express the target gene of interest. Such non-expressing cells can be derived from a tissue not normally expressing the target gene or the target gene can be specifically mutated in the cell. The target gene of interest can be cloned into one of many mammalian expression plasmids, for example, the pcDNA3.1 +/- system (Invitrogen, Inc.) or retroviral vectors, and introduced into the non-expressing host cells. Transfected or transduced cells expressing o c the target gene may be isolated by selection for a drug resistance marker encoded by the expression vector. The level of gene transcription is monotonically related to the transfection dosage. In this way, the effects of varying levels ofthe target gene may be investigated. Other methods of modifying RNA abundances and activities and thus gene c abundances include ribozymes, antisense species, and RNA aptamers (Good et al., 1997, Gene Therapy 4: 45-54). Controllable application or exposure of a cell to these entities permits controllable perturbation of RNA abundances.
Ribozymes are RNAs which are capable of catalyzing RNA cleavage reactions. (Cech, 1987, Science 236:1532-1539; PCT International Publication WO 90/11364,
10 published October 4, 1990; Sarver et al., 1990, Science 247: 1222-1225). "Hairpin" and "hammerhead" RNA ribozymes can be designed to specifically cleave a particular target mRNA. Rules have been established for the design of short RNA molecules with ribozyme activity, which are capable of cleaving other RNA molecules in a highly sequence specific way and can be targeted to virtually all kinds of RNA. (Haseloff et al., 1988, Nature
15 334:585-591; Koizumi et al, 1988, FEBS Lett., 228:228-230; Koizumi et al, 1988, FEBS Lett., 239:285-288). Ribozyme methods involve exposing a cell to, inducing expression in a cell, etc. of such small RNA ribozyme molecules. (Grassi and Marini, 1996, Annals of Medicine 28: 499-510; Gibson, 1996, Cancer and Metastasis Reviews 15: 287-299). In another embodiment, activity of a target RNA (preferable mRNA) species,
90 specifically its rate of translation, can be controllably inhibited by the controllable application of antisense nucleic acids. An "antisense" nucleic acid as used herein refers to a nucleic acid capable of hybridizing to a sequence-specific (e.g., non-poly A) portion ofthe target RNA, for example its translation initiation region, by virtue of some sequence complementarity to a coding and/or non-coding region. The antisense nucleic acids ofthe jc invention can be oligonucleotides that are double-stranded or single-stranded, RNA or DNA or a modification or derivative thereof, which can be directly administered in a controllable manner to a cell or which can be produced intracellularly by transcription of exogenous, introduced sequences in controllable quantities sufficient to perturb translation ofthe target RNA. op. In still another embodiment, RNA aptamers can be introduced into or expressed in a cell. RNA aptamers are specific RNA ligands for proteins, such as for Tat and Rev RNA (Good et al, 1997, Gene Therapy 4: 45-54) that can specifically inhibit their translation.
Post-transcriptional gene silencing (PTGS) or RNA interference (RNAi) can also be used to modify RNA abundances (Guo et al., 1995, Cell 81:611-620; Fire et al, 1998, oc Nature 391 :806-811). In RNAi, dsRNAs are injected into cells to specifically block expression of its homologous gene. In particular, in RNAi, both the sense strand and the anti-sense strand can inactivate the corresponding gene. It is suggested that the dsRNAs are cut by nuclease into 21-23 nucleotide fragments. These fragments hybridize to the c homologous region of their corresponding mRNAs to form double-stranded segments which are degraded by nuclease (Grant, 1999, Cell 96:303-306; Tabara et al., 1999, Cell 99:123- 132; Zamore et al, 2000, Cell 101 :25-33; Bass, 2000, Cell 101 :235-238; Petcherski et al, 2000, Nature 405:364-368; Elbashir et al., Nature 411 :494-498; Paddison et al, Proc. Natl. Acad. Sci. USA 99:1443-1448; Technical Bulletins at the web site
H p. http://www.dharmacon.con/tech/tech03.html, accessed October 16, 2001; ). Therefore, in one embodiment, one or more dsRNAs having sequences homologous to the sequences of one or more mRNAs whose abundances are to be modified are transfected into a cell or tissue sample. Any standard method for introducing nucleic acids into cells can be used.
Methods of modifying protein abundances include, inter alia, those altering protein
. c degradation rates and those using antibodies (which bind to proteins affecting abundances of activities of native target protein species). Increasing (or decreasing) the degradation rates of a protein species decreases (or increases) the abundance of that species. Methods for controllably increasing the degradation rate of a target protein in response to elevated temperature and/or exposure to a particular drug, which are known in the art, can be jp. employed in this invention. For example, one such method employs a heat-inducible or drug-inducible N-terminal degron, which is an N-terminal protein fragment that exposes a degradation signal promoting rapid protein degradation at a higher temperature (e.g., 37° C) and which is hidden to prevent rapid degradation at a lower temperature (e.g., 23 ° C) (Dohmen et. al, 1994, Science 263:1273-1276). Such an exemplary degron is Arg-DHFRts, jc a variant of murine dihydrofolate reductase in which the N-terminal Val is replaced by Arg and the Pro at position 66 is replaced with Leu. According to this method, for example, a gene for a target protein, P, is replaced by standard gene targeting methods known in the art (Lodish et al, 1995, Molecular Biology ofthe Cell, W.H. Freeman and Co., New York, especially chap 8) with a gene coding for the fusion protein Ub-Arg-DHFRts-P ("Ub" stands o p. for ubiquitin). The N-terminal ubiquitin is rapidly cleaved after translation exposing the N- terminal degron. At lower temperatures, lysines internal to Arg-DHFRts are not exposed, ubiquitination ofthe fusion protein does not occur, degradation is slow, and active target protein levels are high. At higher temperatures (in the absence of methotrexate), lysines internal to Arg-DHFRts are exposed, ubiquitination ofthe fusion protein occurs, degradation oc is rapid, and active target protein levels are low. Heat activation of degradation is controllably blocked by exposure methotrexate. This method is adaptable to other N- terminal degrees which are responsive to other inducing factors, such as drugs and temperature changes. c Methods of directly modifying protein activities include, inter alia, dominant negative mutations, specific drugs or chemical moieties generally, and also the use of antibodies.
6. EXAMPLE: TEST RESULTS
, π The following example is presented by way of illustration ofthe present invention, and is not intended to limit the present invention in any way.
To illustrate the difference between the improved ANOVA method ofthe invention as described in Table 2 and the traditional ANOVA as described in Table 1 , two experiments were carried out to compare the false positive rate and the sensitivity in gene
. , differential expression detection between the improved and traditional ANOVA. Affymetrix HG-U95A microarrays are used for all measurements.
To compare false positive rate, five (N=5) replicate microarray measurements were chosen and grouped into three groups (k=3). All five microarrays were hybridized with cell- 293, human embiyonic kidney cell (Sample A). Two ofthe groups have two replicates each n and one group has only one replicate. Because all five replicates are measurements of sample A, none ofthe genes or probes measured should be detected as differentially expressed. Thus, in this experiment, any gene or probes determined as being differentially expressed are false positives. The false positive rate is the number of false positives divided by the total number of gene or probes in the microarray. The lower the false positive rate, jc the better the method.
To compare the detection sensitivity, another five microarrays were selected and grouped into three groups (k=3). Two ofthe groups, having two and one replicate, repectively, contains measurements obtained using Sample A. The third group has two replicates that are obtained using cell-A549, human lung cancer cell (Sample B). Sample A o p. and Sample B are very different, and therefore, differential expressions between the two different samples are expected. In this case, the detection sensitivity is evaluated based on the rate of detection, which is the number of genes or probes detected as differentially expressed divided by the total number of genes or probes in the microarray. The higher the detection rate, the more sensitive the method. o c Table 5 summarizes the test results. For the given threshold p-value<0.01, the improved ANOVA method provides both much lower false positive rate and much higher detection sensitivity. This indicates that the improved ANOVA method ofthe invention with two inputs has higher statistical power than the traditional ANOVA.
Figure imgf000099_0001
7. REFERENCES CITED All references cited herein are incorporated herein by reference in their entirety and for all purposes to the same extent as if each individual publication or patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety for all purposes.
Many modifications and variations ofthe present invention can be made without departing from its spirit and scope, as will be apparent to those skilled in the art. The specific embodiments described herein are offered by way of example only, and the invention is to be limited only by the terms ofthe appended claims along with the full scope of equivalents to which such claims are entitled.

Claims

WHAT IS CLAIMED IS:
1. A method of analyzing a plurality of measurements of a variable {;;„} in k different measurement groups by ANOVA analysis, wherein ;,, is the 7th measurement in c the tth measurement group, t = 1, 2, ..., k and 7 = 1, 2, ..., n„ nt being the number of measurements in measurement group t, wherein each said measurement group consists of one or more measurements of said variable under a condition common to said measurement group, and wherein each said measurement;;,, has a predetermined measurement variance, said method comprising
, p. (a) determining a within-group variance for said k different measurement groups, wherein said within-group variance consists of a propagated variance and a scattered variance, said propagated variance being determined based on predetermined measurement variances of said plurality of measurements, and said scattered variance being determined based on deviation of each of said plurality of measurements with respect to a mean of
, c measurements in a respective measurement group;
(b) determining a between-group variance for said k different measurement groups, wherein said between-group variance is a variance of a plurality of group means, one for each of said k different measurement groups, with respect to a mean of said plurality of measurements, wherein each said group mean is the mean of said one or more ^ measurements in a measurement group; and
(c) comparing said within-group variance with said between-group variance, thereby analyzing said plurality of measurements.
2. The method of claim 1, wherein said within-group variance is determined based jc on group variances of said k measurement groups, each said group variance consisting of a measurement group propagated variance and a measurement group scattered variance, wherein said measurement group propagated variance is determined based on predetermined measurement variances of said one or more measurements in said measurement group, and wherein said measurement group scattered variance is a variance of said one or more o p. measurements with respect to the mean of said one or more measurements in said measurement group.
3. The method of claim 2, wherein said propagated variance for each said measurement group is determined according to the equation
35
Figure imgf000101_0001
cr y-,p„ =
9 7 where σ ^ is said propagated variance and σti is said predetermined measurement
variance of measurement,);,,. 0
4. The method of claim 3, wherein said mean of said one or more measurements is determined according to the equation
yti -5> - Λ ι=—l yt = - n.
where yt is said mean of said one or more measurements, and wherein said scattered variance is determined according to the equation
Figure imgf000101_0002
c where σ - is said scattered variance.
5. The method of any one of claims 1-4, further comprising prior to said step (a) a step of determining said predetermined measurement variance for each said measurement.
6. The method of claim 5, wherein said predetermined measurement variance of each said measurement v„ is determined according to an error model.
7. The method of claim 6, wherein said error model is a three-term error model according to equation 5
Figure imgf000102_0001
2 wherein σti is said predetermined measurement variance of said measurement;;,,, a is a fractional error coefficient, b is a Poisson error coefficient, and c is a standard deviation of background noise.
8. The method of claim 4, wherein said group propagated variance and said group scattered variance are combined according to the equation
Figure imgf000102_0002
9. The method of claim 8, wherein said comparing step comprises determining the significance level of a statistical metric by a statistical test, wherein said statistical metric is determined by a method comprising (i) determining a within-group degree of freedom;
(ii) determining a between-group degree of freedom; (iii) determining a within-group mean square; (iv) determining a between-group mean square; and (v) calculating said statistical metric.
10. The method of claim 9, wherein said within-group degree of freedom is calculated according to the equation
- 1 v R = N - k + — R ! n t
where vR is said wάthin-group degree of freedom, said between-group degree of freedom is calculated according to the equation ■k-1
where vτ is said between-group degree of freedom, said within-group mean square is calculated according to the equation
Figure imgf000103_0001
where sR is said within-group mean square and where SR is calculated according to the equation
Figure imgf000103_0002
where vRt is calculated according to the equation
V Rt = n 1 + n
and said between-group mean square is calculated according to the equation
: Uγ I Vγ
where _?~ is said between-group mean square and where S7. is calculated according to the equation
Figure imgf000103_0003
where y is calculated according to the equation
Figure imgf000104_0001
where N is the total number of measurements.
11. The method of claim 10, wherein said statistical test is an F-test and said significance level is a p-value determined according to the equation
pvalue = 1 - fcdf\sτ I sR ,vT,vR J
12. The method of claim 11, wherein said variable is the transcript level of a gene.
13. The method of claim 12, wherein said transcript level is measured using a DΝA microarray.
14. The method of claim 13, wherein said variable is the abundance of a protein.
15. The method of claim 14, wherein said abundance is measured using a protein microarray.
16. The method of claim 14, wherein said abundance is measured using two- dimensional gel electrophoresis.
17. An improved AΝOVA method for analyzing a plurality of measurements of a variable { ,J in k different measurement groups, wherein y„ is the tth measurement in the tth measurement group, t = 1, 2, ..., k and i = 1, 2, ..., nt, nt being the number of measurements in measurement group t, wherem each said measurement group consisting of one or more measurements of said variable under a condition common to said measurement group, wherein the improvement comprises determining a within-group variance consisting of a propagated variance and a scattered variance, said propagated variance being determined based on predetermined measurement variances of said one or more measurements, and said scattered variance being determined based on deviation of each of said one or more measurements with respect to a mean of measurements in a respective measurement group.
18. The method of claim 17, wherein each of said one or more measurements is an error- weighted measurement generated using a weighting factor, wherein said weighting factor is determined based on said predetermined measurement variance of said measurement.
19. The method of claim 18, wherein said weighting factor is determined according to the equation
W .
where σt 2i is said predetermined measurement variance of measurement ya.
20. The method of claim 17, wherein said within-group variance is determined based on group variances of said k measurement groups, each said group variance consisting of a measurement group propagated variance and a measurement group scattered variance, wherein said measurement group propagated variance is determined based on predetermined measurement variances of said one or more measurements in said measurement group, and wherein said measurement group scattered variance is a variance of said one or more measurements with respect to the mean of said one or more measurements in said measurement group.
21. The method of claim 20, wherein said propagated variance for each said measurement group is determined according to the equation Σ< σ y,p2 = n.
2 where σ y ^tp „ is said x propag *-*ated variance.
22. The method of claim 21 , wherein said mean of said one or more measurements is determined according to the equation
Figure imgf000106_0001
where yt is said mean of said one or more measurements, and wherein said scattered variance is determined according to the equation
Figure imgf000106_0002
2 where σ y -ti p> is said scattered variance.
23. The method of any one of claims 17, 18 and 20, further comprising determining said predetermined measurement variance for each of said plurality of measurements.
24. The method of claim 23, wherein said predetermined measurement variance of each of said plurality of measurement is determined according to an en-or model.
25. The method of claim 24, wherein said error model is a three-term error model according to the equation σ„l == c2 + b2 - yll + a2 -y 2
wherein σti is said predetermined measurement variance of said measurement ,„ a is a fractional error coefficient, b is a Poisson error coefficient, and c is a standard deviation of background noise.
26. The method of claim 18 or 19, wherein said improvement further comprises o using an effective number of samples for each measurement group.
27. The method of claim 25, wherein said variable is the transcript level of a gene.
28. The method of claim 27, wherein said transcript level is measured using a DNA 5 microarray.
29. The method of claim 28, wherein said variable is the abundance of a protein.
30. The method of claim 29, wherein said abundance is measured using a protein 0 microarray.
31. The method of claim 30, wherein said abundance is measured using two- dimensional gel electrophoresis.
5 32. A method of analyzing a plurality of measurements of a variable { ,,} in k different measurement groups by ANOVA analysis, wherein;;,, is the tth measurement in the tth measurement group, t — 1, 2, ..., k and i - 1, 2, ..., nt, nt being the number of measurements in measurement group t, wherein each of said k different measurement group consists of one or more measurements of said variable under a condition common to said 0 measurement group, and wherein each said measurement;',, has a predetermined measurement variance, said method comprising
(a) weighting each measurement;,, with a weighting factor to obtain an error- weighted measurement, wherein said weighting factor is determined based on said predetermined measurement variance of said measurement; 5 (b) determining a within-group variance for said k different measurement groups, wherein said within-group variance consists of a propagated variance and a scattered variance, said propagated variance being determined based on predetermined measurement
, variances of said plurality of measurements, and said scattered variance being determined based on deviation of each of said plurality of measurements with respect to a mean of error-weighted measurements in a respective measurement group;
(c) determining a between-group variance for said k different measurement groups, wherein said between-group variance is a variance of a plurality of group means, one for
, p. each of said k different measurement groups, with respect to a mean of said plurality of error- weighted measurements, wherein each said group mean is the mean of said one or more measurements in a measurement group; and
(d) comparing said within-group variance with said between-group variance, thereby analyzing said plurality of measurements.
15
33. The method of claim 32, wherein said within-group variance is determined based on group variances of said k measurement groups, each said group variance consisting of a measurement group propagated variance and a measurement group scattered variance, wherein said measurement group propagated variance is determined based on predetermined jp. measurement variances of said one or more measurements in said measurement group, and wherein said measurement group scattered variance is a variance of said one or more measurements with respect to the mean of said one or more error- weighted measurements in said measurement group.
j c
34. The method of claim 33, wherein said weighting factor is determined according to the equation
W ti ^
30
where σ t 2i is said predetermined measurement variance of measurement;;,,, and wherein said propagated variance for each said measurement group is determined according to the equation
35 σ y-,p „ =
wti Σ—
1=1 '=ι σti
where σ - is said propagated variance.
35. The method of claim 34, wherein said mean of said one or more error- weighted measurements is determined according to the equation
Wι« - - ι=l yt =
W,, ι=l
where yt is said mean of said one or more error- weighted measurements, and wherein said scattered variance is determined according to the equation
Figure imgf000109_0001
where σ v s is said scattered variance.
36. The method of any one of claims 32-35, further comprising prior to said step (a) a step of determining said predetermined measurement variance for each said measurement
JA
37. The method of claim 36, wherein said predetermined measurement variance of each said measurement;;,, is determined according to an error model.
38. The method of claim 36, wherein said error model is a three-term error model according to the equation
σ = c 2 + b2 - ytt + a2 - y 2
2 wherein σ ti is said predetermined measurement variance of said measurement v,„ a is a fractional error coefficient, b is a Poisson error coefficient, and c is a standard deviation of background noise..
39. The method of claim 35, wherein said group propagated variance and said group scattered variance are combined according to the equation
σ yytfpP + ( \^'t - 1) ' Gr y>ts σyt = n.
40. The method of claim 39, wherein said comparing step comprises determining the significance level of a statistical metric by a statistical test, wherein said statistical metric is determined by a method comprising
(i) determining a within-group degree of freedom; (ii) determining a between-group degree of freedom; (iii) determining a within-group mean square; (iv) determining a between-group mean square; and
(v) calculating said statistical metric.
41. The method of claim 40, wherein said within-group degree of freedom is calculated according to the equation
v R = eN - k +
Figure imgf000110_0001
where vR is said within-group degree of freedom, en, is calculated according to the equation
Figure imgf000111_0001
and eN is calculated according to the equation
ft. eN- ∑ent
(=1
said between-group degree of freedom is calculated according to the equation
Figure imgf000111_0002
where vτ is said between-group degree of freedom, said within-group mean square is calculated according to the equation
Figure imgf000111_0003
where s -R2 i •s said within-group mean square and where SΛ is calculated according to the equation
Figure imgf000111_0004
where vRt is calculated according to the equation
V Rt = n . - 1 + n and said between-group mean square is calculated according to the equation
&rπ y /
2 where sτ is said between-group mean square and where Sτ is calculated according to the equation
Figure imgf000112_0001
where y is calculated according to the equation
- =
Figure imgf000112_0002
N
where N is the total number of measurements.
42. The method of claim 41, wherein said statistical test is an F-test and said significance level is a p-value determined according to the equation
pvalue=l- fcdf\ lJ I TJ >VT > VR
43. The method of claim 42, wherein said variable is the transcript level of a gene.
44. The method of claim 43, wherein said transcript level is measured using a DΝA . microarray
45. The method of claim 44, wherein said variable is the abundance of a protein.
46. The method of claim 45, wherein said abundance is measured using a protein microarray.
c 47. The method of claim 46, wherein said abundance is measured using two- dimensional gel electrophoresis.
48. A method of analyzing a plurality of measurements of a variable {ytiJ} in k * n different measurement groups by ANOVA analysis, wherein ytiJ is they'th measurement in
, p. the ttth measurement group, t = 1, 2, ..., n, i = 1, 2, ..., k, j = 1, ..., mtl, mti being the number of measurements in measurement group ti, wherein each of said k * n different measurement groups consists of measurements of said variable under common conditions t and 7 of condition groups N having n different conditions and TT having k different conditions, respectively, and wherein each said measurement „ has a predetermined
, - measurement variance, said method comprising
(a) determining a within-group variance for said k * n different measurement groups, wherein said within-group variance consists of a propagated variance and a scattered variance, said propagated variance being determined based on predetermined measurement variances of said plurality of measurements, and said scattered variance being determined jr. based on deviation of each of said plurality of measurements with respect to a mean of measurements in a respective measurement group;
(b) determining (bl) a between-group variance for condition group K, wherein said between-group variance is a variance of condition group means for said condition group K, one for each condition in said condition group K, with respect to a mean of said plurality of jc measurements, wherein each said condition group mean for said condition group K is the mean of measurements in all measurement groups under a respective condition in said condition group K, or (b2) an interaction variance, wherein said interaction variance is a variance of group interaction means, one for each of said k * n different measurement groups, with respect to a mean of said plurality of measurements and the condition group op. mean for said condition group K and the condition group mean for said condition group N, wherein each said group interaction mean is the mean of measurements in said measurement groups over the number of measurements in said measurement group, and wherein each said condition group mean for said condition group K is the mean of measurements in all measurement groups under a respective condition in said condition group K and each said
35 condition group mean for said condition group N is the mean of measurements in all measurement groups under a respective condition in said condition group N; and
(c) comparing said within-group variance with said between-group variance, thereby
5 analyzing said plurality of measurements.
49. The method of claim 48, wherein said within-group variance is determined based on group variances of said k measurement groups, each said group variance consisting of a measurement group propagated variance and a measurement group scattered variance,
, p. wherein said measurement group propagated variance is determined based on predetermined measurement variances of said one or more measurements in said measurement group, and wherein said measurement group scattered variance is a variance of said one or more measurements with respect to the mean of said one or more measurements in said measurement group.
15
50. The method of claim 49, wherein said propagated variance for each said measurement group is determined according to the equation
0 Σ tij
7=1 σ ytiP m.
2 2 5 where σy is said propagated variance and σ tiJ is said predetermined measurement
variance of measurement yt .
51. The method of claim 50, wherein said mean of said one or more measurements is determined according to the equation 30
Σ y tij
7 = ι y a =
35 where yti is said mean of said one or more measurements, and wherein said scattered variance is determined according to the equation
Figure imgf000115_0001
ι n where σ r, „ is said scattered variance.
52. The method of claim 51, further comprising prior to said step (a) a step of determining said predetermined measurement variance for each said measurement.
53. The method of claim 52, wherein said predetermined measurement variance of each said measurement ytij is determined according to an error model.
54. The method of claim 53, wherein said error model is a three-term error model according to equation 20
σ^ = c2 + b2 - yty + a2 - y 1
2 wherein σtiJ is said predetermined measurement variance of said measurement ytjJ, a is a
25 fractional error coefficient, b is a Poisson error coefficient, and c is a standard deviation of background noise.
55. The method of claim 54, wherein said group propagated variance and said group scattered variance are combined according to the equation 30
σy.tp + (mti - ή- σy.ts σy = — : — mti
35
56. The method of claim 55, wherein said comparing step comprises determining the significance level of a statistical metric by a statistical test, wherein said statistical metric is c determined by a method comprising
(i) determining a within-group degree of freedom;
(ii) determining a between-group degree of freedom, if said between-group variance is determined in said step (b), or an interaction degree of freedom, if said interaction variance is determined in said step (b); , p. (iii) determining a within-group mean square;
(iv) determining a between-group mean square, if said between-group variance is determined in said step (b), or an interaction mean square, if said interaction variance is determined in said step (b); and
(v) calculating said statistical metric.
15
57. The method of claim 56, where said between-group variance is determined in step (b), and wherein said within-group degree of freedom is calculated according to the equation
20 k n I v R = N - n - k + t = li = l m ti
where vR is said within-group degree of freedom, said between-group degree of freedom is calculated according to the equation
25 vτ =k—l
where vτ is said between-group degree of freedom, said within-group mean square is
30 calculated according to the equation
SR -SR /vR
35 where sR is said within-group mean square and where SR is calculated according to the equation
Figure imgf000117_0001
where V Rtj is calculated according to the equation
Figure imgf000117_0002
and said between-group mean square is calculated according to the equation
r T T
where sτ is said between-group mean square and where Sτ is calculated according to the equation
S T = ∑ N t - ( fyrrt -
Figure imgf000117_0003
t = l
where N, is calculated according to the equation
Figure imgf000117_0004
where y is calculated according to the equation
Figure imgf000118_0001
/v π^'
where N is the total number of measurements, and yt is calculated according to the equation
n i ,,
/ , / m yt =jΞ ΞL
N,
58. The method of claim 56, where said interaction variance is determined in step (b), wherein said within-group degree of freedom is calculated according to the equation
V R = N - H - k + ∑ ∑ t = li = l mm
where vR is said within-group degree of freedom, said interaction degree of freedom is calculated according to the equation
v7 =(«-l).(*-l)
where v7 is said interaction degree of freedom, said within-group mean square is calculated according to the equation
SR ~ ^R ' VR
where sR is said within-group mean square and where SΛ is calculated according to the equation
Figure imgf000119_0001
where V Rti is calculated according to the equation
vRH = ma - +- mt!
and said interaction mean square is calculated according to the equation
Sj =SI /VI
where _?7 is said interaction mean square and where S7 is calculated according to the equation
Figure imgf000119_0002
where y is calculated according to the equation
Figure imgf000119_0003
where N is the total number of measurements, yt is calculated according to the equation
Figure imgf000120_0001
Nt
where N, is calculated according to the equation
Nt =∑ ti ιι==ll
J . is calculated according to the equation
J, tij
Figure imgf000120_0002
where N, is calculated according to the equation
Figure imgf000120_0003
and ya is calculated according to the equation
Figure imgf000120_0004
59. The method of claim 57, wherein said statistical test is an F-test and said significance level is a p-value determined according to the equation
pvalue= l-fcdf\sτ /sR ,vτ,vh
60. The method of claim 58, wherein said statistical test is an F-test and said significance level is a p-value determined according to the equation
pvalue = 1 - fcdf [sj' I s R , vI t vR ).
61. The method of claim 58 or 59, wherein said variable is the transcript level of a gene.
10
62. The method of claim 61, wherein said transcript level is measured using a DNA microarray.
63. The method of claim 62, wherein said variable is the abundance of a protein.
15
64. The method of claim 63, wherein said abundance is measured using a protein microarray.
65. The method of claim 63, wherein said abundance is measured using two- jp. dimensional gel electrophoresis.
66. An improved two-way ANOVA method for analyzing a plurality of measurements of a variable {y, } in k * n different measurement groups by ANOVA analysis, wherein ytlJ is they'th measurement in the t/th measurement group, t = 1, 2, ..., n, i = jc 1, 2, ..., k, j = 1, ..., mh, and mtl is the number of measurements in measurement group ti, wherein each of said k * n different measurement groups consists of measurements of said variable under common conditions t and i of condition groups N having n different conditions and K having k different condition, respectively, wherein the improvement comprises determining a within-group variance consisting of a propagated variance and a scattered variance, said propagated variance being determined based on predetermined measurement variances of said plurality of measurements, and said scattered variance being determined based on deviation of each of said plurality of measurements with respect to a mean of measurements in a respective measurement group.
35
67. The method of claim 66, wherein each of said one or more measurements is an error- weighted measurement generated using a weighting factor, wherein said weighting factor is determined based on said predetermined measurement variance of said measurement.
68. The method of claim 67, wherein said weighting factor is determined according to the equation
W tij σ tij
2 where σft7 is said predetermined measurement variance of measurement yti
69. The method of claim 66, wherein said within-group variance is. determined based on group variances of said k measurement groups, each said group variance consisting of a measurement group propagated variance and a measurement group scattered variance, wherein said measurement group propagated variance is determined based on predetermined measurement variances of said one or more measurements in said measurement group, and wherein said measurement group scattered variance is a variance of said one or more measurements with respect to the mean of said one or more error- weighted measurements in said measurement group.
70. The method of claim 69, wherein said propagated variance for each said measurement group is determined according to the equation
tij σ _ 2 = L
}'li P m ti j. where σ - is said propagated variance and σ tlJ is said predetermined measurement
variance of measurement yt .
71. The method of claim 70, wherein said mean of said one or more measurements is determined according to the equation
"* tl - Σ 7 = 1 y y n m
where y is said mean of said one or more measurements, and wherein said scattered variance is determined according to the equation
Figure imgf000123_0001
2 where σ y-,ns „ is said scattered variance.
72. The method of claim 71, further comprising prior to said step (a) a step of determining said predetermined measurement variance for each said measurement.
73. The method of claim 72, wherein said predetermined measurement variance of each said measurement;/,,, is determined according to an error model.
74 ιe ethod of claim 73 , wherein said error model is a three-term error model according to equation
σ 2 J = c2 + b2 - ytlJ + a2 - ytlJ 2 2 wherein <JtiJ is said predetermined measurement variance of said measurement^, a is a fractional error coefficient, b is a Poisson error coefficient, and c is a standard deviation of background noise.
75. The method of claim 74, wherein said group propagated variance and said group scattered variance are combined according to the equation
(7ylip + (mti - - 0- vus σyti mti
76. The method of claim 75, wherein said variable is the transcript level of a gene.
77. The method of claim 76, wherein said transcript level is measured using a DNA microarray.
78. The method of claim 77, wherein said variable is the abundance of a protein.
79. The method of claim 78, wherein said abundance is measured using a protein microarray.
80. The method of claim 79, wherein said abundance is measured using two- dimensional gel electrophoresis.
81. A method of determining if a plurality of measurements in k different measurement groups of a first variable {v„(l)} of a plurality of M different variables are unchanged among the plurality of different measurement groups, wherein a plurality of measurements of each of said plurality of M different variables {;'„(/)} in k different measurement groups are available, wherein ytl(j) is the tth measurement of theyth variable in the tth measurement group, t - 1, 2, ..., k; i = 1, 2, ..., n„ nt being the number of measurements in measurement group t; andy = 1, 2, ..., M; wherein each said measurement group consists of one or more measurements of a respective variable under a condition common to said measurement group, and wherein each said measurement ytj(j) has a predetermined measurement variance, said method comprising
(a) determining for each of said plurality of M different variables a within-group variance for said k different measurement groups of said variable, wherein said within- group variance consists of a propagated variance and a scattered variance, said propagated variance being determined based on predetermined measurement variances of said plurality of measurements, and said scattered variance being determined based on deviation of each of said plurality of measurements with respect to a mean of measurements in a respective
. p. measurement group;
(b) determining a between-group variance for said k different measurement groups of said first variable, wherein said between-group variance is a variance of a plurality of group means, one for each of said k different measurement groups, with respect to a mean of said plurality of measurements, wherein each said group mean is the mean of said one or
, ^ more measurements in a measurement group;
(c) determining an average within-group variance of measurements of said M different variables, wherein said average within-group variance is an average of within- group variances of said measurements of said plurality of different M variables; and
(d) comparing said within-group variance of said first variable with said between- jp. group variance of said first variable and with said average within-group variance, thereby determining if said plurality of measurements of said first variable are unchanged among said plurality of different measurement groups.
82. The method of claim 81, wherein said steps (a)-(d) are repeated for one or more j c other variables in said plurality of M different variables.
83. The method of claim 81, wherein each said within-group variance is determined based on group variances of said k measurement groups, each said group variance consisting of a measurement group propagated variance and a measurement group scattered variance, « wherein said measurement group propagated variance is determined based on predetermined measurement variances of said one or more measurements in said measurement group, and wherein said measurement group scattered variance is a variance of said one or more measurements with respect to the mean of said one or more error- weighted measurements in said measurement group.
35
84. The method of claim 83, wherein said propagated variance for each said measurement group is determined according to the equation
Σ U) y,p ϋ) = 1 = 1 n.
where σ _(J) is said propagated variance and 0" ft. (_/) is said predetermined
measurement variance of measurement yti(j).
85. The method of claim 84, wherein said mean of said one or more measurements is determined according to the equation
∑ ya ( j)
7 = 1 y t ( J) = n .
where yt (y ) is said mean of said one or more measurements, and wherein said scattered variance is determined according to the equation
Figure imgf000126_0001
where d- ( ) is said scattered variance.
86. The method of any one of claims 81-85, further comprising prior to said step (a) a step of determining said predetermined measurement variance for each said measurement.
87. The method of claim 86, wherein said predetermined measurement variance of each said measurement ytl(j) is determined according to an error model.
88. The method of claim 87, wherein said error model is a three-term error model according to equation
Figure imgf000127_0001
wherein σ ti (J) is said predetermined measurement variance of said measurement ytlQ), a is a fractional error coefficient, b is a Poisson error coefficient, and c is a standard deviation of background noise.
89. The method of claim 85, wherein said group propagated variance and said group scattered variance are combined according to the equation
Figure imgf000127_0002
90. The method of claim 89, wherein said comparing said within-group variance of said first variable with said between-group variance of said first variable is carried out according to the equation
Figure imgf000127_0003
wherein
Figure imgf000127_0004
and vr(l) = k -1
and
Figure imgf000128_0001
where
Figure imgf000128_0002
where
vΛ(l) - ^ -1 +
/?
and
5r 2(l) = 5r(l)/v3.(l)
where
Figure imgf000128_0003
where
Figure imgf000129_0001
where N is the total number of measurements of said first variable; and said comparing said within-group variance of said first variable with said average within-group variance comparing is carried out according to the equation
SmallError_ pvalue (1) = 1 - fcdf\sR (1) /sR 2,vR, vavg
wherein
Figure imgf000129_0002
wherein
Figure imgf000129_0003
where
Figure imgf000129_0004
where
y Rt (J) = n t - l + n . and where Vavg is chosen to be at least about 20 to about 100.
91. The method of claim 90, wherein each of said plurality of M different variables is the transcript level of a gene.
92. The method of claim 91, wherein said transcript level is measured using a DNA microarray.
93. The method of claim 92, wherein each of said plurality of M different variables is the abundance of a protein.
94. The method of claim 93, wherein said abundance is measured using a protein microarray.
95. The method of claim 94, wherein said abundance is measured using two- dimensional gel electrophoresis.
96. A method for analyzing variation among a plurality of measurements of a variable {y,}, wherein;;, is the tth measurement, t = 1, 2, ..., n„ nt being the number of measurements, wherein said plurality of measurements are measured under a common condition, wherein each said measurement has a predetermined measurement variance, said method comprising
(a) determining a propagated variance and a scattered variance, wherein said propagated variance is determined based on predetermined measurement variances of said plurality of measurements, and wherein said scattered variance is a variance of said plurality of measurements with respect to the mean of said plurality of measurements; and
(b) comparing said propagated variance and said scattered variance, thereby analyzing variation in said plurality of measurements of said variable.
97. The method of claim 96, wherein said propagated variance for each said measurement group is determined according to the equation n
Σ σ t σ T t = l y , p n
where σ - is said propagated variance and σ t is said predetermined measurement
variance of measurement , and wherein said scattered variance is determined according to the equation
Figure imgf000131_0001
2 where σ -ts is said scattered variance and where
n
Σ y t t = \ y =
98. The method of claim 97, further comprising before said step (a) a step of weighting each of said plurality of measurements with a weighting factor to obtain an error- weighted measurement, wherein said weighting factor is determined based on said predetermined measurement variance of said measurement.
99. The method of claim 98, wherein said weighting factor is determined according to the equation W σ
where σ t is said predetermined measurement variance of measurement yt.
100. The method of claim 97, 98 or 99, wherein said comparing is carried out using an F-test according to the equation
Consistenc y _ pvalue = 1 - fcdf , rij - 1? rif
Figure imgf000132_0001
101. A computer system comprising a processor, and a memory coupled to said processor and encoding one or more programs, wherein said one or more programs cause the processor to carry out the method of any one of claims 1, 17, 34, 48, 66, 81 and 96.
102. A computer program product for use in conjunction with a computer having a processor and a memory connected to the processor, said computer program product comprising a computer readable storage medium having a computer program mechanism encoded thereon, wherein said computer program mechanism may be loaded into the memory of said computer and cause said computer to carry out the method of any one of claims 1, 17, 34, 48, 66, 81 and 96.
PCT/US2003/002584 2003-01-22 2003-01-30 Improved anova method for data analysis WO2004068136A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2003/002584 WO2004068136A1 (en) 2003-01-22 2003-01-30 Improved anova method for data analysis

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/000,000 2002-08-01
PCT/US2003/002584 WO2004068136A1 (en) 2003-01-22 2003-01-30 Improved anova method for data analysis

Publications (1)

Publication Number Publication Date
WO2004068136A1 true WO2004068136A1 (en) 2004-08-12

Family

ID=32823172

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/002584 WO2004068136A1 (en) 2003-01-22 2003-01-30 Improved anova method for data analysis

Country Status (1)

Country Link
WO (1) WO2004068136A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1569155A1 (en) * 2004-02-21 2005-08-31 Samsung Electronics Co., Ltd. Method of detecting an error spot in a DNA chip and system using the method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5265617A (en) * 1991-02-20 1993-11-30 Georgetown University Methods and means for non-invasive, dynamic tracking of cardiac vulnerability by simultaneous analysis of heart rate variability and T-wave alternans
US5842997A (en) * 1991-02-20 1998-12-01 Georgetown University Non-invasive, dynamic tracking of cardiac vulnerability by simultaneous analysis of heart rate variability and T-wave alternans
US6243615B1 (en) * 1999-09-09 2001-06-05 Aegis Analytical Corporation System for analyzing and improving pharmaceutical and other capital-intensive manufacturing processes
US20010034023A1 (en) * 1999-04-26 2001-10-25 Stanton Vincent P. Gene sequence variations with utility in determining the treatment of disease, in genes relating to drug processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5265617A (en) * 1991-02-20 1993-11-30 Georgetown University Methods and means for non-invasive, dynamic tracking of cardiac vulnerability by simultaneous analysis of heart rate variability and T-wave alternans
US5842997A (en) * 1991-02-20 1998-12-01 Georgetown University Non-invasive, dynamic tracking of cardiac vulnerability by simultaneous analysis of heart rate variability and T-wave alternans
US20010034023A1 (en) * 1999-04-26 2001-10-25 Stanton Vincent P. Gene sequence variations with utility in determining the treatment of disease, in genes relating to drug processing
US6243615B1 (en) * 1999-09-09 2001-06-05 Aegis Analytical Corporation System for analyzing and improving pharmaceutical and other capital-intensive manufacturing processes

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1569155A1 (en) * 2004-02-21 2005-08-31 Samsung Electronics Co., Ltd. Method of detecting an error spot in a DNA chip and system using the method

Similar Documents

Publication Publication Date Title
US6691042B2 (en) Methods for generating differential profiles by combining data obtained in separate measurements
US6801859B1 (en) Methods of characterizing drug activities using consensus profiles
US6468476B1 (en) Methods for using-co-regulated genesets to enhance detection and classification of gene expression patterns
US6203987B1 (en) Methods for using co-regulated genesets to enhance detection and classification of gene expression patterns
US6950752B1 (en) Methods for removing artifact from biological profiles
US7130746B2 (en) Computer systems and computer programs for determining protein activity levels using gene expression profiles
CA2378933A1 (en) Iterative probe design and detailed expression profiling with flexible in-situ synthesis arrays
US7996155B2 (en) ANOVA method for data analysis
US7418351B2 (en) Methods for analysis of measurement errors in measured signals
JP2011004763A (en) Method and composition for rna interference
US7807447B1 (en) Compositions and methods for exon profiling
US20050143628A1 (en) Methods for characterizing tissue or organ condition or status
Farber et al. Integrating global gene expression analysis and genetics
Sykacek et al. The impact of quantitative optimization of hybridization conditions on gene expression analysis
AU773456B2 (en) Methods for using co-regulated genesets to enhance detection and classification of gene expression patterns
WO2004068136A1 (en) Improved anova method for data analysis
Pushparaj Introduction to functional bioinformatics
US20230154566A1 (en) Epigenetic age predictor
US20020146694A1 (en) Functionating genomes with cross-species coregulation
Papana Tools for Comprehensive Statistical Analysis of Microarray Data

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT SE SI SK TR

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP

DPE2 Request for preliminary examination filed before expiration of 19th month from priority date (pct application filed from 20040101)