CA2451074A1 - Diagnosis and prognosis of breast cancer patients - Google Patents

Diagnosis and prognosis of breast cancer patients Download PDF

Info

Publication number
CA2451074A1
CA2451074A1 CA002451074A CA2451074A CA2451074A1 CA 2451074 A1 CA2451074 A1 CA 2451074A1 CA 002451074 A CA002451074 A CA 002451074A CA 2451074 A CA2451074 A CA 2451074A CA 2451074 A1 CA2451074 A1 CA 2451074A1
Authority
CA
Canada
Prior art keywords
seq
genes
expression
sample
markers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CA002451074A
Other languages
French (fr)
Other versions
CA2451074C (en
Inventor
Hongyue Dai
Yudong He
Peter S. Linsley
Mao Mao
Christopher J. Roberts
Laura Johanna Van't Veer
Marc J. Van De Vijver
Rene Bernards
A. A. M. Hart
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netherlands Cancer Institute
Merck Sharp and Dohme LLC
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of CA2451074A1 publication Critical patent/CA2451074A1/en
Application granted granted Critical
Publication of CA2451074C publication Critical patent/CA2451074C/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/574Immunoassay; Biospecific binding assay; Materials therefor for cancer
    • G01N33/57407Specifically defined cancers
    • G01N33/57415Specifically defined cancers of breast
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16ZINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS, NOT OTHERWISE PROVIDED FOR
    • G16Z99/00Subject matter not provided for in other main groups of this subclass
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/158Expression markers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The present invention relates to genetic markers whose expression is correlated with breast cancer. Specifically, the invention provides sets of markers whose expression patterns can be used to differentiate clinical conditions associated with breast cancer, such as the presence or absence of the estrogen receptor ESR1, and BRCA1 and sporadic tumors, and to provide information on the likelihood of tumor distant metastases within five years of initial diagnosis. The invention relates to methods of using these markers to distinguish these conditions. The invention also relates to kits containing ready-to-use microarrays and computer software for data analysis using the statistical methods disclosed herein.

Description

DIAGNOSIS AND PROGNOSIS OF BREAST CANCER PATIENTS
This application claims benefit of United States Provisional Application No.
60/298,918, filed June 18, 2001, and United States Provisional Application No.
60/380,710, filed on May 14, 2002, each of which is incorporated by reference herein in its entirety.
This application includes a Sequence Listing submitted on compact disc, recorded on two compact discs, including one duplicate, containing Filename 9301175228.txt, of size 6,755,971 bytes, created June 13, 2002. The sequence listing on the compact discs is incorporated by reference herein in its entirety.
1. FIELD OF THE INVENTION
r The present invention relates to the identification of marker genes useful in the diagnosis and prognosis of breast cancer. More particularly, the invention relates to the identification of a set of marker genes associated with breast cancer, a set of marker genes differentially expressed in estrogen receptor (+) versus estrogen receptor (-) tumors, a set of marker genes differentially expressed in BRCAl versus sporadic tumors, and a set of marker genes differentially expressed in sporadic tumors from patients with good clinical prognosis (i.e., metastasis- or disease-free >5 years) versus patients with poor clinical prognosis (i.e., metastasis- or disease-free <5 years). For each of the marker sets above, the invention further relates to methods of distinguishing the breast cancer-related conditions. The invention fixrther provides methods for determining the course of treatment of a patient with breast cancer.
2. BACKGROUND OF THE INVENTION
The increased number of cancer cases reported in the United States, and, indeed, around the world, is a major concern. Currently there are only a handful of treatments available for specific types of cancer, and these provide no guarantee of success.
In order to be most effective, these treatments require not only an early detection of the malignancy, but a reliable assessment of the severity of the malignancy.
The incidence of breast cancer, a leading cause of death in women, has been gradually increasing in the United States over the last thirty years. Its cumulative risk is relatively high; 1 in 8 women are expected to develop some type of breast cancer by age 85 in the United States. In fact, breast cancer is the most common cancer in women and the second most common cause of cancer death in the United States. In 1997, it was estimated that 181,000 new cases were reported in the U.S., and that 44,000 people would die of breast cancer (Parker et al., CA Cahce~J. Clih. 47:5-27 (1997); Chu et al., J.
Nat. CanceY
Ihst. 88:1571-1579 (1996)). While mechanism of tumorigenesis for most breast carcinomas is largely unknown, there are genetic factors that can predispose some women to developing breast cancer (Mild et al., Science, 266:66-71(1994)). The discovery and characterization of BRCAI and BRCA2 has recently expanded our knowledge of genetic factors which can contribute to familial breast cancer. Germ-line mutations within these two loci are associated with a 50 to 85% lifetime risk of breast and/or ovarian cancer (Casey, Cur.
Opin. Oucol. 9:88-93 (1997); Marcus et al., Cancer 77:697-709 (1996)). Only about 5% to 10% of breast cancers are associated with breast cancer susceptibility genes, BRCAI and BRCA2. The cumulative lifetime risk of breast cancer for women who carry the mutant BRCAl is predicted to be approximately 92%, while the cumulative lifetime risk for the non-carrier majority is estimated to be approximately 10%. BRCAI is a tumor suppressor gene that is involved in DNA repair anc cell cycle control, which are both important for the maintenance of genomic stability. More than 90% of all mutations reported so far result in a premature truncation of the protein product with abnormal or abolished function. The histology of breast cancer in BRCAl mutation Garners differs from that in sporadic cases, but mutation analysis is the only way to find the carrier. Like BRCAl, BRCA2 is involved in the development of breast cancer, and like BRCAI plays a role in DNA
repair. However, unlike BRCAl, it is not involved in ovarian cancer.
Other genes have been linked to breast cancer, for example c-erb-2 (HERZ) and p53 (Beenken et al., Ann. Surg. 233(5):630-638 (2001). Overexpression of c-erb-2 (HER2) and p53 have been correlated with poor prognosis (Rudolph et al., Hum.
Pathol.
32(3):311-319 (2001), as has been aberrant expression products of mdm2 (Lukas et al., Cancer Res. 61(7):3212-3219 (2001) and cyclinl and p27 (Porter & Roberts, International Publication W098/33450, published August 6, 1998). However, no other clinically useful markers consistently associated with breast cancer have been identified.
Sporadic tumors, those not currently associated with a known germline mutation, constitute the majority of breast cancers. It is also likely that other, non-genetic factors also have a significant effect on the etiology of the disease.
Regardless of the cancer's origin, breast cancer morbidity and mortality increases significantly if it is not detected early in its progression. Thus, considerable effort has focused on the early detection of cellular transformation and tumor formation in breast tissue.

A marker-based approach to tumor identification and characterization promises improved diagnostic and prognostic reliability. Typically, the diagnosis of breast cancer requires histopathological proof of the presence of the tumor. In addition to diagnosis, histopathological examinations also provide information about prognosis and selection of treatment regimens. Prognosis may also be established based upon clinical parameters such as tumor size, tumor grade, the age of the patient, and lymph node metastasis.
Diagnosis andlor prognosis may be determined to varying degrees of effectiveness by direct examination of the outside of the breast, or through mammography or other X-ray imaging methods (Jatoi, Am. J. SuYg. 177:518-524 (1999)). The latter approach is not without considerable cost, however. Every time a mammogram is taken, the patient incurs a small risk of having a breast tumor induced by the ionizing properties of the radiation used during the test. In addition, the process is expensive and the subjective interpretations of a technician can lead to imprecision. For example, one study showed major clinical disagreements for about one-third of a set of mammograms that were interpreted individually by a surveyed group of radiologists. Moreover, many women find that undergoing a mammogram is a painful experience. Accordingly, the National Cancer Institute has not recommended mammograms for women under fifty years of age, since this group is not as likely to develop breast cancers as are older women. It is compelling to note, however, that while only about 22% of breast cancers occur in women under fifty, data suggests that breast cancer is more aggressive in pre-menopausal women.
In clinical practice, accurate diagnosis of various subtypes of breast cancer is important because treatment options, prognosis, and the likelihood of therapeutic response all vary broadly depending on the diagnosis. Accurate prognosis, or determination of distant metastasis-free survival could allow the oncologist to tailor the administration of adjuvant chemotherapy, with women having poorer prognoses being given the most aggressive treatment. Furthermore, accurate prediction of poor prognosis would greatly impact clinical trials for new breast cancer therapies, because potential study patients could then be stratified according to prognosis. Trials could then be limited to patients having poor prognosis, in turn making it easier to discern if an experimental therapy is efficacious.
To date, no set of satisfactory predictors for prognosis based on the clinical information alone has been identified. The detection of BRCAI or BRCA2 mutations represents a step towards the design of therapies to better control and prevent the appearance of these tumors. However, there is no equivalent means for the diagnosis of patients with sporadic tumors, the most common type of breast cancer tumor, nor is there a means of differentiating subtypes of breast cancer.
3. SUMMARY OF THE INVENTION
The invention provides gene marker sets that distinguish various types and subtypes of breast cancer, and methods of use therefor. In one embodiment, the invention provides a method for classifying a cell sample as ER(+) or ER(-) comprising detecting a difference in the expression of a first plurality of genes relative to a control, said first plurality of genes consisting of at least 5 of the genes corresponding to the markers listed in Table 1. In specific embodiments, said plurality of genes consists of at least 50, 100, 200, 500, 1000, up to 2,460 of the gene markers listed in Table 1. In another specific embodiment, said plurality of genes consists of each of the genes corresponding to the 2,460 markers listed in Table 2. In another specific embodiment, said plurality consists of the 550 markers listed in Table 2. In another specific embodiment, said control comprises nucleic acids derived from a pool of tumors from individual sporadic patients. In another specific embodiment, said detecting comprises the steps of (a) generating an ER(+) template by hybridization of nucleic acids derived from a plurality of ER(+) patients within a plurality of sporadic patients against nucleic acids derived from a pool of tumors from individual sporadic patients; (b) generating an ER(-) template by hybridization of nucleic acids derived from a plurality of ER(-) patients within said plurality of sporadic patients against nucleic acids derived from said pool of tumors from individual sporadic patients within said plurality; (c) hybridizing nucleic acids derived from an individual sample against said pool;
and (d) determining the similarity of marker gene expression in the individual sample to the ER(+) template and the ER(-) template, wherein if said expression is more similar to the ER(+) template, the sample is classified as ER(+), and if said expression is more similar to the ER(-) template, the sample is classified as ER(-).
The invention further provides the above methods, applied to the classification of samples as BRCAl or sporadic, and classifying patients as having good prognosis or poor prognosis. For the BRCAI/sporadic gene markers, the invention provides that the method may be used wherein the plurality of genes is at least 5, 20, 50, 100, 200 or 300 of the BRCAI/sporadic markers listed in Table 3. In a specific embodiment, the optimum 100 markers listed in Table 4 are used. For the prognostic markers, the invention provides that at Ieast 5, 20, 50, 100, or 200 gene markers listed in Table 5 may be used. In a specific embodiment, the optimum 70 markers listed in Table 6 are used.
The invention further provides that markers may be combined. Thus, in one embodiment, at least 5 markers from Table 1 are used in conjunction with at least 5 markers from Table 3. In another embodiment, at least 5 markers from Table 5 are used in conjunction with at least 5 markers from Table 3. In another embodiment, at least 5 markers from Table 1 are used in conjunction with at least 5 markers from Table 5. In another embodiment, at least 5 markers from each of Tables l, 3, and 5 are used simultaneously.
The invention further provides a method for classifying a sample as ER(+) or ER(-) by calculating the similarity between the expression of at least 5 of the markers listed in Table 1 in the sample to the expression of the same markers in an ER(-) nucleic acid pool and an ER(+) nucleic acid pool, comprising the steps of: (a) labeling nucleic acids derived from a sample, with a first fluorophore to obtain a first pool of fluorophore-labeled nucleic acids; (b) labeling with a second fluorophore a first pool of nucleic acids derived from two or more ER(+) samples, and a second pool of nucleic acids derived from two or more ER(-) s~ples; (c) contacting said first fluorophore-labeled nucleic acid and said first pool of second fluorophore-labeled nucleic acid with said first microarray under conditions such that hybridization can occur, and contacting said first fluorophore-labeled nucleic acid and said second pool of second fluorophore-labeled nucleic acid with said second microarray under conditions such that hybridization can occur, detecting at each of a plurality of discrete loci on the first microarray a first flourescent emission signal from said first fluorophore-labeled nucleic acid and a second fluorescent emission signal from said first pool of second fluorophore-labeled genetic matter that is bound to said first microarray under said conditions, and detecting at each of the marker loci on said second microarray said first fluorescent emission signal from said first fluorophore-labeled nucleic acid and a third fluorescent emission signal from said second pool of second fluorophore-labeled nucleic acid; (d) determining the similarity of the sample to the ER(-) and ER(+) pools by comparing said first fluorescence emission signals and said second fluorescence emission signals, and said first emission signals and said third fluorescence emission signals; and (e) classifying the sample as ER(+) where the first fluorescence emission signals are more similar to said second fluorescence emission signals than to said third fluorescent emission signals, and classifying the sample as ER(-) where the first fluorescence emission signals are more similar to said third fluorescence emission signals than to said second fluorescent emission signals, wherein said similarity is defined by a statistical method.
The invention further provides that the other disclosed marker sets may be used in the above method to distinguish BRCAI from sporadic tumors, and patients with poor prognosis from patients with good prognosis.
In a specific embodiment, said similarity is calculated by determining a first sum of the differences of expression levels for each marker between said first fluorophore-labeled nucleic acid and said first pool of second fluorophore-labeled nucleic acid, and a second sum of the differences of expression levels for each marker between said first fluorophore-labeled nucleic acid and said second pool of second fluorophore-labeled nucleic acid, wherein if said first sum is greater than said second sum, the sample is classified as ER(-), and if said second sum is greater than said first sum, the sample is classified as ER(+). In another specific embodiment, said similarity is calculated by computing a first classifier parameter P1 between an ER(+) template and the expression of said markers in said sample, and a second classifier parameter Pz between an ER(-) template and the expression of said markers in said sample, wherein said P1 and PZ are calculated according to the formula:
Equation (1) p =(~l 'Y)~~hl II ' IL~I) wherein z1 and Z2 are ER(-) and ER(+) templates, respectively, and are calculated by averaging said second fluorescence emission signal for each of said markers in said first pool of second fluorophore-labeled nucleic acid and said third fluorescence emission signal for each of said markers in said second pool of second fluorophore-labeled nucleic acid, respectively, and wherein ,y is said first fluorescence emission signal of each of said markers in the sample to be classified as ER(+) or ER(-), wherein the expression of the markers in the sample is similar to ER(+) if P1 < PZ, and similar to ER(-) if P1 > P2.
The invention further provides a method for identifying marker genes the expression of which is associated with a particular phenotype. In one embodiment, the invention provides a method for determining a set of maxker genes whose expression is associated with a particular phenotype, comprising the steps of (a) selecting the phenotype having two or more phenotype categories; (b) identifying a plurality of genes wherein the expression of said genes is correlated or anticorrelated with one of the phenotype categories, and wherein the correlation coefficient for each gene is calculated according to the equation I) Equation (2) wherein C is a number representing said phenotype category and y is the logarithmic expression ratio across all the samples for each individual gene, wherein if the correlation coefficient has an absolute value of a threshold value or greater, said expression of said gene is associated with the phenotype category, and wherein said plurality of genes is a set of marker genes whose expression is associated with a particular phenotype.
The threshold depends upon the number of samples used; the threshold can be calculated as 3 X 1l h-3, where 1/ ~-3 is the distribution width and n = the number of samples. In a specific embodiment where n = 9~, said threshold value is 0.3. In a specific embodiment, said set of marker genes is validated by: (a) using a statistical method to randomize the association between said marker genes and said phenotype category, thereby creating a control correlation coefficient for each marker gene; (b) repeating step (a) one hundred or more times to develop a frequency distribution of said control correlation coefficients for each marker gene; (c) determining the number of marker genes having a control correlation coefficient of a threshold value or above, thereby creating a control marker gene set; and (d) comparing the number of control marker genes so identified to the number of marker genes, wherein if the p value of the difference between the number of maxker genes and the n~ber of control genes is less than 0.01, said set of marker genes is validated. In another specific embodiment, said set of marker genes is optimized by the method comprising: (a) rank-ordering the genes by amplitude of correlation or by significance of the correlation coefficients, and (b) selecting an arbitrary number of marker genes from the top of the rank-ordered list. The threshold value depends upon the number of samples tested.
The invention further provides a method for assigning a person to one of a plurality of categories in a clinical trial, comprising determining for each said person the level of expression of at least five of the prognosis markers listed in Table 6, determining therefrom whether the person has an expression pattern that correlates with a good prognosis or a poor prognosis, and assigning said person to one category in a clinical trial if said person is determined to have a good prognosis, and a different category if that person is determined to have a poor prognosis. The invention further provides a method for assigning a person to one of a plurality of categories in a clinical trial, where each of said categories is associated with a different phenotype, comprising determining fox each said person the level of expression of at least five markers from a set of markers, wherein said set of markers includes markers associated with each of said clinical categories, determining therefrom whether the person has an expression pattern that correlates with one of the clinical categories, an assigning said person to one of said categories if said person is determined to have a phenotype associated with that category.
The invention further provides a method of classifying a first cell or organism as having one of at least two different phenotypes, said at least two different phenotypes comprising a first phenotype and a second phenotype, said method comprising:
(a) comparing the level of expression of each of a plurality of genes in a first sample from the first cell or organism to the level of expression of each of said genes, respectively, in a pooled sample from a plurality of cells or organisms, said plurality of cells or organisms comprising different cells or organisms exhibiting said at least two different phenotypes, respectively, to produce a first compared value; (b) comparing said first compared value to a second compared value, wherein said second compared value is the product of a method comprising comparing the level of expression of each of said genes in a sample from a cell or organism characterized as having said first phenotype to the level of expression of each of said genes, respectively, in said pooled sample; (c) comparing said first compared value to a third compared value, wherein said third compared value is the product of a method comprising comparing the level of expression of each of said genes in a sample from a cell or organism characterized as having said second phenotype to the level of expression of each of said genes, respectively, in said pooled sample, (d) optionally carrying out one or more times a step of comparing said first compared value to one or more additional compared values, respectively, each additional compared value being the product of a method comprising comparing the level of expression of each of said genes in a sample from a cell or organism characterized as having a phenotype different from said first and second phenotypes but included among said at least two different phenotypes, to the level of expression of each of said genes, respectively, in said pooled sample; and (e) determining to which of said second, third and, if present, one or more additional compared values, said first compared value is most similar, wherein said first cell or organism is determined to have the phenotype of the cell or organism used to produce said compared value most similar to said first compared value.
In a specific embodiment of the above method, said compared values are each ratios of the levels of expression of each of said genes. In another specific embodiment, each of said levels of expression of each of said genes in said pooled sample are normalized prior to any of said comparing steps. In another specific embodiment, normalizing said levels of expression is carried out by dividing each of said levels of expression by the median or mean level of expression of each of said genes or dividing by the mean or median level of expression of one or more housekeeping genes in said pooled sample. In a more specific embodiment, said normalized levels of expression are subjected to a log transform and said comparing steps comprise subtracting said log transform from the log of said levels of expression of each of said genes in said sample from said cell or organism. In another specific embodiment, said at least two different phenotypes are _g_ different stages of a disease or disorder. In another specific embodiment, said at least two different phenotypes are different prognoses of a disease or disorder. In yet another specific embodiment, said levels of expression of each of said genes, respectively, in said pooled sample or said levels of expression of each of said genes in a sample from said cell or organism characterized as having said first phenotype, said second phenotype, or said phenotype different from said first and second phenotypes, respectively, are stored on a computer.
The invention further provides microarrays comprising the disclosed marker sets. In one embodiment, the invention provides a microarray comprising at least 5 markers derived from any one of Tables 1-6, wherein at least 50% of the probes on the microarray are present in any one of Tables 1-6. In more specific embodiments, at least 60%, 70%, 80%, 90%, 95% or 98% of the probes on said microarray are present in any one of Tables 1-6.
In another embodiment, the invention provides a microarray for distinguishing ER(+) and ER(-) cell samples comprising a positionally-addressable array of polynucleotide probes bound to a support, said polynucleotide probes comprising a plurality of polynucleotide probes of different nucleotide sequences, each of said different nucleotide sequences comprising a sequence complementary and hybridizable to a plurality of genes, said plurality consisting of at least 5 of the genes corresponding to the markers listed in Table 1 or Table 2, wherein at least SO% of the probes on the microarray are present in any one of Table 1 or Table 2. In yet another embodiment, the invention provides a microarray for distinguishing BRCAl -type and sporadic tumor-type cell samples comprising a positionally-addressable array of polynucleotide probes bound to a support, said polynucleotide probes comprising a plurality of polynucleotide probes of different nucleotide sequences, each of said different nucleotide sequences comprising a sequence complementary and hybridizable to a plurality of genes, said plurality consisting of at least 5 of the genes corresponding to the markers listed in Table 3 or Table 4, wherein at least 50%
of the probes on the rnicroarray axe present in any one of Table 3 or Table 4.
In still another embodiment, the invention provides a microarray for distinguishing cell samples from 30 patients having a good prognosis and cell samples from patients having a poor prognosis comprising a positionally-addressable array of polynucleotide probes bound to a support, said polynucleotide probes comprising a plurality of polynucleotide probes of different nucleotide sequences, each of said different nucleotide sequences comprising a sequence complementary and hybridizable to a plurality of genes, said plurality consisting of at least 5 35 of the genes corresponding to the markers listed in Table 5 or Table 6, wherein at least 50%

of the probes on the microarray are present in any one of Table 5 or Table 6.
The invention further provides for microarrays comprising at least 5, 20, 50, 100, 200, 500, 100, 1,250, 1,500, 1,750, or 2,000 of the ER-status marker genes listed in Table 1, at least 5, 20, 50, 100, 200, or 300 of the BRCAI sporadic marker genes listed in Table 3, or at least 5, 20, 50, 100 or 200 of the prognostic marker genes listed in Table 5, in any combination, wherein at least 50%, 60%, 70%, 80%, 90%, 95% or 98% of the probes on said microarrays are present in Table 1, Table 3 and/or Table 5.
The invention further provides a kit for determining the ER-status of a sample, comprising at least two microarrays each comprising at least 5 of the markers listed in Table 1, and a computer system for determining the similarity of the level of nucleic acid derived from the markers listed in Table 1 in a sample to that in an ER(-) pool and an ER(+) pool, the computer system comprising a processor, and a memory encoding one or more programs coupled to the processor, wherein the one or more programs cause the processor to perform a method comprising computing the aggregate differences in expression of each marker between the sample and ER(-) pool and the aggregate differences in expression of each marker between the sample and ER(+) pool, or a method comprising determining the correlation of expression of the markers in the sample to the expression in the ER(-) and ER(+) pools, said correlation calculated according to Equation (4). The invention provides for kits able to distinguish BRCAI and sporadic tumors, and samples from patients with good prognosis from samples from patients with poor prognosis, by inclusion of the appropriate marker gene sets. The invention further provides a kit for determining whether a sample is derived from a patient having a good prognosis or a poor prognosis, comprising at least one microarray comprising probes to at least 5 of the genes corresponding to the markers listed in Table 5, and a computer readable medium having recorded thereon one or more programs for determining the similarity of the level of nucleic acid derived from the markers listed in Table 5 in a sample to that in a pool of samples derived from individuals having a good prognosis and a pool of samples derived from individuals having a good prognosis, wherein the one or more programs cause a computer to perform a method comprising computing the aggregate differences in expression of each marker between the s~ple and the good prognosis pool and the aggregate differences in expression of each marker between the sample and the poor prognosis pool, or a method comprising determining the correlation of expression of the markers in the sample to the expression in the good prognosis and poor prognosis pools, said correlation calculated according to Equation (3).

4. BRIEF DESCRIPTION OF THE FIGURES
FIG. 1 is a Venn-type diagram showing the overlap between the marker sets disclosed herein, including the 2,460 ER markers, the 430 BRCAIIsporadic markers, and the 231 prognosis reporters.
FIG. 2 shows the experimental procedures for measuring differential changes in mRNA transcript abundance in breast cancer tumors used in this study. In each experiment, Cy5-labeled cRNA from one tumor X is hybridized on a 25k human microarray together with a Cy3-labeled cRNA pool made of cRNA samples from tumors 1, 2, .
. . N.
The digital expression data were obtained by scanning and image processing.
The error modeling allowed us to assign a p-value to each transcript ratio measurement.
FIG. 3 Two-dimensional clustering reveals two distinctive types of tumors.
The clustering was based on the gene expression data of 98 breast cancer tumors over 4986 significant genes. Dark gray (red) presents up-regulation, light gray (green) represents down-regulation, black indicates no change in expression, and gray indicates that data is not available. 4986 genes were selected that showed a more than two fold change in expression ratios in more than five experiments. Selected clinical data for test results of BR CAl mutations, estrogen receptor (ER), and proestrogen receptor (PR), tumor grade, lymphocytic infiltrate, and angioinvasion axe shown at right. Black denotes negative and white denotes positive. The dominant pattern in the lower part consists of 36 patients, out of which 34 are ER-negative (total 39), and 16 are BR CAl-mutation carriers (total 18).
FIG. 4 A portion of unsupervised clustered results as shown in FIG. 3.
ESRI (the estrogen receptor gene) is coregulated with a set of genes that are strongly co-regulated to form a dominant pattern.
FIG. 5A Histogram of correlation coefficients of significant genes between their expression ratios and estrogen-receptor (ER) status (i.e., ER level).
The histogram for experimental data is shown as a gray line. The results of one Monte-Carlo trial is shown in solid black. There are 2,460 genes whose expression data correlate with ER
status at a level higher than 0.3 or anti-correlated with ER status at a level lower than -0.3.
FIG. 5B The distribution of the number of genes that satisfied the same selection criteria (amplitude of correlation above 0.3) from 10,000 Monte-Carlo runs. It is estimated that this set of 2,460 genes reports ER status at a confidence Ievel of p >99.99%.
FIG. 6 Classification Type l and Type 2 error rates as a function of the number (out of 2,460) marker genes used in the classifier. The combined error rate is lowest when approximately 550 marker genes are used.

FIG. 7 Classification of 98 tumor samples as ER(+) or ER(-) based on expression levels of the 550 optimal marker genes. ER(+) samples (above white line) exhibit a clearly different expression pattern that ER(-) samples (below white line).
FIG. 8 Correlation between expression levels in samples from each patient and the average profile of the ER(-) group vs. correlation with the ER(+) group. Squares represent samples from clinically ER(-) patients; dots represent samples from clinically ER(+) patients.
FIG. 9A Histogram of correlation coefficients of gene expression ratio of each significant gene with the BRCAI mutation status is shown as a solid line.
The dashed line indicates a frequency distribution obtained from one Monte-Carlo run. 430 genes exhibited an amplitude of correlation or anti-correlation greater than 0.35.
FIG. 9B Frequency distribution of the number of genes that exhibit an amplitude of correlation or anti-correlation greater than 0.35 for the 10,000 Monte-Carlo run control. Mean =115. p(n > 430) = 0.48% and p(>430/2) = 9.0%.
FIG. 10 Classification type 1 and type 2 error rates as a function of the number of discriminating genes used in the classifier (template). The combined error rate is lowest when approximately 100 discriminating marker genes axe used.
FIG. 11A The classification of 38 tumors in the ER(-) group into two subgroups, BRCAl and sporadic, by using the optimal set of 100 discriminating marker genes. Patients above the white line are characterized by BRCAl -related patterns.
FIG. 11B Correlation between expression levels in samples from each ER(-) patient and the average profile of the BRCAI group vs. correlation with the sporadic group.
Squares represent samples from patients with sporadic-type tumors; dots represent samples from patients carrying the BRCAI mutation.
FIG. 12A Histogram of correlation coefficients of gene expression ratio of each significant gene with the prognostic category (distant metastases group and no distant metastases group) is shown as a solid line. The distribution obtained from one Monte-Carlo run is shown as a dashed line. The amplitude of correlation or anti-correlation of 231 marker genes is greater than 0.3.
FIG. 12B Frequency distribution of the number of genes whose amplitude of correlation or anti-correlation was greater than 0.3 for 10,000 Monte-Carlo runs.
FIG. 13 The distant metastases group classification error rate for type 1 and type 2 as a function of the number of discriminating genes used in the classifier. The combined error rate is lowest when approximately 70 discriminating marker genes are used.

FIG. 14 Classification of 78 sporadic tumors into two prognostic groups, distant metastases (poor prognosis) and no distant metastases (good prognosis) using the optimal set of 70 discriminating marker genes. Patients above the white line are characterized by good prognosis. Patients below the white line are characterized by poor prognosis.
FIG. 15 Correlation between expression levels in samples from each patient and the average profile of the good prognosis group vs. correlation with the poor prognosis group. Squares represent samples from patients having a poor prognosis; dots represent samples from patients having a good prognosis. Red squares represent the 'reoccurred' patients and the blue dots represent the 'non-reoccurred'. A total of 13 out of 78 were mis-classified.
FIG. 16 The reoccurrence probability as a function of time since diagnosis.
Group A and group B were predicted by using a leave-one-out method based on the optimal set of 70 discriminating marker genes. The 43 patients in group A consists of 37 patients from the no distant metastases group and 6 patients from the distant metastases group. The 35 patients in group B consists of 28 patients from the distant metastases group and 7 patients from the no distant metastases group.
FIG. 17 The distant metastases probability as a function of time since diagnosis for ER(+) (yes) or ER(-) (no) individuals.
FIG. 18 The distant metastases probability as a function of time since diagnosis for progesterone receptor (PR)(+) (yes) or PR(-) (no) individuals.
FIG. 19A, B The distant metastases probability as a function of time since diagnosis. Groups were defined by the tumor grades.
FIG. 20A Classification of 19 independent sporadic tumors into two prognostic groups, distant metastases and no distant metastases, using the 70 optimal marker genes. Patients above the white line have a good prognosis. Patients below the white line have a poor prognosis.
FIG. 20B Correlation between expression ratios of each patient and the average expression ratio of the good prognosis group is defined by the training set versus the correlation between expression ratios of each patient and the average expression ratio of the poor prognosis training set. Of nine patients in the good prognosis group, three are from the "distant metastases group"; of ten patients in the good prognosis group, one patient is from the "no distant metastases group". This error rate of 4 out of 19 is consistent with 13 out of 78 for the initial 78 patients.

FIG. 20C The reoccurrence probability as a function of time since diagnosis for two groups predicted based on expression of the optimal 70 marker genes.
FIG. 21A Sensitivity vs. 1-specificity for good prognosis classification.
FIG. 21B Sensitivity vs. 1-specificity for poor prognosis classification.
FIG. 21C Total error rate as a function of threshold on the modeled likelihood. Six clinical parameters (ER status, PR status, tumor grade, tumor size, patient age, and presence or absence of angioinvasion) were used to perform the clinical modeling.
FIG. 22 Comparison of the log(ratio) of individual samples using the "material sample pool" vs. mean subtracted log(intensity) using the "mathematical sample pool" for 70 reporter genes in the 78 sporadic tumor samples. The "material sample pool"
was constructed from the 78 sporadic tumor samples.
FIG. 23A Results of the "leave one out" cross validation based on single channel data. Samples are grouped according to each sample's coefficient of correlation to the average "good prognosis" profile and "poor prognosis" profile for the 70 genes examined. The white line separates samples from patients classified as having poor prognoses (below) and good prognoses (above).
FIG. 23B Scatter plot of coefficients of correlation to the average expression in "good prognosis" samples and "poor prognosis" samples. The false positive rate (i.e., rate of incorrectly classifying a sample as being from a patient having a good prognosis as being one from a patient having a poor prognosis) was 10 out of 44, and the false negative rate is 6 out of 34.
FIG. 24A Single-channel hybridization data for samples ranked according to the coefficients of correlation with the good prognosis classifier. Samples classified as "good prognosis" lie above the white line, and those classified as "poor prognosis" lie below.
FIG. 24B Scatterplot of sample correlation coefficients, with three incorrectly classified samples lying to the right of the threshold correlation coefficient value.
The threshold correlation value was set at 0.2727 to limit the false negatives to approximately 10% of the samples.
5. DETAILED DESCRIPTION OF THE INVENTION
5.1 INTRODUCTION
The invention relates to sets of genetic markers whose expression patterns correlate with important characteristics of breast cancer tumors. i. e., estrogen receptor (ER) status, BRCAl status, and the likelihood of relapse (i.e., distant metastasis or poor prognosis). More specifically, the invention provides for sets of genetic markers that can distinguish the following three clinical conditions. First, the invention relates to sets of markers whose expression correlates with the ER status of a patient, and which can be used to distinguish ER(+) from ER(-) patients. ER status is a useful prognostic indicator, and an indicator of the likelihood that a patient will respond to certain therapies, such as tamoxifen.
Also, among women who are ER positive the response rate (over 50%) to hormonal therapy is much higher than the response rate (less 10%) in patients whose ER status is negative. In patients with ER positive tumors the possibility of achieving a hormonal response is directly proportional to the level ER (P. Clabresi and P.S. Schein, MEDICAL ONCOLOGY
(2lrD ED.), McGraw-Hill, Inc., New York (1993)). Second, the invention further relates to sets of markers whose expression correlates with the presence of BRCA1 mutations, and which can be used to distinguish BRCAI-type tumors from sporadic tumors. Third, the invention relates to genetic markers whose expression correlates with clinical prognosis, and which can be used to distinguish patients having good prognoses (i.e., no distant metastases of a for within five years) from poor prognoses (i. e., distant metastases of a tumor within five years). Methods are provided for use of these markers to distinguish between these patient groups, and to determine general courses of treatment. Microarrays comprising these markers are also provided, as well as methods of constructing such microarrays. Each markers correspond to a gene in the human genome, i.e., such marker is identifiable as all or a portion of a gene. Finally, because each of the above markers correlates with a certain breast cancer-related conditions, the markers, or the proteins they encode, are likely to be targets for drugs against breast cancer.
5.2 DEFll~ITIONS
As used herein, "BRCAI tumor" means a tumor having cells containing a mutation of the BRCA1 locus.
The "absolute amplitude" of correlation expressions means the distance, either positive or negative, from a zero value; i.e., both correlation coefficients -0.35 and 0.35 have an absolute amplitude of 0.35.
"Status" means a state of gene expression of a set of genetic markers whose expression is strongly correlated with a particular phenotype. For example, "ER status"
means a state of gene expression of a set of genetic markers whose expression is strongly correlated with that of ESRl (estrogen receptor gene), wherein the pattern of these genes' expression differs detectably between tumors expressing the receptor and tumors not expressing the receptor.

"Good prognosis" means that a patient is expected to have no distant metastases of a breast tumor within five years of initial diagnosis of breast cancer.
"Poor prognosis" means that a patient is expected to have distant metastases of a breast tumor within five years of initial diagnosis of breast cancer.
"Marker" means an entire gene, or an EST derived from that gene, the expression or level of which changes between certain conditions. Where the expression of the gene correlates with a certain condition, the gene is a marker for that condition.
"Marker-derived polynucleotides" means the RNA transcribed from a marker gene, any cDNA or cRNA produced therefrom, and any nucleic acid derived therefrom, such as synthetic nucleic acid having a sequence derived from the gene corresponding to the marker gene.
5.3 MARKERS USEFUL IN DIAGNOSIS AND PROGNOSIS OF BREAST CANCER
5.3.1 MARKER SETS
The invention provides a set of 4,986 genetic markers whose expression is correlated with the existence of breast cancer by clustering analysis. A
subset of these markers identified as useful for diagnosis or prognosis is listed as SEQ ID
NOS: 1-2,699.
The invention also provides a method of using these markers to distinguish tumor types in diagnosis or prognosis.
In one embodiment, the invention provides a set of 2,460 genetic markers that can classify breast cancer patients by estrogen receptor (ER) status; i.
e., distinguish between ER(+) and ER(-) patients or tumors derived from these patients. ER
status is an important indicator of the likelihood of a patient's response to some chemotherapies (i. e., tamoxifen). These markers are listed in Table 1. The invention also provides subsets of at least 5, 10, 25, 50, 100, 200, 300, 400, 500, 750, 1,000, 1,250, 1,500, 1,750 or 2,000 genetic markers, drawn from the set of 2,460 markers, which also distinguish ER(+) and ER(-) patients or tumors. Preferably, the number of markers is 550. The invention further provides a set of 550 of the 2,460 markers that are optimal for distinguishing ER status (Table 2). The invention also provides a method of using these markers to distinguish between ER(+) and ER(-) patients or tumors derived therefrom.
In another embodiment, the invention provides a set of 430 genetic markers that can classify ER(-) breast cancer patients by BRCAI status; i. e., distinguish between tumors containing a BRCAI mutation and sporadic tumors. These markers are listed in Table 3. The invention further provides subsets of at least 5, 10 20, 30, 40, 50, 75, 100, 150, 200, 250, 300 or 350 markers, drawn from the set of 430 markers, which also distinguish between tumors containing a BRCAI mutation and sporadic tumors.
Preferably, the number of markers is 100. A preferred set of 100 markers is provided in Table 4. The invention also provides a method of using these markers to distinguish betweenBRCAl and sporadic patients or tumors derived therefrom.
In another embodiment, the invention provides a set of 231 genetic markers that can distinguish between patients with a good breast cancer prognosis (no breast cancer tumor distant metastases within five years) and patients with a poor breast cancer prognosis (tumor distant metastases within five years). These markers are listed in Table 5. The invention also provides subsets of at least 5, 10, 20, 30, 40, 50, 75, 100, 150 or 200 markers, drawn from the set of 231, which also distinguish between patients with good and poor prognosis. A preferred set of 70 markers is provided in Table 6. In a specific embodiment, the set of markers consists of the twelve kinase-related markers and the seven cell division-or mitosis-related markers listed. The invention also provides a method of using the above markers to distinguish between patients with good or poor prognosis.

Table 1. 2,460 gene markers that distinguish ER(+) and ER(-) cell samples.
GenBank SEQ ID NO GenBank SEQ ID NO
Accession Number Accession Number Ag006628 SEQ ID NO 9 NM 007057 SEQ ID NO 1352 AB007916 SEQ ID NO 19 NM _007214 SEQ ID NO 1363 Ag007950 SEQ ID NO 20 NM 007217 SEQ ID NO 1364 AB014568 SEQ ID NO 30 NM _009585 SEQ ID NO 1375 Agp18260 SEQ ID NO 31 NM 009587 SEQ ID NO 1376 Ag020689 SEQ ID NO 37 NM 012105 SEQ ID NO 1381 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Number Accession Number ''8033058 SEQ ID NO 67 NM 012446 SEQ ID NO 1414 ''8037791 SEQ ID NO 79 NM 013277 SEQ ID NO 1426 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Number Accession Number GenBank SEQ ID NO GenBank SEQ ID NO
Accession Number Accession Number GenBank SEQ ID NO GenBank SEQ ID NO
Accession Number Accession Number GenBank SEQ ID NO GenBank SEQ ID NO
Accession Number Accession Number AJ272057 S.EQ ID NO NM 014935 SEQ ID NO 1550 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Number Accession Number GenBank SEQ ID NO GenBank SEQ ID NO
Accession Number Accession Number GenBank SEQ ID NO GenBank SEQ ID NO
Accession Number Accession Number GenBank SEQ ID NO GenBank SEQ ID NO
Accession Number Accession Number GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 000095 SEQ !D NO 429 NM 018326 SEQ ID NO 1772 _28_ GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 001280 SEQ (D NO 588 Y14737 SEQ ID NO 1932 NM 001321 SEQ 1D NO 594 Contig29 SEQ ID NO 1939 RC

NM 001327 SEQ ID NO 595 Contig237RC SEQ ID NO 1940 NM 001329 SEQ ID NO 596 Contig263RC SEQ ID NO 1941 NM 001333 SEQ ID NO 597 Contig292RC SEQ ID NO 1942 NM 001338 SEQ ID NO 598 Contig382RC SEQ ID NO 1944 NM 001360 SEQ ID NO 599 Contig399RC SEQ ID NO 1945 NM 001363 SEQ ID NO 600 Contig448RC SEQ ID NO 1946 NM 001381 SEQ ID NO 601 Contig569RC SEQ ID NO 1947 NM 001394 SEQ ID NO 602 Contig580RC SEQ ID NO 1948 NM 001395 SEQ ID NO 603 Contig678RC SEQ ID NO 1949 NM 001419 SEQ ID NO 604 Contig706RC SEQ ID NO 1950 NM 001424 SEQ ID NO 605 Contig718RC SEQ ID NO 1951 NM 001428 SEQ ID NO 606 Contig719RC SEQ ID NO 1952 NM 001436 SEQ ID NO 607 Contig742RC SEQ ID NO 1953 NM 001444 SEQ ID NO 608 Contig753RC SEQ ID NO 1954 NM 001446 SEQ ID NO 609 Contig758RC SEQ ID NO 1956 NM 001453 SEQ ID NO 611 Contig760RC SEQ ID NO 1957 NM 001456 SEQ ID NO 612 Contig842RC SEQ ID NO 1958 NM 001457 SEQ ID NO 613 Contig848RC SEQ ID NO 1959 NM 001463 SEQ ID NO 614 Contig924RC SEQ ID NO 1960 NM 001465 SEQ ID NO 615 Contig974RC SEQ ID NO 1961 NM 001481 SEQ ID NO 616 Contig1018RC SEQ ID NO 1962 NM 001493 SEQ ID NO 617 Contig1056RC SEQ ID NO 1963 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 001494 SEQ ID NO 618 Contig1061RC SEQ ID NO 1964 NM 001500 SEQ ID NO 619 Contig1129RC SEQ ID NO 1965 NM 001504 SEQ ID NO 620 Contig1148 SEQ ID NO 1966 NM 001511 SEQ ID NO 621 Contig1239RC SEQ ID NO 1967 NM 001513 SEQ ID NO 622 Contig1277 SEQ ID NO 1968 NM 001527 SEQ ID NO 623 Contig1333RC SEQ ID NO 1969 NM 001529 SEQ ID NO 624 Contig1386RC SEQ ID NO 1970 NM 001530 SEQ ID NO 625 Contig1389RC SEQ ID NO 1971 NM 001540 SEQ ID NO 626 Contig1418RC SEQ ID NO 1972 NM 001550 SEQ ID NO 627 Contig1462RC SEQ ID NO 1973 NM 001551 SEQ ID NO 628 Contig1505RC SEQ ID NO 1974 NM 001552 SEQ ID NO 629 Contig1540RC SEQ ID NO 1975 NM 001554 SEQ ID NO 631 Contig1584RC SEQ ID NO 1976 NM 001558 SEQ ID NO 632 Contig1632RC SEQ ID NO 1977 NM 001560 SEQ ID NO 633 Contig1682RC SEQ ID NO 1978 NM 001565 SEQ ID NO 634 Contig1778RC SEQ ID NO 1979 NM 001569 SEQ ID NO 635 Contig1829 SEQ ID NO 1981 NM 001605 SEQ ID NO 636 Contig1838RC SEQ ID NO 1982 NM 001609 SEQ ID NO 637 Contig1938RC SEQ ID NO 1983 NM 001615 SEQ ID NO 638 Contig1970RC SEQ ID NO 1984 NM 001623 SEQ ID NO 639 Contig1998RC SEQ ID NO 1985 NM 001627 SEQ ID NO 640 Contig2099RC SEQ ID NO 1986 NM 001628 SEQ ID NO 641 Contig2143RC SEQ ID NO 1987 NM 001630 SEQ ID NO 642 Contig2237RC SEQ ID NO 1988 NM 001634 SEQ ID NO 643 Contig2429RC SEQ ID NO 1990 NM 001656 SEQ ID NO 644 Contig2504RC SEQ ID NO 1991 NM 001673 SEQ ID NO 645 Contig2512RC SEQ ID NO 1992 NM 001675 SEQ ID NO 647 Contig2575RC SEQ ID NO 1993 NM 001679 SEQ ID NO 648 Contig2578RC SEQ ID NO 1994 NM 001689 SEQ ID NO 649 Contig2639RC SEQ ID NO 1995 NM 001703 SEQ ID NO 650 Contig2647RC SEQ ID NO 1996 NM 001710 SEQ ID NO 651 Contig2657RC SEQ ID NO 1997 NM 001725 SEQ ID NO 652 Contig2728RC SEQ ID NO 1998 NM 001730 SEQ ID NO 653 Contig2745RC SEQ ID NO 1999 NM 001733 SEQ ID NO 654 Contig2811RC SEQ ID NO 2000 NM 001734 SEQ ID NO 655 Contig2873RC SEQ ID NO 2001 NM 001740 SEQ ID NO 656 Contig2883RC SEQ ID NO 2002 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 001745 SEQ ID NO 657 Contig2915 SEQ ID NO 2003 RC

NM 001747 SEQ ID NO 658 Contig2928RC SEQ ID NO 2004 NM 001756 SEQ ID NO 659 Contig3024RC SEQ ID NO 2005 NM 001757 SEQ ID NO 660 Contig3094RC SEQ ID NO 2006 NM 001758 SEQ ID NO 661 Contig3164RC SEQ ID NO 2007 NM 001762 SEQ ID NO 662 Contig3495RC SEQ ID NO 2009 NM 001767 SEQ ID NO 663 Contig3607RC SEQ ID NO 2010 NM 001770 SEQ ID NO 664 Contig3659RC SEQ ID NO 2011 NM 001777 SEQ ID NO 665 Contig3677RC SEQ ID NO 2012 NM 001778 SEQ ID NO 666 Contig3682RC SEQ ID NO 2013 NM 001781 SEQ ID NO 667 Contig3734RC SEQ ID NO 2014 NM 001786 SEQ ID NO 668 Contig3834RC SEQ ID NO 2015 NM 001793 SEQ ID NO 669 Contig3876RC SEQ ID NO 2016 NM 001803 SEQ ID NO 671 Contig3902RC SEQ ID NO 2017 NM 001806 SEQ ID NO 672 Contig3940RC SEQ ID NO 2018 NM 001809 SEQ ID NO 673 Contig4380RC SEQ ID NO 2019 NM 001814 SEQ ID NO 674 Contig4388RC SEQ ID NO 2020 NM 001826 SEQ ID NO 675 Contig4467RC SEQ ID NO 2021 NM 001830 SEQ ID NO 677 Contig4949RC SEQ ID NO 2023 NM 001838 SEQ ID NO 678 Contig5348RC SEQ ID NO 2024 NM 001839 SEQ ID NO 679 Contig5403RC SEQ ID NO 2025 NM 001853 SEQ ID NO 681 Contig5716RC SEQ ID NO 2026 NM 001859 SEQ ID NO 682 Contig6118RC SEQ ID NO 2027 NM 001861 SEQ ID NO 683 Contig6164RC SEQ ID NO 2028 NM 001874 SEQ ID NO 685 Contig6181RC SEQ ID NO 2029 NM 001885 SEQ ID NO 686 Contig6514RC SEQ ID NO 2030 NM 001892 SEQ ID NO 688 Contig6612RC SEQ ID NO 2031 NM 001897 SEQ ID NO 689 Contig6881RC SEQ ID NO 2032 NM 001899 SEQ ID NO 690 Contig8165RC SEQ ID NO 2033 NM 001905 SEQ ID NO 691 Contig8221RC SEQ ID NO 2034 NM 001912 SEQ ID NO 692 Contig8347RC SEQ ID NO 2035 NM 001914 SEQ ID NO 693 Contig8364RC SEQ ID NO 2036 NM 001919 SEQ ID NO 694 Contig8888RC SEQ ID NO 2038 NM 001941 SEQ ID NO 695 Contig9259RC SEQ ID NO 2039 NM 001943 SEQ ID NO 696 Contig9541RC SEQ ID NO 2040 NM 001944 SEQ ID NO 697 Contig10268RC SEQ ID NO 2041 NM 001953 SEQ ID NO 699 Contig10363RC SEQ ID NO 2042 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 001954 SEQ ID NO 700 Contig10437RC SEQ ID NO 2043 NM 001955 SEQ ID NO 701 Contig11086RC SEQ ID NO 2045 NM 001956 SEQ ID NO 702 Contig11275 SEQ ID NO 2046 RC

NM 001958 SEQ ID NO 703 Contig11648RC SEQ ID NO 2047 NM 001961 SEQ ID NO 705 Contig12216RC SEQ ID NO 2048 NM 001970 SEQ ID NO 706 Contig12369RC SEQ ID NO 2049 NM 001979 SEQ ID NO 707 Contig12814RC SEQ ID NO 2050 NM 001982 SEQ ID NO 708 Contig12951RC SEQ ID NO 2051 NM 002017 SEQ ID NO 710 Contig13480RC SEQ ID NO 2052 NM 002033 SEQ ID NO 713 Contig14284RC SEQ ID NO 2053 NM 002046 SEQ ID NO 714 Contig14390RC SEQ ID NO 2054 NM 002047 SEQ ID NO 715 Contig14780RC SEQ ID NO 2055 NM 002051 SEQ ID NO 716 Contig14954RC SEQ ID NO 2056 NM 002053 SEQ ID NO 717 Contig14981RC SEQ ID NO 2057 NM 002061 SEQ ID NO 718 Contig15692RC SEQ ID NO 2058 NM 002065 SEQ ID NO 719 Contig16192RC SEQ ID NO 2059 NM 002068 SEQ ID NO 720 Contig16759RC SEQ ID NO 2061 NM 002077 SEQ ID NO 722 Contig16786 SEQ ID NO 2062 RC

NM 002091 SEQ ID NO 723 Contig16905RC SEQ ID NO 2063 NM 002101 SEQ ID NO 724 Contig17103RC SEQ ID NO 2064 NM 002106 SEQ ID NO 725 Contig17105RC SEQ ID NO 2065 NM 002110 SEQ ID NO 726 Contig17248RC SEQ ID NO 2066 NM 002111 SEQ ID NO 727 Contig17345RC SEQ ID NO 2067 NM 002115 SEQ ID NO 728 Contig18502RC SEQ ID NO 2069 NM 002118 SEQ ID NO 729 Contig20156RC SEQ ID NO 2071 NM 002123 SEQ ID NO 730 Contig20302RC SEQ ID NO 2073 NM 002131 SEQ ID NO 731 Contig20600RC SEQ ID NO 2074 NM 002136 SEQ ID NO 732 Contig20617RC SEQ ID NO 2075 NM 002145 SEQ ID NO 733 Contig20629RC SEQ ID NO 2076 NM 002164 SEQ ID NO 734 Contig20651_RC SEQ ID NO 2077 NM 002168 SEQ ID NO 735 Contig21130RC SEQ ID NO 2078 NM 002184 SEQ ID NO 736 Contig21185RC SEQ ID NO 2079 NM 002185 SEQ ID NO 737 Contig21421RC SEQ ID NO 2080 NM 002189 SEQ ID NO 738 Contig21787RC SEQ ID NO 2081 NM 002200 SEQ ID NO 739 Contig21812RC SEQ ID NO 2082 NM 002201 SEQ ID NO 740 Contig22418RC SEQ ID NO 2083 NM 002213 SEQ ID NO 741 Contig23085RC SEQ ID NO 2084 GenBank SEQ ID NO GenBanlc SEQ ID NO
Accession Accession Number Number NM 002219 SEQ ID NO 742 Contig23454RC SEQ ID NO 2085 NM 002222 SEQ ID NO 743 Contig24138RC SEQ ID NO 2086 NM 002239 SEQ ID NO 744 Contig24252RC SEQ ID NO 2087 NM 002243 SEQ ID NO 745 Contig24655RC SEQ ID NO 2089 NM 002245 SEQ ID NO 746 Contig25055RC SEQ ID NO 2090 NM 002250 SEQ ID NO 747 Contig25290RC SEQ ID NO 2091 NM 002254 SEQ ID NO 748 Contig25343RC SEQ ID NO 2092 NM 002266 SEQ ID NO 749 Contig25362RC SEQ ID NO 2093 NM 002273 SEQ ID NO 750 Contig25617RC SEQ ID NO 2094 NM 002281 SEQ ID NO 751 Contig25659RC SEQ ID NO 2095 NM 002292 SEQ ID NO 752 Contig25722RC SEQ ID NO 2096 NM 002298 SEQ ID NO 753 Contig25809RC SEQ ID NO 2097 NM 002300 SEQ ID NO 754 Contig25991 SEQ ID NO 2098 t5 NM 002308 SEQ ID NO 755 Contig26022RC SEQ ID NO 2099 NM 002314 SEQ ID NO 756 Contig26077RC SEQ ID NO 2100 NM 002337 SEQ ID NO 757 Contig26310RC SEQ ID NO 2101 NM 002341 SEQ ID NO 758 Contig26371RC SEQ ID NO 2102 NM 002342 SEQ ID NO 759 Contig26438RC SEQ ID NO 2103 NM 002346 SEQ ID NO 760 Contig26706RC SEQ ID NO 2104 NM 002349 SEQ ID NO 761 Contig27088RC SEQ ID NO 2105 NM 002350 SEQ ID NO 762 Contig27186RC SEQ ID NO 2106 NM 002356 SEQ ID NO 763 Contig27228RC SEQ ID NO 2107 NM 002358 SEQ ID NO 764 Contig27344RC SEQ ID NO 2109 NM 002370 SEQ ID NO 765 Contig27386RC SEQ ID NO 2110 ~5 NM 002395 SEQ ID NO 766 Contig27624RC SEQ ID NO 2111 NM 002416 SEQ ID NO 767 Contig27749RC SEQ ID NO 2112 NM 002421 SEQ ID NO 768 Contig27882RC SEQ ID NO 2113 NM 002426 SEQ ID NO 769 Contig27915RC SEQ ID NO 2114 NM 002435 SEQ ID NO 770 Contig28030RC SEQ ID NO 2115 NM 002438 SEQ ID NO 771 Contig28081RC SEQ ID NO 2116 NM 002444 SEQ ID NO 772 Contig28152RC SEQ ID NO 2117 NM 002449 SEQ ID NO 773 Contig28550RC SEQ ID NO 2119 NM 002450 SEQ ID NO 774 Contig28552RC SEQ ID NO 2120 NM 002456 SEQ ID NO 775 Contig28712RC SEQ ID NO 2121 NM 002466 SEQ ID NO 776 Contig28888RC SEQ ID NO 2122 NM 002482 SEQ ID NO 777 Contig28947RC SEQ ID NO 2123 NM 002497 SEQ ID NO 778 Contig29126RC SEQ ID NO 2124 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 002510 SEQ ID NO 779 Contig29193RC SEQ ID NO 2125 NM 002515 SEQ ID NO 781 Contig29369RC SEQ ID NO 2126 NM 002524 SEQ ID NO 782 Contig29639RC SEQ ID NO 2127 NM 002539 SEQ ID NO 783 Contig30047RC SEQ ID NO 2129 NM 002555 SEQ ID NO 785 Contig30154RC SEQ ID NO 2131 NM 002570 SEQ ID NO 787 Contig30209RC SEQ ID NO 2132 NM 002579 SEQ ID NO 788 Contig30213RC SEQ ID NO 2133 NM 002587 SEQ ID NO 789 Contig30230 SEQ ID NO 2134 RC

NM 002590 SEQ ID NO 790 Contig30267RC SEQ ID NO 2135 NM 002600 SEQ ID NO 791 Contig30390RC SEQ ID NO 2136 NM 002614 SEQ ID NO 792 Contig30480RC SEQ ID NO 2137 NM 002618 SEQ ID NO 794 Contig30609RC SEQ ID NO 2138 NM 002626 SEQ ID NO 795 Contig30934RC SEQ ID NO 2139 NM 002633 SEQ ID NO 796 Contig31150RC SEQ ID NO 2140 NM 002639 SEQ ID NO 797 Contig31186RC SEQ ID NO 2141 NM 002648 SEQ ID NO 798 Contig31251RC SEQ ID NO 2142 NM 002659 SEQ ID NO 799 Contig31288RC SEQ ID NO 2143 NM 002661 SEQ ID NO 800 Contig31291RC SEQ ID NO 2144 NM 002662 SEQ ID NO 801 Contig31295RC SEQ ID NO 2145 NM 002664 SEQ ID NO 802 Contig31424RC SEQ ID NO 2146 NM 002689 SEQ ID NO 804 Contig31449RC SEQ ID NO 2147 NM 002690 SEQ ID NO 805 Contig31596RC SEQ ID NO 2148 NM 002709 SEQ ID NO 806 Contig31864RC SEQ ID NO 2149 NM 002727 SEQ ID NO 807 Contig31928RC SEQ ID NO 2150 NM 002729 SEQ ID NO 808 Contig31966RC SEQ ID NO 2151 NM 002734 SEQ ID NO 809 Contig31986RC SEQ ID NO 2152 NM 002736 SEQ ID NO 810 Contig32084RC SEQ ID NO 2153 NM 002740 SEQ ID NO 811 Contig32105RC SEQ ID NO 2154 NM 002748 SEQ ID NO 813 Contig32185RC SEQ ID NO 2156 NM 002774 SEQ ID NO 814 Contig32242RC SEQ ID NO 2157 NM 002775 SEQ ID NO 815 Contig32322RC SEQ ID NO 2158 NM 002776 SEQ ID NO 816 Contig32336RC SEQ ID NO 2159 NM 002789 SEQ ID NO 817 Contig32558RC SEQ ID NO 2160 NM 002794 SEQ ID NO 818 Contig32798RC SEQ ID NO 2161 NM 002796 SEQ ID NO 819 Contig33005RC SEQ ID NO 2162 NM 002800 SEQ ID NO 820 Contig33230RC SEQ ID NO 2163 NM 002801 SEQ ID NO 821 Contig33260RC SEQ ID NO 2164 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 002808 SEQ ID NO 822 Contig33654RC SEQ ID NO 2166 NM 002821 SEQ ID NO 824 Contig33741RC SEQ ID NO 2167 NM 002826 SEQ ID NO 825 Contig33771RC SEQ ID NO 2168 NM 002827 SEQ ID NO 826 Contig33814RC SEQ ID NO 2169 NM 002838 SEQ ID NO 827 Contig33815RC SEQ ID NO 2170 NM 002852 SEQ ID NO 828 Contig33833 SEQ ID NO 2171 NM 002854 SEQ ID NO 829 Contig33998RC SEQ ID NO 2172 NM 002856 SEQ ID NO 830 Contig34079 SEQ ID NO 2173 NM 002857 SEQ ID NO 831 Contig34080RC SEQ ID NO 2174 NM 002858 SEQ ID NO 832 Contig34222RC SEQ ID NO 2175 NM 002888 SEQ ID NO 833 Contig34233 SEQ ID NO 2176 RC

NM 002890 SEQ ID NO 834 Contig34303RC SEQ ID NO 2177 NM 002901 SEQ ID NO 836 Contig34393RC SEQ ID NO 2178 NM 002906 SEQ ID NO 837 Contig34477RC SEQ ID NO 2179 NM 002916 SEQ ID NO 838 Contig34766RC SEQ ID NO 2181 NM 002923 SEQ ID NO 839 Contig34952 SEQ ID NO 2182 NM 002933 SEQ ID NO 840 Contig34989RC SEQ ID NO 2183 NM 002936 SEQ ID NO 841 Contig35030RC SEQ ID NO 2184 NM 002937 SEQ ID NO 842 Contig35251RC SEQ ID NO 2185 NM 002950 SEQ ID NO 843 Contig35629 SEQ ID NO 2186 RC

NM 002961 SEQ ID NO 844 Contig35635RC SEQ ID NO 2187 NM 002964 SEQ ID NO 845 Contig35763RC SEQ ID NO 2188 NM 002965 SEQ ID NO 846 Contig35814RC SEQ ID NO 2189 NM 002966 SEQ ID NO 847 Contig35896RC SEQ ID NO 2190 NM 002982 SEQ ID NO 849 Contig35976 SEQ ID NO 2191 RC

NM 002983 SEQ ID NO 850 Contig36042RC SEQ ID NO 2192 NM 002984 SEQ ID NO 851 Contig36081RC SEQ ID NO 2193 NM 002985 SEQ ID NO 852 Contig36152RC SEQ ID NO 2194 NM 002988 SEQ ID NO 853 Contig36193RC SEQ ID NO 2195 NM 002996 SEQ ID NO 854 Contig36312RC SEQ ID NO 2196 NM 002997 SEQ ID NO 855 Contig36323RC SEQ ID NO 2197 NM 002999 SEQ ID NO 856 Contig36339RC SEQ ID NO 2198 NM 003012 SEQ ID NO 857 Contig36647RC SEQ ID NO 2199 NM 003022 SEQ ID NO 858 Contig36744RC SEQ ID NO 2200 NM 003034 SEQ ID NO 859 Contig36761RC SEQ ID NO 2201 NM 003035 SEQ ID NO 860 Contig36879RC SEQ ID NO 2202 NM 003039 SEQ ID NO 861 Contig36900RC SEQ ID NO 2203 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 003051 SEQ ID NO 862 Contig37015RC SEQ ID NO 2204 NM 003064 SEQ ID NO 863 Contig37024RC SEQ ID NO 2205 NM 003066 SEQ ID NO 864 Contig37072RC SEQ ID NO 2207 NM 003088 SEQ ID NO 865 Contig37140RC SEQ ID NO 2208 NM 003090 SEQ ID NO 866 Contig37141RC SEQ ID NO 2209 NM 003096 SEQ ID NO 867 Contig37204RC SEQ ID NO 2210 NM 003099 SEQ ID NO 868 Contig37281RC SEQ ID NO 2211 NM 003102 SEQ ID NO 869 Contig37287RC SEQ ID NO 2212 NM 003104 SEQ ID NO 870 Contig37439RC SEQ ID NO 2213 NM 003108 SEQ ID NO 871 Contig37562RC SEQ ID NO 2214 NM 003121 SEQ ID NO 873 Contig37571RC SEQ ID NO 2215 NM 003134 SEQ ID NO 874 Contig37598 SEQ ID NO 2216 NM 003137 SEQ ID NO 875 Contig37758RC SEQ ID NO 2217 NM 003144 SEQ ID NO 876 Contig37778RC SEQ ID NO 2218 NM 003146 SEQ ID NO 877 Contig37884RC SEQ ID NO 2219 NM 003149 SEQ ID NO 878 Contig37946RC SEQ ID NO 2220 NM 003151 SEQ ID NO 879 Contig38170RC SEQ ID NO 2221 NM 003157 SEQ ID NO 880 Contig38288RC SEQ ID NO 2223 NM 003158 SEQ ID NO 881 Contig38398RC SEQ ID NO 2224 NM 003165 SEQ ID NO 882 Contig38580RC SEQ ID NO 2226 NM 003172 SEQ ID NO 883 Contig38630RC SEQ ID NO 2227 NM 003177 SEQ ID NO 884 Contig38652RC SEQ ID NO 2228 NM 003197 SEQ ID NO 885 Contig38683RC SEQ ID NO 2229 NM 003202 SEQ ID NO 886 Contig38726RC SEQ ID NO 2230 NM 003213 SEQ ID NO 887 Contig38791RC SEQ ID NO 2231 NM 003217 SEQ ID NO 888 Contig38901RC SEQ ID NO 2232 NM 003225 SEQ ID NO 889 Contig38983RC SEQ ID NO 2233 NM 003226 SEQ ID NO 890 Contig39090RC SEQ ID NO 2234 NM 003236 SEQ ID NO 892 Contig39132RC SEQ ID NO 2235 NM 003239 SEQ ID NO 893 Contig39157_RC SEQ ID NO 2236 30 NM 003248 SEQ ID NO 894 Contig39226RC SEQ ID NO 2237 NM 003255 SEQ ID NO 895 Contig39285RC SEQ ID NO 2238 NM 003258 SEQ ID NO 896 Contig39556RC SEQ ID NO 2239 NM 003264 SEQ ID NO 897 Contig39591RC SEQ ID NO 2240 NM 003283 SEQ ID NO 898 Contig39826RC SEQ ID NO 2241 35 NM 003318 SEQ ID NO 899 Contig39845RC SEQ ID NO 2242 NM 003329 SEQ ID NO 900 Contig39891RC SEQ ID NO 2243 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 003332 SEQ ID NO 901 Contig39922RC SEQ ID NO 2244 NM 003358 SEQ ID NO 902 Contig39960RC SEQ ID NO 2245 NM 003359 SEQ ID NO 903 Contig40026RC SEQ ID NO 2246 NM 003360 SEQ ID NO 904 Contig40121RC SEQ ID NO 2247 NM 003368 SEQ ID NO 905 Contig40128RC SEQ ID NO 2248 NM 003376 SEQ ID NO 906 Contig40146 SEQ ID NO 2249 NM 003380 SEQ ID NO 907 Contig40208RC SEQ ID NO 2250 NM 003392 SEQ ID NO 908 Contig40212RC SEQ ID NO 2251 NM 003412 SEQ ID NO 909 Contig40238RC SEQ ID NO 2252 NM 003430 SEQ ID NO 910 Contig40434RC SEQ ID NO 2253 NM 003462 SEQ ID NO 911 Contig40446RC SEQ ID NO 2254 NM 003467 SEQ ID NO 912 Contig40500 SEQ ID NO 2255 RC

NM 003472 SEQ ID NO 913 Contig40573RC SEQ ID NO 2256 NM 003479 SEQ ID NO 914 Contig40813RC SEQ ID NO 2258 NM 003489 SEQ ID NO 915 Contig40816RC SEQ ID NO 2259 NM 003494 SEQ ID NO 916 Contig40845RC SEQ ID NO 2261 NM 003498 SEQ ID NO 917 Contig40889RC SEQ ID NO 2262 NM 003504 SEQ ID NO 919 Contig41035 SEQ ID NO 2263 NM 003508 SEQ ID NO 920 Contig41234RC SEQ ID NO 2264 NM 003510 SEQ ID NO 921 Contig41413RC SEQ ID NO 2266 NM 003512 SEQ ID NO 922 Contig41521RC SEQ ID NO 2267 NM 003528 SEQ ID NO 923 Contig41530 SEQ ID NO 2268 RC

NM 003544 SEQ ID NO 924 Contig41590 SEQ ID NO 2269 NM 003561 SEQ ID NO 925 Contig41618RC SEQ ID NO 2270 NM 003563 SEQ ID NO 926 Contig41624RC SEQ ID NO 2271 NM 003568 SEQ ID NO 927 Contig41635RC SEQ ID NO 2272 NM 003579 SEQ ID NO 928 Contig41676RC SEQ ID NO 2273 NM 003600 SEQ ID NO 929 Contig41689 SEQ ID NO 2274 RC

NM 003615 SEQ ID NO 931 Contig41804RC SEQ ID NO 2275 NM 003627 SEQ ID NO 932 Contig41887_RC SEQ ID NO 2276 NM 003645 SEQ ID NO 935 Contig41905RC SEQ ID NO 2277 NM 003651 SEQ ID NO 936 Contig41954RC SEQ ID NO 2278 NM 003657 SEQ ID NO 937 Contig41983RC SEQ ID NO 2279 NM 003662 SEQ ID NO 938 Contig42006RC SEQ ID NO 2280 NM 003670 SEQ ID NO 939 Contig42014RC SEQ ID NO 2281 NM 003675 SEQ ID NO 940 Contig42036RC SEQ ID NO 2282 NM 003676 SEQ ID NO 941 Contig42041RC SEQ ID NO 2283 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 003681 SEQ ID NO 942 Contig42139 SEQ ID NO 2284 NM 003683 SEQ ID NO 943 Contig42161RC SEQ ID NO 2285 NM 003686 SEQ ID NO 944 Contig42220RC SEQ ID NO 2286 NM 003689 SEQ ID NO 945 Contig42306RC SEQ ID NO 2287 NM 003714 SEQ ID NO 946 Contig42311RC SEQ ID NO 2288 NM 003720 SEQ ID NO 947 Contig42313RC SEQ ID NO 2289 NM 003726 SEQ ID NO 948 Contig42402RC SEQ ID NO 2290 NM 003729 SEQ ID NO 949 Contig42421RC SEQ ID NO 2291 NM 003740 SEQ ID NO 950 Contig42430RC SEQ ID NO 2292 NM 003772 SEQ ID NO 952 Contig42431RC SEQ ID NO 2293 NM 003791 SEQ ID NO 953 Cantig42542RC SEQ ID NO 2294 NM 003793 SEQ ID NO 954 Contig42582 SEQ ID NO 2295 NM 003795 SEQ ID NO 955 Contig42631RC SEQ ID NO 2296 NM 003806 SEQ ID NO 956 Contig42751RC SEQ ID NO 2297 NM 003821 SEQ ID NO 957 Contig42759RC SEQ ID NO 2298 NM 003829 SEQ ID NO 958 Contig43054 SEQ ID NO 2299 NM 003831 SEQ ID NO 959 Contig43079RC SEQ ID NO 2300 NM 003862 SEQ ID NO 960 Contig43195RC SEQ ID NO 2301 NM 003866 SEQ ID NO 961 Contig43368RC SEQ ID NO 2302 NM 003875 SEQ ID NO 962 Contig43410RC SEQ ID NO 2303 NM 003878 SEQ ID NO 963 Contig43476 SEQ ID NO 2304 RC

NM 003894 SEQ ID NO 965 Contig43549RC SEQ ID NO 2305 NM 003897 SEQ ID NO 966 Contig43645RC SEQ ID NO 2306 NM 003904 SEQ ID NO 967 Contig43648RC SEQ ID NO 2307 NM 003929 SEQ ID NO 968 Contig43673RC SEQ ID NO 2308 NM 003933 SEQ ID NO 969 Contig43679RC SEQ ID NO 2309 NM 003937 SEQ ID NO 970 Contig43694RC SEQ ID NO 2310 NM 003940 SEQ ID NO 971 Contig43747 SEQ ID NO 2311 RC

NM 003942 SEQ ID NO 972 Contig43918RC SEQ ID NO 2312 NM 003944 SEQ ID NO 973 Contig43983_RC SEQ ID NO 2313 NM 003953 SEQ ID NO 974 Contig44040RC SEQ ID NO 2314 NM 003954 SEQ ID NO 975 Contig44064 SEQ ID NO 2315 RC

NM 003975 SEQ ID NO 976 Contig44195RC SEQ ID NO 2316 NM 003981 SEQ ID NO 977 Contig44226RC SEQ ID NO 2317 NM 003982 SEQ ID NO 978 Contig44289RC SEQ ID NO 2320 NM 003986 SEQ ID NO 979 Contig44310RC SEQ ID NO 2321 NM 004003 SEQ ID NO 980 Contig44409 SEQ ID NO 2322 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 004010 SEQ ID NO 981 Contig44413 SEQ ID NO 2323 RC

NM 004024 SEQ ID NO 982 Contig44451RC SEQ ID NO 2324 NM 004038 SEQ ID NO 983 Contig44585RC SEQ ID NO 2325 NM 004049 SEQ ID NO 984 Contig44656RC SEQ ID NO 2326 NM 004052 SEQ ID NO 985 Contig44703RC SEQ ID NO 2327 NM 004053 SEQ ID NO 986 Contig44708RC SEQ ID NO 2328 NM 004079 SEQ ID NO 987 Contig44757 SEQ ID NO 2329 RC

NM 004104 SEQ ID NO 988 Contig44829RC SEQ ID NO 2331 NM 004109 SEQ ID NO 989 Contig44870 SEQ ID NO 2332 NM 004110 SEQ ID NO 990 Contig44893RC SEQ ID NO 2333 NM 004120 SEQ ID NO 991 Contig44909RC SEQ ID NO 2334 NM 004131 SEQ ID NO 992 Contig44939RC SEQ ID NO 2335 NM 004143 SEQ ID NO 993 Contig45022RC SEQ ID NO 2336 NM 004154 SEQ ID NO 994 Contig45032RC SEQ ID NO 2337 NM 004170 SEQ ID NO 996 Contig45041RC SEQ ID NO 2338 NM 004172 SEQ ID NO 997 Contig45049RC SEQ ID NO 2339 NM 004176 SEQ ID NO 998 Contig45090RC SEQ ID NO 2340 NM 004180 SEQ ID NO 999 Contig45156RC SEQ ID NO 2341 .

NM 004181 SEQ ID NO 1000Contig45316RC SEQ ID NO 2342 NM 004184 SEQ ID NO 1001Contig45321 SEQ ID NO 2343 NM 004203 SEQ ID NO 1002Contig45375RC SEQ ID NO 2345 NM 004207 SEQ ID NO 1003Contig45443 SEQ ID NO 2346 RC

NM 004217 SEQ ID NO 1004Contig45454RC SEQ ID NO 2347 NM 004219 SEQ ID NO 1005Contig45537 SEQ ID NO 2348 RC

NM 004221 SEQ ID NO 1006Contig45588RC SEQ ID NO 2349 NM 004233 SEQ ID NO 1007Contig45708RC SEQ ID NO 2350 NM 004244 SEQ ID NO 1008Contig45816RC SEQ ID NO 2351 NM 004252 SEQ ID NO 1009Contig45847RC SEQ ID NO 2352 NM 004265 SEQ ID NO 1010Contig45891RC SEQ ID NO 2353 NM 004267 SEQ ID NO 1011Contig46056_RC SEQ ID NO 2354 30 NM 004281 SEQ ID NO 1012Contig46062 SEQ ID NO 2355 RC

NM 004289 SEQ ID NO 1013Contig46075RC SEQ ID NO 2356 NM 004298 SEQ ID NO 1015Contig46164RC SEQ ID NO 2357 NM 004301 SEQ ID NO 1016Contig46218RC SEQ ID NO 2358 NM 004305 SEQ ID NO 1017Contig46223RC SEQ ID NO 2359 35 NM 004311 SEQ ID NO 1018Contig46244RC SEQ ID NO 2360 NM 004315 SEQ ID NO 1019Contig46262RC SEQ ID NO 2361 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 004323 SEQ ID NO 1020Contig46362 SEQ ID NO 2364 RC

NM 004330 SEQ ID NO 1021Contig46443RC SEQ ID NO 2365 NM 004336 SEQ ID NO 1022Contig46553RC SEQ ID NO 2367 NM 004338 SEQ ID NO 1023Contig46597RC SEQ ID NO 2368 NM 004350 SEQ ID NO 1024Contig46653RC SEQ ID NO 2369 NM 004354 SEQ ID NO 1025Contig46709RC SEQ ID NO 2370 NM 004358 SEQ ID NO 1026Contig46777 SEQ ID NO 2371 RC

NM 004360 SEQ ID NO 1027Contig46802RC SEQ ID NO 2372 NM 004362 SEQ ID NO 1028Contig46890RC SEQ ID NO 2374 NM 004374 SEQ ID NO 1029Contig46922RC SEQ ID NO 2375 NM 004378 SEQ ID NO 1030Contig46934 SEQ ID NO 2376 RC

NM 004392 SEQ ID NO 1031Contig46937RC SEQ ID NO 2377 NM 004395 SEQ ID NO 1032Contig46991RC SEQ ID NO 2378 NM 004414 SEQ ID NO 1033Contig47016RC SEQ ID NO 2379 NM 004418 SEQ ID NO 1034Contig47045 SEQ ID NO 2380 RC

NM 004425 SEQ ID NO 1035Contig47106RC SEQ ID NO 2381 NM 004431 SEQ ID NO 1036Contig47146RC SEQ ID NO 2382 NM 004436 SEQ ID NO 1037Contig47230RC SEQ ID NO 2383 NM 004438 SEQ ID NO 1038Contig47405RC SEQ ID NO 2384 NM 004443 SEQ ID NO 1039Contig47456RC SEQ ID NO 2385 NM 004446 SEQ ID NO 1040Contig47465RC SEQ ID NO 2386 NM 004451 SEQ ID NO 1041Contig47498RC SEQ ID NO 2387 NM 004454 SEQ ID NO 1042Contig47578RC SEQ ID NO 2388 NM 004456 SEQ ID NO 1043Contig47645RC SEQ ID NO 2389 NM 004458 SEQ ID NO 1044Contig47680RC SEQ ID NO 2390 NM 004472 SEQ ID NO 1045Contig47781RC SEQ ID NO 2391 NM 004480 SEQ ID NO 1046Contig47814RC SEQ ID NO 2392 NM 004482 SEQ ID NO 1047Contig48004RC SEQ ID NO 2393 NM 004494 SEQ ID NO 1048Contig48043RC SEQ ID NO 2394 NM 004496 SEQ ID NO 1049Contig48057_RC SEQ ID NO 2395 NM 004503 SEQ ID NO 1050Contig48076RC SEQ ID NO 2396 NM 004504 SEQ ID NO 1051Contig48249RC SEQ ID NO 2397 NM 004515 SEQ ID NO 1052Contig48263RC SEQ ID NO 2398 NM 004522 SEQ ID NO 1053Contig48270RC SEQ ID NO 2399 NM 004523 SEQ ID NO 1054Contig48328RC SEQ ID NO 2400 NM 004525 SEQ ID NO 1055Contig48518RC SEQ ID NO 2401 NM 004556 SEQ 1D NO 1056Contig48572RC SEQ ID NO 2402 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 004559 SEQ ID NO 1057Contig48659RC SEQ ID NO 2403 NM 004569 SEQ ID NO 1058Contig48722RC SEQ ID NO 2404 NM 004577 SEQ ID NO 1059Contig48774RC SEQ ID NO 2405 NM 004585 SEQ ID NO 1060Contig48776 SEQ 1D NO 2406 RC

NM 004587 SEQ ID NO 1061Contig48800RC SEQ ID NO 2407 NM 004594 SEQ ID NO 1062Contig48806RC SEQ ID NO 2408 NM 004599 SEQ ID NO 1063Contig48852RC SEQ ID NO 2409 NM 004633 SEQ ID NO 1066Contig48900RC SEQ ID NO 2410 NM 004642 SEQ ID NO 1067Contig48913RC SEQ ID NO 2411 NM 004648 SEQ ID NO 1068Contig48970RC SEQ ID NO 2413 NM 004663 SEQ ID NO 1069Contig49058RC SEQ ID NO 2414 NM 004664 SEQ ID NO 1070Contig49063RC SEQ ID NO 2415 NM 004684 SEQ ID NO 1071Contig49093 SEQ ID NO 2416 NM 004688 SEQ ID NO 1072Contig49098RC SEQ ID NO 2417 NM 004694 SEQ ID NO 1073Contig49169RC SEQ ID NO 2418 NM 004695 SEQ ID NO 1074Contig49233RC SEQ ID NO 2419 NM 004701 SEQ ID NO 1075Contig49270RC SEQ ID NO 2420 NM 004708 SEQ ID NO 1077Contig49282RC SEQ ID NO 2421 NM 004711 SEQ ID NO 1078Contig49289RC SEQ ID NO 2422 NM 004726 SEQ ID NO 1079Contig49342RC SEQ ID NO 2423 NM 004750 SEQ ID NO 1081Contig49344 SEQ ID NO 2424 NM 004761 SEQ ID NO 1082Contig49388RC SEQ ID NO 2425 NM 004762 SEQ ID NO 1083Contig49405 SEQ ID NO 2426 RC

NM 004780 SEQ ID NO 1085Contig49445RC SEQ ID NO 2427 NM 004791 SEQ ID NO 1086Contig49468RC SEQ ID NO 2428 NM 004798 SEQ ID NO 1087Contig49509RC SEQ ID NO 2429 NM 004808 SEQ ID NO 1088Contig49578RC SEQ ID NO 2431 NM 004811 SEQ ID NO 1089Contig49581RC SEQ ID NO 2432 NM 004833 SEQ ID NO 1090Contig49631RC SEQ ID NO 2433 NM 004835 SEQ ID NO 1091Contig49673_RC SEQ ID NO 2435 30 NM 004843 SEQ ID NO 1092Contig49743RC SEQ ID NO 2436 NM 004847 SEQ ID NO 1093Contig49790RC SEQ ID NO 2437 NM 004848 SEQ ID NO 1094Contig49818RC SEQ ID NO 2438 NM 004864 SEQ ID NO 1095Contig49849RC SEQ ID NO 2439 NM 004865 SEQ ID NO 1096Contig49855 SEQ ID NO 2440 35 NM 004866 SEQ ID NO 1097Contig49910RC SEQ ID NO 2441 NM 004877 SEQ ID NO 1098Contig49948RC SEQ ID NO 2442 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 004900 SEQ ID NO 1099Contig50004RC SEQ ID NO 2443 NM 004906 SEQ ID NO 1100Contig50094 SEQ ID NO 2444 NM 004910 SEQ ID NO 1101Contig50120RC SEQ ID NO 2446 NM 004918 SEQ ID NO 1103Contig50153RC SEQ ID NO 2447 NM 004923 SEQ ID NO 1104Contig50189RC SEQ ID NO 2448 NM 004938 SEQ ID NO 1105Contig50276RC SEQ ID NO 2449 NM 004951 SEQ ID NO 1106Contig50288RC SEQ ID NO 2450 NM 004968 SEQ ID NO 1107Contig50297RC SEQ ID NO 2451 NM 004994 SEQ ID NO 1108Contig50391RC SEQ ID NO 2452 NM 004999 SEQ ID NO 1109Contig50410 SEQ ID NO 2453 NM 005001 SEQ ID NO 1110Contig50523RC SEQ ID NO 2454 NM 005002 SEQ ID NO 1111Contig50529 SEQ ID NO 2455 NM 005012 SEQ ID NO 1112Contig50588RC SEQ ID NO 2456 NM 005032 SEQ ID NO 1113Contig50592 SEQ ID NO 2457 NM 005044 SEQ ID NO 1114Contig50669RC SEQ ID NO 2458 NM 005046 SEQ ID NO 1115Contig50719RC SEQ ID NO 2460 NM 005049 SEQ ID NO 1116Contig50728RC SEQ ID NO 2461 NM 005067 SEQ ID NO 1117Contig50731RC SEQ ID NO 2462 NM 005077 SEQ ID NO 1118Contig50802RC SEQ ID NO 2463 NM 005080 SEQ ID NO 1119Contig50822RC SEQ ID NO 2464 NM 005084 SEQ ID NO 1120Contig50850RC SEQ ID NO 2466 NM 005130 SEQ ID NO 1122Contig50860RC SEQ ID NO 2467 NM 005139 SEQ iD NO 1123Contig50913RC SEQ iD NO 2468 NM 005168 SEQ ID NO 1125Contig50950RC SEQ ID NO 2469 NM 005190 SEQ ID NO 1126Contig51066RC SEQ ID NO 2470 NM 005196 SEQ ID NO 1127Contig51105RC SEQ ID NO 2472 NM 005213 SEQ ID NO 1128Contig51117RC SEQ ID NO 2473 NM 005218 SEQ ID NO 1129Contig51196RC SEQ ID NO 2474 NM 005235 SEQ ID NO 1130Contig51235RC SEQ ID NO 2475 NM 005245 SEQ ID NO 1131Contig51254RC SEQ ID NO 2476 NM 005249 SEQ ID NO 1132Contig51352RC SEQ ID NO 2477 NM 005257 SEQ ID NO 1133Contig51369RC SEQ ID NO 2478 NM 005264 SEQ ID NO 1134Contig51392RC SEQ ID NO 2479 NM 005271 SEQ ID NO 1135Contig51403RC SEQ ID NO 2480 NM 005314 SEQ ID NO 1136Contig51685RC SEQ ID NO 2483 NM 005321 SEQ ID NO 1137Contig51726RC SEQ ID NO 2484 NM 005322 SEQ ID NO 1138Contig51742RC SEQ ID NO 2485 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 005325 SEQ ID NO 1139Contig51749RC SEQ ID NO 2486 NM 005326 SEQ ID NO 1140Contig51775 SEQ ID NO 2487 RC

NM 005335 SEQ ID NO 1141Contig51800 SEQ ID NO 2488 NM 005337 SEQ ID NO 1142Contig51809RC SEQ ID NO 2489 NM 005342 SEQ ID NO 1143Contig51821RC SEQ ID NO 2490 NM 005345 SEQ ID NO 1144Contig51888RC SEQ ID NO 2491 NM 005357 SEQ ID NO 1145Contig51953RC SEQ ID NO 2493 NM 005375 SEQ ID NO 1146Contig51967RC SEQ ID NO 2495 NM 005391 SEQ ID NO 1147Contig51981RC SEQ ID NO 2496 NM 005408 SEQ ID NO 1148Contig51994RC SEQ ID NO 2497 NM 005409 SEQ ID NO 1149Contig52082RC SEQ ID NO 2498 NM 005410 SEQ ID NO 1150Contig52094RC SEQ ID NO 2499 NM 005426 SEQ ID NO 1151Contig52320 SEQ ID NO 2500 NM 005433 SEQ ID NO 1152Contig52398RC SEQ ID NO 2501 NM 005441 SEQ ID NO 1153Contig52425RC SEQ ID NO 2503 NM 005443 SEQ ID NO 1154Contig52482RC SEQ ID NO 2504 NM 005483 SEQ ID NO 1155Contig52543RC SEQ ID NO 2505 NM 005486 SEQ ID NO 1156Contig52553 SEQ ID NO 2506 RC

NM 005496 SEQ ID NO 1157Contig52579RC SEQ ID NO 2507 NM 005498 SEQ ID NO 1158Contig52603RC SEQ ID NO 2508 NM 005499 SEQ ID NO 1159Contig52639RC SEQ ID NO 2509 NM 005514 SEQ ID NO 1160Contig52641RC SEQ ID NO 2510 NM 005531 SEQ ID NO 1162Contig52684 SEQ ID NO 2511 NM 005538 SEQ ID NO 1163Contig52705RC SEQ ID NO 2512 NM 005541 SEQ ID NO 1164Contig52720RC SEQ ID NO 2513 NM 005544 SEQ ID NO 1165Contig52722RC SEQ ID NO 2514 NM 005548 SEQ ID NO 1166Contig52723RC SEQ ID NO 2515 NM 005554 SEQ ID NO 1167Contig52740RC SEQ ID NO 2516 NM 005555 SEQ ID NO 1168Contig52779RC SEQ ID NO 2517 NM 005556 SEQ ID NO 1169Contig52957RC SEQ ID NO 2518 NM 005557 SEQ ID NO 1170Contig52994RC SEQ ID NO 2519 NM 005558 SEQ ID NO 1171Contig53022RC SEQ ID NO 2520 NM 005562 SEQ ID NO 1172Contig53038RC SEQ ID NO 2521 NM 005563 SEQ ID NO 1173Contig53047RC SEQ ID NO 2522 NM 005565 SEQ ID NO 1174Contig53130 SEQ ID NO 2523 NM 005566 SEQ ID NO 1175Contig53183RC SEQ ID NO 2524 NM 005572 SEQ ID NO 1176Contig53242RC SEQ ID NO 2526 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 005582 SEQ ID NO 1177Contig53248RC SEQ ID NO 2527 NM 005608 SEQ ID NO 1178Contig53260RC SEQ ID NO 2528 NM 005614 SEQ 1D NO 1179Contig53296RC SEQ ID NO 2531 NM 005617 SEQ ID NO 1180Contig53307 SEQ ID NO 2532 RC

NM 005620 SEQ ID NO 1181Contig53314RC SEQ ID NO 2533 NM 005625 SEQ ID NO 1182Contig53401RC SEQ ID NO 2534 NM 005651 SEQ 1D NO 1183Contig53550RC SEQ ID NO 2535 NM 005658 SEQ ID NO 1184Contig53551RC SEQ ID NO 2536 NM 005659 SEQ ID NO 1185Contig53598RC SEQ ID NO 2537 NM 005667 SEQ ID NO 1186Contig53646RC SEQ ID NO 2538 NM 005686 SEQ iD NO 1187Contig53658RC SEQ ID NO 2539 NM 005690 SEQ ID NO 1188Contig53698RC SEQ ID NO 2540 NM 005720 SEQ ID NO 1190Contig53719RC SEQ ID NO 2541 NM 005727 SEQ ID NO 1191Contig53742RC SEQ ID NO 2542 NM 005733 SEQ ID NO 1192Contig53757RC SEQ ID NO 2543 NM 005737 SEQ ID NO 1193Contig53870RC SEQ ID NO 2544 NM 005742 SEQ ID NO 1194Contig53952RC SEQ ID NO 2546 NM 005746 SEQ ID NO 1195Contig53962RC SEQ ID NO 2547 NM 005749 SEQ ID NO 1196Contig53968RC SEQ ID NO 2548 NM 005760 SEQ ID NO 1197Contig54113RC SEQ ID NO 2549 NM 005764 SEQ ID NO 1198Contig54142RC SEQ ID NO 2550 NM 005794 SEQ ID NO 1199Contig54232RC SEQ ID NO 2551 NM 005796 SEQ ID NO 1200Contig54242RC SEQ ID NO 2552 NM 005804 SEQ 1D NO 1201Contig54260RC SEQ ID NO 2553 ~5 NM 005813 SEQ ID NO 1202Contig54263RC SEQ ID NO 2554 NM 005824 SEQ ID NO 1203Contig54295 SEQ ID NO 2555 RC

NM 005825 SEQ ID NO 1204Contig54318RC SEQ ID NO 2556 NM 005849 SEQ ID NO 1205Contig54325RC SEQ ID NO 2557 NM 005853 SEQ ID NO 1206Contig54389RC SEQ ID NO 2558 NM 005855 SEQ ID NO 1207Contig54394_RC SEQ ID NO 2559 NM 005864 SEQ ID NO 1208Contig54414RC SEQ ID NO 2560 NM 005874 SEQ ID NO 1209Contig54425 SEQ ID NO 2561 NM 005876 SEQ ID NO 1210Contig54477 SEQ ID NO 2562 RC

NM 005880 SEQ ID NO 1211Contig54503RC SEQ ID NO 2563 NM 005891 SEQ ID NO 1212Contig54534RC SEQ ID NO 2564 NM 005892 SEQ ID NO 1213Contig54560RC SEQ ID NO 2566 NM 005899 SEQ ID NO 1214Contig54581RC SEQ ID NO 2567 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 005915 SEQ ID NO 1215Contig54609RC SEQ ID NO 2568 NM 005919 SEQ 1D NO 1216Contig54666RC SEQ ID NO 2569 NM 005923 SEQ ID NO 1217Contig54667RC SEQ ID NO 2570 NM 005928 SEQ ID NO 1218Contig54726RC SEQ ID NO 2571 NM 005932 SEQ ID NO 1219Contig54742RC SEQ ID NO 2572 NM 005935 SEQ ID NO 1220Contig54745 SEQ ID NO 2573 RC

NM 005945 SEQ ID NO 1221Contig54757RC SEQ ID NO 2574 NM 005953 SEQ ID NO 1222Contig54761RC SEQ ID NO 2575 NM 005978 SEQ ID NO 1223Contig54813RC SEQ ID NO 2576 NM 005990 SEQ ID NO 1224Contig54867RC SEQ ID NO 2577 NM 006002 SEQ ID NO 1225Contig54895RC SEQ ID NO 2578 NM 006004 SEQ ID NO 1226Contig54898RC SEQ ID NO 2579 NM 006005 SEQ ID NO 1227Contig54913RC SEQ ID NO 2580 NM 006006 SEQ ID NO 1228Contig54965RC SEQ ID NO 2582 NM 006017 SEQ ID NO 1229Contig54968RC SEQ ID NO 2583 NM 006018 SEQ ID NO 1230Contig55069RC SEQ ID NO 2584 NM 006023 SEQ ID NO 1231Contig55181RC SEQ ID NO 2585 NM 006027 SEQ 1D NO 1232Contig55188RC SEQ ID NO 2586 NM 006029 SEQ ID NO 1233Contig55221RC SEQ ID NO 2587 NM 006033 SEQ 1D NO 1234Contig55254RC SEQ ID NO 2588 NM 006051 SEQ ID NO 1235Contig55265RC SEQ ID NO 2589 NM 006055 SEQ ID NO 1236Contig55377RC SEQ ID NO 2591 NM 006074 SEQ 1D NO 1237Contig55397RC SEQ 1D NO 2592 NM 006086 SEQ ID NO 1238Contig55448RC SEQ ID NO 2593 NM 006087 SEQ ID NO 1239Contig55468RC SEQ ID NO 2594 NM 006096 SEQ ID NO 1240Contig55500RC SEQ ID NO 2595 NM 006101 SEQ ID NO 1241Contig55538RC SEQ ID NO 2596 NM 006103 SEQ ID NO 1242Contig55558RC SEQ ID NO 2597 NM 006111 SEQ ID NO 1243Contig55606RC SEQ ID NO 2598 NM 006113 SEQ ID NO 1244Contig55674RC SEQ ID NO 2599 NM 006115 SEQ ID NO 1245Contig55725RC SEQ ID NO 2600 NM 006117 SEQ ID NO 1246Contig55728RC SEQ ID NO 2601 NM 006142 SEQ ID NO 1247Contig55756RC SEQ ID NO 2602 NM 006144 SEQ ID NO 1248Contig55769RC SEQ ID NO 2603 NM 006148 SEQ ID NO 1249Contig55771RC SEQ ID NO 2605 NM 006153 SEQ ID NO 1250Contig55813RC SEQ ID NO 2607 NM 006159 SEQ ID NO 1251Contig55829RC SEQ ID NO 2608 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 006170 SEQ ID NO 1252Contig55852RC SEQ ID NO 2609 NM 006197 SEQ ID NO 1253Contig55883 SEQ ID NO 2610 RC

NM 006224 SEQ ID NO 1255Contig55920RC SEQ ID NO 2611 NM 006227 SEQ ID NO 1256Contig55940RC SEQ ID NO 2612 NM 006235 SEQ ID NO 1257Contig55950RC SEQ ID NO 2613 NM 006243 SEQ ID NO 1258Contig55991RC SEQ ID NO 2614 NM 006264 SEQ ID NO 1259Contig55997 SEQ 1D NO 2615 RC

NM 006271 SEQ ID NO 1261Contig56023RC SEQ ID NO 2616 NM 006274 SEQ ID NO 1262Contig56030RC SEQ ID NO 2617 NM 006290 SEQ ID NO 1265Contig56093RC SEQ (D NO 2618 NM 006291 SEQ ID NO 1266Contig56205RC SEQ ID NO 2621 NM 006296 SEQ ID NO 1267Contig56270RC SEQ ID NO 2622 NM 006304 SEQ ID NO 1268Contig56276RC SEQ ID NO 2623 NM 006314 SEQ ID NO 1269Contig56291RC SEQ ID NO 2624 NM 006332 SEQ ID NO 1270Contig56298RC SEQ ID NO 2625 NM 006357 SEQ ID NO 1271Contig56307 SEQ ID NO 2627 NM 006366 SEQ ID NO 1272Contig56390RC SEQ ID NO 2628 NM 006372 SEQ ID NO 1273Contig56434 SEQ ID NO 2629 RC

NM 006377 SEQ ID NO 1274Contig56457RC SEQ ID NO 2630 NM 006378 SEQ ID NO 1275Contig56534RC SEQ ID NO 2631 NM 006383 SEQ ID NO 1276Contig56670RC SEQ ID NO 2632 NM 006389 SEQ ID NO 1277Contig56678RC SEQ ID NO 2633 NM 006393 SEQ ID NO 1278Contig56742RC SEQ ID NO 2634 NM 006398 SEQ ID NO 1279Contig56759RC SEQ ID NO 2635 NM 006406 SEQ ID NO 1280Contig56765RC SEQ ID NO 2636 NM 006408 SEQ ID NO 1281Contig56843RC SEQ ID NO 2637 NM 006410 SEQ ID NO 1282Contig57011RC SEQ ID NO 2638 NM 006414 SEQ ID NO 1283Contig57023RC SEQ ID NO 2639 NM 006417 SEQ ID NO 1284Contig57057 SEQ ID NO 2640 RC

NM 006430 SEQ ID NO 1285Contig57076RC SEQ ID NO 2641 NM 006460 SEQ ID NO 1286Contig57081RC SEQ ID NO 2642 NM 006461 SEQ ID NO 1287Contig57091RC SEQ ID NO 2643 NM 006469 SEQ ID NO 1288Contig57138RC SEQ ID NO 2644 NM 006470 SEQ ID NO 1289Contig57173RC SEQ ID NO 2645 NM 006491 SEQ ID NO 1290Contig57230RC SEQ ID NO 2646 NM 006495 SEQ ID NO 1291Contig57258RC SEQ ID NO 2647 NM 006500 SEQ ID NO 1292Contig57270RC SEQ ID NO 2648 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 006509 SEQ ID NO 1293Contig57272RC SEQ ID NO 2649 NM 006516 SEQ ID NO 1294Contig57344RC SEQ ID NO 2650 NM 006533 SEQ ID NO 1295Contig57430RC SEQ ID NO 2651 NM 006551 SEQ ID NO 1296Contig57458RC SEQ ID NO 2652 NM 006556 SEQ ID NO 1297Contig57493RC SEQ ID NO 2653 NM 006558 SEQ ID NO 1298Contig57584RC SEQ ID NO 2654 NM 006564 SEQ ID NO 1299Contig57595 SEQ ID NO 2655 NM 006573 SEQ ID NO 1300Contig57602RC SEQ ID NO 2656 NM 006607 SEQ ID NO 1301Contig57609 SEQ ID NO 2657 RC

NM 006622 SEQ ID NO 1302Contig57610RC SEQ ID NO 2658 NM 006623 SEQ ID NO 1303Contig57644RC SEQ ID NO 2659 NM 006636 SEQ ID NO 1304Contig57725RC SEQ ID NO 2660 NM 006670 SEQ ID NO 1305Contig57739RC SEQ ID NO 2661 NM 006681 SEQ ID NO 1306Contig57825RC SEQ ID NO 2662 NM 006682 SEQ ID NO 1307Contig57864RC SEQ ID NO 2663 NM 006696 SEQ ID NO 1308Contig57940RC SEQ ID NO 2664 NM 006698 SEQ ID NO 1309Contig58260RC SEQ ID NO 2665 NM 006705 SEQ ID NO 1310Contig58272 SEQ ID NO 2666 RC

NM 006739 SEQ ID NO 1311Contig58301RC SEQ ID NO 2667 NM 006748 SEQ ID NO 1312Contig58368RC SEQ ID NO 2668 NM 006759 SEQ ID NO 1313Contig58471RC SEQ ID NO 2669 NM 006762 SEQ ID NO 1314Contig58755 SEQ ID NO 2671 RC

NM 006763 SEQ ID NO 1315Contig59120RC SEQ ID NO 2672 NM 006769 SEQ ID NO 1316Contig60157RC SEQ ID NO 2673 NM 006770 SEQ ID NO 1317Contig60864RC SEQ ID NO 2676 NM 006780 SEQ ID NO 1318Contig61254RC SEQ ID NO 2677 NM 006787 SEQ ID NO 1319Contig61815 SEQ ID NO 2678 NM 006806 SEQ ID NO 1320Contig61975 SEQ ID NO 2679 NM 006813 SEQ ID NO 1321Contig62306 SEQ ID NO 2680 NM 006825 SEQ ID NO 1322Contig62568RC SEQ ID NO 2681 NM 006826 SEQ ID NO 1323Contig62922RC SEQ ID NO 2682 NM 006829 SEQ ID NO 1324Contig62964RC SEQ ID NO 2683 NM 006834 SEQ ID NO 1325Contig63520RC SEQ ID NO 2685 NM 006835 SEQ ID NO 1326Contig63649RC SEQ ID NO 2686 NM 006840 SEQ ID NO 1327Contig63683RC SEQ ID NO 2687 NM 006845 SEQ ID NO 1328Contig63748RC SEQ ID NO 2688 NM 006847 SEQ ID NO 1329Contig64502 SEQ ID NO 2689 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 006851 SEQ ID NO 1330Contig64688 SEQ ID NO 2690 NM 006855 SEQ ID NO 1331Contig64775RC SEQ ID NO 2691 NM 006864 SEQ ID NO 1332Contig65227 SEQ ID NO 2692 NM 006868 SEQ ID NO 1333Contig65663 SEQ ID NO 2693 NM 006875 SEQ ID NO 1334Contig65785RC SEQ ID NO 2694 NM 006889 SEQ ID NO 1336Contig65900 SEQ ID NO 2695 NM 006892 SEQ ID NO 1337Contig66219RC SEQ ID NO 2696 NM 006912 SEQ ID NO 1338Contig66705RC SEQ ID NO 2697 NM 006931 SEQ ID NO 1341Contig66759 SEQ ID NO 2698 RC

NM 006941 SEQ ID NO 1342Contig67182RC SEQ ID NO 2699 Table 2. 550 preferred ER status markers drawn from Table 1.
Identifier CorrelationName Description NM 002051 0.763977GATA3 GATA-binding protein 3 B020689 0.753592K1AA0882 KlAA0882 protein NM 001218 0.753225CA12 carbonic anhydrase XII

NM 000125 0.748421ESR1 estrogen receptor 1 Contig56678 0.747816 ESTs RC

NM 004496 0.729116HNF3A hepatocyte nuclear factor 3, alpha NM 017732 0.713398FLJ20262 hypothetical protein FLJ20262 NM 006806 -0.712678BTG3 BTG family, member 3 Contig56390 0.705940 ESTs RC

Contig37571 0.704468 ESTs RC

NM 004559 -0.701617NSEP1 nuclease sensitive element p binding rotein 1 Contig50153_RC-0.696652 ESTs, Weakly similar to LKHU
proteoglycan link protein precursor [H.sapiens]

NM 012155 0.694332EMAP-2 microtubule-associated protein e like chinoderm EMAP

Contig237 RC 0.687485FLJ21127 hypothetical protein FLJ21127 NM 019063 -0.686064C20RF2 chromosome 2 open reading 2 frame NM 012219 -0.680900MRAS muscle RAS oncogene homolog NM 001982 0.676114ERBB3 v-erb-b2 avian erythroblastic I eukemia viral oncogene homolog NM 006623 -0.675090PHGDH phosphoglycerate dehydrogenase NM 000636 -0.674282SOD2 superoxide dismutase 2, mitochondria) NM 006017 -0.670353PROML1 prominin (mouse)-like 1 Contig57940_RC0.667915MAP-1 MAP-1 protein Contig46934-RC0.666908 ESTs, Weakly similar to JE0350 Anterior gradient-2 [H.sapiens]

NM_005080 0.665772XBP1 X-box binding protein 1 NM_014246 0.665725CELSR1 cadherin, EGF LAG seven-pass t G-ype receptor 1, flamingo (Drosophila) homolog Identifier CorrelationName Description Contig54667_RC-0.663727 Human DNA sequence from clone RP1-187J11 on chromosome 6q11.1-22.33. Contains the gene for a novel protein similar to S. pombe and S. cerevisiae predicted proteins, the gene for a novel protein similar to protein kinase C inhibitors, the 3' end of the gene for a novel protein similar to Drosophila L82 and predicted worm proteins, ESTs, STSs, GSSs and two putative CpG

islands Contig51994_RC0.663715 ESTs, Weakly similar to B0416.1 [C.elegans]

NM 016337 0.663006 RNB6 RNB6 NM 015640 -0.660165PAI-RBP1 PAI-1 mRNA-binding protein X07834 -0.657798SOD2 superoxide dismutase 2, mitochondria) NM 012319 0.657666 LIV-1 LIV-1 protein, estrogen regulated Contig41887_RC0.656042 ESTs, Weakly similar to Homolog of rat Zymogen granule membrane protein [H.sapiens]

NM 003462 0.655349 P28 dynein, axonemal, light intermediate polypeptide Contig58301_RC0.654268 Homo sapiens mRNA; cDNA

DKFZp667D095 (from clone DKFZp667D095) NM 005375 0.653783 MYB v-myb avian myeloblastosis viral oncogene homolog NM 017447 -0.652445YG81 hypothetical protein LOC54149 Contig924 -0.650658 ESTs RC

M55914 -0.650181MPB1 MYC promoter-binding protein NM 006004 -0.649819UQCRH ubiquinol-cytochrome c reductase h inge protein NM 000964 0.649072 RARA retinoic acid receptor, alpha NM 013301 0.647583 HSU79303 protein predicted by clone '8023211 -0.647403PD12 peptidyl arginine deiminase, ' type II

NM 016629 -0.646412LOC51323 hypothetical protein K02403 0.645532 C4A complement component 4A

NM 016405 -0.642201HSU93243 Ubc6p homolog Contig46597 0.641733 ESTs RC

Contig55377_RC0.640310 ESTs NM 001207 0.637800 BTF3 basic transcription factor Identifier CorrelationName Description NM 018166 0.636422 FLJ10647 hypothetical protein FLJ10647 AL110202 -0.635398 Homo sapiens mRNA; cDNA

DKFZp58612022 (from clone DKFZp58612022) AL133105 -0.635201DKFZp434Fhypothetical protein DKFZp434F2322 NM 016839 -0.635169RBMS1 RNA binding motif, single stranded i nteracting protein 1 Contig53130 -0.634812 ESTs, Weakly similar to hyperpolarization-activated cyclic nucleotide-gated channel hHCN2 [H.sapiens]

NM 018014 -0.634460BCL11A B-cell CLL/lymphoma 11A (zinc f inger protein) NM 006769 -0.632197LM04 LIM domain only 4 U92544 0.631170 JCL-1 hepatocellular carcinoma associated protein; breast cancer associated gene 1 Contig49233 -0.631047 Homo sapiens, Similar to RC nuclear _ receptor binding factor 2, clone IMAGE:3463191, mRNA, partial cds AL133033 0.629690 KIAA1025 KIAA1025 protein AL049265 0.629414 Homo sapiens mRNA; cDNA

DKFZp564F053 (from clone DKFZp564F053) NM 018728 0.627989 MY05C myosin 5C

NM 004780 0.627856 TCEAL1 transcription elongation factor A

(SII)-like 1 Contig760 0.627132 ESTs RC

RC 0.626543 FLJ12538 hypothetical protein FLJ12538 Contig399 _ similar to ras-related protein M83822 0.625092 CDC4L cell division cycle 4-like NM 001255 -0.625089CDC20 CDC20 (cell division cycle 20, S.

cerevisiae, homology 006739 -0.624903MCM5 minichromosome maintenance NM

_ deficient (S. cerevisiae) 5 (cell division cycle 46) NM 002888 -0.624664RARRES1 retinoic acid receptor responder (tazarotene induced) 1 NM 003197 0.623850 TCEB1 transcription elongation L factor B

(SIII), polypeptide 1-like NM 006787 0.623625 JCL-1 hepatocellular carcinoma associated p rotein; breast cancer associated gene 1 Contig49342 0.622179 ESTs RC

Identifier CorrelationName Description AL133619 0.621719 Homo sapiens mRNA; cDNA
DKFZp434E2321 (from clone DKFZp434E2321 ); partial cds AL133622 0.621577 KIAA0876 KIAA0876 protein NM 004648 -0.621532PTPNS1 protein tyrosine phosphatase, r non-eceptor type substrate 1 NM 001793 -0.621530CDH3 cadherin 3, type 1, P-cadherin ( placental) NM 003217 0.620915 TEGT testis enhanced gene transcript (BAX inhibitor 1 ) NM 001551 0.620832 IGBP1 immunoglobulin (CD79A) binding p rotein 1 NM 002539 -0.620683ODC1 ornithine decarboxylase 1 Contig55997 -0.619932 ESTs RC

NM 000633 0.619547 BCL2 B-cell CLL/lymphoma 2 NM 016267 -0.619096TONDU TONDU

Contig3659 0.618048 FLJ21174 hypothetical protein FLJ21174 RC

NM 000191 0.617250 HMGCL 3-hydroxymethyl-3-methylglutaryl-C oenzyme A lyase (hydroxymethylglutaricaciduria) NM 001267 0.616890 CHAD chondroadherin Contig39090 0.616385 ESTs RC

AF055270 -0.616268HSSG1 heat-shock suppressed protein Contig43054 0.616015 FLJ21603 hypothetical protein FLJ21603 NM 001428 -0.615855EN01 enolase 1, (alpha) Contig51369 0.615466 ESTs RC

Contig36647 0.615310 GFRA1 GDNF family receptor alpha NM 014096 -0.614832PR01659 PR01659 protein NM 015937 0.614735 LOC51604 CGI-06 protein Contig49790 -0.614463 ESTs RC

NM 006759 -0.614279UGP2 UDP-glucose pyrophosphorylase Contig53598 -0.613787FLJ11413 hypothetical protein FLJ11413 RC

AF113132 -0.613561PSA phosphoserine aminotransferase AK000004 0.613001 Homo sapiens mRNA for FLJ00004 protein, partial cds Contig52543_RC0.612960 Homo sapiens cDNA FLJ13945 fis, clone Y79AA1000969 AB032966 -0.611917KIAA1140 KIAA1140 protein AL080192 0.611544 Homo sapiens cDNA: FLJ21238 fis, clone COL01115 X56807 -0.610654DSC2 desmocollin 2 Identifier CorrelationName Description Contig30390 0.609614 ESTs RC

AL137362 0.609121 FLJ22237 hypothetical protein FLJ22237 NM 014211 -0.608585GABRP gamma-aminobutyric acid (GABA) A

receptor, pi NM 006696 0.608474 SMAP thyroid hormone receptor coactivating protein Contig45588_RC-0.608273 Homo sapiens cDNA: FLJ22610 fis, clone HS104930 NM 003358 0.608244 UGCG UDP-glucose ceramide glucosyltransferase NM 006153 -0.608129NCK1 NCK adaptor protein 1 NM 001453 -0.606939FOXC1 forkhead box C1 Contig54666_RC0.606475 oy65e02.x1 NCI CGAP CLL1 Homo sapiens cDNA clone IMAGE:1670714 3' similar to TR:Q29168 Q29168 UNKNOWN

PROTEIN ;, mRNA sequence.

NM 005945 -0.605945MPB1 MYC promoter-binding protein Contig55725_RC-0.605841 ESTs, Moderately similar to T50635 hypothetical protein DKFZp762L0311.1 [H.sapiens]

Contig37015 -0.605780 ESTs, Weakly similar to RC

U U AS3_HUMAN

PROTEIN [H.sapiens]

AL157480 -0.604362SH3BP1 SH3-domain binding protein NM 005325 -0.604310H1F1 H1 histone family, member NM 001446 -0.604061FABP7 fatty acid binding protein 7, brain Contig263_RC 0.603318 Homo sapiens cDNA: FLJ23000 fis, clone LNG00194 Contig8347 -0.603311 ESTs RC

NM 002988 -0.603279SCYA18 small inducible cytokine subfamily A

(Cys-Cys), member 18, pulmonary and activation-regulated AF111849 0.603157 HEL01 homolog of yeast long chain polyunsaturated fatty acid elongation enzyme 2 NM 014700 0.603042 KIAA0665 KIAA0665 gene product NM 001814 -0.602988CTSC cathepsin C

AF116682 -0.602350PR02013 hypothetical protein PRO2013 AB037836 0.602024 KIAA1415 KIAA1415 protein AB002301 0.602005 KIAA0303 KIAA0303 protein Identifier CorrelationName Description NM 002996 -0.601841SCYD1 small inducible cytokine subfamily D

(Cys-X3-Cys), member 1 (fractalkine, neurotactin) NM 018410 -0.601765DKFZp762 hypothetical protein E1312 DKFZp762E1312 Contig49581 -0.601571KIAA1350 KIAA1350 protein RC

NM 003088 -0.601458SNL singed (Drosophila)-like (sea urchin fascin homolog like) Contig47045_RC0.601088 ESTs, Weakly similar to PROTEIN 1 [H.sapiens]

NM 001806 -0.600954CEBPG CCAAT/enhancer binding protein (C/EBP), gamma NM 004374 0.600766 COX6C cytochrome c oxidase subunit Vlc Contig52641_RC0.600132 ESTs, Weakly similar to CENB

MOUSE MAJOR CENTROMERE

AUTOANTIGEN B [M.musculus]

NM 000100 -0.600127CSTB cystatin B (stefin B) NM 002250 -0.600004KCNN4 potassium intermediate/small c onductance calcium-activated channel, subfamily N, member AB033035 -0.599423KIAA1209 KIAA1209 protein Contig53968_RC0.599077 ESTs NM 002300 -0.598246LDHB lactate dehydrogenase B

NM 000507 0.598110 FBP1 fructose-1,6-bisphosphatase NM 002053 -0.597756GBP1 guanylate binding protein 1, i nterferon-inducible, 67kD

AB007883 0.597043 KIAA0423 KIAA0423 protein NM_004900 -0.597010DJ742C19 phorbolin (similar to apolipoprotein B

.2 mRNA editing protein) NM 004480 0.596321 FUT8 fucosyltransferase 8 (alpha (1,6) fucosyltransferase) Contig35896 0.596281 ESTs RC

NM 020974 0.595173 CEGP1 CEGP1 protein NM 000662 0.595114 NAT1 N-acetyltransferase 1 (arylamine N-3 0 acetyltransferase) NM 006113 0.595017 VAV3 vav 3 oncogene NM_014865 -0.594928KIAA0159 chromosome condensation-related SMC-associated protein 1 Contig55538_RC-0.594573BA395L14.hypothetical protein bA395L14.2 NM 016056 0.594084 LOC51643 CGI-119 protein Identifier CorrelationName Description NM 003579 -0.594063RAD54L RAD54 (S.cerevisiae)-like NM 014214 -0.593860IMPA2 inositol(myo)-1 (or 4)-monophosphatase 2 U79293 0.593793 Human clone 23948 mRNA
sequence NM 005557 -0.593746KRT16 keratin 16 (focal non-epidermolytic palmoplantar keratoderma) NM 002444 -0.592405MSN moesin NM 003681 -0.592155PDXK pyridoxal (pyridoxine, vitamin B6) kinase NM 006372 -0.591711NSAP1 NS1-associated protein 1 NM 005218 -0.591192DEFB1 defensin, beta 1 NM 004642 -0.591081DOC1 deleted in oral cancer (mouse, homology 1 AL133074 0.590359 Homo sapiens cDNA: FLJ22139 fis, 5 clone HEP20959 M73547 0.590317 D5S346 DNA segment, single copy probe LNS-CAI/LNS-CAII (deleted in polyposis Contig65663 0.590312 ESTs AL035297 -0.589728 H.sapiens gene from PAC 747L4 Contig35629 0.589383 ESTs RC

NM 019027 0.588862 FLJ20273 hypothetical protein NM 012425 -0.588804 Homo sapiens Ras suppressor p rotein 1 (RSU1), mRNA

NM 020179 -0.588326FN5 FN5 protein AF090913 -0.587275TMSB10 thymosin, beta 10 NM 004176 0.587190 SREBF1 sterol regulatory element t binding ranscription factor 1 NM 016121 0.586941 LOC51133 NY-REN-45 antigen NM 014773 0.586871 KIAA0141 KIAA0141 gene product NM 019000 0.586677 FLJ20152 hypothetical protein NM_016243 0.585942 LOC51706 cytochrome b5 reductase 1 (B5R.1 ) NM 014274 -0.585815ABP/ZF Alu-binding protein with d zinc finger omain NM 018379 0.585497 FLJ11280 hypothetical protein FLJ11280 AL157431 -0.585077DKFZp762 hypothetical protein DKFZp762A227 D38521 -0.584684KIAA0077 KIAA0077 protein NM 002570 0.584272 PACE4 paired basic amino acid cleaving s ystem 4 Identifier CorrelationName Description NM 001809 -0.584252CENPA centromere protein A (17kD) NM 003318 -0.583556TTK TTK protein kinase NM 014325 -0.583555COR01 coronin, actin-binding protein, NM 005667 0.583376 ZFP103 zinc finger protein homologous to Zfp103 in mouse NM 004354 0.582420 CCNGZ cyclin G2 NM 003670 0.582235 BHLHB2 basic helix-loop-helix domain c ontaining, class B, 2 NM 001673 -0.581902ASNS asparagine synthetase NM 001333 -0.581402CTSL2 cathepsin L2 Contig54295 0.581256 ESTs RC

Contig33998 0.581018 ESTs RC

NM 006002 -0.580592UCHL3 ubiquitin carboxyl-terminal L esterase 3 (ubiquitin thiolesterase) NM 015392 0.580568 NPDC1 neural proliferation, differentiation and control, 1 NM 004866 0.580138 SCAMP1 secretory carrier membrane 1 protein Contig50391 0.580071 ESTs RC

NM 000592 0.579965 C4B complement component 4B

Contig50802 0.579881 ESTs RC

Contig41635 -0.579468 ESTs RC

NM 006845 -0.579339KNSL6 kinesin-like 6 (mitotic centromere-a ssociated kinesin) NM 003720 -0.579296DSCR2 Down syndrome critical region 2 gene NM 000060 0.578967 BTD biotinidase AL050388 -0.578736 Homo sapiens mRNA; cDNA
DKFZp564M2422 (from clone DKFZp564M2422); partial cds NM 003772 -0.578395JRKL jerky (mouse) homolog-like NM 014398 -0.578388TSC403 similar to lysosome-associated m embrane glycoprotein NM_001280 0.578213 CIRBP cold inducible RNA-binding protein NM 001395 -0.577369DUSP9 dual specificity phosphatase NM 016229 -0.576290LOC51700 cytochrome b5 reductase b5R.2 NM 006096 -0.575615NDRG1 N-myc downstream regulated NM 001552 0.575438 GFBP4 insulin-like growth factor-binding I rotein 4 p NM 005558 -0.574818LAD1 ladinin 1 Identifier CorrelationName Description Contig54534_RC0.574784 Human glucose transporter pseudogene Contig1239_RC0.573822 Human Chromosome 16 BAC clone Contig57173_RC0.573807 Homo sapiens mRNA for KIAA1737 protein, partial cds NM 004414 -0.573538DSCR1 Down syndrome critical region 1 gene NM 021103 -0.572722TMSB10 thymosin, beta 10 NM 002350 -0.571917LYN v-yes-1 Yamaguchi sarcoma 1 0 viral related oncogene homolog Contig51235_RC0.571049 Homo sapiens cDNA: FLJ23388 fis, clone HEP17008 NM 013384 0.570987 TMSG1 tumor metastasis-suppressor NM 014399 0.570936 NET-6 tetraspan NET-6 protein Contig26022 -0.570851 ESTs RC

15 AB023152 0.570561 KIAA0935 KIAA0935 protein NM 021077 -0.569944NMB neuromedin B

NM 003498 -0.569129SNN stannin 017077 -0.568979BENE BENE protein D86985 0.567698 KIAA0232 KIAA0232 gene product 20 NM 006357 -0.567513UBE2E3 ubiquitin-conjugating enzyme ( E2E 3 homologous to yeast UBC4/5) AL049397 -0.567434 Homo sapiens mRNA; cDNA
DKFZp586C1019 (from clone DKFZp586C1019) Contig64502 0.567433 ESTs, Weakly similar to unknown [M.musculus]

25 Contig56298 -0.566892FLJ13154 hypothetical protein FLJ13154 RC

Contig46056_RC0.566634 ESTs, Weakly similar to PROTEIN ZAP128 [H.sapiens]

AF007153 0.566044 Homo sapiens clone 23736 mRNA
sequence 30 Contig1778 -0.565789 ESTs RC

NM 017702 -0.565789FLJ20186 hypothetical protein FLJ20186 Contig39226_RC0.565761 Homo sapiens cDNA FLJ12187 fis, clone MAMMA1000831 NM_000168 0.564879 GL13 GLI-Kruppel family member (Greig cephalopolysyndactyly syndrome) Identifier CorrelationName Description Contig57609_RC0.564751 ESTs, Weakly similar to KDA SUBUNIT [H.sapiens]

045975 0.564602 PIBSPA phosphatidylinositol (4,5) bisphosphate 5-phosphatase, A

AF038182 0.564596 Homo sapiens clone 23860 mRNA

sequence Contig5348_RC0.564480 ESTs, Weakly similar to 1607338A

transcription factor BTF3a [H.sapiens]

NM 001321 -0.564459CSRP2 cysteine and glycine-rich protein 2 Contig25362 -0.563801 ESTs RC

NM 001609 0.563782 ACADSB acyl-Coenzyme A dehydrogenase, short/branched chain Contig40146 0.563731 wi84e12.x1 NCI CGAP_Kid12 Homo sapiens cDNA clone IMAGE:2400046 3' similar to SW:RASD DICDI P03967 RAS-LIKE PROTEIN RASD ;, mRNA

sequence.

NM 016002 0.563403 LOC51097 CGI-49 protein Contig34303_RC0.563157 Homo sapiens cDNA: FLJ21517 fis, clone COL05829 Contig55883 0.563141 ESTs RC

NM 017961 0.562479 FLJ20813 hypothetical protein FLJ20813 M21551 -0.562340NMB neuromedin B

Contig3940_RC-0.561956YWHAH tyrosine 3-monooxygenase/tryptophan monooxygenase activation protein, eta polypeptide AB033111 -0.561746KIAA1285 KIAA1285 protein Contig43410 0.561678 ESTs RC

Contig42006 -0.561677 ESTs RC

Contig57272 0.561228 ESTs RC

626403 -0.561068YWHAH tyrosine 3-monooxygenase/tryptophan monooxygenase activation protein, eta polypeptide NM -0.560813MCM6 minichromosome maintenance _ deficient (miss, S. pombe) NM 003875 -0.560668GMPS guanine monphosphate synthetase AK000142 0.559651 AK000142 Homo sapiens cDNA FLJ20135 fis, clone COL06818.

Identifier CorrelationName Description NM 002709 -0.559621PPP1 CB protein phosphatase 1, catalytic subunit, beta isoform NM 001276 -0.558868CH13L1 chitinase 3-like 1 (cartilage glycoprotein-39) NM 002857 0.558862 PXF peroxisomal farnesylated protein Contig33815 -0.558741FLJ22833 hypothetical protein FLJ22833 RC

NM 003740 -0.558491KCNK5 potassium channel, subfamily m K, ember 5 (TASK-2) Contig53646 0.558455 ESTs RC

NM_005538 -0.558350INHBC inhibin, beta C

NM 002111 0.557860 HD huntingtin (Huntington disease) NM 003683 -0.557807D2152056 DNA segment on chromosome (unique) 2056 expressed sequence NM 003035 -0.557380SIL TAL1 (SCL) interrupting locus Contig4388_RC-0.557216 Homo sapiens, Similar to integral 5 membrane protein 3, clone MGC:3011, mRNA, complete cds Contig38288_RC-0.556426 ESTs, Weakly similar to ISHUSS
protein disulfide-isomerase [H.sapiens]

NM 015417 0.556184 DKFZP434 DKFZP4341114 protein NM 015507 -0.556138EGFL6 EGF-like-domain, multiple AF279865 0.555951 KIF13B kinesin family member 13B

Contig31288 -0.555754 ESTs RC

NM 002966 -0.555620S100A10 S100 calcium-binding protein (annexin II ligand, calpactin I, light polypeptide (p11 )) NM 017585 -0.555476SLC2A6 solute carrier family 2 (facilitated g lucose transporter), member NM 013296 -0.555367HSU54999 LGN protein NM 000224 0.554838 KRT18 keratin 18 Contig49270 -0.554593KIAA1553 KIAA1553 protein RC

NM 004848 -0.554538ICB-1 basement membrane-induced gene NM 007275 0.554278 FUS1 lung cancer candidate NM 007044 -0.553550KATNA1 katanin p60 (ATPase-containing) subunit A 1 Contig1829 0.553317 ESTs AF272357 0.553286 NPDC1 neural proliferation, differentiation and control, 1 Identifier CorrelationName Description Contig57584_RC-0.553080 Homo sapiens, Similar to gene rich cluster, C8 gene, clone MGC:2577, mRNA, complete cds NM 003039 -0.552747SLC2A5 solute carrier family 2 (facilitated glucose transporter), member NM 014216 0.552321 ITPK1 inositol 1,3,4-triphosphate kinase NM 007027 -0.552064TOPBP1 topoisomerase (DNA) II binding p rotein AF118224 -0.551916ST14 suppression of tumorigenicity (colon carcinoma, matriptase, epithin) X75315 -0.551853HSRNASE seb4D

B

NM 012101 -0.551824ATDC ataxia-telangiectasia group D-a ssociated protein AL157482 -0.551329FLJ23399 hypothetical protein FLJ23399 NM 012474 -0.551150UMPK uridine monophosphate kinase Contig57081 0.551103 ESTs RC

NM 006941 -0.551069SOX10 SRY (sex determining region Y)-box NM 004694 0.550932 SLC16A6 solute carrier family 16 (monocarboxylic acid transporters), member 6 Contig9541 0.550680 ESTs RC

Contig20617 0.550546 ESTs RC

NM 004252 0.550365 SLC9A3R solute carrier family 9 1 (sodium/hydrogen exchanger), isoform 3 regulatory factor NM 015641 -0.550200DKFZP586 testin NM 004336 -0.550164BUB1 budding uninhibited by benzimidazoles 1 (yeast homology Contig39960 -0.549951FLJ21079 hypothetical protein FLJ21079 RC

NM 020686 0.549659 NPD009 NPD009 protein NM 002633 -0.549647PGM1 phosphoglucomutase 1 Contig30480 0.548932 ESTs RC

NM 003479 0.548896 PTP4A2 protein tyrosine phosphatase type IVA, member 2 NM 001679 -0.548768ATP1 B3 ATPase, Na+/K+ transporting, beta 3 polypeptide NM 001124 -0.548601ADM adrenomedullin NM 001216 -0.548375CA9 carbonic anhydrase IX

Identifier CorrelationName Description U58033 -0.548354MTMR2 myotubularin related protein NM 018389 -0.547875FLJ11320 hypothetical protein FLJ11320 F176012 0.547867 JDP1 J domain containing protein Contig66705 -0.546926ST5 suppression of tumorigenicity NM 018194 0.546878 FLJ10724 hypothetical protein FLJ10724 NM 006851 -0.546823RTVP1 glioma pathogenesis-related protein Contig53870 0.546756 ESTs RC

NM 002482 -0.546012NASP nuclear autoantigenic sperm protein (histone-binding) NM 002292 0.545949 LAMB2 laminin, beta 2 (laminin S) NM 014696 -0.545758KIAA0514 KIAA0514 gene product Contig49855 0.545517 ESTs AL117666 0.545203 DKFZP586 DKFZP586O1624 protein NM 004701 -0.545185CCNB2 cyclin B2 NM 007050 0.544890 PTPRT protein tyrosine phosphatase, r eceptor type, T

NM 000414 0.544778 HSD17B4 hydroxysteroid (17-beta) dehydrogenase 4 Contig52398_RC-0.544775 Homo sapiens cDNA: FLJ21950 fis, clone HEP04949 AB007916 0.544496 KIAA0447 KIAA0447 gene product Contig66219 0.544467 FLJ22402 hypothetical protein FLJ22402 RC

D87453 0.544145 KIAA0264 KIAA0264 protein NM 015515 -0.543929DKFZP434 DKFZP434G032 protein NM 001530 -0.543898HIF1A hypoxia-inducible factor 1, alpha subunit (basic helix-loop-helix transcription factor) NM 004109 -0.543893FDX1 ferredoxin 1 NM 000381 -0.543871MID1 midline 1 (OpitzlBBB syndrome) Contig43983 0.543523 CS2 calsyntenin-2 RC

AL137761 0.543371 Homo sapiens mRNA; cDNA
DKFZp586L2424 (from clone DKFZp586L2424) NM 005764 -0.543175DD96 epithelial protein up-regulated c in arcinoma, membrane associated protein 17 Contig1838 0.542996 Homo sapiens cDNA: FLJ22722 RC c H fis, lone NM 006670 0.542932 5T4 5T4 oncofetal trophoblast glycoprotein Identifier CorrelationName Description Contig28552_RC-0.542617 Homo sapiens mRNA; cDNA
DKFZp434C0931 (from clone DKFZp434C0931 ); partial cds Contig 14284 0.542224 ESTs RC

NM 006290 -0.542115TNFAIP3 tumor necrosis factor, alpha-induced p rotein 3 AL050372 0.541463 Homo sapiens mRNA; cDNA
DKFZp434A091 (from clone DKFZp434A091 ); partial cds NM 014181 -0.541095HSPC159 HSPC159 protein Contig37141_RC0.540990 Homo sapiens cDNA: FLJ23582 fis, clone LNG13759 NM 000947 -0.540621PRIM2A primase, polypeptide 2A (58kD) NM 002136 0.540572 HNRPA1 heterogeneous nuclear ribonucleoprotein A1 NM_004494 -0.540543HDGF hepatoma-derived growth factor (high-mobility group protein 1-like) Contig38983 0.540526 ESTs RC

Contig27882 -0.540506 ESTs RC

211887 -0.540020MMP7 matrix metalloproteinase (matrilysin, uterine) NM 014575 -0.539725SCHIP-1 schwannomin interacting protein Contig38170 0.539708 ESTs RC

Contig44064 0.539403 ESTs RC

U68385 0.539395 MEIS3 Meis (mouse) homolog 3 Contig51967 0.538952 ESTs RC

Contig37562_RC0.538657 ESTs, Weakly similar to transformation-related protein 5 [H.sapiens]

Contig40500_RC0.538582 ESTs, Weakly similar to unnamed protein product [H.sapiens]

Contig1129 0.538339 ESTs RC

NM 002184 0.538185 IL6ST interleukin 6 signal transducer (gp130, oncostatin M receptor) AL049381 0.538041 Homo sapiens cDNA FLJ 12900 fis, clone NT2RP2004321 NM 002189 -0.537867IL15RA interleukin 15 receptor, alpha NM 012110 -0.537562CHIC2 cystein-rich hydrophobic domain 2 AB040881 -0.537473KIAA1448 KIAA1448 protein NM 016577 -0.537430RAB6B RAB6B, member RAS oncogene family NM 001745 0.536940 CAMLG calcium modulating ligand Identifier CorrelationName Description NM 005742 -0.536738P5 protein disulfide isomerase-related protein AB011132 0.536345 KIAA0560 KIAA0560 gene product Contig54898 0.536094 PNN pinin, RC p desmosome associated rotein Contig45049_RC-0.536043FUT4 fucosyltransferase 4 (alpha (1,3) fucosyltransferase, myeloid-specific) NM 006864 -0.535924LILRB3 leukocyte immunoglobulin-like receptor, subfamily B (with TM and ITIM
domains), member Contig53242_RC-0.535909 Homo piens cDNA FLJ11436 sa fis, clone MBA1001213 HE

NM 005544 0.535712 IRS1 insulin receptor substrate Contig47456_RC0.535493 CACNA1 calcium D channel, voltage-dependent, L type, alpha subunit Contig42751 -0.535469 ESTs RC

Contig29126 -0.535186 ESTs RC

NM 012391 0.535067 PDEF prostate t epithelium-specific Ets ranscription factor NM 012429 0.534974 SEC14L2 SEC14 (S.
cerevisiae)-like NM 018171 0.534898 FLJ10659 hypothetical protein Contig53047 -0.534773TTYH1 tweety RC (Drosophila) homolog Contig54968 0.534754 Homo RC sapiens cDNA

fis, clone Contig2099 -0.534694KIAA1691 KIAA1691 RC protein NM 005264 0.534057 GFRA1 GDNF
family receptor alpha NM 014036 -0.533638SBB142 BCM-like membrane protein precursor NM 018101 -0.533473FLJ10468 hypothetical protein Contig56765_RC0.533442 ESTs, Moderately similar to K02E10.2 [C.elegans]

AB006746 -0.533400PLSCR1 phospholipid scramblase NM 001089 0.533350 ABCA3 ATP-binding cassette, sub-family A
(ABC1 ), member NM 018188 -0.533132FLJ10709 hypothetical protein X 94232 -0.532925MAPRE2 microtubule-associated protein, RP/EB
family, member A F234532 -0.532910MY010 myosin X

C ontig292 RC 0.532853 FLJ22386 hypothetical protein N M 000101 -0.532767CYBA cytochrome 3 5 b-245, alpha polypeptide C ontig47814 -0.532656HHGP HHGP
RC protein Identifier CorrelationName Description NM 014320 -0.532430SOUL putative heme-binding protein NM 020347 0.531976 LZTFL1 leucine zipper transcription l factor-ike 1 NM 004323 0.531936 BAG1 BCL2-associated athanogene Contig50850 -0.531914 ESTs RC

Contig11648 0.531704 ESTs RC

NM 018131 -0.531559FLJ10540 hypothetical protein FLJ10540 NM 004688 -0.531329NMI N-myc (and STAT) interactor NM 014870 0.531101 KIAA0478 KIAA0478 gene product Contig31424 0.530720 ESTs RC

NM 000874 -0.530545IFNAR2 interferon (alpha, beta and r omega) eceptor 2 Contig50588 0.530145 ESTs RC

NM 016463 0.529998 HSPC195 hypothetical protein NM 013324 0.529966 CISH cytokine inducible SH2-containing 1 5 protein NM 006705 0.529840 GADD45G growth arrest and DNA-damage-inducible, gamma Contig38901 -0.529747 ESTs RC

NM 004184 -0.529635WARS tryptophanyl-tRNA synthetase NM_015955 -0.529538LOC51072 CGI-27 protein AF151810 0.529416 CGI-52 similar to phosphatidylcholine transfer protein 2 NM 002164 -0.529117INDO indoleamine-pyrrole 2,3 dioxygenase NM 004267 -0.528679CHST2 carbohydrate (chondroitin 6/keratan) sulfotransferase 2 Contig32185_RC-0.528529 Homo sapiens cDNA FLJ13997 fis, clone Y79AA1002220 NM 004154 -0.528343P2RY6 pyrimidinergic receptor P2Y, p G-rotein coupled, 6 NM 005235 0.528294 ERBB4 v-erb-a avian erythroblastic l eukemia viral oncogene homolog-like 4 Contig40208 -0.528062LOC56938 transcription factor BMAL2 RC

NM 013262 0.527297 MIR myosin regulatory light chain i nteracting protein NM 003034 -0.527148SIATBA sialyltransferase 8 (alpha-N-acetylneuraminate: alpha-2,8-sialytransferase, GD3 synthase) A

Identifier CorrelationName Description NM 004556 -0.527146NFKBIE nuclear factor of kappa light polypeptide gene enhancer in B-cells inhibitor, epsilon NM 002046 -0.527051GAPD glyceraldehyde-3-phosphate dehydrogenase NM 001905 -0.526986CTPS CTP synthase Contig42402 0.526852 ESTs RC

NM 014272 -0.526283ADAMTS7 a disintegrin-like and metalloprotease (reprolysin type) with thrombospondin type 1 motif, 7 AF076612 0.526205 CHRD chordin Contig57725_RC-0.526122 Homo sapiens mRNA for HMG-box transcription factor TCF-3, complete cds Contig42041 -0.525877 ESTs RC

Contig44656_RC-0.525868 ESTs, Highly similar to S02392 alpha-2-macroglobulin receptor precursor [H.sapiens]

NM 018004 -0.525610FLJ10134 hypothetical protein FLJ10134 Contig56434 0.525510 Homo sapiens cDNA FLJ13603 RC fis, _ clone PLACE1010270 D25328 -0.525504PFKP phosphofructokinase, platelet Contig55950 -0.525358FLJ22329 hypothetical protein FLJ22329 RC

_ NM 002648 -0.525211PIM1 pim-1 oncogene AL157505 0.525186 Homo sapiens mRNA; cDNA

DKFZp586P1124 (from clone DKFZp586P1124) AF061034 -0.525185FIP2 Homo sapiens FIP2 alternatively translated mRNA, complete cds.

NM 014721 -0.525102KIAA0680 KIAA0680 gene product NM 001634 -0.525030AMD1 S-adenosylmethionine decarboxylase 1 NM 006304 -0.524911DSS1 Deleted in split-hand/split-foot r egion Contig37778_RC0.524667 ESTs, Highly similar to HLHUSB

MHC class II histocompatibility antigen HLA-DP alpha-1 chain precursor [H.sapiens]

NM 003099 0.524339 SNX1 sorting nexin 1 AL079298 0.523774 MCCC2 methylcrotonoyl-Coenzyme A

carboxylase 2 (beta) NM 019013 -0.523663FLJ10156 hypothetical protein Identifier CorrelationName Description NM 000397 -0.523293CYBB cytochrome b-245, beta polypeptide ( chronic granulomatous disease) NM 014811 0.523132 KIAA0649KIAA0649 gene product Contig20600 0.523072 ESTs RC

NM 005190 -0.522710CCNC cyclin C

AL161960 -0.522574FLJ21324hypothetical protein FLJ21324 AL117502 0.522280 Homo sapiens mRNA; cDNA
DKFZp434D0935 (from clone DKFZp434D0935) AF131753 -0.522245 Homo sapiens clone 24859 mRNA
sequence NM 000320 0.521974 QDPR quinoid dihydropteridine reductase NM 002115 -0.521870HK3 hexokinase 3 (white cell) NM 006460 0.521696 HIS1 HMBA-inducible NM 018683 -0.521679ZNF313 zinc finger protein 313 NM 004305 -0.521539BIN1 bridging integrator 1 NM 006770 -0.521538MARCO macrophage receptor with collagenous structure NM 001166 -0.521530BIRC2 baculoviral IAP repeat-containing D42047 0.521522 KIAA0089KIAA0089 protein NM 016235 -0.521298GPRCSB G protein-coupled receptor, family C, group 5, member B

NM 004504 -0.521189HRB HIV-1 Rev binding protein NM 002727 -0.521146PRG1 proteoglycan 1, secretory granule AB029031 -0.520761KIAA1108KIAA1108 protein NM 005556 -0.520692KRT7 keratin 7 NM_018031 0.520600 WDR6 WD repeat domain 6 AL117523 -0.520579KIAA1053KIAA1053 protein NM 004515 -0.520363ILF2 interleukin enhancer binding 2 factor , 45kD

NM 004708 -0.519935PDCD5 programmed cell death 5 NM 005935 0.519765 MLLT2 myeloid/lymphoid or mixed-lineage l eukemia (trithorax (Drosophila) 0 homology; translocated to, Contig49289_RC-0.519546 Homo sapiens mRNA; cDNA
. DKFZp586J1119 (from clone DKFZp586J1119); complete cds NM 000211 -0.519342ITGB2 integrin, beta 2 (antigen l CD18 (p95), ymphocyte function-associated 5 antigen 1; macrophage antigen (mac-1 ) beta subunit) Identifier CorrelationName Description AL079276 0.519207 LOC58495 putative zinc finger protein from Contig57825 0.519041 ESTs RC

NM 002466 -0.518911MYBL2 v-myb avian myeloblastosis viral oncogene homolog-like 2 NM 016072 -0.518802LOC51026 CGI-141 protein AB007950 -0.518699KIAA0481 KIAA0481 gene product NM 001550 -0.518549IFRD1 interferon-related developmental r egulator 1 AF155120 -0.518221UBE2V1 ubiquitin-conjugating enzyme variant 1 Contig49849_RC0.517983 ESTs, Weakly similar to AF188706 1 g20 protein [H.sapiens]

NM 016625 -0.517936LOC51319 hypothetical protein NM 004049 -0.517862BCL2A1 BCL2-related protein A1 Contig50719 0.517740 ESTs RC

D80010 -0.517620LPIN1 lipin 1 NM 000299 -0.517405PKP1 plakophilin 1 (ectodermal dysplasia/skin fragility syndrome) AL049365 0.517080 FTL ferritin, light polypeptide Contig65227 0.517003 ESTs NM 004865 -0.516808TBPL1 TBP-like 1 Contig54813 0.516246 FLJ13962 hypothetical protein FLJ13962 RC

NM 003494 -0.516221DYSF dysferlin, limb girdle muscular d ystrophy 2B (autosomal recessive) NM 004431 -0.516212EPHA2 EphA2 AL117600 -0.516067DKFZP564 DKFZP564J0863 protein AL080209 -0.516037DKFZP586 hypothetical protein F2423 DKFZp586F2423 NM 000135 -0.515613FANCA Fanconi anemia, complementation g roup A

NM 000050 -0.515494ASS argininosuccinate synthetase NM 001830 -0.515439CLCN4 chloride channel 4 NM 018234 -0.515365FLJ10829 hypothetical protein FLJ10829 Contig53307_RC0.515328 ESTs, Highly similar to KIAA1437 protein [H.sapiens]

AL117617 -0.515141 Homo sapiens mRNA; cDNA
DKFZp564H0764 (from clone DKFZp564H0764) NM 002906 -0.515098RDX radixin Identifier CorrelationName Description NM 003360 -0.514427 UGTB UDP glycosyltransferase 8 (UDP-galactose ceramide galactosyltransferase) NM 018478 0.514332 HSMNP1 uncharacterized hypothalamus protein HSMNP1 M90657 -0.513908 TM4SF1 transmembrane 4 superfamily member 1 NM 014967 0.513793 KIAA1018KIAA1018 protein Contig1462_RC0.513604 C110RF1 chromosome 11 open reading 5 frame Contig37287 -0.513324 ESTs RC

NM 000355 -0.513225 TCN2 transcobalamin II; macrocytic a nemia AB037756 0.512914 KIAA1335hypothetical protein KIAA1335 Contig842 -0.512880 ESTs RC

NM 018186 -0.512878 FLJ10706hypothetical protein FLJ10706 NM 014668 0.512746 KIAA0575KIAA0575 gene product NM 003226 0.512611 TFF3 trefoil factor 3 (intestinal) Contig56457_RC-0.512548 TMEFF1 transmembrane protein with EGF-like and two follistatin-like domains 1 AL050367 -0.511999 Homo sapiens mRNA; cDNA
DKFZp564A026 (from clone 0 DKFZp564A026) NM 014791 -0.511963 KIAA0175KIAA0175 gene product Contig36312 0.511794 ESTs RC

NM 004811 -0.511447 LPXN leupaxin Contig67182_RC-0.511416 ESTs, Highly similar to epithelial V-like antigen precursor [H.sapiens]

Contig52723 -0.511134 ESTs RC

Contig17105_RC-0.511072 Homo sapiens mRNA for putative cytoplasmatic protein (ORF1-FL21 ) NM 014449 0.511023 A protein "A"

Contig52957 0.510815 ESTs RC

Contig49388_RC0.510582 FLJ13322hypothetical protein FLJ13322 NM 017786 0.510557 FLJ20366hypothetical protein FLJ20366 AL157476 0.510478 Homo sapiens mRNA; cDNA
DKFZp761 C082 (from clone DKFZp761 C082) NM 001919 0.510242 DCI dodecenoyl-Coenzyme A delta isomerase (3,2 trans-enoyl-5 Coenzyme A isomerase) Identifier CorrelationName Description NM 000268 -0.5107 NF2 neurofibromin 2 (bilateral 65 acoustic neuroma) NM 016210 0.510018 LOC51161 g20 protein Contig45816 -0.509977 ESTs RC

NM 003953 -0.509969MPZL1 myelin protein zero-like NM 000057 -0.509669BLM Bloom syndrome NM 014452 -0.509473DR6 death receptor 6 Contig45156_RG0.509284 ESTs, Moderately similar to motor domain of KIF12 [M.musculus]

NM 006943 0.509149 SOX22 SRY (sex determining region Y)-box NM 000594 -0.509012TNF tumor necrosis factor (TNF

superfamily, member 2) AL137316 -0.508353KIAA1609 KIAA1609 protein NM 000557 -0.508325GDF5 growth differentiation factor ( cartilage-derived morphogenetic protein-1) NM 018685 -0.508307ANLN anillin (Drosophila Scraps homology, a ctin binding protein Contig53401 0.508189 ESTs RC

NM 014364 -0.508170GAPDS glyceraldehyde-3-phosphate dehydrogenase, testis-specific Contig50297_RC0.508137 ESTs, Moderately similar to SX SEQUENCE CONTAMINATION

WARNING ENTRY [H.sapiens]

Contig51800 0.507891 ESTs, Weakly similar to SP SEQUENCE CONTAMINATION

WARNING ENTRY [H.sapiens]

Contig49098 -0.507716MGC4090 hypothetical protein MGC4090 RC

NM 002985 -0.507554SCYAS small inducible cytokine (RANTES) AB007899 0.507439 KIAA0439 KIAA0439 protein; homolog of yeast ubiquitin-protein ligase Rsp5 AL110139 0.507145 Homo sapiens mRNA; cDNA

DKFZp564O1763 (from clone DKFZp564O1763) Contig51117 0.507001 ESTs RC

NM 017660 -0.506768FLJ20085 hypothetical protein FLJ20085 NM 018000 0.506686 FLJ10116 hypothetical protein FLJ10116 NM 005555 -0.506516KRT6B keratin 6B

_72_ Identifier CorrelationName Description NM 005582 -0.506462LY64 lymphocyte antigen 64 (mouse) homolog, radioprotective, 105kD

Contig47405 0.506202 ESTs RC

NM 014808 0.506173 KIAA0793 KIAA0793 gene product NM 004938 -0.506121DAPK1 death-associated protein kinase 1 NM 020659 -0.505793TTYH1 tweety (Drosophila) homolog NM 006227 -0.505604PLTP phospholipid transfer protein NM 014268 -0.505412MAPRE2 microtubule-associated protein, RPIEB family, member 2 NM 004711 0.504849 SYNGR1 synaptogyrin 1 NM 004418 -0.504497DUSP2 dual specificity phosphatase NM 003508 -0.504475FZD9 frizzled (Drosophila) homolog IS

Table 3. 430 gene markers that distinguish BRCAl-related tumor samples from sporadic tumor samples GenBank SEQ ID NO GenBank SEQ ID NO
Accession Number Accession Number AB002301 SEQ ID NO 4 NM _012391 SEQ ID NO 1406 AB032966 SEQ ID NO 53 NM_014402 SEQ ID NO 1488 Ag032988 SEQ ID NO 57 NM 014476 SEQ ID NO 1496 AF070536 SEQ ID NO 133 NM_014785 SEQ ID NO 1534 AJ272057 SEQ ID NO 203 NM_015937 SEQ ID NO 1582 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Number Accession Number AK001438 SEQ ID NO 229 NM _016018 SEQ ID NO 1593 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 000969 SEQ !D NO 547 X57809 SEQ ID NO 1912 NM 001504 SEQ ID NO 620 Contig237 RC SEQ ID NO 1940 NM 001553 SEQ ID NO 630 Contig292 RC SEQ ID NO 1942 NM 001674 SEQ ID NO 646 Contig372 RC SEQ ID NO 1943 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 001675 SEQ ID NO 647 Contig756 SEQ ID NO 1955 RC

NM 001725 SEQ ID NO 652 Contig842 SEQ ID NO 1958 RC

NM 001740 SEQ ID NO 656 Contig1632RC SEQ ID NO 1977 NM 001756 SEQ ID NO 659 Contig1826RC SEQ ID NO 1980 NM 001770 SEQ ID NO 664 Contig2237RC SEQ ID NO 1988 NM 001797 SEQ ID NO 670 Contig2915RC SEQ ID NO 2003 NM 001845 SEQ ID NO 680 Contig3164RC SEQ ID NO 2007 NM 001873 SEQ ID NO 684 Contig3252RC SEQ ID NO 2008 NM 001888 SEQ ID NO 687 Contig3940RC SEQ ID NO 2018 NM 001892 SEQ ID NO 688 Contig9259RC SEQ ID NO 2039 NM 001919 SEQ ID NO 694 Contig10268RC SEQ ID NO 2041 NM 001946 SEQ ID NO 698 Contig10437RC SEQ ID NO 2043 NM 001953 SEQ ID NO 699 Contig10973RC SEQ ID NO 2044 NM 001960 SEQ ID NO 704 Contig14390 SEQ ID NO 2054 RC

NM 001985 SEQ ID NO 709 Contig16453RC SEQ ID NO 2060 NM 002023 SEQ ID NO 712 Contig16759RC SEQ ID NO 2061 NM 002051 SEQ ID NO 716 Contig19551 SEQ ID NO 2070 NM 002053 SEQ ID NO 717 Contig24541RC SEQ ID NO 2088 NM 002164 SEQ ID NO 734 Contig25362RC SEQ ID NO 2093 NM 002200 SEQ ID NO 739 Contig25617RC SEQ ID NO 2094 NM 002201 SEQ ID NO 740 Contig25722RC SEQ ID NO 2096 NM 002213 SEQ ID NO 741 Contig26022RC SEQ ID NO 2099 NM 002250 SEQ ID NO 747 Contig27915RC SEQ ID NO 2114 NM 002512 SEQ ID NO 780 Contig28081RC SEQ ID NO 2116 NM 002542 SEQ ID NO 784 Contig28179RC SEQ ID NO 2118 NM 002561 SEQ ID NO 786 Contig28550RC SEQ ID NO 2119 NM 002615 SEQ ID NO 793 Contig29639RC SEQ ID NO 2127 NM 002686 SEQ ID NO 803 Contig29647RC SEQ ID NO 2128 NM 002709 SEQ ID NO 806 Contig30092RC SEQ ID NO 2130 NM 002742 SEQ ID NO 812 Contig30209 SEQ ID NO 2132 RC

NM 002775 SEQ ID NO 815 Contig32185RC SEQ ID NO 2156 NM 002975 SEQ ID NO 848 Contig32798RC SEQ ID NO 2161 NM 002982 SEQ ID NO 849 Contig33230RC SEQ ID NO 2163 NM 003104 SEQ ID NO 870 Contig33394RC SEQ ID NO 2165 NM 003118 SEQ ID NO 872 Contig36323RC SEQ ID NO 2197 NM 003144 SEQ ID NO 876 Contig36761RC SEQ ID NO 2201 NM 003165 SEQ ID NO 882 Contig37141RC SEQ ID NO 2209 _77_ GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 003197 SEQ ID NO 885 Contig37778RC SEQ ID NO 2218 NM 003202 SEQ ID NO 886 Contig38285 SEQ ID NO 2222 RC

NM 003217 SEQ ID NO 888 Contig38520RC SEQ ID NO 2225 NM 003283 SEQ ID NO 898 Contig38901RC SEQ ID NO 2232 NM 003462 SEQ ID NO 911 Contig39826RC SEQ ID NO 2241 NM 003500 SEQ ID NO 918 Contig40212RC SEQ ID NO 2251 NM 003561 SEQ ID NO 925 Contig40712RC SEQ ID NO 2257 NM 003607 SEQ ID NO 930 Contig41402RC SEQ ID NO 2265 NM 003633 SEQ ID NO 933 Contig41635RC SEQ ID NO 2272 NM 003641 SEQ ID NO 934 Contig42006RC SEQ ID NO 2280 NM 003683 SEQ ID NO 943 Contig42220 SEQ ID NO 2286 RC

NM 003729 SEQ ID NO 949 Contig42306RC SEQ ID NO 2287 NM 003793 SEQ ID NO 954 Contig43918RC SEQ ID NO 2312 NM 003829 SEQ ID NO 958 Contig44195RC SEQ ID NO 2316 NM 003866 SEQ ID NO 961 Contig44265RC SEQ ID NO 2318 NM 003904 SEQ ID NO 967 Contig44278RC SEQ ID NO 2319 NM 003953 SEQ ID NO 974 Contig44757RC SEQ ID NO 2329 NM 004024 SEQ ID NO 982 Contig45588 SEQ ID NO 2349 RC

NM 004053 SEQ ID NO 986 Contig46262RC SEQ ID NO 2361 NM 004295 SEQ ID NO 1014 Contig46288RC SEQ ID NO 2362 NM 004438 SEQ ID NO 1038 Contig46343RC SEQ ID NO 2363 NM 004559 SEQ ID NO 1057 Contig46452RC SEQ ID NO 2366 NM 004616 SEQ ID NO 1065 Contig46868RC SEQ ID NO 2373 NM 004741 SEQ ID NO 1080 Contig46937RC SEQ ID NO 2377 NM 004772 SEQ ID NO 1084 Contig48004RC SEQ ID NO 2393 NM 004791 SEQ ID NO 1086 Contig48249RC SEQ ID NO 2397 NM 004848 SEQ ID NO 1094 Contig48774RC SEQ ID NO 2405 NM 004866 SEQ ID NO 1097 Contig48913RC SEQ ID NO 2411 NM 005128 SEQ ID NO 1121 Contig48945RC SEQ ID NO 2412 NM 005148 SEQ ID NO 1124 Contig48970RC SEQ ID NO 2413 NM 005196 SEQ ID NO 1127 Contig49233RC SEQ ID NO 2419 NM 005326 SEQ ID NO 1140 Contig49289RC SEQ ID NO 2422 NM 005518 SEQ ID NO 1161 Contig49342RC SEQ ID NO 2423 NM 005538 SEQ ID NO 1163 Contig49510RC SEQ ID NO 2430 NM 005557 SEQ ID NO 1170 Contig49855 SEQ ID NO 2440 NM 005718 SEQ ID NO 1189 Contig49948RC SEQ ID NO 2442 NM 005804 SEQ ID NO 1201 Contig50297RC SEQ ID NO 2451 _78_ GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 005824 SEQ ID NO 1203 Contig50669RC SEQ ID NO 2458 NM 005935 SEQ ID NO 1220 Contig50673RC SEQ ID NO 2459 NM 006002 SEQ ID NO 1225 Contig50838RC SEQ ID NO 2465 NM 006148 SEQ ID NO 1249 Contig51068RC SEQ ID NO 2471 NM 006235 SEQ ID NO 1257 Contig51929 SEQ ID NO 2492 NM 006271 SEQ ID NO 1261 Contig51953RC SEQ ID NO 2493 NM 006287 SEQ ID NO 1264 Contig52405 SEQ ID NO 2502 RC

NM 006296 SEQ ID NO 1267 Contig52543RC SEQ ID NO 2505 NM 006378 SEQ ID NO 1275 Contig52720RC SEQ ID NO 2513 NM 006461 SEQ ID NO 1287 Contig53281RC SEQ ID NO 2530 NM 006573 SEQ ID NO 1300 Contig53598RC SEQ ID NO 2537 NM 006622 SEQ ID NO 1302 Contig53757RC SEQ ID NO 2543 NM 006696 SEQ ID NO 1308 Contig53944RC SEQ ID NO 2545 NM 006769 SEQ ID NO 1316 Contig54425 SEQ ID NO 2561 NM 006787 SEQ ID NO 1319 Contig54547RC SEQ ID NO 2565 NM 006875 SEQ ID NO 1334 Contig54757RC SEQ ID NO 2574 NM 006885 SEQ ID NO 1335 Contig54916RC SEQ ID NO 2581 NM 006918 SEQ ID NO 1339 Contig55770RC SEQ ID NO 2604 NM 006923 SEQ ID NO 1340 Contig55801RC SEQ ID NO 2606 NM 006941 SEQ ID NO 1342 Contig56143RC SEQ ID NO 2619 NM 007070 SEQ ID NO 1354 Contig56160RC SEQ ID NO 2620 NM 007088 SEQ ID NO 1356 Contig56303 SEQ ID NO 2626 RC

NM 007146 SEQ ID NO 1358 Contig57023RC SEQ ID NO 2639 NM 007173 SEQ ID NO 1359 Contig57138RC SEQ ID NO 2644 NM 007246 SEQ ID NO 1366 Contig57609RC SEQ ID NO 2657 NM 007358 SEQ ID NO 1374 Contig58301RC SEQ ID NO 2667 NM 012135 SEQ ID NO 1385 Contig58512RC SEQ ID NO 2670 NM 012151 SEQ ID NO 1387 Contig60393 SEQ ID NO 2674 NM 012258 SEQ ID NO 1396 Contig60509RC SEQ ID NO 2675 NM 012317 SEQ ID NO 1399 Contig61254RC SEQ ID NO 2677 NM 012337 SEQ ID NO 1403 Contig62306 SEQ ID NO 2680 NM 012339 SEQ ID NO 1404 Contig64502 SEQ ID NO 2689 Table 4. 100 preferred markers from Table 3 distinguishing BRCAl-related tumors from sporadic tumors.
Identifier CorrelationSequence Description Name NM 001892 -0.651689 CSNK1A1 casein kinase 1, alpha 1 NM 018171 -0.637696 FLJ10659 hypothetical protein FLJ10659 Contig40712 -0.612509 ESTs RC

NM 001204 -0.608470 BMPR2 bone morphogenetic protein r eceptor, type II (serine/threonine kinase) NM 005148 -0.598612 UNC119 unc119 (C.elegans) homolog 626403 0.585054 YWHAH tyrosine 3-monooxygenase/tryptophan monooxygenase activation protein, eta polypeptide NM 015640 0.583397 PAI-RBP1 PAI-1 mRNA-binding protein Contig9259 0.581362 ESTs RC

_ Ag033049 -0.578750 KIAA1223 KIAA1223 protein NM 015523 0.576029 DKFZP566E small fragment nuclease Contig41402_RC-0.571650 Human DNA sequence from clone RP11-16L21 on chromosome 9.

Contains the gene for NADP-dependent leukotriene B4 hydroxydehydrogenase, the gene for a novel DnaJ domain protein similar to Drosophila, C.
elegans and Arabidopsis predicted proteins, the GNG10 gene for guanine nucleotide binding protein 10, a novel gene, ESTs, STSs, GSSs and six CpG islands NM 004791 -0.564819 ITGBL1 integrin, beta-like 1 (with EGF-like r epeat domains) NM 007070 0.561173 FAP48 FKBP-associated protein NM 014597 0.555907 HSU15552 acidic 82 kDa protein mRNA

AF000974 0.547194 TRIPE thyroid hormone receptor interactor NM 016073 -0.547072 CGI-142 CGI-142 Contig3940_RC0.544073 YWHAH tyrosine 3-monooxygenase/tryptophan monooxygenase activation protein, eta polypeptide NM 003683 0.542219 D2152056E DNA segment on chromosome ( unique) 2056 expressed sequence Identifier CorrelationSequence Description Name Contig58512_RC-0.528458 Homo sapiens pancreas tumor-related protein (FKSG12) mRNA, complete cds NM 003904 0.521223 ZNF259 zinc finger protein 259 Contig26022 0.517351 ESTs RC

Contig48970 -0.516953 KIAA0892 KIAA0892 protein RC

NM 016307 -0.515398 PRX2 paired related homeobox protein AL137761 -0.514891 Homo sapiens mRNA; cDNA

DKFZp586L2424 (from clone DKFZp586L2424) NM 001919 -0.514799 DCI dodecenoyl-Coenzyme A delta isomerase (3,2 trans-enoyl-Coenzyme A isomerase) NM 000196 -0.514004 HSD11 B2 hydroxysteroid (11-beta) dehydrogenase 2 NM 002200 0.513149 IRF5 interferon regulatory factor AL133572 0.511340 Homo sapiens mRNA; cDNA

DKFZp43410535 (from clone DKFZp43410535); partial cds NM 019063 0.511127 C20RF2 chromosome 2 open reading frame Contig25617_RC0.509506 ESTs NM 007358 0.508145 M96 putative DNA binding protein NM 014785 -0.507114 KIAA0258 KIAA0258 gene product NM 006235 0.506585 POU2AF1 POU domain, class 2, associating f actor 1 NM 014680 -0.505779 KIAA0100 KIAA0100 gene product X66087 0.500842 MYBL1 v-myb avian myeloblastosis viral oncogene homolog-like 1 Y07512 -0.500686 PRKG1 protein kinase, cGMP-dependent, type I

NM 006296 0.500344 VRK2 vaccinia related kinase 2 Contig44278 0.498260 DKFZP434K DKFZP434K114 protein RC

_ 114 Contig56160 -0.497695 ESTs RC

NM 002023 -0.497570 FMOD fibromodulin M28170 0.497095 CD19 CD19 antigen D26488 0.496511 KIAA0007 KIAA0007 protein X72475 0.496125 H.sapiens mRNA for rearranged Ig kappa light chain variable region (1.114) Identifier CorrelationSequence Description Name K02276 0.496068 MYC v-myc avian myelocytomatosis viral oncogene homolog NM_013378 0.495648 VPREB3 pre-B lymphocyte gene 3 X58529 0.495608 IGHM immunoglobulin heavy constant mu NM 000168 -0.494260 GL13 GLI-Kruppel family member (Greig cephalopolysyndactyly syndrome) NM 004866 -0.492967 SCAMP1 secretory carrier membrane protein NM 013253 -0.491159 DKK3 dickkopf (Xenopus laevis) homolog NM 003729 0.488971 RPC RNA 3'-terminal phosphate cyclase NM 006875 0.487407 PIM2 pim-2 oncogene NM 018188 0.487126 FLJ10709 hypothetical protein FLJ10709 NM 004848 0.485408 ICB-1 basement membrane-induced gene NM 001179 0.483253 ART3 ADP-ribosyltransferase 3 NM 016548 -0.482329 LOC51280 golgi membrane protein GP73 NM 007146 -0.481994 ZNF161 zinc finger protein 161 NM 021242 -0.481754 STRAIT1149 hypothetical protein STRAIT11499 NM 016223 0.481710 PACSIN3 protein kinase C and casein kinase 2 0 substrate in neurons 3 NM 003197 -0.481526 TCEB1 L transcription elongation factor B

(SIII), polypeptide 1-like NM 000067 -0.481003 CA2 carbonic anhydrase II

NM 006885 -0.479705 ATBF1 AT-binding transcription factor 1 NM_002542 0.478282 OGG1 8-oxoguanine DNA glycosylase AL133619 -0.476596 Homo sapiens mRNA; cDNA

DKFZp434E2321 (from clone DKFZp434E2321 ); partial cds D80001 0.476130 KIAA0179 KIAA0179 protein NM 018660 -0.475548 LOC55893 papillomavirus regulatory factor AB004857 0.473440 SLC11A2 solute carrier family 11 (proton-coupled divalent metal ion transporters), member 2 NM 002250 0.472900 KCNN4 potassium intermediate/small conductance calcium-activated channel, subfamily N, member Contig56143_RC-0.472611 ESTs, Weakly similar to A54849 3 5 collagen alpha 1 (VI I) chain precursor [H.sapiens]

_g2_ Identifier CorrelationSequence Description Name NM 001960 0.471502 EEF1 D eukaryotic translation elongation f actor 1 delta (guanine nucleotide exchange protein) Contig52405_RC-0.470705 ESTs, Weakly similar to SX SEQUENCE CONTAMINATION

WARNING ENTRY [H.sapiens]

Contig30092 -0.469977 Homo sapiens PR-domain zinc RC

_ f inger protein 6 isoform B (PRDM6) mRNA, partial cds; alternatively spliced NM 003462 -0.468753 P28 dynein, axonemal, light i ntermediate polypeptide Contig60393 0.468475 ESTs Contig842 0.468158 ESTs RC

NM 002982 0.466362 SCYA2 small inducible cytokine (monocyte chemotactic protein 1, homologous to mouse Sig je) Contig14390 0.464150 ESTs RC

NM 001770 0.463847 CD19 CD19 antigen AK000617 -0.463158 Homo sapiens mRNA; cDNA

DKFZp434L235 (from clone DKFZp434L235) AF073299 -0.463007 SLC9A2 solute carrier family 9 (sodium/hydrogen exchanger), isoform 2 NM 019049 0.461990 FLJ20054 hypothetical protein AL137347 -0.460778 DKFZP761 hypothetical protein M

NM 000396 -0.460263 CTSK cathepsin K (pycnodysostosis) NM 018373 -0.459268 FLJ11271 hypothetical protein FLJ11271 NM 002709 0.458500 PPP1CB protein phosphatase 1, catalytic subunit, beta isoform NM 016820 0.457516 OGG1 8-oxoguanine DNA glycosylase Contig10268_RC0.456933 Human DNA sequence from clone RP11-196N14 on chromosome Contains ESTs, STSs, GSSs and CpG islands. Contains three novel genes, part of a gene for a novel protein similar to protein serine/threonine phosphatase regulatory subunit 1 (PP4R1 ) and a gene for a novel protein with an ankyrin domain Identifier CorrelationSequence Description Name NM 014521 -0.456733 SH3BP4 SH3-domain binding protein AJ272057 -0.456548 STRAIT1149 hypothetical protein STRAIT11499 NM 015964 -0.456187 LOC51673 brain specific protein Contig16759 -0,456169 ESTs RC

NM 015937 -0,455954 LOC51604 CGI-06 protein NM 007246 -0.455500 KLHL2 ketch (Drosophila)-like 2 (Mayven) NM 009985 -0.453024 ETFB electron-transfer-flavoprotein, beta polypeptide NM 000984 -0.452935 RPL23A ribosomal protein L23a Contig51953 -0.451695 ESTs RC

NM 015984 0.450491 UCH37 ubiquitin C-terminal hydrolase NM 000903 -0.450371 DIA4 diaphorase (NADH/NADPH) (cytochrome b-5 reductase) NM 001797 -0.449862 CDH11 cadherin 11, type 2, OB-cadherin (osteoblast) NM 014878 0.449818 KIAA0020 KIAA0020 gene product NM 002742 -0.449590 PRKCM protein kinase C, mu Table 5. 231 gene markers that distinguish patients with good prognosis from patients with poor prognosis.
GenBank SEQ ID NO GenBank SEQ ID NO
Accession Number Accession Number AB037745 SEQ ID NO 75 NM _014363 SEQ ID NO 1480 Ag037863 SEQ ID NO 88 NM 014750 SEQ ID NO 1527 AJ224741 SEQ ID NO 196 NM _016337 SEQ ID NO 1636 AL137514 SEQ ID NO 327 NM_ 018265 SEQ ID NO 1766 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number RC

NM 001124 SEQ ID NO 562 Contig753 SEQ ID NO 1954 RC

NM 001168 SEQ ID NO 566 Contig1778RC SEQ ID NO 1979 NM 001216 SEQ ID NO 574 Contig2399RC SEQ ID NO 1989 NM 001280 SEQ ID NO 588 Contig2504RC SEQ ID NO 1991 NM 001282 SEQ ID NO 589 Contig3902RC SEQ ID NO 2017 NM 001333 SEQ ID NO 597 Contig4595 SEQ ID NO 2022 NM 001673 SEQ ID NO 645 Contig8581RC SEQ ID NO 2037 NM 001809 SEQ ID NO 673 Contig13480RC SEQ ID NO 2052 NM 001827 SEQ ID NO 676 Contig17359RC SEQ ID NO 2068 NM 001905 SEQ ID NO 691 Contig20217 SEQ ID NO 2072 RC

NM 002019 SEQ ID NO 711 Contig21812RC SEQ ID NO 2082 NM 002073 SEQ ID NO 721 Contig24252RC SEQ ID NO 2087 NM 002358 SEQ ID NO 764 Contig25055RC SEQ ID NO 2090 NM 002570 SEQ ID NO 787 Contig25343RC SEQ ID NO 2092 NM 002808 SEQ ID NO 822 Contig25991 SEQ ID NO 2098 NM 002811 SEQ ID NO 823 Contig27312RC SEQ ID NO 2108 NM 002900 SEQ ID NO 835 Contig28552RC SEQ ID NO 2120 NM 002916 SEQ ID NO 838 Contig32125_RC SEQ ID NO 2155 NM 003158 SEQ ID NO 881 Contig32185RC SEQ ID NO 2156 NM 003234 SEQ ID NO 891 Contig33814RC SEQ ID NO 2169 NM 003239 SEQ ID NO 893 Contig34634RC SEQ ID NO 2180 NM 003258 SEQ ID NO 896 Contig35251RC SEQ ID NO 2185 NM 003376 SEQ ID NO 906 Contig37063RC SEQ ID NO 2206 NM 003600 SEQ ID NO 929 Contig37598 SEQ ID NO 2216 NM 003607 SEQ ID NO 930 Contig38288RC SEQ ID NO 2223 GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 003662 SEQ ID NO 938 Contig40128RC SEQ ID NO 2248 NM 003676 SEQ ID NO 941 Contig40831RC SEQ ID NO 2260 NM 003748 SEQ ID NO 951 Contig41413RC SEQ ID NO 2266 NM 003862 SEQ ID NO 960 Contig41887RC SEQ ID NO 2276 NM 003875 SEQ ID NO 962 Contig42421RC SEQ ID NO 2291 NM 003878 SEQ ID NO 963 Contig43747RC SEQ ID NO 2311 NM 003882 SEQ ID NO 964 Contig44064RC SEQ ID NO 2315 NM 003981 SEQ ID NO 977 Contig44289RC SEQ ID NO 2320 NM 004052 SEQ ID NO 985 Contig44799RC SEQ ID NO 2330 NM 004163 SEQ ID NO 995 Contig45347RC SEQ ID NO 2344 NM 004336 SEQ ID NO 1022 Contig45816RC SEQ ID NO 2351 NM 004358 SEQ ID NO 1026 Contig46218RC SEQ ID NO 2358 NM 004456 SEQ ID NO 1043 Contig46223RC SEQ ID NO 2359 NM 004480 SEQ ID NO 1046 Contig46653RC SEQ ID NO 2369 NM 004504 SEQ ID NO 1051 Contig46802RC SEQ ID NO 2372 NM 004603 SEQ ID NO 1064 Contig47405RC SEQ ID NO 2384 NM 004701 SEQ ID NO 1075 Contig48328RC SEQ ID NO 2400 NM 004702 SEQ ID NO 1076 Contig49670RC SEQ ID NO 2434 NM 004798 SEQ ID NO 1087 Contig50106RC SEQ ID NO 2445 NM 004911 SEQ ID NO 1102 Contig50410 SEQ ID NO 2453 NM 004994 SEQ ID NO 1108 Contig50802 SEQ ID NO 2463 RC

NM 005196 SEQ ID NO 1127 Contig51464RC SEQ ID NO 2481 NM 005342 SEQ ID NO 1143 Contig51519RC SEQ ID NO 2482 NM 005496 SEQ ID NO 1157 Contig51749RC SEQ ID NO 2486 NM 005563 SEQ ID NO 1173 Contig51963 SEQ ID NO 2494 NM 005915 SEQ ID NO 1215 Contig53226RC SEQ ID NO 2525 NM 006096 SEQ ID NO 1240 Contig53268RC SEQ ID NO 2529 NM 006101 SEQ ID NO 1241 Contig53646RC SEQ ID NO 2538 NM 006115 SEQ ID NO 1245 Contig53742RC SEQ ID NO 2542 NM 006117 SEQ ID NO 1246 Contig55188RC SEQ ID NO 2586 NM 006201 SEQ ID NO 1254 Contig55313RC SEQ ID NO 2590 NM 006265 SEQ ID NO 1260 Contig55377RC SEQ ID NO 2591 NM 006281 SEQ ID NO 1263 Contig55725 SEQ ID NO 2600 RC

NM 006372 SEQ ID NO 1273 Contig55813RC SEQ ID NO 2607 NM 006681 SEQ ID NO 1306 Contig55829RC SEQ ID NO 2608 NM 006763 SEQ ID NO 1315 Contig56457RC SEQ ID NO 2630 NM 006931 SEQ ID NO 1341 Contig57595 SEQ ID NO 2655 _ 87 _ GenBank SEQ ID NO GenBank SEQ ID NO
Accession Accession Number Number NM 007036 SEQ ID NO 1349 Contig57864RC SEQ ID NO 2663 NM 007203 SEQ ID NO 1362 Contig58368RC SEQ ID NO 2668 NM 012177 SEQ ID NO 1390 Contig60864RC SEQ ID NO 2676 NM 012214 SEQ ID NO 1392 Contig63102RC SEQ ID NO 2684 NM 012261 SEQ ID NO 1397 Contig63649RC SEQ ID NO 2686 NM 012429 SEQ ID NO 1413 Contig64688 SEQ ID NO 2690 _88_ Table 6. 70 Preferred prognosis markers drawn from Table 5.
Identifier CorrelationSequence Description Name AL080059 -0.527150 Homo sapiens mRNA for KIAA1750 protein, partial cds Contig63649-0.468130 ESTs RC

Contig46218-0.432540 ESTs RC

NM 016359 -0.424930 LOC51203 clone HQ0310 PR00310p1 AA555029 -0.424120 ESTs RC

_ NM 003748 0.420671 ALDH4 aldehyde dehydrogenase 4 ( glutamate gamma-semialdehyde dehydrogenase; pyrroline-5-carboxylate dehydrogenase) Contig38288-0.414970 ESTs, Weakly similar to ISHUSS

RC protein disulfide-isomerase [H.sapiens]

NM 003862 0.410964 FGF18 fibroblast growth factor 18 Contig28552-0.409260 Homo sapiens mRNA; cDNA

RC DKFZp434C0931 (from clone DKFZp434C0931 ); partial cds Contig321250.409054 ESTs RC

U82987 0.407002 BBC3 Bcl-2 binding component 3 AL137718 -0.404980 Homo sapiens mRNA; cDNA

DKFZp434C0931 (from clone DKFZp434C0931 ); partial cds AB037863 0.402335 KIAA1442 KIAA1442 protein NM 020188 -0.400070 DC13 DC13 protein NM 020974 0.399987 CEGP1 CEGP1 protein NM 000127 -0.399520 EXT1 exostoses (multiple) 1 NM 002019 -0.398070 FLT1 fms-related tyrosine kinase ( vascular endothelial growth factor/vascular permeability factor receptor) NM 002073 -0.395460 GNAZ guanine nucleotide binding protein (G protein), alpha z polypeptide NM 000436 -0.392120 OXCT 3-oxoacid CoA transferase NM 004994 -0.391690 MMP9 matrix metalloproteinase 9 (gelatinase B, 92kD gelatinase, 92kD type IV collagenase) Contig553770.390600 ESTs RC

-~9-Contig35251-0.390410 Homo sapiens cDNA: FLJ22719 fis, RC clone HS114307 Contig25991-0.390370 ECT2 epithelial cell transforming sequence 2 oncogene NM_003875 -0.386520 GMPS guanine monphosphate synthetase NM 006101 -0.385890 HEC highly expressed in cancer, rich in leucine heptad repeats NM 003882 0.384479 WISP1 WNT1 inducible signaling pathway p rotein 1 NM 003607 -0.384390 PK428 Ser-Thr protein kinase related to the myotonic dystrophy protein kinase AF073519 -0.383340 SERF1A small EDRK-rich factor 1A

(telomeric) AF052162 -0.380830 FLJ12443 hypothetical protein FLJ12443 NM 000849 0.380831 GSTM3 glutathione S-transferase M3 (brain) Contig32185-0.379170 Homo sapiens cDNA FLJ13997 fis, RC clone Y79AA1002220 NM 016577 -0.376230 RAB6B RAB6B, member RAS oncogene f amily Contig483280.375252 ESTs, Weakly similar to T17248 RC hypothetical protein DKFZp586G1122.1 [H.sapiens]

Contig462230.374289 ESTs RC

NM 015984 -0.373880 UCH37 ubiquitin C-terminal hydrolase NM 006117 0.373290 PECI peroxisomal D3,D2-enoyl-CoA

isomerase AK000745 -0.373060 Homo sapiens cDNA FLJ20738 fis, clone HEP08257 Contig40831-0.372930 ESTs RC

NM 003239 0.371524 TGFB3 transforming growth factor, beta 3 NM 014791 -0.370860 KIAA0175 KIAA0175 gene product X05610 -0.370860 COL4A2 collagen, type IV, alpha 2 NM 016448 -0.369420 L2DTL L2DTL protein NM 018401 0.368349 HSA250839 gene for serine/threonine protein kinase NM 000788 -0.367700 DCK deoxycytidine kinase Contig51464-0.367450 FLJ22477 hypothetical protein FLJ22477 RC

AL080079 -0.367390 DKFZP564D hypothetical protein 0462 DKFZp564D0462 NM 006931 -0.366490 SLC2A3 solute carrier family 2 (facilitated g lucose transporter), member AF257175 0.365900 Homo sapiens hepatocellular carcinoma-associated antigen (HCA64) mRNA, complete cds NM 014321 -0.365810 ORC6L origin recognition complex, subunit 6 ( yeast homology-like NM 002916 -0.365590 RFC4 replication factor C (activator 1 ) 4 (37kD) Contig55725-0.365350 ESTs, Moderately similar to RC hypothetical protein DKFZp762L0311.1 [H.sapiens]

Contig24252-0.364990 ESTs RC

AF201951 0.363953 CFFM4 high affinity immunoglobulin epsilon receptor beta subunit NM -0.363850 MCM6 minichromosome maintenance _ deficient (miss, S. pombe) NM 001282 0.363326 AP2B1 adaptor-related protein complex 2, b eta 1 subunit Contig56457-0.361650 TMEFF1 transmembrane protein with EGF-RC like and two follistatin-like domains 1 NM 000599 -0.361290 IGFBP5 insulin-like growth factor binding p rotein 5 NM 020386 -0.360780 LOC57110 H-REV107 protein-related protein NM 014889 -0.360040 MP1 metalloprotease 1 (pitrilysin family) AF055033 -0.359940 IGFBP5 insulin-like growth factor binding protein 5 NM 006681 -0.359700 NMU neuromedin U

NM 007203 -0.359570 AKAP2 A kinase (PRKA) anchor protein Contig631020.359255 FLJ11354 hypothetical protein FLJ11354 RC

NM 003981 -0.358260 PRC1 protein regulator of cytokinesis Contig20217-0.357880 ESTs RC

NM 001809 -0.357720 CENPA centromere protein A (17kD) Contig2399_R-0.356600 SM-20 similar to rat smooth muscle protein NM 004702 -0.356600 CCNE2 cyclin E2 NM 007036 -0.356540 ESM1 endothelial cell-specific molecule 1 NM 018354 -0.356000 FLJ11190 hypothetical protein FLJ11190 The sets of markers listed in Tables 1-6 partially overlap; in other words, some markers are present in multiple sets, while other markers are unique to a set (FIG. 1).
Thus, in one embodiment, the invention provides a set of 256 genetic markers that can distinguish between ER(+) and ER(-), and also between BRCA1 tumors and sporadic tumors (i.e., classify a tumor as ER(-) or ER(-) and BRCAl-related or sporadic). In a more specific embodiment, the invention provides subsets of at least 20, at least 50, at least 100, or at least 150 of the set of 256 markers, that can classify a tumor as ER(-) or ER(-) and BRCAI-related or sporadic. In another embodiment, the invention provides 165 markers that can distinguish between ER(+) and ER(-), and also between patients with good versus poor prognosis (i. e., classify a tumor as either ER(-) or ER(+) and as having been removed from a patient with a good prognosis or a poor prognosis). In a more specific embodiment, the invention further provides subsets of at least 20, 50, 100 or 125 of the full set of 165 markers, which also classify a tumor as either ER(-) or ER(+) and as having been removed from a patient with a good prognosis or a poor prognosis The invention further provides a set of twelve markers that can distinguish between BRCAl tumors and sporadic tumors, and between patients with good versus poor prognosis. Finally, the invention provides eleven markers capable of differentiating all three statuses. Conversely, the invention provides 2,050 of the 2,460 ER-status markers that can determine only ER status, 173 of the 430 BRCAl v. sporadic markers that can determine only BRCAI v. sporadic status, and 65 of the 231 prognosis markers that can only determine prognosis. In more specific embodiments, the invention also provides for subsets of at least 20, 50, 100, 200, 500, 1,000, 1,500 or 2,000 of the 2,050 ER-status markers that also determine only ER status. The invention also provides subsets of at least 20, 50, 100 or 150 of the 173 markers that also determine only BRCAl v. sporadic status. The invention further provides subsets of at least 20, 30, 40, or 50 of the 65 prognostic markers that also determine only prognostic status.
Any of the sets of markers provided above rnay be used alone specifically or in combination with markers outside the set. For example, markers that distinguish ER-status may be used in combination with the BRCAI vs. sporadic markers, or with the prognostic markers, or both. Any of the marker sets provided above may also be used in combination with other markers for breast cancer, or for any other clinical or physiological condition.
The relationship between the marker sets is diagramed in FIG. 1.

5.3.2 IDENTIFICATION OF MARKERS
The present invention provides sets of markers for the identification of conditions or indications associated with breast cancer. Generally, the marker sets were identified by determining which of 25,000 human markers had expression patters that correlated with the conditions or indications.
In one embodiment, the method for identifying marker sets is as follows.
After extraction and labeling of target polynucleotides, the expression of all markers (genes) in a sample X is compared to the expression of all markers in a standard or control. In one embodiment, the standard or control comprises target polynucleotide molecules derived from a sample from a normal individual (i.e., an individual not afflicted with breast cancer).
In a preferred embodiment, the standard or control is a pool of target polynucleotide molecules. The pool may derived from collected samples from a number of normal individuals. In a preferred embodiment, the pool comprises samples taken from a number of individuals having sporadic-type tumors. In another preferred embodiment, the pool comprises an artificially-generated population of nucleic acids designed to approximate the level of nucleic acid derived from each marker found in a pool of marker-derived nucleic acids derived from tumor samples. In yet another embodiment, the pool is derived from normal or breast cancer cell lines or cell line samples.
The comparison may be accomplished by any means known in the art. For example, expression levels of various markers may be assessed by separation of target polynucleotide molecules (e.g., RNA or cDNA) derived from the markers in agarose or polyacrylamide gels, followed by hybridization with marker-specific oligonucleotide probes. Alternatively, the comparison may be accomplished by the labeling of taxget polynucleotide molecules followed by separation on a sequencing gel.
Polynucleotide samples are placed on the gel such that patient and control or standard polynucleotides are in adj acent lanes. Comparison of expression levels is accomplished visually or by means of densitometer. In a preferred embodiment, the expression of all markers is assessed simultaneously by hybridization to a microarray. In each approach, markers meeting certain criteria are identified as associated with breast cancer.
A marker is selected based upon significant difference of expression in a sample as compared to a standard or control condition. Selection may be made based upon either significant up- or down regulation of the marker in the patient sample.
Selection may also be made by calculation of the statistical significance (i.e., the p-value) of the correlation between the expression of the marker and the condition or indication.
Preferably, both selection criteria are used. Thus, in one embodiment of the present invention, markers associated with breast cancer are selected where the markers show both more than two-fold change (increase or decrease) in expression as compared to a standard, and the p-value for the correlation between the existence of breast cancer and the change in marker expression is no more than 0.01 (i. e., is statistically significant).
The expression of the identified breast cancer-related markers is then used to identify markers that can differentiate tumors into clinical types. In a specific embodiment using a number of tumor samples, markers are identified by calculation of correlation coefficients between the clinical category or clinical parameters) and the linear, logarithmic or any transform of the expression ratio across all samples for each individual gene.
Specifically, the correlation coefficient is calculated as Equation (2) where G represents the clinical parameters or categories and Y' represents the linear, logarithmic or any transform of the ratio of expression between sample and control.
Markers for which the coefficient of correlation exceeds a cutoff are identified as breast cancer-related markers specific for a particular clinical type. Such a cutoff or threshold corresponds to a certain significance of discriminating genes obtained by Monte Carlo simulations. The threshold depends upon the number of samples used; the threshold can be calculated as 3 X 1l n -3 ~ where 1/ n _3 is the distribution width and n =
the number of samples. In a specific embodiment, markers are chosen if the correlation coefficient is greater than about 0.3 or less than about -0.3.
Next, the significance of the correlation is calculated. This significance may be calculated by any statistical means by which such significance is calculated. In a specific example, a set of correlation data is generated using a Monte-Carlo technique to randomize the association between the expression difference of a particular marker and the clinical category. The frequency distribution of markers satisfying the criteria through calculation of correlation coefficients is compared to the number of markers satisfying the criteria in the data generated through the Monte-Carlo technique. The frequency distribution of markers satisfying the criteria in the Monte-Carlo runs is used to determine whether the number of markers selected by correlation with clinical data is significant. See Example 4.
Once a marker set is identified, the markers may be rank-ordered in order of significance of discrimination. One means of rank ordering is by the amplitude of correlation between the change in gene expression of the marker and the specific condition being discriminated. Another, preferred means is to use a statistical metric.
In a specific embodiment, the metric is a Fisher-like statistic:

t = ~~xl ~ \x2 / l CT1 ~YZI -1~ -I- ~2 ~yZ2 - l~ ~~1 + h2 1~~~~nl + ~h2 Equation (3) In this equation, ~xl ~ is the error-weighted average of the log ratio of transcript expression measurements within a first diagnostic group (e.g., ER(-), ~~~ is the error-weighted average of log ratio within a second, related diagnostic group (e.g., ER(+)), 61 is the variance of the log ratio within the ER(-) group and hl is the number of samples for which valid measurements of log ratios are available. 6Z is the variance of log ratio within the second diagnostic group (e.g., ER(+)), and h2 is the number of samples for which valid measurements of log ratios are available. The t-value represents the variance-compensated difference between two means.
The rank-ordered marker set may be used to optimize the number of markers in the set used for discrimination. This is accomplished generally in a "leave one out"
method as follows. In a first run, a subset, for example 5, of the markers from the top of the ranked list is used to generate a template, where out of X samples, X-1 are used to generate the template, and the status of the remaining sample is predicted. This process is repeated for every sample until every one of the X samples is predicted once. In a second run, additional markers, for example 5, are added, so that a template is now generated from 10 markers, and the outcome of the remaining sample is predicted. This process is repeated until the entire set of markers is used to generate the template. For each of the runs, type 1 e~'or (false negative) and type 2 errors (false positive) are counted; the optimal number of markers is that number where the type 1 error rate, or type 2 error rate, or preferably the total of type 1 and type 2 error rate is lowest.
For prognostic markers, validation of the marker set may be accomplished by an additional statistic, a survival model. This statistic generates the probability of tumor distant metastases as a function of time since initial diagnosis. A number of models may be used, including Weibull, normal, log-normal, log logistic, log-exponential, or log-Rayleigh (Chapter 12 "Life Testing", S-PLUS 2000 GUIDE To STATISTICS, Vol. 2, p. 368 (2000)).
For the "normal" model, the probability of distant metastases P at time t is calculated as P = a x exp ~-t2/z'2 ~ Equation (4) where cc is fixed and equal to 1, and 2' is a parameter to be fitted and measures the "expected lifetime".
It will be apparent to those skilled in the art that the above methods, in particular the statistical methods, described above, are not limited to the identification of markers associated with breast cancer, but may be used to identify set of marker genes associated with any phenotype. The phenotype can be the presence or absence of a disease such as cancer, or the presence or absence of any identifying clinical condition associated with that cancer. In the disease context, the phenotype may be a prognosis such as a survival time, probability of distant metastases of a disease condition, or likelihood of a p~icular response to a therapeutic or prophylactic regimen. The phenotype need not be cancer, or a disease; the phenotype may be a nominal characteristic associated with a healthy individual.
5.3.3 SAMPLE COLLECTION
In the present invention, target polynucleotide molecules are extracted from a sample taken from an individual afflicted with breast cancer. The sample may be collected in any clinically acceptable manner, but must be collected such that marker-derived polynucleotides (i. e., RNA) are preserved. mRNA or nucleic acids derived therefrom (i. e., cDNA or amplified DNA) are preferably labeled distinguishably from standard or control polynucleotide molecules, and both are simultaneously or independently hybridized to a microarray comprising some or all of the markers or marker sets or subsets described above.
Alternatively, mRNA or nucleic acids derived therefrom may be labeled with the same label as the standard or control polynucleotide molecules, wherein the intensity of hybridization of each at a particular probe is compared. A sample may comprise any clinically relevant tissue sample, such as a tumor biopsy or fine needle aspirate, or a sample of bodily fluid, such as blood, plasma, serum, lymph, ascitic fluid, cystic fluid, urine or nipple exudate. The sample may be taken from a human, or, in a veterinary context, from non-human animals such as ruminants, horses, swine or sheep, or from domestic companion animals such as felines and canines.
Methods for preparing total and poly(A)+ RNA are well known and are described generally in Sambrook et al., MOLECULAR CLONING - A LABORATORY
MANUAL
(2I~m ED.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York (1989)) and Ausubel et al., CURRENT PROTOCOLS IN MOLECULAR BIOLOGY, vol. 2, Current Protocols Publishing, New York (1994)).

RNA may be isolated from eukaryotic cells by procedures that involve lysis of the cells and denaturation of the proteins contained therein. Cells of interest include wild-type cells (i.e., non-cancerous), drug-exposed wild-type cells, tumor- or tumor-derived cells, modified cells, normal or tumor cell line cells, and drug-exposed modified cells.
Additional steps rnay be employed to remove DNA. Cell lysis may be accomplished with a nonionic detergent, followed by microcentrifugation to remove the nuclei and hence the bulk of the cellular DNA. In one embodiment, RNA is extracted from cells of the various types of interest using guanidinium thiocyanate lysis followed by CsCI
centrifugation to separate the RNA from DNA (Chirgwin et al., Biochemistry 18:5294-5299 (1979)). Poly(A)+ RNA is selected by selection with oligo-dT cellulose (see Sambrook et al., , MOLECULAR CLONING - A LABORATORY MANUAL (2~ ED.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York (1989). Alternatively, separation of RNA from DNA can be accomplished by organic extraction, for example, with hot phenol or phenol/chloroform/isoamyl alcohol.
If desired, RNase inhibitors may be added to the lysis buffer. Likewise, for certain cell types, it may be desirable to add a protein denaturation/digestion step to the protocol.
For many applications, it is desirable to preferentially enrich mRNA with respect to other cellular RNAs, such as transfer RNA (tRNA) and ribosomal RNA
(rRNA).
Most mRNAs contain a poly(A) tail at their 3' end. This allows them to be enriched by affinity chromatography, for example, using oligo(dT) or poly( coupled to a solid support, such as cellulose or SephadexTM (see Ausubel et al., CURRENT
PROTOCOLS IN
MOLECULAR BIOLOGY, vol. 2, Current Protocols Publishing, New York (1994). Once bound, poly(A)+ mRNA is eluted from the affinity column using 2 mM EDTAlO.1 %
SDS.
The sample of RNA can comprise a plurality of different mRNA molecules, each different mRNA molecule having a different nucleotide sequence. In a specific embodiment, the mRNA molecules in the RNA sample comprise at least 100 different nucleotide sequences. More preferably, the mRNA molecules of the RNA sample comprise mRNA molecules corresponding to each of the marker genes. In another specific embodiment, the RNA sample is a mammalian RNA sample.
In a specific embodiment, total RNA or mRNA from cells are used in the methods of the invention. The source of the RNA can be cells of a plant or animal, human, mammal, primate, non-human animal, dog, cat, mouse, rat, bird, yeast, eukaryote, prokaryote, etc. In specific embodiments, the method of the invention is used with a sample containing total mRNA or total RNA from 1 x 106 cells or less. In another embodiment, proteins can be isolated from the foregoing sources, by methods known in the art, for use in expression analysis at the protein level.
Probes to the homologs of the marker sequences disclosed herein can be employed preferably wherein non-human nucleic acid is being assayed.
5.4 METHODS OF USING BREAST CANCER MARKER SETS
5.4.1 DIAGNOSTIC METHODS
The present invention provides for methods of using the marker sets to analyze a sample from an individual so as to determine the individual's tumor type or subtype at a molecular level, whether a tumor is of the ER(+) or ER(-) type, and whether the tumor is BRCAI -associated or sporadic. The individual need not actually be afflicted with breast cancer. Essentially, the expression of specific marker genes in the individual, or a sample taken therefrom, is compared to a standard or control. For example, assume two breast cancer-related conditions, X and Y. One can compare the level of expression of breast cancer prognostic markers for condition X in an individual to the level of the marker-derived polynucleotides in a control, wherein the level represents the level of expression exhibited by samples having condition X. In this instance, if the expression of the markers in the individual's sample is substantially (i. e., statistically) different from that of the control, then the individual does not have condition X. Where, as here, the choice is bimodal (i. e., a sample is either X or Y), the individual can additionally be said to have condition Y. Of course, the comparison to a control representing condition Y
can also be performed. Preferably both are performed simultaneously, such that each control acts as both a positive and a negative control. The distinguishing result may thus either be a demonstrable difference from the expression levels (i.e., the amount of marker-derived ~A~ or polynucleotides derived therefrom) represented by the control, or no significant difference.
Thus, in one embodiment, the method of determining a particular tumor-related status of an individual comprises the steps of (1) hybridizing labeled target polynucleotides from an individual to a microarray containing one of the above marker sets;
(2) hybridizing standard or control polynucleotides molecules to the microarray, wherein the standard or control molecules are differentially labeled from the target molecules; and (3) determining the difference in transcript levels, or lack thereof, between the target and standard or control, wherein the difference, or lack thereof, determines the individual's tumor-related status. In a more specific embodiment, the standard or control molecules comprise marker-derived polynucleotides from a pool of samples from normal individuals, -9~-or a pool of tumor samples from individuals having sporadic-type tumors. In a preferred embodiment, the standard or control is an artificially-generated pool of marker-derived polynucleotides, which pool is designed to mimic the level of marker expression exhibited by clinical samples of normal or breast cancer tumor tissue having a particular clinical indication (i. e., cancerous or non-cancerous; ER(+) or ER(-) tumor; BRCAl -or sporadic type tumor). In another specific embodiment, the control molecules comprise a pool derived from normal or breast cancer cell lines.
The present invention provides sets of markers useful for distinguishing ER(+) from ER(-) tumor types. Thus, in one embodiment of the above method, the level of polynucleotides (i.e., mRNA or polynucleotides derived therefrom) in a sample from an individual, expressed from the markers provided in Table 1 are compared to the level of expression of the same markers from a control, wherein the control comprises marker-related polynucleotides derived from ER(+) samples, ER(-) samples, or both.
Preferably, the comparison is to both ER(+) and ER(-), and preferably the comparison is to polynucleotide pools from a number of ER(+) and ER(-) samples, respectively.
Where the individual's marker expression most closely resembles or correlates with the ER(+) control, and does not resemble or correlate with the ER(-) control, the individual is classified as ER(+). Where the pool is not pure ER(+) or ER(-), for example, a sporadic pool is used. A
set of experiments using individuals with known ER status should be hybridized against the pool, in order to define the expression templates for the ER(+) and ER(-) group. Each individual with unknown ER status is hybridized against the same pool and the expression profile is compared to the templates (s) to determine the individual's ER
status.
The present invention provides sets of markers useful for distinguishing BRCAI -related tumors from sporadic tumors. Thus, the method can be performed substantially as for the ER(+/-) determination, with the exception that the markers are those listed in Tables 3 and 4, and the control markers are a pool of marker-derived polynucleotides BRCAl tumor samples, and a pool of marker-derived polynucleotides from sporadic tumors. A patient is determined to have a BRCAl germline mutation where the expression of the individual's marker-derived polynucleotides most closely resemble, or are most closely correlated with, that of the BRCAl control. Where the control is not pure BRCAl or sporadic, two templates can be defined in a manner similar to that for ER status, as described above.
For the above two embodiments of the method, the full set of markers may be used (i.e., the complete set of markers for Tables 1 or 3). In other embodiments, subsets of the markers may be used. In a preferred embodiment, the preferred markers listed in Tables 2 or 4 are used.
The similarity between the marker expression profile of an individual and that of a control can be assessed a number of ways. In the simplest case, the profiles can be compared visually in a printout of expression difference data. Alternatively, the similarity can be calculated mathematically.
In one embodiment, the similarity measure between two patients x and y, or patient x and a template y, can be calculated using the following equation:
_ 2 - 2 x' x Y' Y x x S = 1- ~ ~ ' ~ '~' y Equation (5) ~xr ~Yi '=1 ~Xr '-1 6Yi In this equation, .7r andy are two patients with components of log ratiox and yi, i =~,...,N = 4,986. Associated with every value x is error 6'~ . The smaller the value 6~ , _ N~ x. N'' 1 the more reliable the measurement .x . x = ~ ~ ~ ~ is the error-weighted arithmetic t=1 xr i=1 ~i mean.
In a preferred embodiment, templates are developed for sample comparison.
The template is defined as the error-weighted log ratio average of the expression difference for the group of marker genes able to differentiate the particular breast cancer-related condition. For example, templates are defined for ER(+) samples and for ER(-) samples.
Next, a classifier parameter is calculated. This parameter may be calculated using either expression level differences between the sample and template, or by calculation of a correlation coefficient. Such a coefficient, PZ, can be calculated using the following equation:
p - (Zl ~ y)~~ ~Z ~ ~ ~~,~~) Equation (1) where Zi is the expression template i, and y is the expression profile of a patient.
Thus, in a more specific embodiment, the above method of determining a particular tumor-related status of an individual comprises the steps of (1) hybridizing labeled target polynucleotides from an individual to a microarray containing one of the above marker sets; (2) hybridizing standard or control polynucleotides molecules to the microarray, wherein the standard or control molecules are differentially labeled from the target molecules; and (3) determining the ratio (or difference) of transcript levels between two channels (individual and control), or simply the transcript levels of the individual; and (4) comparing the results from (3) to the predefined templates, wherein said determining is accomplished by means of the statistic of Equation 1 or Equation 5, and wherein the difference, or lack thereof, determines the individual's tumor-related status.
5.4.2 PROGNOSTIC METHODS
The present invention provides sets of markers useful for distinguishing samples from those patients with a good prognosis from samples from patients with a poor prognosis. Thus, the invention further provides a method for using these markers to determine whether an individual afflicted with breast cancer will have a good or poor clinical prognosis. In one embodiment, the invention provides for method of determining whether an individual afflicted with breast cancer will likely experience a relapse within five years of initial diagnosis (i.e., whether an individual has a poor prognosis) comprising (1) comparing the level of expression of the markers listed in Table 5 in a sample taken from the individual to the level of the same markers in a standard or control, where the standard or control levels represent those found in an individual with a poor prognosis; and (2) determining whether the level of the marker-related polynucleotides in the sample from the individual is significantly different than that of the control, wherein if no substantial difference is found, the patient has a poor prognosis, and if a substantial difference is found, the patient has a good prognosis. Persons of skill in the art will readily sae that the markers associated with good prognosis can also be used as controls. In a more specific embodiment, both controls are run. In case the pool is not pure 'good prognosis' or 'poor prognosis', a set of experiments of individuals with known outcome should be hybridized against the pool to define the expression templates for the good prognosis and poor prognosis group. Each individual with unknown outcome is hybridized against the same pool and the resulting expression profile is compared to the templates to predict its outcome.
Poor prognosis of breast cancer may indicate that a tumor is relatively aggressive, while good prognosis may indicate that a tumor is relatively nonaggressive.
Therefore, the invention provides for a method of determining a course of treatment of a breast cancer patient, comprising determining whether the level of expression of the 231 markers of Table 5, or a subset thereof, correlates with the level of these markers in a sample representing a good prognosis expression pattern or a poor prognosis pattern; and determining a course of treatment, wherein if the expression correlates with the poor prognosis pattern, the tumor is treated as an aggressive tumor.
As with the diagnostic markers, the method can use the complete set of markers listed in Table 5. However, subsets of the markers may also be used.
In a preferred embodiment, the subset listed in Table 6 is used.
Classification of a sample as "good prognosis" or "poor prognosis" is accomplished substantially as for the diagnostic markers described above, wherein a template is generated to which the marker expression levels in the sample are compared.
The use of marker sets is not restricted to the prognosis of breast cancer-related conditions, and may be applied in a variety of phenotypes or conditions, clinical or experimental, in which gene expression plays a role. Where a set of markers has been identified that corresponds to two or more phenotypes, the marker sets can be used to distinguish these phenotypes. For example, the phenotypes may be the diagnosis and/or prognosis of clinical states or phenotypes associated with other cancers, other disease conditions, or other physiological conditions, wherein the expression level data is derived from a set of genes correlated with the particular physiological or disease condition.
5.4.3 IMPROVING SENSITIVITY TO EXPRESSION LEVEL DIFFERENCES
In using the markers disclosed herein, and, indeed, using any sets of markers to differentiate an individual having one phenotype from another individual having a second phenotype, one can compare the absolute expression of each of the markers in a sample to a control; for example, the control can be the average level of expression of each of the markers, respectively, in a pool of individuals. To increase the sensitivity of the comparison, however, the expression level values are preferably transformed in a number of ways.
For example, the expression level of each of the markers can be normalized by the average expression level of all markers the expression level of which is determined, or by the average expression level of a set of control genes. Thus, in one embodiment, the markers are represented by probes on a microarray, and the expression level of each of the markers is normalized by the mean or median expression level across all of the genes represented on the microarray, including any non-marker genes. In a specific embodiment, the normalization is carried out by dividing the median or mean level of expression of all of the genes on the microarray. In another embodiment, the expression levels of the markers is normalized by the mean or median level of expression of a set of control markers. In a specific embodiment, the control markers comprise a set of housekeeping genes.
In another specific embodiment, the normalization is accomplished by dividing by the median or mean expression level of the control genes.
The sensitivity of a marker-based assay will also be increased if the expression levels of individual markers are compared to the expression of the same markers in a pool of samples. Preferably, the comparison is to the mean or median expression level of each the marker genes in the pool of samples. Such a comparison may be accomplished, for example, by dividing by the mean or median expression level of the pool for each of the markers from the expression level each of the markers in the sample. This has the effect of accentuating the relative differences in expression between markers in the sample and markers in the pool as a whole, making comparisons more sensitive and more likely to produce meaningful results that the use of absolute expression levels alone.
The expression level data may be transformed in any convenient way; preferably, the expression level data for all is log transformed before means or medians are taken.
In performing comparisons to a pool, two approaches may be used. First, the expression levels of the markers in the sample may be compared to the expression level of those markers in the pool, where nucleic acid derived from the sample and nucleic acid derived from the pool are hybridized during the course of a single experiment.
Such an approach requires that new pool nucleic acid be generated for each comparison or limited numbers of comparisons, and is therefore limited by the amount of nucleic acid available.
Alternatively, and preferably, the expression levels in a pool, whether normalized and/or transformed or not, are stored on a computer, or on computer-readable media, to be used in comparisons to the individual expression level data from the sample (i. e., single-channel data).
Thus, the current invention provides the following method of classifying a first cell or organism as having one of at least two different phenotypes, where the. different phenotypes comprise a first phenotype and a second phenotype. The level of expression of each of a plurality of genes in a first sample from the first cell or organism is compared to the level of expression of each of said genes, respectively, in a pooled sample from a plurality of cells or organisms, the plurality of cells or organisms comprising different cells or organisms exhibiting said at least two different phenotypes, respectively, to produce a first compared value. The first compared value is then compared to a second compared value, wherein said second compared value is the product of a method comprising comparing the level of expression of each of said genes in a sample from a cell or organism characterized as having said first phenotype to the level of expression of each of said genes, respectively, in the pooled sample. The first compared value is then compared to a third compared value, wherein said third compared value is the product of a method comprising comparing the level of expression of each of the genes in a sample from a cell or organism characterized as having the second phenotype to the level of expression of each of the genes, respectively, in the pooled sample. Optionally, the first compared value can be compared to additional compared values, respectively, where each additional compared value is the product of a method comprising comparing the level of expression of each of said genes in a sample from a cell or organism characterized as having a phenotype different from said first and second phenotypes but included among the at least two different phenotypes, to the level of expression of each of said genes, respectively, in said pooled sample. Finally, a determination is made as to which of said second, third, and, if present, one or more additional compared values, said first compared value is most similar, wherein the first cell or organism is determined to have the phenotype of the cell or organism used to produce said compared value most similar to said first compared value.
In a specific embodiment of this method, the compared values are each ratios of the levels of expression of each of said genes. In another specific embodiment, each of the levels of expression of each of the genes in the pooled sample are normalized prior to any of the comparing steps. In a more specific embodiment, the normalization of the levels of expression is carried out by dividing by the median or mean level of the expression of each of the genes or dividing by the mean or median level of expression of one or more housekeeping genes in the pooled sample from said cell or organism. In another specific embodiment, the normalized levels of expression are subjected to a log transform, and the comparing steps comprise subtracting the log transform from the log of the levels of expression of each of the genes in the sample. In another specific embodiment, the two or more different phenotypes are different stages of a disease or disorder. In still another specific embodiment, the two or more different phenotypes are different prognoses of a disease or disorder. In yet another specific embodiment, the levels of expression of each of the genes, respectively, in the pooled sample or said levels of expression of each of said genes in a sample from the cell or organism characterized as having the first phenotype, second phenotype, or said phenotype different from said first and second phenotypes, respectively, are stored on a computer or on a computer-readable medium.
In another specific embodiment, the two phenotypes are ER(+) or ER(-) status. In another specific embodiment, the two phenotypes are BRCAl or sporadic tumor-type status. In yet another specific embodiment, the two phenotypes are good prognosis and poor prognosis.
Of course, single-channel data may also be used without specific comparison to a mathematical sample pool. For example, a sample may be classified as having a first or a second phenotype, wherein the first and second phenotypes are related, by calculating the similarity between the expression of at least 5 markers in the sample, where the markers are correlated with the first or second phenotype, to the expression of the same markers in a first phenotype template and a second phenotype template, by (a) labeling nucleic acids derived from a sample with a fluorophore to obtain a pool of fluorophore-labeled nucleic acids; (b) contacting said fluorophore-labeled nucleic acid with a microarray under conditions such that hybridization can occur, detecting at each of a plurality of discrete loci on the microarray a flourescent emission signal from said fluorophore-labeled nucleic acid that is bound to said microarray under said conditions; and (c) determining the similarity of marker gene expression in the individual sample to the first and second templates, wherein if said expression is more similar to the first template, the sample is classified as having the first phenotype, and if said expression is more similar to the second template, the sample is classified as having the second phenotype.
5.5 DETERMINATION OF MARKER GENE EXPRESSION LEVELS
5.5.1 METHODS
The expression levels of the marker genes in a sample may be determined by any means known in the art. The expression level may be determined by isolating and determining the level (i. e., amount) of nucleic acid transcribed from each marker gene.
Alternatively, or additionally, the level of specific proteins translated from mRNA
transcribed from a marker gene may be determined.
The level of expression of specific marker genes can be accomplished by determining the amount of mRNA, or polynucleotides derived therefrom, present in a sample. Any method for determining RNA levels can be used. For example, RNA is isolated from a sample and separated on an agarose gel. The separated RNA is then transferred to a solid support, such as a filter. Nucleic acid probes representing one or more markers are then hybridized to the filter by northern hybridization, and the amount of marker-derived RNA is determined. Such determination can be visual, or machine-aided, for example, by use of a densitometer. Another method of determining RNA
levels is by use of a dot-blot or a slot-blot. In this method, RNA, or nucleic acid derived therefrom, from a sample is labeled. The RNA or nucleic acid derived therefrom is then hybridized to a filter containing oligonucleotides derived from one or more marker genes, wherein the oligonucleotides are placed upon the filter at discrete, easily-identifiable locations.
Hybridization, or lack thereof, of the labeled RNA to the filter-bound oligonucleotides is determined visually or by densitometer. Polynucleotides can be labeled using a radiolabel or a fluorescent (i.e., visible) label.
These examples are not intended to be limiting; other methods of determining RNA abunda~lce are known in the art.
The level of expression of particular marker genes may also be assessed by determining the level of the specific protein expressed from the marker genes.
This can be accomplished, for example, by separation of proteins from a sample on a polyacrylamide gel, followed by identification of specific marker-derived proteins using antibodies in a western blot. Alternatively, proteins can be separated by two-dimensional gel electrophoresis systems. Two-dimensional gel electrophoresis is well-known in the art and typically involves isoelectric focusing along a first dimension followed by SDS-PAGE
electrophoresis along a second dimension. See, e.g., Hames et al, 1990, GEL
ELECTROPHORESIS OF PROTEINS: A PRACTICAL APPROACH, 1RL Press, New York;
Shevchenlco et al., Proc. Nat'l Acad. Sci. USA 93:1440-1445 (1996); Sagliocco et al., Yeast 12:1519-1533 (1996); Lander, Science 274:536-539 (1996). The resulting electropherograms can be analyzed by numerous techniques, including mass spectrometric techniques, western blotting and immunoblot analysis using polyclonal and monoclonal antibodies.
Alternatively, marker-derived protein levels can be determined by constructing an antibody microarray in which binding sites comprise immobilized, preferably monoclonal, antibodies specific to a plurality of protein species encoded by the cell genome. Preferably, antibodies are present for a substantial fraction of the marker-derived proteins of interest. Methods for making monoclonal antibodies are well known (see, e.g., Harlow and Lane, 1988, ANTIBODIES: A LABORATORY MANUAL, Cold Spring Harbor, New York, which is incorporated in its entirety for all purposes). In one embodiment, monoclonal antibodies are raised against synthetic peptide fragments designed based on genomic sequence of the cell. With such an antibody array, proteins from the cell are contacted to the array. and their binding is assayed with assays known in the art.
Generally, the expression, and the level of expression, of proteins of diagnostic or prognostic interest can be detected through immunohistochemical staining of tissue slices or sections.

Finally, expression of marker genes in a number of tissue specimens may be characterized using a "tissue array" (Kononen et al., Nat. Med 4(7):844-7 (1998)). In a tissue array, multiple tissue samples are assessed on the same microarray. The arrays allow ih situ detection of RNA and protein levels; consecutive sections allow the analysis of multiple samples simultaneously.
5.5.2 MICROARRAYS
In preferred embodiments, polynucleotide microarrays are used to measure expression so that the expression status of each of the markers above is assessed simultaneously. In a specific embodiment, the invention provides for oligonucleotide or cDNA arrays comprising probes hybridizable to the genes corresponding to each of the marker sets described above (i.e., markers to determine the molecular type or subtype of a tumor; markers to distinguish ER status; markers to distinguish BRCAI from sporadic tumors; markers to distinguish patients with good versus patients with poor prognosis;
markers to distinguish both ER(+) from ER(-), and BRCAl tumors from sporadic tumors;
markers to distinguish ER(+) from ER(-), and patients with good prognosis from patients with poor prognosis; markers to distinguish BRCAI tumors from sporadic tumors, and patients with good prognosis from patients with poor prognosis; and markers able to distinguish ER(+) from ER(-), BRCAI tumors from sporadic tumors, and patients with good prognosis from patients with poor prognosis; and markers unique to each status).
The microarrays provided by the present invention may comprise probes hybridizable to the genes corresponding to markers able to distinguish the status of one, two, or all three of the clinical conditions noted above. In particular, the invention provides polynucleotide arrays comprising probes to a subset or subsets of at least 50, 100, 200, 300, 400, 500, 750, 1,000, 1,250, 1,500, 1,750, 2,000 or 2,250 genetic markers, up to the full set of 2,460 markers, which distinguish ER(+) and ER(-) patients or tumors. The invention also provides probes to subsets of at least 20, 30, 40, 50, 75, 100, 150, 200, 250, 300, 350 or 400 markers, up to the full set of 430 markers, which distinguish between tumors containing a BRCAl mutation and sporadic tumors within an ER(-) group of tumors. The invention also provides probes to subsets of at least 20, 30, 40, 50, 75, 100, 150 or 200 markers, up to the full set of 231 markers, which distinguish between patients with good and poor prognosis within sporadic tumors. In a specific embodiment, the array comprises probes to marker sets or subsets directed to any two of the clinical conditions. In a more specific embodiment, the array comprises probes to marker sets or subsets directed to all three clinical conditions.

In yet another specific embodiment, microarrays that are used in the methods disclosed herein optionally comprise markers additional to at least some of the markers listed in Tables 1-6. For example, in a specific embodiment, the microarray is a screening or scanning array as described in Altschuler et al., International Publication WO 02/18646, published March 7, 2002 and Scherer et al., International Publication WO
02/16650, published February 28, 2002. The scanning and screening arrays comprise regularly-spaced, positionally-addressable probes derived from genomic nucleic acid sequence, both expressed and unexpressed. Such arrays may comprise probes corresponding to a subset of, or all of, the markers listed in Tables 1-6, or a subset thereof as described above, and can be used to monitor marker expression in the same way as a microarray containing only markers listed in Tables 1-6.
In yet another specific embodiment, the microarray is a commercially-available cDNA microarray that comprises at least five of the markers listed in Tables 1-6.
Preferably, a commercially-available cDNA microarray comprises all of the markers listed in Tables 1-6. However, such a microarray may comprise 5, 10, 15, 25, 50, 100, 150, 250, 500, 1000 or more of the markers in any of Tables 1-6, up to the maximum number of markers in a Table, and may comprise all of the markers in any one of Tables 1-6 and a subset of another of Tables 1-6, or subsets of each as described above. In a specific embodiment of the microarrays used in the methods disclosed herein, the markers that are all or a portion of Tables 1-6 make up at least 50%, 60%, 70%, 80%, 90%, 95%
or 98% of the probes on the microarray.
General methods pertaining to the construction of microarrays comprising the marker sets and/or subsets above are described in the following sections.
5.5.2.1 CONSTRUCTION OF MICROARRAYS
Microarrays are prepared by selecting probes which comprise a polynucleotide sequence, and then immobilizing such probes to a solid support or surface. For example, the probes may comprise DNA sequences, RNA sequences, or copolymer sequences of DNA and RNA. The polynucleotide sequences of the probes may also comprise DNA
~d/or RNA analogues, or combinations thereof. For example, the polynucleotide sequences of the probes may be full or partial fragments of genomic DNA. The polynucleotide sequences of the probes may also be synthesized nucleotide sequences, such as synthetic oligonucleotide sequences. The probe sequences can be synthesized either enzymatically irz vivo, enzymatically in vitro (e.g., by PCR), or non-enzymatically in vitro.

The probe or probes used in the methods of the invention are preferably immobilized to a solid support which may be either porous or non-porous. For example, the probes of the invention may be polynucleotide sequences which are attached to a nitrocellulose or nylon membrane or filter covalently at either the 3' or the 5' end of the polynucleotide. Such hybridization probes are well known in the art (see, e.g., Sambrook et al., MOLECULAR CLONING - A LABORATORY MANUAL (2~ ED.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York (1989). Alternatively, the solid support or surface may be a glass or plastic surface. In a particularly preferred embodiment, hybridization levels are measured to microarrays of probes consisting of a solid phase on the surface of which are immobilized a population of polynucleotides, such as a population of DNA or DNA mimics, or, alternatively, a population of RNA or RNA mimics. The solid phase may be a nonporous or, optionally, a porous material such as a gel.
In preferred embodiments, a microarray comprises a support or surface with an ordered array of binding (e.g., hybridization) sites or "probes" each representing one of the makers described herein. Preferably the microarrays are addressable arrays, and more preferably positionally addressable arrays. More specifically, each probe of the array is preferably located at a known, predetermined position on the solid support such that the identity (i.e., the sequence) of each probe can be determined from its position in the array (i.e., on the support or surface). In preferred embodiments, each probe is covalently attached to the solid support at a single site.
Microarrays can be made in a number of ways, of which several are described below. However produced, microarrays share certain characteristics. The arrays are reproducible, allowing multiple copies of a given array to be produced and easily compared with each other. Preferably, microarrays are made from materials that are stable under binding (e.g., nucleic acid hybridization) conditions. The microarrays are preferably small, e.g., between 1 cm2 and 25 cm2, between 12 cm2 and 13 cm2, or 3 cm2. However, larger arrays are also contemplated and may be preferable, e.g., for use in screening arrays.
Preferably, a given binding site or unique set of binding sites in the microarray will specifically bind (e.g., hybridize) to the product of a single gene in a cell (e.g., to a specific ~A~ or to a specific cDNA derived therefrom). However, in general, other related or similar sequences will cross hybridize to a given binding site.
The microarrays of the present invention include one or more test probes, each of which has a polynucleotide sequence that is complementary to a subsequence of RNA or DNA to be detected. Preferably, the position of each probe on the solid surface is known.
Indeed, the microarrays are preferably positionally addressable arrays.
Specifically, each probe of the array is preferably located at a known, predetermined position on the solid support such that the identity (i.e., the sequence) of each probe can be determined from its position on the array (i.e., on the support or surface).
According to the invention, the microarray is an array (i. e., a matrix) in which each position represents one of the markers described herein. For example, each position can contain a DNA or DNA analogue based on genomic DNA to which a particular RNA
or cDNA transcribed from that genetic marker can specifically hybridize. The DNA
or DNA
analogue can be, e.g., a synthetic oligomer or a gene fragment. In one embodiment, probes representing each of the markers is present on the array. In a preferred embodiment, the IO gay comprises the 550 of the 2,460 RE-status markers, 70 of the BRC'AI/sporadic markers, and all 231 of the prognosis markers.
5.5.2.2 PREPARING PROBES FOR MICROARRAYS
As noted above, the "probe" to which a particular polynucleotide molecule 15 specifically hybridizes according to the invention contains a complementary genomic polynucleotide sequence. The probes of the microarray preferably consist of nucleotide sequences of no more than 1,000 nucleotides. In some embodiments, the probes of the array consist of nucleotide sequences of 10 to 1,000 nucleotides. In a preferred embodiment, the nucleotide sequences of the probes are in the range of 10-200 nucleotides in length and are 20 genomic sequences of a species of organism, such that a plurality of different probes is present, with sequences complementary and thus capable of hybridizing to the genome of such a species of organism, sequentially tiled across all or a portion of such genome. In other specific embodiments, the probes are in the range of 10-30 nucleotides in length, in the range of 10-40 nucleotides in length, in the range of 20-50 nucleotides in length, in the 25 range of 40-80 nucleotides in length, in the range of 50-150 nucleotides in length, in the range of 80-120 nucleotides in length, and most preferably axe 60 nucleotides in length.
The probes may comprise DNA or DNA "mimics" (e.g., derivatives and analogues) corresponding to a portion of an organism's genome. In another embodiment, the probes of the microarray are complementary RNA or RNA mimics. DNA mimics are polymers 30 composed of subunits capable of specific, Watson-Crick-like hybridization with DNA, or of specific hybridization with RNA. The nucleic acids can be modified at the base moiety, at the sugar moiety, or at the phosphate backbone. Exemplary DNA mimics include, e.g., phosphorothioates.
DNA can be obtained, e.g., by polymerase chain reaction (PCR) amplification of 35 genomic DNA or cloned sequences. PCR primers are preferably chosen based on a known sequence of the genome that will result in amplification of specific fragments of genomic DNA. Computer programs that are well known in the art are useful in the design of primers with the required specificity and optimal amplification properties, such as Oligo version 5.0 (National Biosciences). Typically each probe on the microarray will be between 10 bases and 50,000 bases, usually between 300 bases and 1,000 bases in length. PCR
methods are well known in the art, and are described, for example, in Innis et al., eds., PCR PROTOCOLS:
A GUIDE TO METHODS AND APPLICATIONS, Academic Press Inc., San Diego, CA
(1990). It will be apparent to one skilled in the art that controlled robotic systems are useful for isolating and amplifying nucleic acids.
An alternative, preferred means for generating the polynucleotide probes of the microarray is by synthesis of synthetic polynucleotides or oligonucleotides, e.g., using N-phosphonate or phosphoramidite chemistries (Froehler et al., Nucleic Acid Res.
14:5399-5407 (1986); McBride et al., Tetrahedron Lett. 24:246-248 (1983)). Synthetic sequences are typically between about 10 and about 500 bases in length, more typically between about 20 and about 100 bases, and most preferably between about 40 and about 70 bases in length.
In some embodiments, synthetic nucleic acids include non-natural bases, such as, but by no means limited to, inosine. As noted above, nucleic acid analogues may be used as binding sites for hybridization. An example of a suitable nucleic acid analogue is peptide nucleic acid (see, e.g., Egholin et al., Nature 363:566-568 (1993); U.S. Patent No.
5,539,083).
probes are preferably selected using an algorithm that takes into account binding energies, base composition, sequence complexity, cross-hybridization binding energies, and secondary structure (see Friend et al., International Patent Publication WO
01/05935, published January 25, 2001; Hughes et al., Nat. Biotech. 19:342-7 (2001)).
A skilled artisan will also appreciate that positive control probes, e.g., probes known to be complementary and hybridizable to sequences in the target polynucleotide molecules, and negative control probes, e.g., probes known to not be complementary and hybridizable to sequences in the taxget polynucleotide molecules, should be included on the array. In one embodiment, positive controls are synthesized along the perimeter of the array. In another embodiment, positive controls are synthesized in diagonal stripes across the array. In still another embodiment, the reverse complement for each probe is synthesized next to the position of the probe to serve as a negative control. In yet another embodiment, sequences from other species of organism are used as negative controls or as "spike-in"
controls.

5.5.2.3 ATTACHING PROBES TO THE SOLm SURFACE
The probes are attached to a solid support or surface, which may be made, e.g., from glass, plastic (e.g., polypropylene, nylon), polyacrylamide, nitrocellulose, gel, or other porous or nonporous material. A preferred method for attaching the nucleic acids to a surface is by printing on glass plates, as is described generally by Schena et al, Science 270:467-470 (1995). This method is especially useful for preparing microarrays of cDNA
(See also, DeRisi et al, Nature Genetics 14:457-460 (1996); Shalon et al., Genome Res.
6:639-645 (1996); and Schena et al., Py~oc. Natl. Acad. Sci. U.S.A. 93:10539-11286 (1995)).
A second preferred method for making microarrays is by making high-density oligonucleotide arrays. Techniques are known for producing arrays containing thousands of oligonucleotides complementary to defined sequences, at defined locations on a surface using photolithographic techniques for synthesis in situ (see, Fodor et al., 1991, Science 251:767-773; Pease et al., 1994, Proc. Natl. Acad. Sci. U.S.A. 91:5022-5026;
Lockhart et al., 1996, Natuf°e Biotechnology 14:1675; U.S. Patent Nos. 5,578,832;
5,556,752; and 5,510,270) or other methods for rapid synthesis and deposition of defined oligonucleotides (Blanchard et al., BiosensoYS & Bioelectronics 11:687-690). When these methods are used, oligonucleotides (e.g., 60-mers) of known sequence are synthesized directly on a surface such as a derivatized glass slide. Usually, the array produced is redundant, with several oligonucleotide molecules per RNA.
Other methods for making microarrays, e.g., by masking (Maskos and Southern, 1992, Nuc. Acids. Res. 20:1679-1684), may also be used. In principle, and as noted supna, any type of array, for example, dot blots on a nylon hybridization membrane (see Sambrook et al., MOLECULAR CLONING - A LABORATORY MANUAL (2ND ED.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York (1989)) could be used.
However, as will be recognized by those skilled in the art, very small arrays will frequently be preferred because hybridization volumes will be smaller.
In one embodiment, the arrays of the present invention are prepared by synthesizing polynucleotide probes on a support. In such an embodiment, polynucleotide probes axe attached to the support covalently at either the 3' or the 5' end of the polynucleotide.
In a particularly preferred embodiment, microarrays of the invention are manufactured by means of an ink jet printing device for oligonucleotide synthesis, e.g., using the methods and systems described by Blanchard in U.S. Pat. No.
6,028,189;
Blanchard et al., 1996, Bioserasors and Bioelectronics 11:687-690; Blanchard, 1998, in SYNTHETIC DNA ARRAYS IN GENETIC ENGINEERING, Vol. 20, J.K. Setlow, Ed., Plenum press, New York at pages 1 I I-123. Specifically, the oligonucleotide probes in such microarrays are preferably synthesized in arrays, e.g., on a glass slide, by serially depositing individual nucleotide bases in "microdroplets" of a high surface tension solvent such as propylene carbonate. The microdroplets have small volumes (e.g., 100 pL or less, more preferably 50 pL or less) and are separated from each other on the microarray (e.g., by hydrophobic domains) to form circular surface tension wells which define the locations of the array elements (i.e., the different probes). Microarrays manufactured by this ink jet method are typically of high density, preferably having a density of at least about 2,500 different probes per 1 cm2. The polynucleotide probes are attached to the support covalently at either the 3' or the 5' end of the polynucleotide.
5.5.2.4 TARGET POLYNCJCLEOTIDE MOLECULES
The polynucleotide molecules which may be analyzed by the present invention (the "target polynucleotide molecules") may be from any clinically relevant source, but are expressed RNA or a nucleic acid derived therefrom (e.g., cDNA or amplified RNA
derived from cDNA that incorporates an RNA polymerase promoter), including naturally occurring nucleic acid molecules, as well as synthetic nucleic acid molecules. In one embodiment, the target polynucleotide molecules comprise RNA, including, but by no means limited to, total cellular RNA, poly(A)+ messenger RNA (mRNA) or fraction thereof, cytoplasmic mRNA, or RNA transcribed from cDNA (i.e., cRNA; see, e.g., Linsley & Schelter, U.S.
Patent Application No. 09/411,074, filed October 4, 1999, or U.S. Patent Nos.
5,545,522, 5,891,636, or 5,716,785). Methods for preparing total and poly(A)~ RNA are well known in the art, and are described generally, e.g., in Sambrook et al., MOLECULAR
CLONING - A
LABORATORY MANUAL (2ND ED.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York (1989). In one embodiment, RNA is extracted from cells of the various types of interest in this invention using guanidinium thiocyanate lysis followed by CsCl centrifugation (Chirgwin et al., 1979, Biochefnist~y 18:5294-5299). In another embodiment, total RNA is extracted using a silica gel-based column, commercially available examples of which include RNeasy (Qiagen, Valencia, California) and StrataPrep (Stratagene, La Jolla, California). In an alternative embodiment, which is preferred for S.
cerevisiae, RNA is extracted from cells using phenol and chloroform, as described in Ausubel et al., eds., 1989, CURRENT PROTOCOLS IN MOLECULAR BIOLOGY, Vol III, Green Publishing Associates, Inc., John Wiley & Sons, Inc., New York, at pp. 13.12.1-13.12.5).
Poly(A)+ RNA can be selected, e.g., by selection with oligo-dT cellulose or, alternatively, by oligo-dT primed reverse transcription of total cellular RNA. In one embodiment, RNA
c~ be fragmented by methods known in the art, e.g., by incubation with ZnClz, to generate fragments of RNA. In another embodiment, the polynucleotide molecules analyzed by the invention comprise cDNA, or PCR products of amplified RNA or cDNA.
In one embodiment, total RNA, mRNA, or nucleic acids derived therefrom, is isolated from a sample taken from a person afflicted with breast cancer.
Target polynucleotide molecules that are poorly expressed in particular cells may be enriched using normalization techniques (Bonaldo et al., 1996, Geho~ce Res. 6:791-806).
As described above, the target polynucleotides are detestably labeled at one or more nucleotides. Any method known in the art may be used to detestably label the target polynucleotides. Preferably, this labeling incorporates the label uniformly along the length of the RNA, and more preferably, the labeling is carried out at a high degree of efficiency.
One embodiment for this labeling uses oligo-dT primed reverse transcription to incorporate the label; however, conventional methods of this method are biased toward generating 3' end fragments. Thus, in a preferred embodiment, random primers (e.g., 9-mers) are used in reverse transcription to uniformly incorporate labeled nucleotides over the full length of the target polynucleotides. Alternatively, random primers may be used in conjunction with PCR methods or T7 promoter-based ih vitro transcription methods in order to amplify the target polynucleotides.
In a preferred embodiment, the detectable label is a luminescent label. For example, fluorescent labels, bio-luminescent labels, chemi-luminescent labels, and colorimetric labels may be used in the present invention. In a highly preferred embodiment, the label is a fluorescent label, such as a fluorescein, a phosphor, a rhodamine, or a polymethine dye derivative. Examples of commercially available fluorescent labels include, for example, fluorescent phosphoramidites such as FluorePrime (Amersham Pharmacia, Piscataway, N.J.), Fluoredite (Millipore, Bedford, Mass.), FAM (ABI, Foster City, Calif.), and Cy3 or Cy5 (Amersham Pharmacia, Piscataway, N.J.). In another embodiment, the detectable label is a radiolabeled nucleotide.
In a further preferred embodiment, target polynucleotide molecules from a patient sample are labeled differentially from target polynucleotide molecules of a standard. The standard can comprise target polynucleotide molecules from normal individuals (i.e., those not afflicted with breast cancer). In a highly preferred embodiment, the standard comprises target polynucleotide molecules pooled from samples from normal individuals or tumor samples from individuals having sporadic-type breast tumors. In another embodiment, the target polynucleotide molecules are derived from the same individual, but are taken at different time points, and thus indicate the efficacy of a treatment by a change in expression of the markers, or lack thereof, during and after the course of treatment (i.e., chemotherapy, radiation therapy or cryotherapy), wherein a change in the expression of the markers from a poor prognosis pattern to a good prognosis pattern indicates that the treatment is efficacious.
In this embodiment, different timepoints are differentially labeled.
5.5.2.5 HYBRIDIZATION TO MICROARR.AYS
Nucleic acid hybridization and wash conditions are chosen so that the target polynucleotide molecules specifically bind or specifically hybridize to the complementary polynucleotide sequences of the array, preferably to a specific array site, wherein its complementary DNA is located.
Arrays containing double-stranded probe DNA situated thereon are preferably subjected to denaturing conditions to render the DNA single-stranded prior to contacting with the target polynucleotide molecules. Arrays containing single-stranded probe DNA
(e.g., synthetic oligodeoxyribonucleic acids) may need to be denatured prior to contacting with the target polynucleotide molecules, e.g., to remove hairpins or dimers wluch form due to self complementary sequences.
Optimal hybridization conditions will depend on the length (e.g., oligomer versus polynucleotide greater than 200 bases) and type (e.g., RNA, or DNA) of probe and target nucleic acids. One of skill in the art will appreciate that as the oligonucleotides become shorter, it may become necessary to adjust their length to achieve a relatively uniform melting temperature for satisfactory hybridization results. General parameters for specific (i.e., stringent) hybridization conditions for nucleic acids are described in Sambrook et al., MOLECULAR CLONING - A LABORATORY MANUAL (2ND ED.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York (1989), and in Ausubel et al., CURRENT
PROTOCOLS IN MOLECULAR BIOLOGY, vol. 2, Current Protocols Publishing, New York (1994). Typical hybridization conditions for the cDNA microarrays of Schena et al. are hybridization in 5 X SSC plus 0.2% SDS at 65 °C for four hours, followed by washes at 25 °C in low stringency wash buffer (1 X SSC plus 0.2% SDS), followed by 10 minutes at 25 °C in higher stringency wash buffer (0.1 X SSC plus 0.2% SDS) (Schena et al., PYOC. Natl.
Acad. Sci. U.S.A. 93:10614 (1993)). Useful hybridization conditions are also provided in, e_g, ~ Tijessen, 1993, HYBRIDIZATION WITH NUCLEIC ACID PROBES, Elsevier Science Publishers B.V.; and I~ricka, 1992, NONISOTOPIC DNA PROBE TECHNIQUES, Academic Press, San Diego, CA.
Particularly preferred hybridization conditions include hybridization at a temperature at or near the mean melting temperature of the probes (e.g., within 5 °C, more preferably within 2 °C) in 1 M NaCI, 50 mM MES buffer (pH 6.5), 0.5% sodium sarcosine and 30%
fonnamide.
5.5.2.6 SIGNAL DETECTION AND DATA ANALYSIS
When fluorescently labeled probes are used, the fluorescence emissions at each site of a microarray may be, preferably, detected by scanning confocal laser microscopy. In one embodiment, a separate scan, using the appropriate excitation line, is carried out for each of the two fluorophores used. Alternatively, a laser may be used that allows simultaneous specimen illumination at wavelengths specific to the two fluorophores and emissions from the two fluorophores can be analyzed simultaneously (see Shalon et al., 1996, "A DNA
microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization," Gehome Research 6:639-645, which is incorporated by reference in its entirety for all purposes). In a preferred embodiment, the arrays are scanned with a laser fluorescent scanner with a computer controlled X-Y stage and a microscope objective.
Sequential excitation of the two fluorophores is achieved with a mufti-line, mixed gas laser and the emitted light is split by wavelength and detected with two photomultiplier tubes.
Fluorescence laser scanning devices are described in Schena et al., Geuome Res. 6:639-645 (1996), and in other references cited herein. Alternatively, the fiber-optic bundle described by Ferguson et al., Nature Biotech. 14:1681-1684 (1996), may be used to monitor mRNA
abundance levels at a large number of sites simultaneously.
Signals are recorded and, in a preferred embodiment, analyzed by computer, e.g., using a 12 or 16 bit analog to digital board. In one embodiment the scanned image is despeckled using a graphics program (e.g., Hijaak Graphics Suite) and then analyzed using an image gridding program that creates a spreadsheet of the average hybridization at each wavelength at each site. If necessary, an experimentally determined correction for "cross talk" (or overlap) between the channels for the two fluors may be made. For any particular hybridization site on the transcript array, a ratio of the emission of the two fluorophores can be calculated. The ratio is independent of the absolute expression level of the cognate gene, but is useful for genes whose expression is significantly modulated in association with the different breast cancer-related condition.
5.6 COMPUTER-FACILITATED ANALYSIS
The present invention further provides for kits comprising the marker sets above. In a preferred embodiment, the kit contains a microarray ready for hybridization to target polynucleotide molecules, plus software for the data analyses described above..

The analytic methods described in the previous sections can be implemented by use of the following computer systems and according to the following programs and methods. A Computer system comprises internal components linked to external components. The internal components of a typical computer system include a processor element interconnected with a main memory. For example, the computer system can be an Intel 8086-, 80386-, 80486-, PentiumTM, or PentiumTM-based processor with preferably 32 MB or more of main memory.
The external components may include mass storage. This mass storage can be one or more hard disks (which are typically packaged together with the processor and memory). Such hard disks are preferably of 1 GB or greater storage capacity.
Other external components include a user interface device, which can be a monitor, together with an inputting device, which can be a "mouse", or other graphic input devices, andlor a keyboard. A printing device can also be attached to the computer.
Typically, a computer system is also linked to network link, which can be pad of an Ethernet link to other local computer systems, remote computer systems, or wide area communication networks, such as the Internet. This network link allows the computer system to share data and processing tasks with other computer systems.
Loaded into memory during operation of this system are several software components, which are both standard in the art and special to the instant invention. These software components collectively cause the computer system to function according to the methods of this invention. These software components are typically stored on the mass storage device. A software component comprises the operating system, which is responsible for managing computer system and its network interconnections.
This operating system can be, for example, of the Microsoft Windows~ family, such as Windows 3.1, Windows 95, Windows 98, Windows 2000, or Windows NT. The software component represents common languages and functions conveniently present on this system to assist programs implementing the methods specific to this invention. Many high or low level computer languages can be used to program the analytic methods of this invention.
Instructions can be interpreted during run-time or compiled. Preferred languages include C/
C++, FORTRAN and JAVA. Most preferably, the methods of this invention are programmed in mathematical software packages that allow symbolic entry of equations and high-level specification of processing, including some or all of the algorithms to be used, thereby freeing a user of the need to procedurally program individual equations or algorithms. Such packages include Mathlab from Mathworks (Natick, MA), Mathematica~
from Wolfram Research (Champaign,1L), or S-PlusC~ from Math Soft (Cambridge, MA).

Specifically, the software component includes the analytic methods of the invention as programmed in a procedural language or symbolic package.
The software to be included with the kit comprises the data analysis methods of the invention as disclosed herein. In particular, the software may include mathematical routines for marker discovery, including the calculation of correlation coefficients between clinical categories (i.e., ER status) and marker expression. The software may also include mathematical routines for calculating the correlation between sample marker expression and control marker expression, using array-generated fluorescence data, to determine the clinical classification of a sample.
In an exemplary implementation, to practice the methods of the present invention, a user first loads experimental data into the computer system.
These data can be directly entered by the user from a monitor, keyboard, or from other computer systems linked by a network connection, or on removable storage media such as a CD-ROM, floppy disk (not illustrated), tape drive (not illustrated), ZIP~ drive (not illustrated) or through the network. Next the user causes execution of expression profile analysis software which perfornls the methods of the present invention.
In another exemplary implementation, a user first loads experimental data andlor databases into the computer system. This data is loaded into the memory from the storage media or from a remote computer, preferably from a dynamic geneset database system, through the network. Next the user causes execution of software that performs the steps of the present invention.
Alternative computer systems and software for implementing the analytic methods of this invention will be apparent to one of skill in the art and are intended to be comprehended within the accompanying claims. In particular, the accompanying claims are intended to include the alternative program structures for implementing the methods of this invention that will be readily apparent to one of skill in the art.
6. EXAMPLES
Materials And Methods 117 tumor samples from breast cancer patients were collected. RNA
samples were then prepared, and each RNA sample was profiled using inkjet-printed microarrays. Marker genes were then identified based on expression patterns;
these genes were then used to train classifiers, which used these marker genes to classify tumors into diagnostic and prognostic categories. Finally, these marker genes were used to predict the diagnostic and prognostic outcome for a group of individuals..

1. Sample collection 117 breast cancer patients treated at The Netherlands Cancer Institute /
Antoni van Leeuwenhoek Hospital, Amsterdam, The Netherlands, were selected on the basis of the following clinical criteria (data extracted from the medical records of the NKI/AvL Tumor Register, Biometrics Department).
Group 1 (n=97, 78 for training, 19 for independent tests) was selected on the basis of (1) primary invasive breast carcinoma <5 cm (T1 or T2); (2) no axillary metastases (NO); (3) age at diagnosis <55 years; (4) calender year of diagnosis 1983-1996;
and (5) no prior malignancies (excluding carcinoma in situ of the cervix or basal cell c~cinoma of the skin). All patients were treated by modified radical mastectomy (n=34) or breast conserving treatment (n=64), including axillary lymph node dissection.
Breast conserving treatment consisted of excision of the tumor, followed by radiation of the whole breast to a dosis of 50 Gy, followed by a boost varying from 15 to 25 Gy. Five patients received adjuvant systemic therapy consisting of chemotherapy (n=3) or hormonal therapy (n=2), all other patients did not receive additional treatment. All patients were followed at least annually for a period of at least 5 years. Patient follow-up information was extracted from the Tumor Registry of the Biometrics Department.
Group 2 (n=20) was selected as: (1) carriers of a germline mutation in BRCAI or BRCA2; and (2) having primary invasive breast carcinoma. No selection or exclusion was made based on tumor size, lymph node status, age at diagnosis, calender year of diagnosis, other malignancies. Germline mutation status was known prior to this research protocol.
Information about individual from which tumor samples were collected include: year of birth; sex; whether the individual is pre- or post-menopausal; the year of diagnosis; the number of positive lymph nodes and the total number of nodes;
whether there was surgery, and if so, whether the surgery was breast-conserving or radical;
whether there was radiotherapy, chemotherapy or hormonal therapy. The tumor was graded according to the formula P=TNM, where T is the tumor size (on a scale of 0-5); N is the number of nodes that are positive (on a scale of 0-4); and M is metastases (0 = absent, 1 = present).
The tumor was also classified according to stage, tumor type (iya situ or invasive; lobular or ductal; grade) and the presence or absence of the estrogen and progesterone receptors. The progression of the cancer was described by (where applicable): distant metastases; year of distant metastases, year of death, year of last follow-up; and BRCAI genotype.

2. Tumors:
Germline mutation testing of BRCAI and BRCA2 on DNA isolated from peripheral blood lymphocytes includes mutation screening by a Protein Truncation Test (PTT) of exon 11 of BRCAI and exon 10 and 11 of BRCA~, deletion PCR of BRCAI
genomic deletion of exon 13 and 22, as well Denaturing Gradient Gel Electrophoresis (DGGE) of the remaining exons. Aberrant bands were all confirmed by genomic sequencing analyzed on a ABI3700 automatic sequencer and confirmed on a independent DNA sample.
From all, tumor material was snap frozen in liquid nitrogen within one hour after surgery.
pf the frozen tumor material an H&E (hematoxylin-eosin) stained section was prepared prior to and after cutting slides for RNA isolation. These H&E frozen sections were assessed for the percentage of tumor cells; only samples with >50% tumor cells were selected for further study.
For all tumors, surgical specimens fixed in formaldehyde and embedded in paraffin were evaluated according to standard histopathological procedures.
HBzE stained paraffin sections were examined to assess tumor type (e.g., ductal or lobular according to the WHO classification); to assess histologic grade according the method described by Elston and Ellis (grade 1-3); and to assess the presence of lymphangio-invasive growth and the presence of an extensive lymphocytic infiltrate. All histologic factors were independently assessed by two pathologists (MV and JL); consensus on differences was reached by examining the slides together. A representative slide of each tumor was used for immunohistochemical staining with antibodies directed against the estrogen-and progesterone receptor by standard procedures. The staiiung result was scored as the percentage of positively staiung nuclei (0%, 10%, 20%, etc., up to 100%).
Amplification, labeling and hybridization The outline for the production of marker-derived nucleic acids and hybridization of the nucleic acids to a microarray are outlined in FIG. 2. 30 frozen sections of 30 ~,M thickness were used for total RNA isolation of each snap frozen tumor specimen.
Total RNA was isolated wzth RNAzoITM B (Campro Scientific, Veenendaal, The Netherlands) according to the manufacturers protocol, including homogenization of the tissue using a Polytron PT-MR2100 (Merck, Amsterdam, The Netherlands) and finally dissolved in RNAse-free H20. The quality of the total RNA was assessed by A260~A280 ratio and had to be between 1.7 and 2.1 as well as visual inspection of the RNA on an agarose gel which should indicate a stronger 28S ribosomal RNA band compared to the 18S

ribosomal RNA band. subsequently, 25~g of total RNA was DNase treated using the Qiagen RNase-free DNase kit and RNeasy spin columns (Qiagen Inc, GmbH, Germany) according to the manufacturers protocol. DNase treated total RNA was dissolved in RNase-free H20 to a final concentration of 0.2~,g/~1.
S~g total RNA was used as input for cRNA synthesis. An oligo-dT primer containing a T7 RNA polymerase promoter sequence was used to prime first strand cDNA
synthesis, and random primers (pdN6) were used to prime second strand cDNA
synthesis by MMLV reverse transcriptase. This reaction yielded a double-stranded cDNA that contained the T7 RNA polymerase (T7RNAP) promoter. The double-stranded cDNA was then transcribed into cRNA by T7RNAP.
cRNA was labeled with Cy3 or Cy5 dyes using a two-step process. First, allylaxnine-derivitized nucleotides were enzymatically incorporated into cRNA
products.
For cRNA labeling, a 3: I mixture of 5-(3-Aminoallyl)uridine 5'-triphosphate (Sigma) and UTP was substituted for UTP in the in vitro transcription (IVT) reaction.
Allylamine-derivitized cRNA products were then reacted with N-hydroxy succinimide esters of Cy3 or Cy5 (CyDye, Amersham Pharmacia Biotech). S~g Cy5-labeled cRNA from one breast cancer patient was mixed with the same amount of Cy3-labeled product from a pool of equal amount of cRNA from each individual sporadic patient.
Microarray hybridizations were done in duplicate with fluor reversals.
Before hybridization, labeled cRNAs were fragmented to an average size of ~50-100nt by heating at 60 °C in the presence of 10 mM ZnCl2. Fragmented cRNAs were added to hybridization buffer containing 1 M NaCI, 0.5% sodium sarcosine and 50 mM MES, pH
6.5, which stringency was regulated by the addition of formamide to a final concentration of 30%. Hybridizations were carried out in a final volume of 3 mls at 40 °C on a rotating platform in a hybridization oven (Robbins Scientific) for 48h. After hybridization, slides were washed and scanned using a confocal laser scanner (Agilent Technologies).
Fluorescence intensities on scanned images were quantified, normalized and corrected.
4. Poolin of samples The reference cRNA pool was formed by pooling equal amount of cRNAs from each individual sporadic patient, for a total of 78 tumors.
5. 25k human microarray Surface-bound oligonucleotides were synthesized essentially as proposed by Blanchard et al., Biosens. Bioelectroh. 6(7):687-690 (1996); see also Hughes et al., Nature Biotech. 19(4):342-347 (2000). Hydrophobic glass surfaces (3 inches by 3 inches) containing exposed hydroxyl groups were used as substrates for nucleotide synthesis.
Phosphoramidite monomers were delivered to computer-defined positions on the glass surfaces using ink jet printer heads. Unreacted monomers were then washed away and the ends of the extended oligonucleotides were deprotected. This cycle of monomer coupling, washing and deprotection was repeated for each desired layer of nucleotide synthesis.
Oligonucleotide sequences to be printed were specified by computer files.
Microarrays containing approximately 25,000 human gene sequences (Hu25K microarrays) were used for this study. Sequences for microarrays were selected from RefSeq (a collection of non-redundant mRNA sequences, located on the Internet at nhn.nih.gov/LocusLink/refseq.html) and Phil Green EST contigs, which is a collection of EST contigs assembled by Dr. Phil Green et al at the University of Washington (Ewing and Green, Nat. Genet. 25(2):232-4 (2000)), available on the Internet at phrap.org/est assembly/
index.html. Each mRNA or EST contig was represented on Hu25K microarray by a single 60mer oligonucleotide essentially as described in Hughes et al., Nature Biotech. 19(4):342 347 and in International Publication WO 01/06013, published January 25, 2001, and in International Publication WO 01/05935, published January 25, 2001, except that the rules for oligo screening were modified to remove oligonucleotides with more than 30%C or with 6 or more contiguous C residues.
Example 1: Differentially regulated gene sets and overall expression patterns of breast cancer tumors Of the approximately 25,000 sequences represented on the microarray, a group of approximately 5,000 genes that were significantly regulated across the group of samples was selected. A gene was determined to be significantly differentially regulated with cancer of the breast if it showed more than two-fold of transcript changes as compared to a sporadic tumor pool, and if the p-value for differential regulation (Hughes et al., Cell 102:109-126 (2000)) was less than 0.01 either upwards or downwards in at least five out of 98 tumor samples.
An unsupervised clustering algorithm allowed us to cluster patients based on their similarities measured over this set of 5,000 significant genes. The similarity measure between two patients x and y is defined as Nv ('x~ x) (.Y' .Y Nv x -x 2 Nv - 2 S = 1- ~ ' ~ ' ~ 'y' y Equation (5) '-1 6x' 6Y1 '-1 fix' '-1 ~Yi In Equation (5), .x and y are two patients with components of log ratio x and y1, i=1,..., N--5,100. Associated with every value x is error ~~ . The smaller the value ~~
, the more _ Nv x. Nv 1 reliable the measurement .x . x = ~ ~ ~ ~ is the error-weighted arithmetic mean.
i-1 xt i_1 x' The use of correlation as similarity metric emphasizes the importance of co-regulation in clustering rather than the amplitude of regulations.
The set of approximately 5,000 genes can be clustered based on their similarities measured over the group of 98 tumor samples. The similarity measure between two genes was defined in the same way as in Equation (1) except that now for each gene, there are 98 components of log ratio measurements.
The result of such a two-dimensional clustering is displayed in FIG 3. Two distinctive patterns emerge from the clustering. The first pattern consists of a group of patients in the Iower part of the plot whose regulations are very different from the sporadic pool. The other pattern is made of a group of patients in the upper part of the plot whose expressions are only moderately regulated in comparison with the sporadic pool. These dominant patterns suggest that the tumors can be unambiguously divided into two distinct types based on this set of 5,000 significant genes.
To help understand these patterns, they were associated with estrogen-receptor (ER), proestrogen receptor (PR), tumor grade, presence of lymphocytic infiltrate, and angioinvasion (FIG. 3). The lower group in FIG 3, which features the dominant pattern, consists of 36 patients. Of the 39 ER-negative patients, 34 patients are clustered together in this group. From FIG. 4, it was observed that the expression of estrogen receptor alpha gene ESRI and a large group of co-regulated genes are consistent with this expression pattern.
From FIG. 3 and FIG. 4, it was concluded that gene expression patterns can be used to classify tumor samples into subgroups of diagnostic interest. Thus, genes co-regulated across 98 tumor samples contain information about the molecular basis of breast cancers. The combination of clinical data, and microarray measured gene abundance of ESRI demonstrates that the distinct types are related to, or at least are reported by, the ER
status.
Example 2: Identification of Genetic Markers Distinguishing Estrogen Receptor (+) From Estrogen Receptor (-) Patients The results described in this Example allow the identification of expression marker genes that differentiate two major types of tumor cells: "ER-negative"
group and °'ER-positive" group. The differentiation of samples by ER(+) status was accomplished in three steps: (1) identification of a set of candidate marker genes that correlate with ER
level; (2) rank-ordering these candidate genes by strength of correlation; (3) optimization of the number of marker genes; and (4) classifying samples based on these marker genes.
1. Selection of candidate discriminating~e, nes In the first step, a set of candidate discriminating genes was identified based on gene expression data of training samples. Specifically, we calculated the correlation coefficients p between the category numbers or ER level and logarithmic expression ratio Y' across all the samples for each individual gene:
Equation (2) The histogram of resultant correlation coefficients is shown in FIG. 5A as a gray line.
While the amplitude of correlation or anti-correlation is small for the majority of genes, the amplitude for some genes is as great as 0.5. Genes whose expression ratios either correlate or anti-correlate well with the diagnostic category of interest are used as reporter genes for the category.
Genes having a correlation coefficient larger than 0.3 ("correlated genes") or less than -0.3 ("anti-correlated genes") were selected as reporter genes. The threshold of 0.3 was selected based on the correlation distribution for cases where there is no real correlation (one can use permutations to determine this distribution).
Statistically, this distribution width depends upon the number of samples used in the correlation calculation.
The distribution width for control cases (no real correlation) is approximately 1/ h-3, where n = the number of samples. In our case, n = 9~. Therefore, a threshold of 0.3 roughly corresponds to 3 - 6 in the distribution ( 3 X 1/ h-3 ).
2,460 such genes were found to satisfy this criterion. In order to evaluate the significance of the correlation coefficient of each gene with the ER level, a bootstrap technique was used to generate Monte-Carlo data that randomize the association between gene expression data of the samples and their categories. The distribution of correlation coefficients obtained from one Monte-Carlo trial is shown as a dashed line in FIG SA. To estimate the significance of the 2,460 marker genes as a group, 10,000 Monte-Carlo runs were generated. The collection of 10,000 such Monte-Carlo trials forms the null hypothesis. The number of genes that satisfy the same criterion for Monte-Carlo data varies from run to run. The frequency distribution from 10,000 Monte-Carlo runs of the number of genes having correlation coefficients of >0.3 or <-0.3 is displayed in FIG.
5B. Both the mean and maximum value are much smaller than 2,460. Therefore, the significance of this gene group as the discriminating gene set between ER(+) and ER(-) samples is estimated to be g-.eater than 99.99%.
2. Rank-ordering of candidate discriminating __.genes In the second step, genes on the candidate list were rank-ordered based on the significance of each gene as a discriminating gene. The markers were rank-ordered either by amplitude of correlation, or by using a metric similar to a Fisher statistic:
t=(Cxl~-~x2>~
61 ~Yll -1~ -I- ~2 ~Y12 -1~ (j21 + Yl2 -1~~~~hl + ~~2 Equation (3) In Equation (3), ~xl~ is the error-weighted average of log ratio within the ER(-), and ~~~ is the error-weighted average of log ratio within the ER(+) group. a'1 is the variance of log ratio within the ER(-) group and hl is the number of samples that had valid measurements of log ratios. a'~ is the variance of log ratio within the ER(+) group and ~
is the number of samples that had valid measurements of log ratios. The t-value in Equation (3) represents the variance-compensated difference between two means. The confidence level of each gene in the candidate list was estimated with respect to a null hypothesis derived from the actual data set using a bootstrap technique; that is, many artificial data sets were generated by randomizing the association between the clinical data and the gene expression data.
3. Optimization of the number of marker gLnes The leave-one-out method was used for cross validation in order to optimize the discriminating genes. For a set of marker genes from the rank-ordered candidate list, a classifier was trained with 97 samples, and was used to predict the status of the remaining sample. The procedure was repeated for each of the samples in the pool, and the number of cases where the prediction for the one left out is wrong or correct was counted.
The above performance evaluation from leave-one-out cross validation was repeated by successively adding more marker genes from the candidate list. The performance as a function of the number of marker genes is shown in FIG. 6.
The error rates for type 1 and type 2 errors varied with the number of marker genes used, but were both minimal while the number of the marker genes is around 550. Therefore, we consider this set of 550 genes is considered the optimal set of marker genes that can be used to classify breast cancer tumors into "ER-negative" group and "ER-positive"
group. FIG. 7 shows the classification of patients as ER(+) or ER(-) based on this 550 marker set. FIG. ~
shows the correlation of each tumor to the ER-negative template verse the correlation of each tumor to the ER-positive template.
4. Classification based on marker enes In the third step, a set of classifier parameters was calculated for each type of training data set based on either of the above ranking methods. A template for the ER(-) ~°up ( z1) was generated using the error-weighted log ratio average of the selected group of genes. Similarly, a template for ER(+) group (called ~2) was generated using the error-weighted log ratio average of the selected group of genes. Two classifier parameters ( P
~d Pa) were defined based on either correlation or distance. P measures the similarity between one sample ,yand the ER(-) template Zt over this selected group of genes. P~
measures the similarity between one sample ,y and the ER(+) template ~2 over this selected group of genes. The correlation PZ is defined as:

p -(~Z'y)I ~Izlli ~~~~~~ Equation (1) A "leave-one-out" method was used to cross-validate the classifier built based on the marker genes. In this method, one sample was reserved for cross validation each time the classifier was trained. For the set of 550 optimal marker genes, the classifier was trained with 97 of the 98 samples, and the status of the remaining sample was predicted. This procedure was performed with each of the 98 patients. The number of cases where the prediction was wrong or correct was counted. It was further determined that subsets of as few as ~50 of the 2,460 genes are able classify tumors as ER(+) or ER(-) nearly as well as using the total set.
In a small number of cases, there was disagreement between classification by the 550 marker set and a clinical classification. In comparing the microarray measured log ratio of expression for ESRl to the clinical binary decision (negative or positive) of ER
status for each patient, it was seen that the measured expression is consistent with the qualitative category of clinical measurements (mixture of two methods) for the majority of tumors. For example, two patients who were clinically diagnosed as ER(+) actually exhibited low expression of ESRI from microarray measurements and were classified as ER
negative by 550 marker genes. Additionally, 3 patients who were clinically diagnosed as ER(-) exhibited high expression of ESRI from microarray measurements and were classified as ER(+) by the same 550 marker genes. Statistically, however, microarray measured gene expression of ESR1 correlates with the dominant patters better than clinically determined ER status.
Example 3: Identification of Genetic Markers Distinguishing BRCAI Tumors From Sporadic Tumors in Estrogen Receptor (-) Patients The BRCAI mutation is one of the major clinical categories in breast cancer tumors. It was determined that of tumors of 38 patients in the ER(-) group, 17 exhibited the BRCAI mutation, while 21 were sporadic tumors. A method was therefore developed that enabled the differentiation of the 17 BRCAI mutation tumors from the 21 sporadic tumors in the ER(-) group.
1. Selection of candidate discriminating genes In the first step, a set of candidate genes was identified based on the gene expression patterns of these 38 samples. We first calculated the correlation between the BRCAI-mutation category number and the expression ratio across all 38 samples for each individual gene by Equation (2). The distribution of the correlation coefficients is shown as a histogram defined by the solid line in FIG. 9A. We observed that, while the majority of genes do not correlate with BRCAI mutation status, a small group of genes correlated at significant levels. It is likely that genes with larger correlation coefficients would serve as reporters for discriminating tumors of BRCAI mutation Garners from sporadic tumors within the ER(-) group.
In order to evaluate the significance of each correlation coefficient with respect to a null hypothesis that such correlation coefficient could be found by chance, a bootstrap technique was used to generate Monte-Carlo data that randomizes the association between gene expression data of the samples and their categories. 10,000 such Monte-Carlo runs were generated as a control in order to estimate the significance of the marker genes as a group. A threshold of 0.35 in the absolute amplitude of correlation coefficients (either correlation or anti-correlation) was applied both to the real data and the Monte-Carlo data.
Following this method, 430 genes were found to satisfy this criterion for the experimental data. The p-value of the significance, as measured against the 10,000 Monte-Carlo trials, is approximately 0.0048 (FIG. 9B). That is, the probability that this set of 430 genes contained useful information about BRCAl-like tumors vs sporadic tumors exceeds 99%.
2. Rank-ordering of candidate discriminatin~,~~enes In the second step, genes on the candidate list were rank-ordered based on the significance of each gene as a discriminating gene. Here, we used the absolute amplitude of correlation coefficients to rank order the marker genes.
3 Optimization of discriminating_genes In the third step, a subset of genes from the top of this rank-ordered list was used for classification. We defined a BRCAl group template (called ~1) by using the error-weighted log ratio average of the selected group of genes. Similarly, we defined a non-BRCAl group template (called ~2) by using the error-weighted log ratio average of the selected group of genes. Two classifier parameters (P l and P2) were defined based on either correlation or distance. P 1 measures the similarity between one sample y and the BRCAI template ~1 over this selected group of genes. P2 measures the similarity between one sample ,y and the non-BRCA1 template ~2 over this selected group of genes.
For correlation, P1 and P2 were defined in the same way as in Equation (4).
The leave-one-out method was used for cross validation in order to optimize the discriminating genes as described in Example 2. For a set of marker genes from the rank-ordered candidate list, the classifier was trained with 37 samples the remaining one was predicted. The procedure was repeated for alI the samples in the pool, and the number of cases where the prediction for the one left out is wrong or correct was counted.
To determine the number of markers constituting a viable subset, the above performance evaluation from leave-one-out cross validation was repeated by cumulatively adding more marker genes from the candidate list. The performance as a function of the number of marker genes is shown in FIG. 10. The error rates for type 1 (false negative) and type 2 (false positive) errors (Bendat & Piersol, RANDOM DATA ANALYSIS AND
MEASUREMENT PROCEDURES, 2D ED., Wiley Interscience, p. 89) reached optimal ranges when the number of the marker genes is approximately 100. Therefore, a set of about 100 genes is considered to be the optimal set of marker genes that can be used to classify tumors in the ER(-) group as either BRCAI-related tumors or sporadic tumors.
The classification results using the optimal 100 genes are shown in FIGS.
11A and 11B. As shown in Figure 11A, the co-regulation patterns of the sporadic patients differ from those of the BRCAl patients primarily in the amplitude of regulation. Only one sporadic tumor was classified into the BRGAI group. Patients in the sporadic group are not necessarily BRCAl mutation negative; however, it is estimated that only approximately 5%
of sporadic tumors are indeed BRCA1-mutation carriers.
Example 4: Identification of Genetic Markers Distinguishing Sporadic Tumor Patients with >5 Year Versus <5 Year Survival Times 78 tumors from sporadic breast cancer patients were used to explore prognostic predictors from gene expression data. Of the 78 samples in this sporadic breast cancer group, 44 samples were known clinically to have had no distant metastases within 5 years since the initial diagnosis ("no distant metastases group") and 34 samples had distant metastases within 5 years since the initial diagnosis ("distant metastases group"). A group of 231 markers, and optimally a group of 70 markers, was identified that allowed differentiation between these two groups.

1. Selection of candidate discriminating genes In the first step, a set of candidate discriminating genes was identified based on gene expression data of these 78 samples. The correlation between the prognostic category number (distant metastases vs no distant metastases) and the logarithmic expression ratio across all samples for each individual gene was calculated using Equation (2). The distribution of the correlation coefficients is shown as a solid line in FIG. 12A.
FIG. 12A also shows the result of one Monte-Carlo run as a dashed line. We observe that even though the majority of genes do not correlate with the prognostic categories, a small group of genes do correlate. It is likely that genes with larger correlation coefficients would be more useful as reporters for the prognosis of interest - distant metastases group and no distant metastases group.
In order to evaluate the significance of each correlation coefficient with respect to a null hypothesis that such correlation coefficient can be found by chance, we used a bootstrap technique to generate data from 10,000 Monte-Carlo runs as a control (FIG. 12B). We then selected genes that either have the correlation coefficient larger than 0.3 ("correlated genes") or less than -0.3 ("anti-correlated genes"). The same selection criterion was applied both to the real data and the Monte-Carlo data. Using this comparison, 231 markers from the experimental data were identified that satisfy this criterion. The probability of this gene set for discriminating patients between the distant metastases group and the no distant metastases group being chosen by random fluctuation is approximately 0.003.
2. Rank-ordering of candidate discriminating-, e~ nes In the second step, genes on the candidate list were rank-ordered based on the significance of each gene as a discriminating gene. Specifically, a metric similar to a "Fisher" statistic, defined in Equation (3), was used for the purpose of rank ordering. The confidence level of each gene in the candidate list was estimated with respect to a null hypothesis derived from the actual data set using the bootstrap technique.
Genes in the candidate list can also be ranked by the amplitude of correlation coefficients.
3. Optimization of discriminating enes In the third step, a subset of 5 genes from the top of this rank-ordered list was selected to use as discriminating genes to classify 78 tumors into a "distant metastases group" or a "no distant metastases group". The leave-one-out method was used for cross validation. Specifically, 77 samples defined a classifier based on the set of selected discriminating genes, and these were used to predict the remaining sample.
This procedure was repeated so that each of the 78 samples was predicted. The number of cases in which predictions were correct or incorrect were counted. The performance of the classifier was measured by the error rates of type 1 and type 2 for this selected gene set.
We repeated the above performance evaluation procedure, adding 5 more marker genes each time from the top of the candidate list, until all 231 genes were used. As shown in FIG. 13, the number of mis-predictions of type 1 and type 2 errors change dramatically with the number of marker genes employed. The combined error rate reached a minimum when 70 marker genes from the top of our candidate list never used.
Therefore, this set of 70 genes is the optimal, preferred set of marker genes useful for the classification of sporadic tumor patients into either the distant metastases or no distant metastases group.
Fewer or more markers also act as predictors, but are less efficient, either because of higher error rates, or the introduction of statistical noise.
4. Reoccurrence probability curves The prognostic classification of 78 patients with sporadic breast cancer tumors into two distinct subgroups was predicted based on their expression of the 70 optimal marker genes (FIGS. 14 and 15).
To evaluate the prognostic classification of sporadic patients, we predicted the outcome of each patient by a classifier trained by the remaining 77 patients based on the 70 optimal marker genes. FIG. I6 plots the distant metastases probability as a function of the time since initial diagnosis for the two predicted groups. The difference between these two reoccurrence curves is significant. Using the ~ test (S-PLUS 2000 Guide to Statistics, vol. 2, MathSoft, p. 44), the p-value is estimated to be 10-9. The distant metastases probability as a function of the time since initial diagnosis was also compared between ER(+) and ER(-) individuals (FIG. 17), PR(+) and PR(-) individuals (FIG. 18), and between individuals with different tumor grades (FIGS. 19A, 19B). In comparison, the p-values for the differences between two prognostic groups based on clinical data are much less significant than that based on gene expression data, ranging from 10-3 to 1.
To parameterize the reoccurrence probability as a function of time since initial diagnosis, the curve was fitted to one type of survival model -"normal":
P=ax ex~--t2~zz) (4) For fixed a = 1, we found that z = 125months for patients in the no distant metastases group ~d z= 36 months for patients in the distant metastases group. Using tumor grades, we found t=100 months for patients with tumor grades 1 and 2 and z= 60 for patients with tumor grade 3. It is accepted clinical practice that tumor grades are the best available prognostic predictor. However, the difference between the two prognostic groups classified based on 70 marker genes is much more significant than those classified by the best available clinical information.
Pro ng ostic prediction for 19 independent sporadic tumors To confirm the proposed prognostic classification method and to ensure the reproducibility, robustness, and predicting power of the 70 optimal prognostic marker genes, we applied the same classifier to 19 independent tumor samples from sporadic breast cancer patients, prepared separately at The Netherlands Cancer Institute ~.
The same reference pool was used.
The classification results of 19 independent sporadic tumors are shown in Figure 20. FIG. 20A shows the log ratio of expression regulation of the same 70 optimum m~.ker genes. Based on our classifier model, we expected the misclassification of 19*(6+7)/78 = 3.2 tumors. Consistently, (I+3) = 4 of 19 tumors were misclassified.
6. Clinical uarameters as a group vs. microarray data - Results of to istic re erg scion In the previous section, the predictive power of each individual clinical parameter was compared with that of the expression data. However, it is more meaningful to combine all the clinical parameters as a group, and then compare them to the expression data. This requires multi-variant modeling; the method chosen was logistic regression.
Such an approach also demonstrates how much improvement the microarray approach adds to the results of the clinical data.
The clinical parameters used for the mufti-variant modeling were: (1) tumor grade; (2) ER status; (3) presence or absence of the progestogen receptor (PR); (4) tumor size; (5) patient age; and (6) presence or absence of angioinvasion. For the microarray data, two correlation coefficients were used. One is the correlation to the mean of the good prognosis group (C1) and the other is the correlation to the mean of the bad prognosis group (C2). When calculating the correlation coefficients for a given patient, this patient is excluded from either of the two means.
The logistic regression optimizes the coefficient of each input parameter to best predict the outcome of each patient. One way to judge the predictive power of each input parameter is by how much deviance (similar to Chi-square in the linear regression, see for example, Hasomer & Lemeshow, APPLIED LOGISTIC REGRESSION, John Wiley &
Sons, (2000)) the parameter accounts for. The best predictor should account for most of the deviance. To fairly assess the predictive power, each parameter was modeled independently. The microarray parameters explain most of the deviance, and hence are powerful predictors.
The clinical parameters, and the two microarray parameters, were then monitored as a group. The total deviance explained by the six clinical parameters was 31.5, and total deviance explained by the microarray parameters was 39.4. However, when the clinical data was modeled f rst, and the two microarray parameters added, the final deviance accounted for is 57Ø
The logistic regression computes the likelihood that a patient belongs to the good or poor prognostic group. FIGS. 21A and 21B show the sensitivity vs. (1-specificity).
The plots were generated by varying the threshold on the model predicted likelihood. The curve which goes through the top left corner is the best (high sensitivity with high specificity). The microarray outperformed the clinical data by a large margin.
For example, at a fixed sensitivity of around 80%, the specificity was ~80% from the microarray data, and ~65% from the clinical data for the good prognosis group. For the poor prognosis group, the corresponding specificities were ~80% and ~70%, again at a fixed sensitivity of 80%. Combining the microarray data with the clinical data further improved the results.
The result can also be displayed as the total error rate as the function of the threshold in FIG. 21 C. At all possible thresholds, the error rate from the microarray was always smaller than that from the clinical data. By adding the microarray data to the clinical data, the error rate is further reduced, as one can see in Figure 21C.
Odds ratio tables can be created from the prediction of the logistic regression. The probability of a patient being in the good prognosis group is calculated by the logistic regression based on different combinations of input parameters (clinical and/or microarray). Patients are divided into the following four groups according to the prediction and the true outcome: (1) predicted good and truly good, (2) predicted good but truly poor, (3) predicted poor but truly good, (4) predicted poor and truly poor. Groups (1) & (4) represent correct predictions, while groups (2) & (3) represent mis-predictions. The division for the prediction is set at probability of 50%, although other thresholds can be used. The results are listed in Table 7. It is clear from Table 7 that microarray profiling (Table 7.3 8z 7.10) outperforms any single clinical data (Table 7.4-7.9) and the combination of the clinical data (Table 7.2). Adding the micro-array profiling in addition to the clinical data give the best results (Table 7.1).

For microarray profiling, one can also make a similar table (Table 7. I I) without using logistic regression. In this case, the prediction was simply based on C1-C2 (greater than 0 means good prognosis, less than 0 mean bad prognosis).
Table 7.1 Prediction by clinical+microarray Predicted Predicted good poor true 39 5 good true 4 30 poor Table 7.2 Prediction by clinical alone Predicted Predicted ood poor true 34 10 good true 12 22 poor Table 7.3 Prediction by microarray predicted Predicted good poor true 39 5 good true 10 24 poor Table 7.4 Prediction by grade Predicted Predicted ood oor true 23 21 good true 5 29 poor Table 7.5 Prediction by ER

Predicted Predicted ood poor true 35 9 good true 21 13 poor Table 7.6 Prediction by PR

Predicted Predicted good oor true 35 9 good true 18 16 poor Table 7.7 Prediction by size Predicted Predicted good poor true 35 9 good true 13 21 poor Table e 7.8 Prediction by ag Predicted Predicted ood poor true 33 11 good true 15 19 poor Table 7.9 Prediction by angioinvasion Predicted Predicted ood poor true 37 7 good true 19 15 poor Table 7.10 Prediction by dC
(C1-C2) Predicted Predicted ood oor true 36 8 good true 6 ~ 28 poor Table 7.11 No logistic regression, simply 'ud ed b C1-C2 Predicted Predicted ood oor true 37 7 good true 6 ~ 28 poor Example 5. Concept of mini-array for diagnosis purposes.
All genes on the marker gene list fox the purpose of diagnosis and prognosis can be synthesized on a small-scale microarray using ink jet technology. A
microarray with genes for diagnosis and prognosis can respectively or collectively be made.
Each gene on the list is represented by single or multiple oligonucleotide probes, depending on its sequence uniqueness across the genome. This custom designed mini-array, in combination with sample preparation protocol, can be used as a diagnostic/prognostic kit in clinics.
Example 6. Biological Significance of diagnostic marker genes The public domain was searched for the available functional annotations for the 430 marker genes for BRCAI diagnosis in Table 3. The 430 diagnostic genes in Table 3 .
can be divided into two groups: (1) 196 genes whose expressions are highly expressed in BRCAI-like group; and (2) 234 genes whose expression are highly expressed sporadic group. Of the 196 BRCAI group genes, 94 are annotated. Of the 234 sporadic group genes, 100 are annotated. The terms "T-cell", " B-cell" or "immunoglobulin" are involved in 13 of the 94 annotated genes, and in 1 of the 100 annotated genes, respectively. Of 24,479 genes represented on the microarrays, there are 7,586 genes with annotations to date. "T-cell", B-cell" and "immunoglobulin" are found in 207 of these 7,586 genes. Given this, the p-value of the 13 "T-cell", "B-cell" or "immunoglobulin" genes in the BRCAI group is very significant (p-value =1.1x10-6). In comparison, the observation of 1 gene relating to "T
cell", "B-cell", or "immunoglobulin" in the sporadic group is not significant (p-value =
0.18).
The observation that BRCAl patients have highly expressed lymphocyte (T-cell and B-cell) genes agrees with what has been seen from pathology that BRCAI breast tumor has more frequently associated with high lymphocytic infiltration than sporadic cases (Chappuis et al., 2000, Semis Surg Ohcol 18:287-295).
Example 7. Biological significance of prognosis marker genes A search was performed for available functional annotations for the 231 prognosis marker genes (Table 5). The markers fall into two groups: (1) 156 markers whose expressions are highly expressed in poor prognostic group; and (2) 75 genes whose expression are highly expressed in good prognostic group. Of the 156 markers, 72 genes are annotated; of the 75 genes, 28 genes are annotated.
Twelve of the 72 markers, but none of the 28 markers, are, or are associated with, kinases. In contrast, of the 7,586 genes on the microarray having annotations to date, only 471 involve kinases. On this basis, the p-value that twelve kinase-related markers in the poor prognostic group is significant (p-value = 0.001). Kinases are important regulators of intracellular signal transduction pathways mediating cell proliferation, differentiation and apoptosis. Their activity is normally tightly controlled and regulated.
Overexpression of certain kinases is well known involving in oncogenesis, such as vascular endothelial growth factor receptorl (VEGFR1 or FLT1), a tyrosine kinase in the poor prognosis group, which plays a very important role in tumor angiogenesis. Interestingly, vascular endothelial growth factor (VEGF), VEGFR's ligand, is also found in the prognosis group, which means both ligand and receptor are upregulated in poor prognostic individuals by an unknown mechanism.
Likewise, 16 of the 72 markers, and only two of the 28 markers, are, or are associated with, ATP-binding or GTP-binding proteins. In contrast, of the 7,586 genes on the microarray having annotations to date, only 714 and 153 involve ATP-binding and GTP-binding, respectively. On this basis, the p-value that 16 GTP- or ATP-binding-related markers in the poor prognosis group is significant (p-value 0.001 and 0.0038).
Thus, the kinase- and ATP- or GTP-binding-related markers within the 72 markers can be used as prognostic indicators.
Cancer is characterized by deregulated cell proliferation. On the simplest level, this requires division of the cell or mitosis. By keyword searching, we found "cell division" or "mitosis" included in the annotations of 7 genes respectively in the 72 annotated markers from the 156 poor prognosis markers, but in none for the 28 annotated genes from 75 good prognosis markers. Of the 7,586 microarray markers with annotations, "cell division" is found in 62 annotations and "mitosis" is found in 37 annotations. Based on these findings, the p-value that seven cell division- or mitosis-related markers are found in the poor prognosis group is estimated to be highly significant (p-value =
3.5x10-5). In comparison, the absence of cell division- or mitosis-related markers in the good prognosis group is not significant (p-value = 0.69). Thus, the seven cell division- or mitosis-related markers may be used as markers for poor prognosis.

Example 8: Construction of an artificial reference pool.
The reference pool for expression profiling in the above Examples was made by using equal amount of cRNAs from each individual patient in the sporadic group. In order to have a reliable, easy-to-made, and large amount of reference pool, a reference pool for breast cancer diagnosis and prognosis can be constructed using synthetic nucleic acid representing, or derived from, each marker gene. Expression of marker genes for individual patient sample is monitored only against the reference pool, not a pool derived from other patients.
To make the reference pool, 60-mer oligonucleotides are synthesized according to 60-mer ink j et array probe sequence for each diagnostic/prognostic reporter genes, then double-stranded and cloned into pBluescript SK- vector (Stratagene, La Jolla, CA), adjacent to the T7 promoter sequence. Individual clones are isolated, and the sequences of their inserts are verified by DNA sequencing. To generate synthetic RNAs, clones are linearized with EcoRI and a T7 in vitro transcription (IVT) reaction is performed according to the MegaScript kit (Ambion, Austin, TX). IVT is followed by DNase treatment of the product. Synthetic RNAs are purified on RNeasy columns (Qiagen, Valencia, CA). These synthetic RNAs are transcribed, amplified, labeled, and mixed together to make the reference pool. The abundance of those synthetic RNAs are adjusted to approximate the abundance of the corresponding marker-derived transcripts in the real for pool.
Example 9: Use of single-channel data and a sample pol represented by stored values.
1. Creation of a reference pool of stored values ("mathematical sample ool") The use of ratio-based data used in Examples 1-7, above, requires a physical reference sample. In the above Examples, a pool of sporadic tumor sample was used as the reference. Use of such a reference, while enabling robust prognostic and diagnostic predictions, can be problematic because the pool is typically a limited resource. A classifier method was therefore developed that does not require a physical sample pool, making application of this predictive and diagnostic technique much simpler in clinical applications.
, To test whether single-channel data could be used, the following procedure was developed. First, the single channel intensity data for the 70 optimal genes, described in Example 4, from the 78 sporadic training samples, described in the Materials and Methods, was selected from the sporadic sample vs. tumor pool hybridization data. The 78 samples consisted of 44 samples from patients having a good prognosis and 34 samples from patients having a poor prognosis. Next, the hybridization intensities for these samples were normalized by dividing by the median intensity of all the biological spots on the same microarray. Where multiple microarrays per sample were used, the average was taken across all of the microarrays. A log transform was performed on the intensity data for each of the 70 genes, or for the average intensity for each of the 70 genes where more than one microarray is hybridized, and a mean log intensity for each gene across the 78 sporadic samples was calculated. For each sample, the mean log intensities thus calculated were subtracted from the individual sample log intensity. This figure, the mean subtracted log(intensity) was then treated as the two color log(ratio) for the classifier by substitution into Equation (5). For new samples, the mean log intensity is subtracted in the same manner as noted above, and a mean subtracted log(intensity) calculated.
The creation of a set of mean log intensities for each gene hybridized creates a "mathematical sample pool" that replaces the quantity-limited "material sample pool."
This mathematical sample pool can then be applied to any sample, including samples in hand and ones to be collected in the future. This "mathematical sample pool"
can be updated as more samples become available.
2. Results To demonstrate that the mathematical sample pool performs a function equivalent to the sample reference pool, the mean-subtracted-log(intensity) (single channel data, relative to the mathematical pool) vs. the log(ratio) (hybridizations, relative to the sample pool) was plotted fox the 70 optimal reporter genes across the 78 sporadic samples, as shown in FIG. 22. The ratio and single-channel quantities are highly correlated, indicating both have the capability to report relative changes in gene expression. A
classifier was then constructed using the mean-subtracted-log(intensity) following exactly the same procedure as was followed using the ratio data, as in Example 4.
As shown in FIGS. 23A and 23B, single-channel data was successful at classifying samples based on gene expression patterns. FIG. 23A shows samples grouped according to prognosis using single-channel hybridization data. The white line separates samples from patients classified as having poor prognoses (below) and good prognoses (above). FIG. 23B plots each sample as its expression data correlates with the good (open circles) or poor (filled squares) prognosis classifier parameter. Using the "leave-one-out"
cross validation method, the classifier predicted 10 false positives out of 44 samples from patients having a good prognosis, and 6 false negatives out of 34 samples from patients having a poor prognosis, where a poor prognosis is considered a "positive."
Tbis outcome is comparable to the use of the ratio-based classifier, which predicted 7 out of 44, and 6 out of 34, respectively.
In clinical applications, it is greatly preferable to have few false positives, which results iu fewer under-treated patients. To conform the results to this preference, a classifier was constructed by ranking the patient sample according to its coefficient of correlation to the "good prognosis" template, and chose a threshold for this correlation coefficient to allow approximately 10% false negatives, i.e., classification of a sample from a patient with poor prognosis as one from a patient with a good prognosis. Out of the 34 poor prognosis samples used herein, this represents a tolerance of 3 out of 34 poor prognosis patients classified incorrectly. This tolerance limit corresponds to a threshold 0.2727 coefficient of correlation to the "good prognosis" template. Results using this threshold are shown in FIGS. 24A and 24B. FIG. 24A shows single-channel hybridization data for samples ranked according to the coefficients of correlation with the good prognosis classifier; samples classified as "good prognosis" lie above the white line, and those classified as "poor prognosis" lie below. FIG. 24B shows a scatterplot of sample correlation coefficients, with three incorrectly classified samples lying to the right of the threshold correlation coefficient value. Using this threshold, the classifier had a false positive rate of 15 out of the 44 good prognosis samples. This result is not very different compared to the error rate of 12 out of 44 for the ratio based classifier.
In summary, the 70 reporter genes carry robust information about prognosis;
the single channel data can predict the tumor outcome almost as well as the ratio based data, while being more convenient in a clinical setting.

7. REFERENCES CITED
All references cited herein are incorporated herein by reference in their entirety and for all purposes to the same extent as if each individual publication or patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety for all purposes.
Many modifications and variations of the present invention can be made without departing from its spirit and scope, as will be apparent to those skilled in the art.
The specific embodiments described herein are offered by way of example only, and the invention is to be limited only by the terms of the appended claims along with the full scope of equivalents to which such claims are entitled.

Claims (60)

1. A method for classifying a cell sample as ER(+) or ER(-) comprising detecting a difference in the expression by said cell sample of a first plurality of genes relative to a control, said first plurality of genes consisting of at least 5 of the genes corresponding to the markers listed in Table 1.
2. The method of claim 1, wherein said plurality consists of at least 50 of the genes corresponding to the markers listed in Table 1.
3. The method of claim 1, wherein said plurality consists of at least 100 of the genes corresponding to the markers listed in Table 1.
4. The method of claim 1, wherein said plurality consists of at least 200 of the genes corresponding to the markers listed in Table 1.
5. The method of claim 1, wherein said plurality consists of at least 500 of the genes corresponding to the markers listed in Table 1.
6. The method of claim 1, wherein said plurality consists of at least 1000 of the genes corresponding to the markers listed in Table 1.
7. The method of claim 1, wherein said plurality consists of each of the genes corresponding to the 2,460 markers listed in Table 2.
8. The method of claim 1, wherein said plurality consists of the 550 gene markers listed in Table 2.
9. The method of claim 1, wherein said control comprises nucleic acids derived from a pool of tumors from individual sporadic patients.
10. The method of claim 1, wherein said detecting comprises the steps of (a) generating an ER(+) template by hybridization of nucleic acids derived from a plurality of ER(+) patients within a plurality of sporadic patients against nucleic acids derived from a pool of tumors from individual sporadic patients;

(b) generating an ER(-) template by hybridization of nucleic acids derived from a plurality of ER(-) patients within said plurality of sporadic patients against nucleic acids derived from said pool of tumors from individual sporadic patients within said plurality;
(c) hybridizing an nucleic acids derived from an individual sample against said pool; and (d) determining the similarity of marker gene expression in the individual sample to the ER(+) template and the ER(-) template, wherein if said expression is more similar to the ER(+) template, the sample is classified as ER(+), and if said expression is more similar to the ER(-) template, the sample is classified as ER(-).
11. A method for classifying a cell sample as BRACA1-related or sporadic, comprising detecting a difference in the expression of a first plurality of genes relative to a control, said first plurality of genes consisting of at least 5 of the genes corresponding to the markers listed in Table 3.
12. The method of claim 11, wherein said plurality consists of at least 50 of the genes corresponding to the markers listed in Table 3.
13. The method of claim 11, wherein said plurality consists of at least 100 of the genes corresponding to the markers listed in Table 3.
14. The method of claim 11, wherein said plurality consists of at least 200 of the genes corresponding to the markers listed in Table 3.
15. The method of claim 11, wherein said plurality consists of each of the genes corresponding to the 430 markers listed in Table 3.
16. The method of claim 11, wherein said plurality consists of each of the genes corresponding to the 100 markers listed in Table 4.
17. The method of claim 11, wherein said control comprises nucleic acids derived from a pool of tumors from individual sporadic patients.
18. The method of claim 11, wherein said detecting comprises the steps of (a) generating a BRCA1 template by hybridization of nucleic acids derived from a plurality of BRCA1 patients within a plurality of ER(-) patients against nucleic acids derived from a pool of tumors;
(b) generating a sporadic template by hybridization of nucleic acids derived from a plurality of sporadic patients within said plurality of ER(-) patients against nucleic acids derived from said pool of tumors;
(c) hybridizing nucleic acids derived from an individual sample against said pool; and (d) determining the similarity of marker gene expression in the individual sample to theBRCA1 template and the sporadic template, wherein if said expression is more similar to the BRCA1 template, the sample is classified as BRCA1, and if said expression is more similar to the sporadic template, the sample is classified as sporadic.
19. A method for classifying an individual as having a good prognosis (no distant metastases within five years of initial diagnosis) or a poor prognosis (distant metastases within five years of initial diagnosis), comprising detecting a difference in the expression of a first plurality of genes in a cell sample taken from the individual relative to a control, said first plurality of genes consisting of at least 5 of the genes corresponding to the markers listed in Table 5.
20. The method of claim 19, wherein said plurality consists of at least 20 of the genes corresponding to the markers listed in Table 5.
21. The method of claim 19, wherein said plurality consists of at least 100 of the genes corresponding to the markers listed in Table 5.
22. The method of claim 19, wherein said plurality consists of at least 150 of the genes corresponding to the markers listed in Table 5.
23. The method of claim 19, wherein said plurality consists of each of the genes corresponding to the 231 markers listed in Table 5.
24. The method of claim 19, wherein said plurality consists of the 70 gene markers listed in Table 6.
25. The method of claim 1, wherein said control comprises nucleic acids derived from a pool of tumors from individual sporadic patients.
26. The method of claim 19, wherein said detecting comprises the steps of:
(a) generating a good prognosis template by hybridization of nucleic acids derived from a plurality of good prognosis patients against nucleic acids derived from a pool of tumors from individual patients;
(b) generating a poor prognosis template by hybridization of nucleic acids derived from a plurality of poor prognosis patients against nucleic acids derived from said pool of tumors from said plurality of individual patients;
(c) hybridizing an nucleic acids derived from and individual sample against said pool; and (d) determining the similarity of marker gene expression in the individual sample to the good prognosis template and the poor prognosis template, wherein if said expression is more similar to the good prognosis template, the sample is classified as having a good prognosis, and if said expression is more similar to the poor prognosis template, the sample is classified as having a poor prognosis.
27. The method of claim 1, wherein the cell sample is additionally classified as BRCA1-related or sporadic by detecting a difference in the expression of a second plurality of genes in a cell sample taken from the individual relative to a control, said second plurality of genes consisting of at least 5 of the genes corresponding to the markers listed in Table 3 or Table 4.
28. The method of claim 1, wherein the cell sample is additionally classified as taken from a patient with a good prognosis or a poor prognosis by detecting a difference in the expression of a second plurality of genes in a cell sample taken from the individual relative to a control, said second plurality of genes consisting of at least 5 of the genes corresponding to the markers listed in Table 5.
29. The method of claim 11, wherein the cell sample is additionally classified as taken from a patient with a good prognosis or a poor prognosis by detecting a difference in the expression of a second plurality of genes in a cell sample taken from the individual relative to a control, said second plurality of genes consisting of at least 20 of the genes corresponding to the markers listed in Table 5.
30. The method of claim 11, wherein the cell sample is additionally classified as ER(+) or ER(-) by detecting a difference in the expression of a second plurality of genes in a cell sample taken from the individual relative to a control, said second plurality of genes consisting of at least 5 of the genes corresponding to the markers listed in Table 1.
31. The method of claim 19, wherein the cell sample is additionally classified as ER(+) or ER(-) by detecting a difference in the expression of a second plurality of genes in a cell sample taken from the individual relative to a control, said second plurality of genes consisting of at least 5 of the genes corresponding to the markers listed in Table 1.
32. The method of claim 19, wherein the cell sample is additionally classified as BRCA1 or sporadic by detecting a difference in the expression of a second plurality of genes in a cell sample taken from the individual relative to a control, said second plurality of genes consisting of at least 5 of the genes corresponding to the markers listed in Table 3.
33. A method for classifying a sample as ER(+) or ER(-) by calculating the similarity between the expression of at least 5 of the markers listed in Table 1 in the sample to the expression of the same markers in an ER(-) nucleic acid pool and an ER(+) nucleic acid pool, comprising the steps of:
(a) labeling nucleic acids derived from a sample, with a first fluorophore to obtain a first pool of fluorophore-labeled nucleic acids;
(b) labeling with a second fluorophore a first pool of nucleic acids derived from two or more ER(+) samples, and a second pool of nucleic acids derived from two or more ER(-) samples:
(c) contacting said first fluorophore-labeled nucleic acid and said first pool of second fluorophore-labeled nucleic acid with a first microarray under conditions such that hybridization can occur, and contacting said first fluorophore-labeled nucleic acid and said second pool of second fluorophore-labeled nucleic acid with a second microarray under conditions such that hybridization can occur, wherein said first microarray and said second microarray are similar to each other, exact replicas of each other, or are identical, detecting at each of a plurality of discrete loci on the first microarray a first flourescent emission signal from said first fluorophore-labeled nucleic acid and a second fluorescent emission signal from said first pool of second fluorophore-labeled genetic matter that is bound to said first microarray under said conditions, and detecting at each of the marker loci on said second microarray said first fluorescent emission signal from said first fluorophore-labeled nucleic acid and a third fluorescent emission signal from said second pool of second fluorophore-labeled nucleic acid;
(d) determining the similarity of the sample to the ER(-) and ER(+) pools by comparing said first fluorescence emission signals and said second fluorescence emission signals, and said first emission signals and said third fluorescence emission signals; and (e) classifying the sample as ER(+) where the first fluorescence emission signals are more similar to said second fluorescence emission signals than to said third fluorescent emission signals, and classifying the sample as ER(-) where the first fluorescence emission signals are more similar to said third fluorescence emission signals than to said second fluorescent emission signals.
34. The method of claim 33, wherein said similarity is calculated by determining a first sum of the differences of expression levels for each marker between said first fluorophore-labeled nucleic acid and said first pool of second fluorophore-labeled nucleic acid, and a second sum of the differences of expression levels for each marker between said first fluorophore-labeled nucleic acid and said second pool of second fluorophore-labeled nucleic acid, wherein if said first sum is greater than said second sum, the sample is classified as ER(-), and if said second sum is greater than said first sum, the sample is classified as ER(+).
35. The method of claim 33, wherein said similarity is calculated by computing a first classifier parameter P1 between an ER(+) template and the expression of said markers in said sample, and a second classifier parameter P2 between an ER(-) template and the expression of said markers in said sample, wherein said P1 and P2 are calculated according to the formula:

wherein ~1 and ~2 are ER(+) and ER(-) templates, respectively, and are calculated by averaging said second fluorescence emission signal for each of said markers in said first pool of second fluorophore-labeled nucleic acid and said third fluorescence emission signal for each of said markers in said second pool of second fluorophore-labeled nucleic acid, respectively, and wherein ~ is said first fluorescence emission signal of each of said makers in the sample to be classified as ER(+) or ER(-), wherein the expression of the markers in the sample is similar to ER(-) if P1 < P2, and similar to ER(+) if P1 > P2.
36. A method for determining a set of marker genes whose expression is associated with a particular phenotype, comprising the steps of:
(a) selecting phenotype having two or more phenotype categories;
(b) identifying a plurality of genes wherein the expression of said genes is correlated or anticorrelated with one of the phenotype categories, and wherein the correlation coefficient for each gene is calculated according to the equation wherein ~ is a number representing said phenotype category and ~ is the logarithmic expression ratio across all the samples for each individual gene, wherein if the correlation coefficient has an absolute value of 0.3 or greater, said expression of said gene is associated with the phenotype category, wherein said plurality of genes is a set of marker genes whose expression is associated with a particular phenotype.
37. The method of claim 36, wherein said set of marker genes is validated by:
(a) using a statistical method to randomize the association between said marker genes and said phenotype category, thereby creating a control correlation coefficient for each marker gene;
(b) repeating step (a) one hundred or more times to develop a frequency distribution of said control correlation coefficients for each marker gene;
(c) determining the number of marker genes having a control correlation coefficient of 0.3 or above, thereby creating a control marker gene set; and (d) comparing the number of control marker genes so identified to the number of marker genes, wherein if the p value of the difference between the number of marker genes and the number of control genes is less than a threshold, said set of marker genes is validated.
38. The method of claim 36, wherein said set of marker genes is optimized by the method comprising:
(a) rank-ordering the genes by amplitude of correlation or by significance of the correlation coefficients to create a rank-ordered list, and (b) selecting an arbitrary number n of marker genes from the top of the rank-ordered list.
39. The method of claim 38, wherein said set of marker genes is further optimized by the method comprising:
(a) calculating an error rate for said arbitrary number n of marker genes;
(b) increasing by 1 the number of genes selected from the top of the rank-ordered list;
(c) calculating an error rate for said number of genes selected from the top of the rank-ordered list;
(d) repeating steps (b) and (c) until said number of genes selected from the top of the rank-ordered list includes all genes included in said rank ordered list, and (e) identifying said number of genes selected from the top of the rank-ordered list for which the error rate is smallest, wherein said set of marker genes is optimized when the error rate is the smallest.
40. A method for assigning a person to one of a plurality of categories in a clinical trial, comprising determining for each said person the level of expression of at least five of the prognosis markers listed in Table 6, determining therefrom whether the person has an expression pattern that correlates with a good prognosis or a poor prognosis, and assigning said person to one category in a clinical trial if said person is determined to have a good prognosis, and a different category if that person is determined to have a poor prognosis.
41. A method of classifying a first cell or organism as having one of at least two different phenotypes, said at least two different phenotypes comprising a first phenotype and a second phenotype, said method comprising:
(a) comparing the level of expression of each of a plurality of genes in a first sample from the first cell or organism to the level of expression of each of said genes, respectively, in a pooled sample from a plurality of cells or organisms, said plurality of cells or organisms comprising different cells or organisms exhibiting said at least two different phenotypes, respectively, to produce a first compared value;
(b) comparing said first compared value to a second compared value, wherein said second compared value is the product of a method comprising comparing the level of expression of each of said genes in a sample from a cell or organism characterized as having said first phenotype to the level of expression of each of said genes, respectively, in said pooled sample;
(c) comparing said first compared value to a third compared value, wherein said third compared value is the product of a method comprising comparing the level of expression of each of said genes in a sample from a cell or organism characterized as having said second phenotype to the level of expression of each of said genes, respectively, in said pooled sample, (d) optionally carrying out one or more times a step of comparing said first compared value to one or more additional compared values, respectively, each additional compared value being the product of a method comprising comparing the level of expression of each of said genes in a sample from a cell or organism characterized as having a phenotype different from said first and second phenotypes but included among said at least two different phenotypes, to the level of expression of each of said genes, respectively, in said pooled sample; and (e) determining to which of said second, third and, if present, one or more additional compared values, said first compared value is most similar;
wherein said first cell or organism is determined to have the phenotype of the cell or organism used to produce said compared value most similar to said first compared value.
42. The method of claim 40, wherein said compared values are each ratios of the levels of expression of each of said genes.
43. The method of claim 40, wherein each of said levels of expression of each of said genes in said pooled sample are normalized prior to any of said comparing steps.
44. The method of claim 42 wherein normalizing said levels of expression is carried out by dividing each of said levels of expression by the median or mean level of expression of each of said genes or dividing by the mean or median level of expression of one or more housekeeping genes in said pooled sample.
45. The method of claim 42 wherein said normalized levels of expression are subjected to a log transform and said comparing steps comprise subtracting said log transform from the log of said levels of expression of each of said genes in said sample from said cell or organism.
46. The method of claim 40, wherein said at least two different phenotype are different stages of a disease or disorder.
47. The method of claim 40, wherein said at least two different phenotype are different prognoses of a disease or disorder.
48. The method of claim 40, wherein said levels of expression of each of said genes, respectively, in said pooled sample or said levels of expression of each of said genes in a sample from said cell or organism characterized as having said first phenotype, said second phenotype, or said phenotype different from said first and second phenotype, respectively, are stored on a computer.
49. A microarray comprising at least 5 markers derived from any one of Tables 1-6, wherein at least 50% of the probes on the microarray are present in any one of Tables 1-6.
50. The microarray of claim 48, wherein at least 70% of the probes on the microarray are present in any one of Tables 1-6.
51. The microarray of claim 48, wherein at least 80% of the probes on the microarray are present in any one of Tables 1-6.
52. The microarray of claim 48, wherein at least 90% of the probes on the microarray are present in any one of Tables 1-6.
53. The microarray of claim 48, wherein at least 95% of the probes on the microarray are present in any one of Tables 1-6.
54. The microarray of claim 48, wherein at least 98% of the probes on the microarray are present in any one of Tables 1-6.
55. A microarray for distinguishing ER(+) and ER(-) cell samples comprising a positionally-addressable array of polynucleotide probes bound to a support, said polynucleotide probes comprising a plurality of polynucleotide probes of different nucleotide sequences, each of said different nucleotide sequences comprising a sequence complementary and hybridizable to a different gene, said plurality consisting of at least 20 of the genes corresponding to the markers listed in Table 1 or Table 2, wherein at least 50%
of the probes on the microarray are present in Table 1 or Table 2.
56. A microarray for distinguishing bocal -related and sporadic cell samples comprising a positionally-addressable array of polynucleotide probes bound to a support, said polynucleotide probes comprising a plurality of polynucleotide probes of different nucleotide sequences, each of said different nucleotide sequences comprising a sequence complementary and hybridizable to a different gene, said plurality consisting of at least 20 of the genes corresponding to the markers listed in Table 3 or Table 4, wherein at least 50% of the probes on the microarray are present in Table 3 or Table 4.
57. A microarray for distinguishing cell samples from individuals having a good prognosis and cell samples from individuals having a poor prognosis, comprising a positionally-addressable array of polynucleotide probes bound to a support, said polynucleotide probes comprising a plurality of polynucleotide probes of different nucleotide sequences, each of said different nucleotide sequences comprising a sequence complementary and hybridizable to a different, said plurality consisting of at least 20 of the genes corresponding to the markers listed in Table 5 or Table 6, wherein at least 50% of the probes on the microarray are present in Table 5 or Table 6.
58. A kit for determining whether a sample contains a BRCAI or sporadic mutation, comprising at least one microarray comprising probes to at least 20 of the genes corresponding to the markers listed in Table 3, and a computer readable medium having recorded thereon one or more programs for determining the similarity of the level of nucleic acid derived from the markers listed in Table 3 in a sample to that in a BRGAl pool and a sporadic tumor pool, wherein the one or more programs cause a computer to perform a method comprising computing the aggregate differences in expression of each marker between the sample and BRCAI and the aggregate differences in expression of each marker between the sample and sporadic pool, or a method comprising determining the correlation of expression of the markers in the sample to the expression in the BRCAI and sporadic pools, said correlation calculated according to Equation (3).
59. A kit for determining the ER-status of a sample, comprising at least one microarray comprising probes to at least 20 of the genes corresponding to the markers listed in Table 1, and a computer readable medium having recorded thereon one or more programs for determining the similarity of the level of nucleic acid derived from the markers listed in Table 1 in a sample to that in an ER(-) pool and an ER(+) pool, wherein the one or more programs cause a computer to perform a method comprising computing the aggregate differences in expression of each marker between the sample and ER(-) pool and the aggregate differences in expression of each marker between the sample and ER(+) pool, or a method comprising determining the correlation of expression of the markers in the sample to the expression in the ER(-) and ER(+) pools, said correlation calculated according to Equation (3).
60. A kit for determining whether a sample is derived from a patient having a good prognosis or a poor prognosis, comprising at least one microarray comprising probes to at least 20 of the genes corresponding to the markers listed in Table 5, and a computer readable medium having recorded thereon one or more programs for determining the similarity of the level of nucleic acid derived from the markers listed in Table 5 in a sample to that in a pool of samples derived from individuals having a good prognosis and a pool of samples derived from individuals having a good prognosis, wherein the one or more programs cause a computer to perform a method comprising computing the aggregate differences in expression of each marker between the sample and the good prognosis pool and the aggregate differences in expression of each marker between the sample and the poor prognosis pool, or a method comprising determining the correlation of expression of the markers in the sample to the expression in the good prognosis and poor prognosis pools, said correlation calculated according to Equation (3).
CA2451074A 2001-06-18 2002-06-14 Diagnosis and prognosis of breast cancer patients Expired - Lifetime CA2451074C (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US29891801P 2001-06-18 2001-06-18
US60/298,918 2001-06-18
US38071002P 2002-05-14 2002-05-14
US60/380,710 2002-05-14
PCT/US2002/018947 WO2002103320A2 (en) 2001-06-18 2002-06-14 Diagnosis and prognosis of breast cancer patients

Publications (2)

Publication Number Publication Date
CA2451074A1 true CA2451074A1 (en) 2002-12-27
CA2451074C CA2451074C (en) 2014-02-11

Family

ID=26970946

Family Applications (1)

Application Number Title Priority Date Filing Date
CA2451074A Expired - Lifetime CA2451074C (en) 2001-06-18 2002-06-14 Diagnosis and prognosis of breast cancer patients

Country Status (11)

Country Link
US (5) US7514209B2 (en)
EP (1) EP1410011B1 (en)
JP (2) JP2005500832A (en)
AT (1) ATE503023T1 (en)
AU (1) AU2002316251A1 (en)
CA (1) CA2451074C (en)
CY (1) CY1111677T1 (en)
DE (1) DE60239535D1 (en)
DK (1) DK1410011T3 (en)
PT (1) PT1410011E (en)
WO (1) WO2002103320A2 (en)

Families Citing this family (177)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040241728A1 (en) * 1999-01-06 2004-12-02 Chondrogene Limited Method for the detection of lung disease related gene transcripts in blood
US20040265868A1 (en) * 1999-01-06 2004-12-30 Chondrogene Limited Method for the detection of depression related gene transcripts in blood
US7473528B2 (en) 1999-01-06 2009-01-06 Genenews Inc. Method for the detection of Chagas disease related gene transcripts in blood
US20040248169A1 (en) * 1999-01-06 2004-12-09 Chondrogene Limited Method for the detection of obesity related gene transcripts in blood
US20050123938A1 (en) * 1999-01-06 2005-06-09 Chondrogene Limited Method for the detection of osteoarthritis related gene transcripts in blood
US20040241726A1 (en) * 1999-01-06 2004-12-02 Chondrogene Limited Method for the detection of allergies related gene transcripts in blood
US20060134635A1 (en) * 2001-02-28 2006-06-22 Chondrogene Limited Method for the detection of coronary artery disease related gene transcripts in blood
US20040248170A1 (en) * 1999-01-06 2004-12-09 Chondrogene Limited Method for the detection of hyperlipidemia related gene transcripts in blood
US20030216558A1 (en) * 2000-12-22 2003-11-20 Morris David W. Novel compositions and methods for cancer
AUPR278301A0 (en) 2001-01-31 2001-02-22 Bionomics Limited A novel gene
US7026121B1 (en) * 2001-06-08 2006-04-11 Expression Diagnostics, Inc. Methods and compositions for diagnosing and monitoring transplant rejection
US6905827B2 (en) 2001-06-08 2005-06-14 Expression Diagnostics, Inc. Methods and compositions for diagnosing or monitoring auto immune and chronic inflammatory diseases
US7235358B2 (en) * 2001-06-08 2007-06-26 Expression Diagnostics, Inc. Methods and compositions for diagnosing and monitoring transplant rejection
EP1410011B1 (en) 2001-06-18 2011-03-23 Rosetta Inpharmatics LLC Diagnosis and prognosis of breast cancer patients
US20030104426A1 (en) * 2001-06-18 2003-06-05 Linsley Peter S. Signature genes in chronic myelogenous leukemia
US7171311B2 (en) * 2001-06-18 2007-01-30 Rosetta Inpharmatics Llc Methods of assigning treatment to breast cancer patients
DE60238143D1 (en) * 2001-09-18 2010-12-09 Genentech Inc COMPOSITIONS AND METHODS FOR THE DIAGNOSIS OF TUMORS
WO2003030725A2 (en) * 2001-10-11 2003-04-17 The Johns Hopkins University Pancreatic cancer diagnosis and therapies
US20040009495A1 (en) * 2001-12-07 2004-01-15 Whitehead Institute For Biomedical Research Methods and products related to drug screening using gene expression patterns
US20030198972A1 (en) * 2001-12-21 2003-10-23 Erlander Mark G. Grading of breast cancer
AU2003205913A1 (en) * 2002-02-20 2003-09-09 Ncc Technology Ventures Pte Limited Materials and methods relating to cancer diagnosis
ES2486265T3 (en) 2002-03-13 2014-08-18 Genomic Health, Inc. Obtaining gene expression profile in biopsied tumor tissues
ES2377720T3 (en) * 2002-05-10 2012-03-30 Purdue Research Foundation AGON�? STICOS ANTIBODIES EPHA2 AND METHODS OF USE OF THE SAME.
US20050152899A1 (en) * 2002-05-10 2005-07-14 Kinch Michael S. EphA2 agonistic monoclonal antibodies and methods of use thereof
JP4557714B2 (en) * 2002-05-10 2010-10-06 メディミューン,エルエルシー EphA2 monoclonal antibody and method of use thereof
WO2004046386A1 (en) 2002-11-15 2004-06-03 Genomic Health, Inc. Gene expression profiling of egfr positive cancer
US20060188889A1 (en) * 2003-11-04 2006-08-24 Christopher Burgess Use of differentially expressed nucleic acid sequences as biomarkers for cancer
JP2006515515A (en) * 2002-12-20 2006-06-01 アバロン ファーマシューティカルズ,インコーポレイテッド Amplified cancer target genes useful for diagnostic and therapeutic screening
US20040231909A1 (en) 2003-01-15 2004-11-25 Tai-Yang Luh Motorized vehicle having forward and backward differential structure
JP2007524362A (en) * 2003-02-14 2007-08-30 サイグレス ディスカバリー, インコーポレイテッド Therapeutic GPCR targets in cancer
WO2004074518A1 (en) * 2003-02-20 2004-09-02 Genomic Health, Inc. Use of intronic rna to measure gene expression
JP2006519620A (en) * 2003-03-04 2006-08-31 アークチュラス バイオサイエンス,インコーポレイティド ER status discrimination characteristics in breast cancer
US20060078893A1 (en) 2004-10-12 2006-04-13 Medical Research Council Compartmentalised combinatorial chemistry by microfluidic control
GB0307403D0 (en) 2003-03-31 2003-05-07 Medical Res Council Selection by compartmentalised screening
GB0307428D0 (en) 2003-03-31 2003-05-07 Medical Res Council Compartmentalised combinatorial chemistry
WO2004091375A2 (en) * 2003-04-11 2004-10-28 Medimmune, Inc. Epha2 and non-neoplastic hyperproliferative cell disorders
US20060263813A1 (en) * 2005-05-11 2006-11-23 Expression Diagnostics, Inc. Methods of monitoring functional status of transplants using gene panels
US20070248978A1 (en) * 2006-04-07 2007-10-25 Expression Diagnostics, Inc. Steroid responsive nucleic acid expression and prediction of disease activity
US7892745B2 (en) * 2003-04-24 2011-02-22 Xdx, Inc. Methods and compositions for diagnosing and monitoring transplant rejection
US7306910B2 (en) * 2003-04-24 2007-12-11 Veridex, Llc Breast cancer prognostics
EP1629284A2 (en) * 2003-05-30 2006-03-01 Rosetta Inpharmatics LLC. Methods for identifying modulators of kinesin activity
EP1638514A4 (en) 2003-06-06 2009-11-18 Medimmune Inc Use of epha4 and modulator or epha4 for diagnosis, treatment and prevention of cancer
EP1651775A2 (en) * 2003-06-18 2006-05-03 Arcturus Bioscience, Inc. Breast cancer survival and recurrence
ES2488845T5 (en) 2003-06-24 2017-07-11 Genomic Health, Inc. Prediction of the probability of cancer recurrence
CA2531967C (en) 2003-07-10 2013-07-16 Genomic Health, Inc. Expression profile algorithm and test for cancer prognosis
US20050112622A1 (en) * 2003-08-11 2005-05-26 Ring Brian Z. Reagents and methods for use in cancer diagnosis, classification and therapy
US20060003391A1 (en) * 2003-08-11 2006-01-05 Ring Brian Z Reagents and methods for use in cancer diagnosis, classification and therapy
ES2311852T3 (en) 2003-08-28 2009-02-16 Ipsogen IDENTIFICATION OF A SPECIFIC MODEL OF THE ERBB2 GENE EXPRESSION IN CANCER OF BREAST
GB0320648D0 (en) * 2003-09-03 2003-10-01 Randox Lab Ltd Molecular marker
AR045563A1 (en) 2003-09-10 2005-11-02 Warner Lambert Co ANTIBODIES DIRECTED TO M-CSF
US7504214B2 (en) 2003-09-19 2009-03-17 Biotheranostics, Inc. Predicting outcome with tamoxifen in breast cancer
EP1670946B1 (en) * 2003-09-19 2012-11-07 bioTheranostics, Inc. Predicting breast cancer treatment outcome
US9856533B2 (en) 2003-09-19 2018-01-02 Biotheranostics, Inc. Predicting breast cancer treatment outcome
GB0323226D0 (en) * 2003-10-03 2003-11-05 Ncc Technology Ventures Pte Lt Materials and methods relating to breast cancer diagnosis
EP1892306A3 (en) * 2003-10-06 2008-06-11 Bayer HealthCare AG Methods and kits for investigating cancer
WO2005064019A2 (en) 2003-12-23 2005-07-14 Genomic Health, Inc. Universal amplification of fragmented rna
MXPA06009545A (en) * 2004-02-20 2007-03-07 Johnson & Johnson Breast cancer prognostics.
US20050186577A1 (en) 2004-02-20 2005-08-25 Yixin Wang Breast cancer prognostics
US20060195266A1 (en) 2005-02-25 2006-08-31 Yeatman Timothy J Methods for predicting cancer outcome and gene signatures for use therein
CA2558808A1 (en) * 2004-03-05 2005-09-22 Rosetta Inpharmatics Llc Classification of breast cancer patients using a combination of clinical criteria and informative genesets
US20050221339A1 (en) 2004-03-31 2005-10-06 Medical Research Council Harvard University Compartmentalised screening by microfluidic control
WO2005100606A2 (en) 2004-04-09 2005-10-27 Genomic Health, Inc. Gene expression markers for predicting response to chemotherapy
US20050260659A1 (en) * 2004-04-23 2005-11-24 Exagen Diagnostics, Inc. Compositions and methods for breast cancer prognosis
US20120258442A1 (en) * 2011-04-09 2012-10-11 bio Theranostics, Inc. Determining tumor origin
EP2371969B1 (en) * 2004-06-04 2018-05-23 Biotheranostics, Inc. Identification of tumors
US7587279B2 (en) 2004-07-06 2009-09-08 Genomic Health Method for quantitative PCR data analysis system (QDAS)
ES2612482T3 (en) 2004-07-23 2017-05-17 Pacific Edge Limited Urine markers for bladder cancer detection
AU2005267756A1 (en) * 2004-07-30 2006-02-09 Rosetta Inpharmatics Llc Prognosis of breast cancer patients
EP1627923A1 (en) * 2004-08-18 2006-02-22 Het Nederlands Kanker Instituut Means and methods for detecting and/or staging a follicular lymphoma cells
US7645575B2 (en) * 2004-09-08 2010-01-12 Xdx, Inc. Genes useful for diagnosing and monitoring inflammation related disorders
US7899623B2 (en) * 2004-09-22 2011-03-01 Tripath Imaging, Inc. Methods and computer program products for analysis and optimization of marker candidates for cancer prognosis
US8747867B2 (en) 2004-09-30 2014-06-10 Ifom Fondazione Instituto Firc Di Oncologia Molecolare Cancer markers
GB0421838D0 (en) 2004-09-30 2004-11-03 Congenia S R L Cancer markers
US7968287B2 (en) 2004-10-08 2011-06-28 Medical Research Council Harvard University In vitro evolution in microfluidic systems
ES2384107T3 (en) 2004-11-05 2012-06-29 Genomic Health, Inc. Molecular indicators of breast cancer prognosis and treatment response prediction
CA3061785A1 (en) 2004-11-05 2006-05-18 Genomic Health, Inc. Predicting response to chemotherapy using gene expression markers
EP1829978B1 (en) * 2004-12-13 2011-09-21 Bio-Dixam LLC Method of detecting gene methylation and method of examining neoplasm by detecting methylation
US20060183893A1 (en) * 2005-01-25 2006-08-17 North Don A Nucleic acids for apoptosis of cancer cells
EP1851543A2 (en) 2005-02-24 2007-11-07 Compugen Ltd. Novel diagnostic markers, especially for in vivo imaging, and assays and methods of use thereof
WO2006110264A2 (en) * 2005-03-16 2006-10-19 Sidney Kimmel Cancer Center Methods and compositions for predicting death from cancer and prostate cancer survival using gene expression signatures
DE102005013846A1 (en) 2005-03-24 2006-10-05 Ganymed Pharmaceuticals Ag Identification of surface-associated antigens for tumor diagnosis and therapy
US7608413B1 (en) * 2005-03-25 2009-10-27 Celera Corporation Kidney disease targets and uses thereof
WO2006101273A1 (en) * 2005-03-25 2006-09-28 Takeda Pharmaceutical Company Limited Prophylactic/therapeutic agent for cancer
CN101297045A (en) 2005-06-03 2008-10-29 阿威亚拉德克斯股份有限公司 Identification of tumors and tissues
EP1910958A2 (en) * 2005-06-08 2008-04-16 Mediqual System and method for dynamic determination of disease prognosis
EP2177910A1 (en) * 2005-11-10 2010-04-21 Aurelium Biopharma Inc. Tissue diagnostics for breast cancer
US8014957B2 (en) * 2005-12-15 2011-09-06 Fred Hutchinson Cancer Research Center Genes associated with progression and response in chronic myeloid leukemia and uses thereof
CA2636855C (en) 2006-01-11 2016-09-27 Raindance Technologies, Inc. Microfluidic devices and methods of use in the formation and control of nanoreactors
NZ545243A (en) * 2006-02-10 2009-07-31 Pacific Edge Biotechnology Ltd Urine gene expression ratios for detection of cancer
US7888019B2 (en) 2006-03-31 2011-02-15 Genomic Health, Inc. Genes involved estrogen metabolism
US20090098538A1 (en) * 2006-03-31 2009-04-16 Glinsky Gennadi V Prognostic and diagnostic method for disease therapy
US7789081B2 (en) * 2006-04-20 2010-09-07 Whirlpool Corporation Modular frame chassis for cooking range
US9562837B2 (en) 2006-05-11 2017-02-07 Raindance Technologies, Inc. Systems for handling microfludic droplets
EP2047910B1 (en) 2006-05-11 2012-01-11 Raindance Technologies, Inc. Microfluidic device and method
DE102006027818A1 (en) * 2006-06-16 2007-12-20 B.R.A.H.M.S. Aktiengesellschaft In vitro multiparameter determination method for the diagnosis and early diagnosis of neurodegenerative diseases
EP2041313B1 (en) * 2006-07-14 2011-03-23 The Government of the United States of America as represented by the Secretary of the Department of Health and Human Services Methods of determining the prognosis of an adenocarcinoma
EP2077912B1 (en) 2006-08-07 2019-03-27 The President and Fellows of Harvard College Fluorocarbon emulsion stabilizing surfactants
US7993832B2 (en) * 2006-08-14 2011-08-09 Xdx, Inc. Methods and compositions for diagnosing and monitoring the status of transplant rejection and immune disorders
EP2090588A4 (en) 2006-10-23 2010-04-07 Neocodex S L In vitro method for prognosis and/or diagnosis of hypersensitivity to ooestrogens or to substances with ooestrogenic activity
KR100862972B1 (en) 2006-10-30 2008-10-13 한국과학기술연구원 Biomaker and screening method of volatile organic compounds having toxicity using thereof
WO2008140484A2 (en) * 2006-11-09 2008-11-20 Xdx, Inc. Methods for diagnosing and monitoring the status of systemic lupus erythematosus
WO2008073578A2 (en) 2006-12-08 2008-06-19 Iowa State University Research Foundation, Inc. Plant genes involved in nitrate uptake and metabolism
US20080312199A1 (en) * 2006-12-15 2008-12-18 Glinsky Gennadi V Treatments of therapy resistant diseases and drug combinations for treating the same
EP2094719A4 (en) * 2006-12-19 2010-01-06 Genego Inc Novel methods for functional analysis of high-throughput experimental data and gene groups identified therfrom
US8772046B2 (en) 2007-02-06 2014-07-08 Brandeis University Manipulation of fluids and reactions in microfluidic systems
US8030060B2 (en) * 2007-03-22 2011-10-04 West Virginia University Gene signature for diagnosis and prognosis of breast cancer and ovarian cancer
US20090062144A1 (en) * 2007-04-03 2009-03-05 Nancy Lan Guo Gene signature for prognosis and diagnosis of lung cancer
WO2008130623A1 (en) 2007-04-19 2008-10-30 Brandeis University Manipulation of fluids, fluid components and reactions in microfluidic systems
KR100886937B1 (en) 2007-06-21 2009-03-09 주식회사 랩 지노믹스 BRCA1 and BRCA2 germline mutations useful for predicting or detecting breast cancer or ovarian cancer
JP5303132B2 (en) 2007-09-20 2013-10-02 シスメックス株式会社 Method and apparatus for determining the presence or absence of cancer cells
EP2093567A1 (en) * 2008-02-21 2009-08-26 Pangaea Biotech, S.A. Brca1 mRNA expression levels predict survival in breast cancer patients treated with neoadjuvant chemotherapy
WO2009124251A1 (en) * 2008-04-03 2009-10-08 Sloan-Kettering Institute For Cancer Research Gene signatures for the prognosis of cancer
ES2338843B1 (en) * 2008-07-02 2011-01-24 Centro De Investigaciones Energeticas, Medioambientales Y Tecnologicas GENOMIC FOOTPRINT OF CANCER OF MAMA.
WO2010009365A1 (en) 2008-07-18 2010-01-21 Raindance Technologies, Inc. Droplet libraries
EP2159291A1 (en) 2008-09-01 2010-03-03 Agendia B.V. Means and method for determining tumor cell percentage in a sample
EP2202320A1 (en) * 2008-12-24 2010-06-30 Agendia B.V. Methods and means for typing a sample comprising colorectal cancer cells
US20100204973A1 (en) * 2009-01-15 2010-08-12 Nodality, Inc., A Delaware Corporation Methods For Diagnosis, Prognosis And Treatment
WO2010111231A1 (en) 2009-03-23 2010-09-30 Raindance Technologies, Inc. Manipulation of microfluidic droplets
EP2241634A1 (en) * 2009-04-16 2010-10-20 Université Libre de Bruxelles Diagnostic method and tools to predict the effiacy of targeted agents against IGF-1 pathway activation in cancer
CA2759079A1 (en) * 2009-04-20 2010-10-28 Medical College Of Georgia Research Institute, Inc. Breast cancer susceptibility gene gt198 and uses thereof
WO2011038400A1 (en) * 2009-09-28 2011-03-31 Institute For Systems Biology Use of gene expression signatures to determine cancer grade
WO2011042564A1 (en) 2009-10-09 2011-04-14 Universite De Strasbourg Labelled silica-based nanomaterial with enhanced properties and uses thereof
DE112010004125A5 (en) * 2009-10-21 2012-11-22 Basf Plant Science Company Gmbh METHOD OF GENERATING BIOMARKER REFERENCE PATTERNS
WO2011057125A2 (en) * 2009-11-05 2011-05-12 Myriad Genetics, Inc. Compositions and methods for determining cancer susceptibility
DK2504451T3 (en) * 2009-11-23 2019-08-05 Genomic Health Inc Methods for predicting the clinical course of cancer
EP2517025B1 (en) 2009-12-23 2019-11-27 Bio-Rad Laboratories, Inc. Methods for reducing the exchange of molecules between droplets
SG181806A1 (en) 2010-01-11 2012-07-30 Genomic Health Inc Method to use gene expression to determine likelihood of clinical outcome of renal cancer
US10351905B2 (en) 2010-02-12 2019-07-16 Bio-Rad Laboratories, Inc. Digital analyte analysis
US9366632B2 (en) 2010-02-12 2016-06-14 Raindance Technologies, Inc. Digital analyte analysis
US9399797B2 (en) 2010-02-12 2016-07-26 Raindance Technologies, Inc. Digital analyte analysis
EP2534267B1 (en) 2010-02-12 2018-04-11 Raindance Technologies, Inc. Digital analyte analysis
US20120196768A1 (en) 2010-04-22 2012-08-02 Toray Industries, Inc. METHOD FOR PREPARING aRNA AND METHOD FOR ANALYSIS OF GENE EXPRESSION
JP5725274B2 (en) 2010-04-22 2015-05-27 国立大学法人大阪大学 Breast cancer prognosis testing method
WO2012045012A2 (en) 2010-09-30 2012-04-05 Raindance Technologies, Inc. Sandwich assays in droplets
US20140030255A1 (en) * 2010-11-03 2014-01-30 Merck Sharp & Dohme Corp. Methods of predicting cancer cell response to therapeutic agents
SG190867A1 (en) * 2010-11-23 2013-07-31 Krisani Biosciences P Ltd Method and system for prognosis and treatment of diseases using portfolio of genes
CN103998064A (en) 2010-12-09 2014-08-20 生物诊断治疗公司 Post-treatment breast cancer prognosis
KR20140004174A (en) * 2011-01-18 2014-01-10 더 트러스티스 오브 더 유니버시티 오브 펜실바니아 Compositions and methods for treating cancer
EP2673614B1 (en) 2011-02-11 2018-08-01 Raindance Technologies, Inc. Method for forming mixed droplets
EP2675819B1 (en) 2011-02-18 2020-04-08 Bio-Rad Laboratories, Inc. Compositions and methods for molecular labeling
WO2012135845A1 (en) 2011-04-01 2012-10-04 Qiagen Gene expression signature for wnt/b-catenin signaling pathway and use thereof
DE202012013668U1 (en) 2011-06-02 2019-04-18 Raindance Technologies, Inc. enzyme quantification
US8748587B2 (en) 2011-06-02 2014-06-10 Novartis Ag Molecules and methods for modulating TMEM16A activities
US8841071B2 (en) 2011-06-02 2014-09-23 Raindance Technologies, Inc. Sample multiplexing
WO2012177945A2 (en) * 2011-06-21 2012-12-27 Children's Hospital Medical Center Diagnostic methods for eosinophilic esophagitis
US9175351B2 (en) 2011-07-13 2015-11-03 Agendia N.V. Means and methods for molecular classification of breast cancer
US8658430B2 (en) 2011-07-20 2014-02-25 Raindance Technologies, Inc. Manipulating droplet size
CA2844805A1 (en) * 2011-08-16 2013-02-21 Oncocyte Corporation Methods and compositions for the treatment and diagnosis of breast cancer
WO2013120089A1 (en) 2012-02-10 2013-08-15 Raindance Technologies, Inc. Molecular diagnostic screening assay
WO2013155567A1 (en) * 2012-04-20 2013-10-24 Mat Malta Advanced Technologies Limited Sex determination genes
EP3524693A1 (en) 2012-04-30 2019-08-14 Raindance Technologies, Inc. Digital analyte analysis
US9045546B2 (en) 2012-05-31 2015-06-02 Novartis Ag Molecules and methods for modulating TMEM16A activities
CA2875710C (en) 2012-06-22 2021-06-29 John Wayne Cancer Institute Molecular malignancy in melanocytic lesions
CN104583777A (en) * 2012-09-05 2015-04-29 和光纯药工业株式会社 Method for determining breast cancer
EP2986762B1 (en) 2013-04-19 2019-11-06 Bio-Rad Laboratories, Inc. Digital analyte analysis
JP2016521979A (en) 2013-05-30 2016-07-28 ジェノミック ヘルス, インコーポレイテッド Gene expression profiling algorithm for calculating recurrence score for patients with kidney cancer
EP3014505A4 (en) * 2013-06-28 2017-03-08 Nantomics, LLC Pathway analysis for identification of diagnostic tests
AU2014318499B2 (en) 2013-09-16 2019-05-16 Biodesix, Inc Classifier generation method using combination of mini-classifiers with regularization and uses thereof
US11901041B2 (en) 2013-10-04 2024-02-13 Bio-Rad Laboratories, Inc. Digital analysis of nucleic acid modification
EP3074532A1 (en) 2013-11-28 2016-10-05 Stichting Het Nederlands Kanker Instituut- Antoni van Leeuwenhoek Ziekenhuis Methods for molecular classification of brca-like breast and/or ovarian cancer
US9944977B2 (en) 2013-12-12 2018-04-17 Raindance Technologies, Inc. Distinguishing rare variations in a nucleic acid sequence from a sample
WO2015103367A1 (en) 2013-12-31 2015-07-09 Raindance Technologies, Inc. System and method for detection of rna species
EP3218521B1 (en) * 2014-11-12 2019-12-25 Hitachi Chemical Co., Ltd. Method for diagnosing organ injury
US10980519B2 (en) 2015-07-14 2021-04-20 Duke University Systems and methods for extracting prognostic image features
US10647981B1 (en) 2015-09-08 2020-05-12 Bio-Rad Laboratories, Inc. Nucleic acid library generation methods and compositions
WO2017083675A1 (en) 2015-11-13 2017-05-18 Biotheranostics, Inc. Integration of tumor characteristics with breast cancer index
WO2017123401A1 (en) 2016-01-13 2017-07-20 Children's Hospital Medical Center Compositions and methods for treating allergic inflammatory conditions
US10934590B2 (en) 2016-05-24 2021-03-02 Wisconsin Alumni Research Foundation Biomarkers for breast cancer and methods of use thereof
US10998178B2 (en) 2017-08-28 2021-05-04 Purdue Research Foundation Systems and methods for sample analysis using swabs
US11859250B1 (en) 2018-02-23 2024-01-02 Children's Hospital Medical Center Methods for treating eosinophilic esophagitis
EP3730941A1 (en) * 2019-04-23 2020-10-28 Institut Jean Paoli & Irène Calmettes Method for determining a reference tumor aggressiveness molecular gradient for a pancreatic ductal adenocarcinoma
US20230330223A1 (en) 2020-04-27 2023-10-19 Agendia N.V. Treatment of her2 negative, mammaprint high risk 2 breast cancer
WO2022029489A1 (en) 2020-08-06 2022-02-10 Agendia NV Systems and methods of using cell-free nucleic acids to tailor cancer treatment
WO2022029488A1 (en) 2020-08-06 2022-02-10 Agenda Nv Systems and methods of assessing breast cancer
CN112927795B (en) * 2021-02-23 2022-09-23 山东大学 Breast cancer prediction system based on bagging algorithm
WO2023224487A1 (en) 2022-05-19 2023-11-23 Agendia N.V. Prediction of response to immune therapy in breast cancer patients
WO2023224488A1 (en) 2022-05-19 2023-11-23 Agendia N.V. Dna repair signature and prediction of response following cancer therapy

Family Cites Families (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5143854A (en) * 1989-06-07 1992-09-01 Affymax Technologies N.V. Large scale photolithographic solid phase synthesis of polypeptides and receptor binding screening thereof
US5545522A (en) * 1989-09-22 1996-08-13 Van Gelder; Russell N. Process for amplifying a target polynucleotide sequence using a single primer-promoter complex
US6342491B1 (en) * 1991-05-21 2002-01-29 American Home Products Corporation Method of treating estrogen receptor positive carcinoma with 17 α-dihydroequilin
US5677125A (en) * 1994-01-14 1997-10-14 Vanderbilt University Method of detection and diagnosis of pre-invasive cancer
WO1995019369A1 (en) * 1994-01-14 1995-07-20 Vanderbilt University Method for detection and treatment of breast cancer
US5578832A (en) * 1994-09-02 1996-11-26 Affymetrix, Inc. Method and apparatus for imaging a sample on a device
US5539083A (en) * 1994-02-23 1996-07-23 Isis Pharmaceuticals, Inc. Peptide nucleic acid combinatorial libraries and improved methods of synthesis
US5556752A (en) * 1994-10-24 1996-09-17 Affymetrix, Inc. Surface-bound, unimolecular, double-stranded DNA
ATE397097T1 (en) * 1995-03-17 2008-06-15 Wayne John Cancer Inst DETECTION OF BREAST METASTASIS USING A MULTIPLE MARKER TEST
US6455300B1 (en) * 1995-12-08 2002-09-24 Han Htun Method and compositions for monitoring DNA binding molecules in living cells
WO1998033450A1 (en) 1997-01-31 1998-08-06 Fred Hutchinson Cancer Research Center Prognosis of cancer patients by determining expression of cell cycle regulators p27 and cyclin e
US6028189A (en) * 1997-03-20 2000-02-22 University Of Washington Solvent for oligonucleotide synthesis and methods of use
NO972006D0 (en) * 1997-04-30 1997-04-30 Forskningsparken I Aas As New method for diagnosis of diseases
US6432707B1 (en) * 1997-12-24 2002-08-13 Corixa Corporation Compositions and methods for the therapy and diagnosis of breast cancer
US6358682B1 (en) * 1998-01-26 2002-03-19 Ventana Medical Systems, Inc. Method and kit for the prognostication of breast cancer
US6107034A (en) * 1998-03-09 2000-08-22 The Board Of Trustees Of The Leland Stanford Junior University GATA-3 expression in human breast carcinoma
US6218122B1 (en) * 1998-06-19 2001-04-17 Rosetta Inpharmatics, Inc. Methods of monitoring disease states and therapies using gene expression profiles
US6203987B1 (en) * 1998-10-27 2001-03-20 Rosetta Inpharmatics, Inc. Methods for using co-regulated genesets to enhance detection and classification of gene expression patterns
AU3395900A (en) 1999-03-12 2000-10-04 Human Genome Sciences, Inc. Human lung cancer associated gene sequences and polypeptides
US7648826B1 (en) * 1999-04-02 2010-01-19 The Regents Of The University Of California Detecting CYP24 expression level as a marker for predisposition to cancer
US6647341B1 (en) * 1999-04-09 2003-11-11 Whitehead Institute For Biomedical Research Methods for classifying samples and ascertaining previously unknown classes
US7013221B1 (en) 1999-07-16 2006-03-14 Rosetta Inpharmatics Llc Iterative probe design and detailed expression profiling with flexible in-situ synthesis arrays
AU6213600A (en) 1999-07-16 2001-02-05 Rosetta Inpharmatics, Inc. Methods for determining the specificity and sensitivity of oligonucleotides for hybridization
US6271002B1 (en) * 1999-10-04 2001-08-07 Rosetta Inpharmatics, Inc. RNA amplification method
AU2596901A (en) 1999-12-21 2001-07-03 Millennium Pharmaceuticals, Inc. Compositions, kits, and methods for identification, assessment, prevention, and therapy of breast cancer
AU2001229340A1 (en) 2000-01-14 2001-07-24 Millennium Pharmaceuticals, Inc. Genes compositions, kits, and methods for identification, assessment, prevention, and therapy of breast cancer
AU2001278076A1 (en) * 2000-07-26 2002-02-05 Applied Genomics, Inc. Bstp-5 proteins and related reagents and methods of use thereof
US6713257B2 (en) 2000-08-25 2004-03-30 Rosetta Inpharmatics Llc Gene discovery using microarrays
US7807447B1 (en) 2000-08-25 2010-10-05 Merck Sharp & Dohme Corp. Compositions and methods for exon profiling
WO2002085298A2 (en) 2001-04-20 2002-10-31 Millennium Pharmaceutical, Inc. Method for detecting breast cancer cells
EP1410011B1 (en) 2001-06-18 2011-03-23 Rosetta Inpharmatics LLC Diagnosis and prognosis of breast cancer patients
US7171311B2 (en) * 2001-06-18 2007-01-30 Rosetta Inpharmatics Llc Methods of assigning treatment to breast cancer patients
CA2531967C (en) * 2003-07-10 2013-07-16 Genomic Health, Inc. Expression profile algorithm and test for cancer prognosis
CA2558808A1 (en) 2004-03-05 2005-09-22 Rosetta Inpharmatics Llc Classification of breast cancer patients using a combination of clinical criteria and informative genesets
AU2005267756A1 (en) 2004-07-30 2006-02-09 Rosetta Inpharmatics Llc Prognosis of breast cancer patients
WO2006084272A2 (en) 2005-02-04 2006-08-10 Rosetta Inpharmatics Llc Methods of predicting chemotherapy responsiveness in breast cancer patients

Also Published As

Publication number Publication date
US7514209B2 (en) 2009-04-07
JP2005500832A (en) 2005-01-13
ATE503023T1 (en) 2011-04-15
DK1410011T3 (en) 2011-07-18
EP1410011B1 (en) 2011-03-23
CY1111677T1 (en) 2015-10-07
US20090157326A1 (en) 2009-06-18
US7863001B2 (en) 2011-01-04
US20110301048A1 (en) 2011-12-08
JP2009131262A (en) 2009-06-18
AU2002316251A1 (en) 2003-01-02
EP1410011A4 (en) 2007-08-01
WO2002103320A2 (en) 2002-12-27
JP5237076B2 (en) 2013-07-17
US20180305768A1 (en) 2018-10-25
PT1410011E (en) 2011-07-25
WO2002103320A3 (en) 2003-07-31
DE60239535D1 (en) 2011-05-05
US20030224374A1 (en) 2003-12-04
US9909185B2 (en) 2018-03-06
EP1410011A2 (en) 2004-04-21
CA2451074C (en) 2014-02-11
US20130116145A1 (en) 2013-05-09

Similar Documents

Publication Publication Date Title
US20180305768A1 (en) Diagnosis and prognosis of breast cancer patients
US7171311B2 (en) Methods of assigning treatment to breast cancer patients
US8019552B2 (en) Classification of breast cancer patients using a combination of clinical criteria and informative genesets
JP6351112B2 (en) Gene expression profile algorithms and tests to quantify the prognosis of prostate cancer
EP1721159B1 (en) Breast cancer prognostics
EP2333112B1 (en) Breast cancer prognostics
US20090239214A1 (en) Prognosis of breast cancer patients
WO2008157277A1 (en) Methods for evaluating breast cancer prognosis
US20080052007A1 (en) Methods and Materials Relating to Breast Cancer Diagnosis
US8105777B1 (en) Methods for diagnosis and/or prognosis of colon cancer
ES2366178T3 (en) DIAGNOSIS AND FORECAST OF BREAST CANCER IN PATIENTS.

Legal Events

Date Code Title Description
EEER Examination request
MKEX Expiry

Effective date: 20220614

MKEX Expiry

Effective date: 20220614