CA2225385A1

CA2225385A1 - Method and system for dna sequence determination and mutation detection

Info

Publication number: CA2225385A1
Application number: CA002225385A
Authority: CA
Inventors: John K. Stevens; Vrijmoed Chi; Gregory Dee; Rodney D. Gilchrist; Ronald J. Green
Original assignee: Individual
Current assignee: Visible Genetics Inc
Priority date: 1995-06-30
Filing date: 1996-06-28
Publication date: 1997-01-23
Also published as: JPH11509622A; EP0835442B1; US6303303B1; EP0835442A1; DE69601720T2; US5853979A; AU6403996A; AU700410B2; WO1997002488A1; DE69601720D1

Abstract

Normalization of experimental fragment patterns for nucleic acid polymers having putatively known sequences starts with obtaining at least one raw fragment pattern for the experimental sample. The raw fragment pattern represents the positions of a selected nucleic acid base within the polymer as a function of migration time or distance. This raw fragment pattern is conditioned using conventional baseline correction and noise reduction technique to yield a clean fragment pattern. The clean fragment pattern is then evaluated to determine one or more "normalization coefficients". These normalization coefficients reflect the displacement, stretching or shrinking, and rate of stretching or shrinking of the clean fragment, or segments thereof, which are necessary to obtain a suitably high degree of correlation between the clean fragment pattern and a standard fragment pattern which represents the positions of the selected nucleic acid base within a standard polymer actually having the known sequence as a function of migration timeor distance. The normalization coefficients are then applied to the clean fragment pattern to produce a normalized fragment pattern which is used for base-calling in a conventional manner. This method may be implemented in an apparatus comprising a computer processor programmed to determine normalization coefficients for an experimental fragment pattern. This computer may be serapate from the electrophoresis apparatus, or part of an integrated unit.

Description

CA 0222~38~ l997- l2- l9 wo 97/02488 PCT/US96/11130 METHOD AND SYSTEM FOR DNA SEQUENCE
DETERMlNATION AND MUTATION DETECTION

DESCRIPTION

I. BACKGROUND OF THE INVENTION
This invention relates to a method and .system of nucleotide sequenee determination and mutation deteetion in a subjeet nueleic acid molecule for use with automatedeleetrophoresis deteetion apparatus.
One ofthe steps in nueleotide sequence determination of a subject nueleic aeid polymer S is i~ lion of the pattern of oligonueleotide fragments which results fi om eleetrophoretie separation of firagments of the subject nueleic acid polymer (the "fragment pattern"). The inte~ elalion of the fragment pattern, eolloquially known as "base-ealling." results in tem~;n~tion ofthe order offour nucleotide bases, A (adenine), C (eytosine), G (guanine) and T (thymine) for DNA or U (uraeil) for RNA in the subject nucleic acid polymer.
In the earlie.st method of base-calling, a method which is still eommonly employed. the subjeet nueleic acid polymer is labeled with a radioactive isotope and either Maxam and Gilbert - - I seqllf n~ ing (Proe. Natl. Aead. Sei. USA~ 74: 5~0-5~i4 ( 1977)) or Sanger et al. chain n sequeneing (Proc. Natl. Acad. Sci. USA 74: 54~3-54~7 ( 1977)) is performed. The resulting four sarnples of nueleic acid fragments (telminating in A, C. G. or T(U) respectively l5 in the Sanger et al. method) are loaded into separate loading sites at the top end of an electrophoresis gel. An electric field is applied across the gel. and the fiagments migrate through the gel. During this electrophoresis. the gel act~ a.~ a separation matrix. The fi-ag-ments, which in each sample are of an extended series of disclete sizes, separate into bands of discrete species in a channel along the Iength of the gcl. Shorter fragments generally move 20 more quicklythan larger fiagments. After a suitable separation peliod, the electrophoresis is stopped. The gel may now be exposed to radiation sensitive film for the generation of an autoradiograph. The pattern of radiation detected on the autoradiograph is a fixed r epresentation of the fragment p attern. A researcher then manually base-calls the order of fi~glT~nt~ firom the firagment pattem by identifying the stepwise sequence of the order of bands 25 aeross the four ehannels.

CA 0222~38~ 1997-12-19 More recently, with the advent of the Human Genome Organization and its massive project to sequence the entire human genome. researchers have been turning to automated DNA sequencers to process vast ~ln~u~ of DNA sequence information. Existing automated DNA sequencers are available fi-om Applied Biosystem,s, Inc. (Foster City, CA). Pharmacia 5 Biotech~ Inc. (Piscataway. NJ), Li-Cor~ Inc. (Lincoln~ NE)~ Molecular Dynamics, Inc.
(Sunnyvale. CA) and Visible Genetics Inc. (Toronto). Automated DNA sequencers arc basically electrophoresis apparatuses with detection systcms which detect the presence of a detectable molecule as it passes through a dctection zonc. Each of these apparatus~ therefore, arc capable of rcal time detection of migrating bands of oligonucleotidc fiagment.c; thc 10 fragment patterns consi.st of a time based record of fluorcsccncc cmissions or othcr dctcctablc signals from each individual clectrophoresis channel. Thcy do not rcquirc thc cumbersomc autoradiography methods of the earliest technologies to gcncratc a fi agment pattern.
The prior art teçhniq~ for c~ 4)u~ -assisted basc-calling for usc in automated DNA
sequencers are exemplified by the method of thc Pharmacia A.L.~.TM sequencer.
15 Oligonucleotide fr~gn~l-ts are labcled with a fluorcscent molccule such as fluorescein prior to the sequencing reactions. Sanger ct al. sequencing is r)clfolmcd and samples are loadcd into the top cnd of an elcctrophol-esis gcl. Under clectrophorcsis thc bands of species scparatc~ and a laser at the bottom end of the gel causes thc fragmcnts to fluorescc as they pas.s through a detection zone. The fragment pattems are a rccord of fluorcsccncc cmissions from each 20 ~~.h:~nnel In general. each fragment patteln includes a selic.s of sharp peaks and low, flat plains;
the peaks representing thc passage of a band of oligonuclcotidc fragments; thc plains representing the absencc of such bands.
To pcrform computel assistcd base-calling~ thc A.L.F. .system cxecutes at Icast four discrete functions: I) it smooths the raw data with a band-pa.ss fiequency filtcr; 2) it 25 itlentifi~s successi~e maxima in each data stream; 3) it aligns thc smoothed data fi-om cach of the four channeLs into an aligned data stream; and 4) it dctcrmincs thc order of thc succcssive maxima with respect to the aligned data stream. The alignmcnt proccss used in the apparatus depends on the existence of vely little variability between the lanes of the gcl. ln this case, the fragment patterns fiom cach lane can bc superimposed by alignmcnt to a presumed stalting 30 point in cach pattern to provide a record of a continuous~ non-overlapping scries of sharp peaks~ each peak representing a one nuclcotidc stcp in the subject nucleic acid. Where a CA 0222~38~ 1997-12-19 distinct ordering of peaks can not be made, the computer identifies the presence of ambiguities and fails to identify a sequence.
Other published methods of computer-assisted base-calling include the methods disclosed by Tibbetts and Bowling (US Pat. No. 5,365,455) and Dam et al (US Pat. No.
5,119,316) which patents are incorporated herein by reference and Fujii (US Pat. No.
5,419825). Tibbetts and Bowling disclose a method and system which relies on the second derivative of the peak slopes to smooth the data. The second derivative is used to provide an informative variable and an intensity variable to deterrnine the nucleic acid sequence corresponding to the subject nucleic acid polymer. Dam et al. disclose a method of combining 10 peak shapes from two signal spectrums derived from the same electrophoresis channel to determine the order of nucleotides in the subject nucleic acid polymer. Fujii discloses a base-calling method and apparatus which uses the peaks in a concurrently-run reference lane as a standard to provide calibration coefficients for correcting for differences in migration speed from lane to lane.
Three practical problems face all existing methods and systems of base-calling. The first is the inability to align shifted lanes of data. If the signal from the related data streams does not begin at approximately the same time, it is difficult. if not impossible, for these techniques to determine the correct alignment. Secondly, it is a challenge to resolve "com-pressions" in the fragment pattern: those anomalies wherein the signal from two or more nucleotides in a row are not distinguishably separated as compared to other nucleotides in the general vicinity. Compressions result most often from short hairpin loops at the end of a fragment which cause altered gel mobility features. The third problem is the inability to identify nucleotide sequences beyond the limits of single nucleotide resolution. Larger fragments tend to need longer electrophoresis runs to separate into discrete bands of fragments, in part because a one nucleotide addition to a 300 nt fragment is less significant than a one nucleotide addition to a 25 nt fragment. The limit of resolution is reached when individual bands can not be usefully distinguished.
All of these problems limit the most crucial aspects of base-calling, which are speed, read-length and accuracy. Read-length is the number of fragment bands which can be identified from the fragment pattern. Greater read-length provides greater inforrnation about the DNA sequence in question. Accuracy measures the number of base-calling errors.

CA 0222~38~ 1997-12-19 The advent of DNA sequence-based diagnosis provides new ~M~I Luni~ies for improved speed, accuracy and read-length in computer-assisted base-calling. DNA sequence-based diagnosis is the routine sequencing of patient DNA to idcntify genotype and/or specific gene sequences of the patient, wherein the DNA sequencc is rcported back to the physician and patient in order to assist in diagnosis and treatment of paticnt conditions. One of the great advantages of DNA sequence-based diagnosis is that thc DNA scquence being examined is largely known. As demonstrated by the instant invcntion~ it is possible to use the known fragment pattern for each DNA sequencc to assist in thc intcrprctation of the fragmcnt pattern obtained firom a patient sample to obtain improvcd rcad-lcngth and accuracy. It can also be used to increase the speed of sample analysis.
It is an object ofthe instant invention to pl-ovidc a mcthod and system for nucleotide scquence delt;.ll~l~lion aod rnutation detcction which can bc uscd with DNA sequencc-based diagnosis.
It is a further object of thc instant invcntion to pl-ovidc a method and systcm for nucleotidc scqucncc dctclmination and mutation dctcction whcn thc fragmcnt pattcrn dcmonstrates localizcd comprcssions It is a further object of the instant invcntion to ~-rovidc a mcthod and systcm for nucleotidc sequence dt;L~;--l~laLion and mutation dctcction whcn thc fiagmcnt pattcrn docs not provide single nucleotide resolution.
It is a further objcct of the instant invcntion to p rovidc a mcthod and systcm of computel-assistcd base-calling which can bc uscd with fragmcnt pattcrn rccords fi-om high .specd clectrophoretic separations which dcmonstratc Icss than idcal scpal-ation charactcristics II. SUMMARY OF THE INVENTION
Thesc and other objects of thc invcntion arc rcalizcd by thc application of a novel approach to the nolm~ tion of expcrimental fi-agmcnt pattcms for nuclcic acid polymers having putatively known sequences. In this mcthod~ at Icast onc raw fiagment pattcrn is obtained for the cxperimental samplc Thc raw fi-agmcnt pattcl-n r cpresents the positions of a selected nucleic acid base within the polymer as a function of migration time or distance.
This raw fragment pattern is conditioned using convcntional baselinc correction and noise reduction technique to yield a clean fragment pattcrn. The clcan fiagment pattern is thcn CA 0222~38~ 1997-12-19 evaluated to determine one or morc "norm~1i7~tion coefficients." These norm~li7~tion cor,ffir,ient~ reflect the displ~Pm~nt ~LI~lcl~ g or ~shrinking, and rate of stretching or shrinking of the clean fr~gm~nt~ or segments thereof. which arc necessaly to obtain a suitably high degree of corTelation between the clean fragment pattern and a standard fragment pattern which represents the positions of the selectcd nucleic acid base within a standard polymer actually having the known seqll~nce as a function of migration time or distance. The nolmali-zation coefficients are then applied to the clean fragment pattern to produce a norrnalized fragment pattern which is used for base-calling in a convcntional manner.
In applying the present invention to the evaluation of nuclcic acid polymers of putatively known sequcnce to thc detection of wcll-charactclizcd mutations in which onc basc is substituted for another at a constant site in thc gcnc. it will generally be sufficicnt to detelmine normalization coefficients for a singlc fragment pattern reflecting the positions of either the normal or mutant base within thc nuclcic acid polymcr. For more general applica-tions, howcvel; it is desirable to detcrmine scpalatc nolmalization coefficicnts for each ofthc four oligonucleotidc fiagment patterns obtaincd for thc samplc by conclating them with four standard fragment patterns.
Thc method ofthe invention is advantagcously implcmcntcd in an apparatus comprising a computer processor programrned to dcterrninc norrnalization cocfficients for an expcrimcntal fiagment pattern. This computcr may bc scparatc fi-om thc clcctrophoresis apparatus, or part of an integrated unit.

III. BRIEF DESCRIPTION OF THE DRAWrNGS
Fig. I illustrates the effect of background subtraction and band-pass fi-cqucncyfiltration on the appearance of data.
Figs. 2A, 2B~ and 2C illustrates thc con-clation mcthod of instant invention.
Figs. 3A, 3B and 3C illustrate the effcct of incrcasing the number of segments into which thc sample data is divided.
Fig. 4 is a plot of prefelTed conelation shift against data point number.
Figs. 5A and 5B illu~strate alignment of data windows.
Fig. 6 shows the process of "reproduction" using Gcnctic Algorithms.

CA 0222~38~ l997- l2- l9 Fig. 7 shows a binary genotype useful for finding values for a the coefficients of a second-order polynomial using Genetic Algorithrns. , Fig. 8 illustrates the exercise of base-calling of aligned data~ as obtained from a Pharmacia A.L.F.TM and processed using HELIOSTM software.
Figs. 9A and 9B illustrate a sequencing cU~ ession.
Fig. 10 illustrates a cross correlogram which plots maximum corrclation against data point number of shifted Origin across thc entirc length of sample and standard fragment patterns.
Fig. I l shows an apparatus in accordance with the invention IV. DETAILEP DESCRIPTION OF THE INVENTION
The instant invention is designed to work with DNA sequence-based diagnosis or any other sequencing environment involving nucleotide sequencc determination and/or mutation detection for the same region of DNA in a plurality of individual DNA-containing samples (human or otherwise). This "diagnostic environment" is unlike the vast majority of DNA
sequencc determination now occu~Ting in which researchcrs arc attempting to make an initial d~ ~ " ,~1 ion of the nuclcotide sequence of unknown rcgions of DNA. DNA sequencc-based diagnosis in which the DNA sequence of a patient genc is detcrmincd is one example of a technique pelformed within a diagnostic cnvironmcnt to which the present invention is applicable. Other examples include identification of pathogenic bacteria or viruses, DNA
fingerprinting~ plant and animal identification, etc The present invention provides a method for normalization of expelimental fiagment pattems for nucleic acid polymers with putatively known scqucnces which enhances the ability to intelpret the infolmation found in the fragment pattcrrls. In this method. at least one raw fragment pattern is obtained for the cxperimental samplc As used in the spccification and elaims hereof. the telm "raw fragment pattelll" refers to a data sct represcnting thc positions of one selected nucleic acid base within the expelimental polymer as a function of migration time or distance. Preferred raw fragment patterns which may be p rocessed using the present invention include raw data collected using the fluorescence detection apparatus of automated DNA sequencers. However. the present invention is applicable to any data set which reflects the separation of oligonucleotide fiagments in space or time. including real time fiagmcnt CA 0222~38~ 1997-12-19 patterns using any type of deteetor, for example a polarization deteetor as deseribed in US
Patent Applieation No. 08/3877272 filed February 13~ 1995 and ineorporated herein by referenee; densitometer traees of autoradiographs or stained gels; traees from laser-seanned gels eo~ lillg fluoreseently-tagged oligonueleotides; and fragment patterns from samples S separated by mass speetrometry.
This raw fragment pattern is eonditioned~ for example using eonventional baseline eorrection and noise reduetion teehnique to yield a "clean firagment pattern." As is known in the art~ three methods of signal proeessing eommonly used are baekground subh-aetion~ low frequeney filtration and high frequeney filtration.
Baekground subtraction eliminates the minimum constant noise reeorded by the detector. The background is calculated as a measure of the minimum signal obtained over a selected number of data points. This measure diffens from low frequency filtration whieh elimin~tes low period variations in signal that may result fiom valiable laser intensity~ ete.
High frequeney filtration eliminates the small valiations in signal intensity that oeeur over highly loealized areas of signal. The result after base-line subtraetion is a band-pass filter applied to the frequeney domain:
F(f)=e ~ e /~~

where ~ determines the low-frequeney eutoff~ and o determines the high frequeney cutoff~
re.spectively. Fig. 1 illustrates the effect of background subtraction~ low and high frequency filtration on the appearance of data from a Visible Genetics MieroGene BlasterTM~ re.sulting in a clean fiagment patteln useful in the invention.
In accordance with the present invention~ a "clean fi-agment pattern" may be obtained by the application of these signal-processing techniques singly or in any eombination. In addition~ other signal proeessing teehniques may be employed to obtained eomparable clean fiagment patterns without departing from the present invention.
One note of eaution concerning thic conditioning step is the finding that signalconditioning or pre-processing may delete features of consequence in the pr eparation of the elean fragment pattern. It is possible to inelude a feedback meçh:lni.cm in the system which adjusts the parameters of the filter mech~ni.cmc. based on the analysis of the degree of CA 0222~38~ 1997-12-19 correlation~ described below. The feedback mechanism adjusts the types of filters employed in signal processing to provide the maximum information about the subject nucleic acid sequence.
The next step in the method of the prescnt invcntion is the comparison of the clean 5 fragment pattern with a standard fragment pattem to dctcrminc one or more "normalization coefficients." The use of a "standard fragment pattcrn" takcs advantage of thc fact that in a diagnostic enviro~ lenL. there is a known fragmcnt pattcrn that is cxpected from each test sample. As used in the specification and claim~s of this application. thc term "~standard firagment pattern" refers to a typical fragment pattem which rcsuits fi-om scqucncing a particular known 10 region of DNA using the same techniquc as thc cxpclimcntal tcchniquc being cmployed. Thus~
a standard fiagmcnt pattem may bc a timc-bascd fluorcsccncc cmission record as obtained from an automated DNA sequencer~ or it may bc anothcr rcprcsentation of thc separated fragment pattern.
A standard fiagment pattem used in thc plcsent invcntion includes all the Icss-than-15 idcal charactcristics of nucleotide separation that may bc associatcd with scquencing of anyparticular r egion of DNA. A standard fragmcnt pattcm may al so tcnd to bc idiosyncratic with the elcctrophoresis apparatus employed~ thc rcaction condition.~i c--mploycd in scqucncing and other factors. Fig. 2A illusb-ates a standard fragmcnt pattcl-n for thc T lanc of thc first 260 nucleotides from the univcrsal primer of pUC I X prcparcd using Scqucnasc 2.0 (Unitcd Statcs 20 BiochemicaL Cleveland) and detccted on a Visiblc Gcnctics Microgcnc Blastcr(tm) Four standard fi-agrnent patterns~ one for each nuclcotidc~ makc.s up thc standard fiagmcnt pattcrn set for a particular nuclcic acid polymer.
A standard fi-agmcnt pattern or fiagmcnt pattcl-n sct for a particulal- nucleic acid polymcr may be generatcd by valious methods. Onc .such mcthod is to obtain several to 25 several hundred actual fiagment pattcms for thc DNA .~icqucncc in question fi-om samples whercin the DNA sequcnce is already known. ~rom thcsc trial runs~ a human opcrator may ~select the trial run that is found to bc thc most typical fi-agmcnt pattern. Bccause of slight gel or sample anomalies, and other anomalies~ diffcrcnt fi-agmcnt pattcms may have slightly different separation charactcristics~ and slightly diffcrcnt pcak amplitudes etc. The selected 30 pattern generally should not show discrete peaks in an arca whcrc comprcssions and ovcrlaps are regularly found. Sirnilarly~ the selected pattem gcnerally must not show discrete separation CA 0222~38~ l997- l2- l9 WO 97/02488 ~CT/US96/11130 _ 9 _ of bases beyond the average single nueleotide resolution limit of the electrophoresis instrument used. An ~It~ tive method to select a standard fragment pattern is to generate amathematieally averaged result from a combination of the trial runs. As described v below, the main use of the standard fragment pattern is as a basis for modifying and S nonn~ g an t;~ l fiagment pattem to enhance the reliability ofthe intelpretation of the t;~elilllental data. Thus. the standard fi-agment pattem is not used as a comparator for identifying deviations fiomthe expected or "normal" sequence. and in fact is used in a manner which assumes that the experimental sequencc will conform to the expected sequenee.
A feature of the standard fragment pattem which is important for some uses is that it results in a minim--m of (and preferably no) ambiguitics in base-calling when combined with the standard firagment pattems fiom the three other sequencing channels. The human operator may prefer to empirically determine which fragment patterns fiom which lanes work best together in order to determine the standard fragmcnt patterns for each sequencing lane.
Additionally, it is well known in the alt that a range of alleles for any gene may be present in a population. To be most useful, a standard fiagment pattern should result from sequencing the dominant allele of a given population. Becausc of this, for some applications ofthe invention. multiple standard fiagment pattems may exist for a specific gene. even within a single experimental environment.
A standard fiagment pattern may be used in different ways to provide improved read-length, accuracy and speed of samplc analysis. These improvements rely on comparison of an experimental sample fragment patteln with the standard fragment pattern to detelmine one or more "nolm:~li7~tion coefficients" for the palticular experimental fragmcnt pattem.
The norm~li7~1ion coefficients reflect the displacemcnt~ strctching or shrinking. and ratc of stretching or shrinking of the clean fiagment pattern~ Ol segments thereof~ which are necessary to obtain a suitably high degrec of corrclation betwccn thc clean fragment pattern and a ,standard fiagment patteln which represents the positions ofthe selected nuc!eic acid base within a standard polymer actually having the known sequence as a function of migration time or distance. The norrnalization coefficients are then applied to thc clean fragment pattern to produce a nolmalized fiagment pattem which is used for base-calling in a conventional manner.
The process of comparing the clean fragment pattern and the standard fragment pattern to arrive at normalization coefficients can bc carricd out in any number of ways without CA 0222~38~ 1997-12-19 d~ Lillg from the present invention. In general, suitable proce~sses involve consideration of a number of tlial norm~li7~0n s. and selection of the b ial nolm~ tion which achieves the best fit in the model being employed. Several~ non-limiting examples of uscful compalison procedures are set folth below. The procedures result in the development of norrn~li7:-tion 5 coell~ which, when applied to an experimental fi-agrnent pattern, shift~ stretch or shrink the experimental fragment pattern to achieve a high degree of overlap with the ~standard fragment pattern.
It will be understood~ that the theoretical goal of achieving an exact ovcrlap between an expt,.i~ Lal fragment pattern and a standard fiagmcnt pattern rnay not bc realistically 10 achievable in practice~ nor are repetitive and timc consuming calculations to obtain perfect norm~li7~tion necessaly to the successful use of the invcntion. Thus~ the term "high degrec of norrn~li7~tion" refers to the maximization of the normalization which is achievable within practical con~ illL~. As a general rule~ a point-for-point con-elation coefficient calculated for normalized fragment patterns and the corresponding standard fragment pattem of at least 0.8 15 is desirablc~ whilc a correlation coefficient of at least 0.95 is prcfen-ed.
Fig. 2 illu~strates one con-elation method of instant invcntion. Fig. 2A illustrates a clean fiagment pattem obtained using a Visible Genetics MicroGcne BlastcrTM. The signal records the T lane of a pUCI 8 sequencing run over the first 260 nuclcotidcs (nt) of the subject nucleic acid molecule. The Y axis is an arbitraly rcplcscntation of signal intcnsity; thc X axis 20 represents a time of 0 to 5 minutes. In thc scqucncing run shown~ the pcaks are cleanly separated.
Fig. 2B represents the ~standard fiagment pattcrn for the T lane of the first 260 nucleotides fiom the universal primer of pUC 18 preparcd using Sequenase 2.0 (United States Biochemical~ Cleveland) and detected on a Visible Genctics Microgene Blaster(tm). The 25 standard sequencc was selected by a hurnan opcrator as thc most typical fi agmcnt pattcm fi om 25 trial l~ns.
The expclimental fiagment pattern of Fig. 2A may be compared with thc standard fragment pattern of Fig. 2B according to the equation:

CA 0222~38~ 1997-12-19 M-l ~x)og(x) = ~ f ~(m)g(x +n~) n~ =O

where f(x) and g(x) are two discrete functions, x = 0.1.2, ..., M- I, f~ is the complex conjugate, and M is one less than the sum of the data points in f(x) and g(x). Altcrnatively. the equation may be desclibed as S CIOR( I 1.21 ) in NextStepTM pro~l~ g environmcnt, where I I = thc cxpcrimental fragment pattern, and 21 = standard fragment pattern.
Fig. 2C shows thc corTelation values of thc entire window of Lanc A against thc entire window of Lane B as lane A is translatcd rclativc to lanc B. (As thc window is shiftcd, it 10 effectively wraps around~ such that the End and Origin p oints appear to be sidc by sidc). The result shows maximum colTelation at point P which con csponds to a prefcrred colTclation shift of +40 data points.
Fig. 2 illustrates compalison of a complctc cxpclimental fi-agment pattcrn and acomplete standard fi-agment patteln. In this casc, thc only norrnalization coefficient dctelmincd 15 is thc shift which rcsults in the highest level of correlation. This simplc model, howcvcr, lacks the robustness which is needed for general applicability. Thus for most pulposcs. a morc complcx analysis is rcquircd to obtain good normalization.
Onc way to take in to account the expcrimcntal valiability in migration ratc causcd by inconsistency of sample preparation chemistly, samplc loading, gel material~ gel thickness.
20 electlic field density. clamping/scculing of gcl in instrumcnt, dctcction rate and other aspccts of the electrophorcsis process is to assign thc data points of thc clean fragmcnt pattern to one or more scgrn~nt~ or "windows." Each window includcs an cmpilically determined numbcr of data points, generally in the range of 100 to 10000 data points. Windows may bc of valiable size within a givcn data selics. if desired. The stalting data point of cach window is designated 25 Oligin, the final data point in a window is dcsignated End. Each window of the expclimental CA 0222~38~ 1997-12-19 fragment pattern is then compared with a comparable number of data points making up the standard fragment using the same procedure described above.
Figs 3A and 3B illu.strate the effect of increasing the number of se~ment.~ or windows into which the experimental data is divided. In Fig. 3A, the experimental fiagment pattern 5 from Fig. 2A was divided into three windows, and each was evaluated individually. Instead of the single offset of +40 data points found using a singlc window, the use of three windows results in an increasing degree of shift throughout thc run~ hc., +24, +34 and +50 in the successive windows reading from right to left. Fig. 3B ~hows thc use of five windows on the same experimental fr~gment, and results in evcn clcarcr r c~olution, with successive shifts of +16~ +23, +35, +48. and +51 for the window~s. Siml-ly put~ thc consequence of too few windows is a lack of precision in shifting information. This may causc problems in base-calling aligned data. It is therefore desirable to use morc than onc window in the correlation process.

When more than one window is uscd in thc analysis as illustrated in Fig 3, it may 15 become necPC~ry to stretch or shrink some windows to obtain a continuous strcam of data in the col~-elatcd data and to obtain a sufficiently high corrclation. To calculate strctch or shlink ("elasticity") for a window wherc a plurality of window~ arc dcfincd. onc can use an "elasticity plot" of pl~fe--~d correlation shift against data t~oint numbcr (Fig. 4). An incrcascd number of windows increases the numbcr of point~s on thc cla.~iticity plot, allowing morc accurate 20 detelmination of thc slopc and offset of thc alignmcnt I inc.
It has been found exp~lil-lell~lly that in an clasticity plot~ a lincar cquation rcprcscnting the least mean square fit adcquatcly reprcscnts thc data. In thi~ casc thc lincar equation f(x) = mx + b will satisfy the line, wherc m is thc slope ofthe linc and b i.s thc Y intcrcept ("offsct") exprcssed in numbcr of data points. The equation of thc linc is uscd to shift the sample data whcre the value at shifted(i) of the shifted data i~s givcn by:

shifted(i) = sample(i) +(~samplc(i) * m) + b) CA 0222~38~ 1997-12-19 Note that as illu~l~aled in Fig. S when the peaks identified in the sample window (Lane B) do not align with the standard data (Lane A). (Fig 5A) they rnay be aligned for analysis purposes by padding the elastically shifted data with zeros when the formula produces values outside of the sample data's range (Fig. 5B).
S While the use of multiplc windows increases thc accuracy of thc ali~nment~ a potcntial problem arises when too many windows arc used. As illustrated in Fig. 4C. when windows include too few features. the correlation bctween thc data and the window and the standard fragment pattern becornes m~ningl~s~ In Fig. 4C, window size has dropped below 1000 data points. One window which includes a singlc pcak is found to havc highest correlation with a peak di$antly removed from the location whcrc it would othelwise be cxpected to corrclate.
This situation demonstrates that thc human opcrator must be scnsitive to the uniquc circumstanccs of each standard fragment pattcrn to dctclmine the ~)Lilllulll number of data points per window.
One method wherein windows with fcwcr data points can be cmployed is to limit the amount ofthc ~standard fragment pattern against which thc window is corrclated. Again. such a lirnitation would be empilically determined as in thc othcr data filtcns employed. lt is found experimcntally that collelation of a sample window with that rcgion of the standard fiagment pattern that falls a~ ill~Lely at the same numbcr of data point~s from the start of signal, and includes twice as many data points as thc samplc window. is sufficient to obtain correlations which are not often SpUliOUS.
An altemative approach to thc dctclmination of nolmalization cocfficients which is applicable whether the exp~ "~,lLal fi-agmcnt pattcrn is considcrcd in one or scveral scgmcnts rnakes use of an adaptive computational mcthod known a~ a "Gcnctic Algolithm." See Holland, J. Adaptati~n in natu~al and Artificial SV.s~tf~ .s. Thc Univcr.sity of Michigan Prcss ( 1975).
Genetic Algorithms (GAs) are particulal-ly good at solving optimization problems wherc traditional methods may fail. ln the context of norrnalization of nuclcic acid fi agmcnt pattcrn~s, GAs are particularly suited to use in expelimental conditions whcrc variations in velocity may excecd 5%.
The conceptual basis of GAs is Darwinian "~sulvival of thc fittest." In nature.
individuals compete for rcsources (e.g., food. sheltcr, mates. ctc.). Those individuals which CA 0222~38~ 1997-12-19 are most highly adapted for their environment tend to produce more offspring. GAs attempt to mimic this process by "evolving" solutions to problems.
GAs operate on a "population" of individuals, each of which is a possible solution to a given problem. Each individual in thc starting population is ~;signec~ a unique binary .string which can be considered to represent that individual's "genotype." The decimal equivalcnt of this binary genotype is referred to as the "phenotype." A fitness function operating on the phenotype reflects how well a particular individual solvcs thc problem.
Once the fitness of every individual in the starting population has been determined, a new generation is creatcd through rcproduction. Individuals arc selected for re~roduction from the stalting population based on their fitness. The highcr an individual solution'.s fitness, the greater the probability of it contributing onc or morc offspring to thc ncxt gencration.
During reproduction, the nurnber of individuals in a population is kept constant through all gcnerations. "Reproduction" re.sults fiom combining the genotypes of the individuals fi-om the prior generation with the highest fitness. Thus, as shown in Fig. 6, in a population of four individuals, portions of the genotype of the two individuals with the highest fitness arc exchangcd at a randomly selected cross-over point to yicld a ncw gcncration of off-spling.
A further a.spcct ofthe Genetic Algolithm approach is thc random use of "mutations"
to introducc diversity into the population. Mutations arc performed by flipping a single randomly selectcd bit an individual's gcnotypc.
Arriving at a solution to a problem using GAs involvcs the repcatcd steps of fitness evaluation, reproduction (through cross-ovcr), and possibly mutation. Each stcp is simply repeated in turn until a the population convcrgcs within prcdctclmincd limits upon a singlc solution to the given problem. A pseudo-code implementation of this process is shown in Table 1.
It will be appreciated by perxons skilled in the alt that thc task of aligning fragmcnt patterns nccds to takc into account a great many expcrimcntal valiations, including valiations CA 0222~38~ l997- l2- l9 Table 1: Genetic Algorithm in Pseudo-code generate initial population compute the fitness of each individual WHILE NOT finished DO
FOR (population_~size / 2) DO
select two individuals from old generation for mating recombine the two individuals to create two offspring randomly mutate offspring insert offspring into new generation END
IF population has converged THEN
finished := TRUE
END
END

Beasley et al., "An Overview of Genetic Algorithms: Part I, Fundamentals", U~live~ it~
Computing 15(2): 5~-69 (1993).

in sample preparation chemistry; sample loading; gcl matcrial; gel thickness; clectric field density; clamping/seculing of gcl in instrumcnt; dctcction rate and othcr aspects of thc electrophoresis process. We havc found cx~crimcntally that applying a second-order polynomial f(x) = c + bx + ax' where c rcflects thc linear shift, b reflccts thc strctch or shlink, and a rcflects thc ratc at which this stretch or shrink occurs to a clcan fragment pattern providcs good normalization of thc 10 clean fragment pattern with a standard fiagrnent pattem for CXp-,l ilJI ,1~ data having valiations in velocity of up to 45%. Using GAs, the coefficients a~ b~ and c can be readily optimi7ed A suitable approach to this optimization uscs a binary string as thc genotypc for each individual, which is divided into three sections repl-esenting thc three coefficicnts as shown in Fig. 7. The size of each section is dependent on thc range of possible values of cach coefficicnt CA 0222~38~ 1997-12-19 and the resolution desired. The phenotype of the individual is determined by deeoding eaeh seetion to the eorresponding deeimal value.
As shown in Fig. 7. a binary string for use in solving the problem presented by this invention may eontain 32 bits of whieh 8 bits specify the offset eoeffieient e. 13 bits speeify the 5 relative veloeity b, and 11 bits specify the relative acceleration a. The objective funetion used to measure the fitness of an individual is the intersection of the standard fragment pattem and an experimental fragment pattern produced by applying the second-order polynomial to the experimental fragment pattern. The intersection is defined by the equation f~X~ min( i=O

10 where x is the expelimental fiagment pattern~ y is the standard fi-agment pattel~ and n is the number of data points. The intersection will be greatest when the two sequences are perfectly aligned.
Calculating the fitness of each individual i.s a three steT- proccss. First the individual's genotype is deeoded producing the values (phenotype~s) for the three coefficients. Sccond~ thc 15 coefficients are plugged into the second-order polynomial and thc polynomial is used to modify the clean fi-agment pattern. Third~ the intersection of thc modified fragment pattem and the .standard pattem is calculated. The intersection valuc is then as~igned to the individual as its fitness value. About 20 generations are needed to align the two sequences using a population of 50 individuals with a mutation probability of 0.001 (i.e. I out of evely 1000 bits mutated 20 after crossover). Using conventional computer equipment this can be accomplished in approximately 8 seconds. This time period is sufficicntly .sholt that all calculations can be run for a standardized peliod of time~ rather than to a ~selected degree of convergence. This substantial simplifies experimental design.
Occasionally~ the second-ordel polynomial will be unable to nolmalize the two 25 sequences. This is due to variations in the velocity of the expelimental fiagment pattern which are greater than seeond-order. This is easily handled by using a higher order polynomial, for example a third- or fourth-order polynomial~ and a largel- binary genotype to include the extra CA 0222~38~ 1997-12-19 co~l~ir~ , or by simply dividing the exp~- u~ al fragmcnt pattem into segments or windows sueh that each segment's variations are at mo~st second-order.
Once norrn~li7ed fragment patterns have been obtained, they may be used in various ways in~ ling base-calling and ll~L~Lioll detection. For purposes of determining the complete S sequence of all four bases in the sample polymer. this will generally involve the superposition of the norm~li7ed fragment patterns for each of the four bases. This can be done by designating a starting point or other "alignment point" in each fi-agment, and aligning those points to position the aligned fragment pattems. Altcmatively, the fi-agments can be aligned using a reference peak as disclosed in US Patent Application Serial No. 08/4~i2,7 19 filed May 30, 1995. which is incolporated herein by r eference.
Fig. 8 illu.strates the CXClCiSC of base-calling of aligned data, as obtained fiom a Pharmacia A.L.F. Sequencer and processed using HELIOS (tm) software. Such base-calling may be by any method known in the prior art, using aligned fragment patterns for each of the four bases to provide a complete sequence.
Generally in base-calling, there are two steps, peak detection and sequence conrelation.
The ~" -,i""l"l value used io peak detection valies with each sequence and must be set on a per-run basi~s. The well-known Fast Fourier Tran~sfolm ver.sion of colTelation is used to speed its calculation.
Once the peak maxima are identified and locatcd in time, potential ambiguities are 20 ir~Pntifi~-l Absent anypotentialambiguities~ a sequential recold ofthe nucleotide represented by each sequential peak concludes the base-calling exercisc. In the diagnostic environment to which the present application applied~ however, the nucleotidc sequence record can be utilized to detect specific mutations. This can be accomplished in a valiety of ways, including amino acid translation, identification of untranslated signal ~scquence~s such as stalt codons, stop 25 codons or splice site junctions. A preferred method involves detelmining correlations of the nonnalized fragment pattenns against a standard to obtain specific diagnostic information about the presence of mutations.
To perform this con-elation, a region around each identified peak in the standard fragment pattern is conrelated with the con-esponding region in the nolmalized fiagment 30 pattem. After nolmalization, the colTelation will be low in locations where the two sequences differ, i.e., where there us a nucleotide valiation because of the high degree of alignment which CA 0222~38~ l997- l2- l9 norm~li7~tion makes pos.sible. However, correlation of a region extending approximately 20 data points on either side of a peak is desirable to compensate for small diserepancies which may remain. The co~ ion process is then repeated for each peak in the norm~li7ed fragrnent pattern. Instances of low corTelation for any peak are indicative of a mutation.The corTelation of the peaks of the nolmalizcd fi-agment pattern with the standard fragment pattern can be performed in several ways. One approach is to determine a standard corTelation, using the equation for corTelation shown above When two discrete functions are corTelated in this manner, a single number is obtained This number ranges in value from zero to some arbitrarily large number the value of which depcnds upon the two functions being 10 con-elated, but which is not predictable ap~io~i. This can creatc a problem in setting thre~shold Ievels defining high versus low correlation. It is thercfol-c F7refcrable to usc a mcasurc of correlation which has defined limits to the range of po~ssible values.
One .such measure of conelation is called the "coefficicnt of col~elation" which can be calculated using the formula fnn~-a77 ) (gi g/n~ n ) (f ) i= I ;

15 wherc f5,d is the standard dcviation of function f, and gsld is the standard deviation of function g. In this case, thc output is nolmalized to a valuc of betwccn - I and I, inclusively. A valuc of I indicatcs total corTelation, and a valuc of-l indicatc~i complctc non-colTelation. Using this method, a gradient of conelation is sup~lied, and valucs which are abovc a l re-defined threshold, i.e., 0.8, could be flagged as suspect An altemative to determining thc cocfficicnt of corrclation is to use the function ~ n f~x~ *~
'\ i=l i=l CA 0222~38~ 1997-12-19 where x and y are the data points of the two fiagment patterns being compared. This equation provides only a rough correlation value, but relies on little computation to do so. Given the large values frequently encountered, the error in this equation may be acceptable.
To return to the advantages of the invention, it is also noted that use of the ~standard 5 fragment pattern allows the resolution of nucleotide sequence where ambiguities OCCUI; such as cull4,.~ions and loss of single nucleotide resolution. Thus~ the present invention permits ~l~t-)m~ted analysis of many of the ambiguities which are sirnply rejected as uninl~ ble using by known sequencing techniques and equipment.
It is found expelim~nt~lly that it is not always necessaly to obtain prccise signal 10 rnaxirna for each nucleotide in ordcr to detclminc the prcscncc or absence of mutation in the patient sample. Localized areas which fail to clearly ICSOIVC into peaks undcr high speed el~ upl1oresis can still carry enough wavc form information to allow accurate inlel~ Lation of the presence or absence of mutation in the patient sample when the sample fi-agment pattern has been norm~li7ed in accordance with thc invention.
"Coll4~l~ssions" are localized areas of fi-agmcnt pattcrn anomalies whcrcin a scrics of bands in aligned data arc not scparatcd to thc samc dcgrcc as other nearby bands. Com-pressions are thought to result fi-om sholt hairpin hyblidizations at onc end of thc nucleic acid molecule which tend to cause a moleculc to travcl fastcr through an elcctrophoresis gcl than would be expected on the basis of size. Thc resulting appcarance in the fragment pattern is 20 il~ustrated in Fig. 9. These compressions may consi,st of ovcrlapping peaks within onc lanc that give one large peak~ or, they may be peaks from diffcrcnt lanes that overlap whcn combincd togcther in the alignmcnt process.
Normally. a base-calling method is not able to dcterminc the number or order of the bases in the compression~ because it is unablc to distinguish the correct ordering of bands.
25 Ex~mi~ ~tion reveals a peak (Peak A) which is clcarly widcr than a singlcton peak (Peaks B and C) but is otherwise indefinable. The method and system of the instant invention. howevcr.
assigns thc correct ordcr and the corrcct number of nuclcotidcs based on what is known about the standard fragment pattern.
A~s stated hereinabove, the .standard fragmcnt pattern includes rcgions of compressions 30 that are typical of a given nucleotide scqucncc. A compression can be characterized by the following features (Fig. 9):

CA 0222~38~ l997- l2- l9 WO 97/02488 PCTrUS96/11130 Peak Height (Ph) Peak Width at half Ph (Pw) Peak Area (Pa, not shown) S Centering of Ph on Pw (Cnt, not shown) A compression is characterized in thc hial runs by thcse features, and thc ratios between the features. An average and standard dcviation is calculated for each ratio The mol-e precise and controlled the trial runs have bccn. thc lowcr thc standard deviation will be.
10 The inclusiveness ofthe ~standard deviation must bc broad cnough to encompass thc degrce of accuracy sought in base-calling. A standard dcviation which includcs only 90% of samples, will permit mi~ç~lling in 10% of samples, a numbcr which may or may not be too high to be u~sefully employed. Oncc ascertained, thc comprcssion statistics arc recordcd in association with the ambiguous peak. These statistics are associatcd with cach compression and hcrein 15 called a "standard compression."
Each standard compression can be assigncd a nuclcotidc basc sequencc upon careful investigation. Researchers resolve compressions by numcrous tcchniqucs. which though morc cumbersome or Icss useful, serve to revcal thc actual undcrlying nuclcotidc sequencc. Thesc tcchniqucs includc: scquencing fi-om p limcr.s ncarcl- to thc compression, scqucncing thc 20 opposite strand of DNA, elcctrophorcsis in morc highly dcnatuling conditions, ctc. Oncc thc actual base sequence is determined, it can be assigncd as a group to thc compression, thus relieving thc researcher from further timc consuming cxcrciscs to rcsolve it.
Regions of the normalized fiagment pattcrns which do not show discretc peaks forbase-calling are te~sted for the existence of known com}~rcssions. If no comprcssion is known 25 for thc region, the arca is flagged for the human opcl-atol to cxaminc as a possiblc ncw mutation.
On the other hand, if thc locus of the pcak indicatcs that it lics in a rcgion of known compression, the ratios of the peak are determincd as abovc. If the peak falls within the standard dcviation of all the ratios dctermincd from thc tlial runs, it is then assigned the 30 sequence ofthc .standard co~ s:iion. Figs. 9A and 9B idcntify thc actual nuclcotidcs assigncd to a standard compression.

CA 0222~38~ 1997-12- l9 Where, however, the ratios determined for the cornpression fall outside of the standard deviation, there lies the possibility of mutation. In this case, the ratios of the compression are cornpared to all known and previously observed mutations in the standard compression. If the compression falls within any of the previously idcntified mutations in the region. it may be 5 identified as corresponding to such a mutation. If thc ratios fall outside of any known standard. the area is flagged for e~ Qn by the human opcrator as an example of a possible new and hitherto unobserved mutation.
A fulther application of the present invention is for base-calling beyond thc limits of single nucleotide resolution. In this case, the standard fiagmcnt pattcrn will define a region 10 where single nucleotide resolution is not observed. In cascs of poor sequencing conditions and a weakly resolving apparatus. resolution may fail around 200 nts. In excellcnt conditions, some apparatus are known to producc read-lengths of ovcr 700 nts. In either case, there is a point which single nucleotide resolution is lost and basc-calling cannot bc pcrformed accurately.
The instant invention rclics on normalization using a standard *ragment pattcrn to resolvc the ambiguous wave forms beyond the limit of singlc nucleotide resolution. The method is essentially the same as a serics of com~rcs~sion anaiyscs as described hereinabovc.
The wave forms beyond the limit of resolution in thc standard fi-agment pattcrn arc measured and ratios between all featurcs are calculatcd, to creatc an cxtendcd selies of standard peaks.
Normalizcd *agment pattems, prepared as dcscribcd hcrcinabovc, may be sequentially analyzed for consistency with the expected ratios of cach pcak-like feature. Any wavc folm which does not fall within the parameters of the standard pcaks is classificd as anomalous and flagged for further investigation.
As noted above. in some applications of thc invcntion it is not nccessary to pcrform base-calling for all four nucleotidc-specific channcls in ordcl to dctcct mutations. According to the instant invention~ it is possible to comparc any basc spccific cxpclimental fiagmcnt pattern, for example the T lane of the patient samplc~ to the base spccific standard fragment pattem for that T lane. The features of the standard fragmcnt pattcrn can be used to idcntify differences within the test lane of the sample and thus provide information about the sample.
This aspect of thc invcntion follows the nolTnalization .stcp dcscribed hercinabovc. Thc degree of correlation of a window at thc preferled nolm~li7~tion is plottcd against the shifted CA 0222~38~ 1997-12-19 origin data point ofthe window. effectively describing a cross-correlograrn. Fig. 10 shows the m~ximllm and minimllm correlation values obtained across the entire length of the standard fragment pattern~ as detelmined from a plurality of trial runs. A standard deviation can be determined after a :~u~I;c;~ L number of tlial runs. Data fiom a test sample is also plotted. As S illustrated~ one window is found to deviate substantially from its expccted degree of co~l~,ldlion. The failure to correlate as expected suggests that the window contains a mutation or other difference from the standard. The .system of the invention would cause such a window to be fiagged for closer ~x~ lion by the human opcratol-. Altcmatively~ the window could be reported directly to the patient file for use in diagnosis In a further alternative~ the window would not be reported to the human operator, until basc-calling had further confirmcd that there was a mutation present in thc area represented by thc window In generaL the cross conclogram mutation detection is a method of "single lane basc-calling" wherein the signal fiom a single nucleotide run is used to idcntify the presence or absence of differences fiom the standard fi-agrnent pattern A uscful embodimcnt of this aspcct of the invention is for identification of infectious discases in paticnt samples. Many groups of diagno.stically-signifir~nt bacteria, viruses. fungi and thc likc all contain regions of DNA which are unique to an individual specics, but which arc ncvcrthclcss amplifiablc using a singlc sct of amplification primers due to commonality of gcnctic codc within related species Diagnostic tests for such organisms may not quickly distinguish bctwccn species within such groups.
Using thc method of the invcntion. howevcr~ it is possiblc to quickly classify a sample as belonging to one ~species within a group on thc basis of thc standard fi-agment patterns for onc selected type of nucleotide. This elirninates thc necd to basc-call four lancs of nucleotidcs and effectively allows a DNA sequencing apparatus to run four timcs as many samples in thc samc time period as before.
Thus. in accordance with the prcsent invention. therc is provided a mcthod for classifying a sarnple of a nucleic acid as a palticular specics within a group of cornmonly-amplifiable nucleic acid polymers. The method utilizes at Icast one sample fragment pattern l~pl~;S~IILillg the positions of a selccted typc of nuclcic acid base ~,vithin the sample nuclcic acid polymer. For each cornmonly-amplifiablc spccies within thc group~ a sct of onc or morc norm~li7~tion coefficicnts is detclmined for the samplc fiagmcnt pattern These scts of norm~ ti-)n coefficients are then applied to thc samplc fragmcnt pattern to obtain a plurality CA 0222~38~ 1997- 12- lg wo 97/02488 PCT/US96/11130 of trial fragment patterns, which are correlated with the corresponding standard fragment patterns. The sample is classified as belonging to the species for which the tlial fragment pattern has the highest colTelation with its con-esponding standard fi-agment pattern, provided that the conelation is over a pre-defined threshold.
This a~spect of the invention is useful in identifying which allele of a group of alleles is present in a gene. The method is also useful in identifying individual species from among a group of genetic variants of a disease-causing microorgani~llL and in particular genetic variants of human immunodeficiency virus.
A fulther variation of the invention which may bc uscful in celtain conditions is the reduction ofthe experimental and standard fragment patterns into square wave data. Square wave data is useful when the signal obtained is highly reproducible from run to run. Thc main advantage of a square wave data format is that it includes a maximum of information content and a minimllm of noise.
The standard fiagment pattern may bc reduced to a square wavc by a number of means.
In one mcthod~ the transition from zero to one occurs at thc inflection point on each slope of a peak. The inflection points are found by using thc zcro crossings of a function that is the convolution of the data function with a function that is the sccond dclivative of a g;~ n pulse that is about one half the width of single basc pair p ulse in the original data scqucncc.
This dclives inflection points with relatively littlc addition of noise due to the dirrelcnliation process. Any data point value greatcr than thc inflection point on that slope of the r)cak is assigned l . Any value below the inflection point is assigncd 0.
The peaks on the square wave are identified and assigned nucleotide sequences. Pcaks may be assigned one or more nucleotides as deterrnined by the human opcrator on thc basis of the standard fragment pattern. Peaks are then given idcntifying characteristics such as a sequential peak numbel; a standard peak width, a standard gap width on eithcr sidc of thc peak and standard deviations with these characteristics.
Whën a sample fragment pattern is obtained, it is r educcd to a square wave folmat, again on the basis ofthe inflection point data as described abovc. Peak numbers are assigned.
The sample square wave may then be used in different ways to identify mutations. In one method, it may be used to align the four different nucleotide data streams as in the method of the invention desclibcd hereinabovc. Altcrnativcly, analysis rnay be purely statistical. Thc CA 0222~38~ 1997-12-19 peak width and gap width of sample can be directly compared to the standard square wave.
If the sample characteristics fall within the standard deviation of the standard~ taking into account p~ elasticity ofthe peaks, then the sample is concluded to be the same as the standard. If the peaks of the sample can not be fit within the terms of the standard~ then the 5 presence of a mutation is concluded and reported The present invention is advantageously implcmentcd using any multipurpose computer in~ inp; those generally refelTed to as personal computcrs and mini-computers, proglammed to determine norm~li7~tion coefficients by compalison of an c~t;,h.lental and a standard fi-agment pattem. As shown schl~m~tic~lly in Fig. I I ~ such a computer will include at least one central processor 110. for examplc an Intel 80386~ X04X~ or Pcntium(~) processor Ol' Motorola 68030, Motorola 68040 or Powcr PC ~01, a stol-agc dcvicc~ ~uch as a hard disk 111, for storing standard fi-agment patterns. means for rccciving raw or clean expelimental fragment pattems such as wire 1 12 shown connected to thc output of an clcctrophoresis apparatus 1 13.
The processor 110 is proglammed to pcrform thc comparison of the expelimental fragment pattem and the standard fiagment pattcrn and to dctclminc nolmalization coefficients based on the compalison. This programrning may bc pclmancnt, as in the casc wherc the processor is a dedicated EEPROM~ or it may bc transicnt in which casc the programming in~structions are loaded from the storage device or fi-om a flopr y diskctte or other transportable media.
The nolm~li7~tion coefficients may bc output fi-om computer~ in print folm usingprinter 1 14; on a video display 1 15; or via a communications Iink I I t~ to another proccssor 117. Alternatively or additionally~ thc normalization cocfficicnts may bc utilized by thc processor 110 to nolmalize the experimental fragmcnt pattcrn for usc in basc-calling or other diagnostic evaluation Thus~ the apparatus may also includc progl-amming for applying thc norm~li7~ion coefficients to the experimcntal fi-agmcnt pattcrn to obtain a normalized fragrnent pattem. and for aligning the normalized fi agments pattcl-ns and evaluating the nucleic acid sequencc of the sample therefirom.

Claims

1. A method for determining the sequence of bases in a sample nucleic acid polymer putatively having a known sequence comprising the steps of:
(a) obtaining at least one raw fragment pattern representing the positions of one selected type of nucleic acid base within the sample nucleic acid polymer as a function of migration time or distance;
(b) conditioning the raw fragment pattern to obtain a clean fragment pattern;
(c) determining one or more normalization coefficients for the clean fragment pattern, said normalization coefficients being selected to provide a high degree of correlation between (i) a normalized fragment pattern obtained by applying the normalization coefficients to the clean fragment pattern and (ii) a standard fragment pattern representing the positions of the selected type of nucleic acid base in a standard nucleic acid polymer actually having the known sequence;
(d) applying the normalization coefficients to the clean fragment pattern to obtain the normalized fragment pattern; and (e) evaluating the normalized fragment pattern to determine positions of at least the selected type of base within the sample nucleic acid polymer.

2. A method according to claim 1, wherein the normalization coefficients are coefficients of a second- or higher-order polynomial.

3. A method according to claim 1 or 2, wherein the normalization coefficients are determined using Genetic Algorithms.

4. A method according to any of claims 1 - 3, wherein the clean data fragment is divided into a plurality of windows, and wherein separate normalization coefficients are determined for each window.

5. A method according to claim 4, wherein each window contains 100-10,000 data points.

6. A method according to any of claims 1 - 5, wherein the clean data fragment is obtained using a feed-back loop to obtain a preferred band-pass filter.

7. A method according to any of claims 1 - 6, wherein the step of evaluating the normalized fragment pattern to determine positions of at least the selected type of base resolves a non-singleton peak in the normalized fragment pattern by statistical comparison of measurements of the non-singleton peak with standard values associated with a corresponding peak in the standard fragment pattern.

8. A method according to any of claims 1 - 6, wherein the clean fragment pattern and standard fragment pattern are reduced to square waves.

9. A method according to any of claims 1 - 8, wherein four raw fragment patterns are obtained, one for each nucleic acid base, and four normalized patterns are produced, further comprising the step of aligning the four normalized fragment patterns and then evaluating the aligned normalized fragment patterns to determine the positions of all base types in the sample nucleic acid polymer.

10. A method according to any of claims 1 - 8, wherein only one raw fragment pattern is obtained, and the positions of only the selected type of base within the sample nucleic acid polymer are determined.

11. A method according to claim 10, wherein the position of the selected type of base within the sample nucleic acid polymer are determined by comparing the normalized fragment pattern to the standard fragment pattern and noting the presence or absence of each peak.

12. A method for evaluating the sequence of a nucleic acid polymer putatively having a known sequence wherein oligonucleotide fragments reflecting the position of nucleic acid bases within the nucleic acid polymer are separated in space or time and then detected as a fragment pattern which is evaluated to determine the sequence of the nucleic acid polymer, characterized by the steps of:
(a) determining one or more normalization coefficients for the fragment pattern. said normalization coefficients being selected to provide a high degree of correlation between (i) a normalized fragment pattern obtained by applying the normalization coefficients to the fragment pattern and (ii) a standard fragment pattern representing the positions of the selected nucleic acid base in a standard nucleic acid polymer actually having the known sequence; and (b) applying the normalization coefficients to the fragment pattern prior to evaluation of the fragment pattern to determine the sequence of the nucleic acid polymer.

13. A method according to claim 12, wherein the normalization coefficients are coefficients of a second- or higher-order polynomial.

14. A method according to any of claims 12 or 13, wherein the normalization coefficients are determined using Genetic Algorithms.

15. A method according to any of claims 12 - 14, wherein the fragment pattern is divided into a plurality of windows, and wherein separate normalization coefficients are determined for each window.

16. A method for detecting mutations in a sample nucleic acid polymer having a putatively normal genetic sequence comprising the steps of:
(a) obtaining at least one sample fragment pattern representing the positions of a selected nucleic acid base within the sample nucleic acid polymer;
(b) determining one or more normalization coefficients for the sample fragment pattern, said normalization coefficients being selected to provide a high degree of correlation between (i) a normalized fragment pattern obtained by applying the normalization coefficients to the sample fragment pattern and (ii) a standard fragment pattern representing the positions of the selected nucleic acid base in a standard nucleic acid polymer actually having the known sequence;
(c) applying the normalization coefficients to the sample fragment pattern to obtain the normalized fragment pattern;
(d) dividing the normalized fragment pattern into a plurality of windows;
and (e) determining the correlation between each window and the standard fragment pattern; wherein a difference between the correlation for any window and predetermined standard correlation values reflects the presence of a mutation within that window.

17. A method according to claim 16, wherein the normalization coefficients are coefficients of a second- or higher-order polynomial.

18. A method according to claim 16 or 17, wherein the normalization coefficients are determined using Genetic Algorithms.

19. An apparatus for normalizing an experimental nucleic acid fragment pattern putatively representing a known nucleic acid sequence comprising:
(a) a computer processor;
(b) a storage device having stored thereon a standard fragment pattern for a nucleic acid polymer actually having the known sequence;
(c) means for receiving the experimental nucleic acid fragment pattern, and (d) means for causing the computer processor to determine one or more normalization coefficients for the experimental fragment pattern, said normalization coefficients being selected to provide a high degree of correlation between (i) a normalized fragment pattern obtained by applying the normalization coefficients to the experimental fragment pattern and (ii) the standard fragment pattern.

20. An apparatus according to claim 19, wherein the means for causing the computer processor to determine one or more normalization coefficients is a program stored on the storage device.

21. An apparatus according to claim 19 or 20, wherein the normalization coefficients are coefficients of a second- or higher-order polynomial.

22. An apparatus according to any of claims 19 - 21, wherein the normalization coefficients are determined using Genetic Algorithms.

23. An apparatus according to any of claims 19 - 22, further comprising means for applying the normalization coefficients to the experimental fragment pattern to obtain a normalized fragment pattern.

24. An apparatus according to claim 23, further comprising means for aligning the normalized fragments patterns and evaluating the nucleic acid sequence of the experimental fragment pattern therefrom.

25. An apparatus according to claim any of claims 19 - 24, further comprising an electrophoresis apparatus, operatively coupled to the means for receiving an experimental fragment pattern.

26. An apparatus according to any of claims 19 - 25, wherein the means for causing the computer processor to determine one or more normalization coefficients is a program stored on the storage device.

27. A method for classifying a sample of a nucleic acid as a particular species within a group of commonly-amplifiable nucleic acid polymers, comprising the steps of:
(a) obtaining at least one sample fragment pattern representing the positions of a selected nucleic acid base within the sample nucleic acid polymer;

(b) for each commonly-amplifiable species within the group, determining a set of one or more normalization coefficients for the sample fragment pattern, said normalization coefficients being selected to provide a high degree of correlation between (i) a normalized fragment pattern obtained by applying the normalization coefficients to the sample fragment pattern and (ii) a standard fragment pattern representing the positions of the selected nucleic acid base within a standard nucleic acid polymer actually belonging to one of the commonly amplifiable species;
(c) applying the sets normalization coefficients to the sample fragment pattern to obtain a plurality of trial fragment patterns; and (d) correlating the trial fragment pattern with the corresponding standard fragment patterns. wherein the sample is classified as belonging to the species for which the trial fragment pattern has the highest correlation with its corresponding standard fragment pattern, provided that the correlation is over a pre-defined threshold.

28. A method according to claim 27, wherein the group of commonly-amplifiable nucleic acid polymers comprises a plurality of alleles of a single gene.

29. A method according to claim 27, wherein the group of commonly-amplifiable nucleic acid polymers comprises a plurality a genetic variants of a disease-causing microorganism.

30. A method according to claim 29, wherein the disease-causing microorganism is human immunodeficiency virus.