WO2017075179A1

WO2017075179A1 - Sequencing by deconvolution

Info

Publication number: WO2017075179A1
Application number: PCT/US2016/059058
Authority: WO
Inventors: William Roy Glover, Iii; Stephen J. Lippard; Matthew MACMANES; Michael Schatz; William Kelley THOMAS
Original assignee: Zs Genetics, Inc.
Priority date: 2015-10-27
Filing date: 2016-10-27
Publication date: 2017-05-04

Abstract

Methods of identifying, sequencing and/or detecting nucleic acid polymers are disclosed, including deconvolution of sequence information.

Description

SEQUENCING BY DECONVOLUTION

RELATED APPLICATIONS

This application claims the benefit under 35 U.S.C. § 119(e) of U.S. provisional application serial number 62/246,638, filed October 27, 2015, the disclosure of which is incorporated by reference herein in its entirety.

FIELD

Provided herein are methods for nucleic acid molecule identification, sequencing and/or detection, and more specifically, methods of identifying, sequencing and/or detecting, nucleic acid molecule using a particle (e.g., electron) beam.

BACKGROUND

Current high-throughput nucleic acid sequencing techniques fall into one of two types: short-read sequencing and long-read sequencing.

Short-read sequencing provides read lengths that are inadequate for interrogation of key sections of the human genome. Short-read sequencing also requires massive amounts of data storage and results in significant bioinformatics challenges due to the need to resolve and stitch together the sequences of the many short reads. The bioinformatics required for short- read sequencing can take months to complete, resulting in very high cost and time burdens. Moreover, data gaps in short-read sequencing prevent researchers from tackling key sequencing issues like structural variance.

Long-read sequencing, on the other hand, allows for analyzing substantially longer fragments of DNA. However, high raw error rates in long-read sequencing means that final sequence frequently is inaccurate. Moreover, the study of key areas such as de novo assembly is made extremely difficult. Long-read sequencing also has high sequencing costs compared to short-read sequencing. The inherent technical limitations of long-read sequencing impose high bioinformatics and cost burdens for functional use, thereby limiting the ability of researchers utilize the promise of long-read technology. SUMMARY

Methods of identifying, sequencing and/or detecting nucleic acid molecules are provided herein that overcome the problems of existing short-read and long-read sequencing techniques.

According to one aspect, methods of determining the sequence of a nucleic acid molecule are provided. The methods include forming a first complementary strand of the nucleic acid molecule using one, two or three types of heavy atom labeled bases; forming at least one second complementary strand of the nucleic acid molecule using one, two or three types of heavy atom labeled bases in which at least one, but not all, of the heavy atom labeled bases are the same as used in forming the first complementary strand of the nucleic acid molecule; and identifying a sequence of bases in the first complementary strand and/or in the at least one second complementary strand using a particle beam.

According to another aspect, methods of determining the sequence of a nucleic acid molecule are provided. The methods include forming a first complementary strand of the nucleic acid molecule using one, two or three types of heavy atom labeled bases; optionally forming at least one second complementary strand of the nucleic acid molecule using one, two or three types of heavy atom labeled bases in which none of the heavy atom labeled bases are the same as used in forming the first complementary strand of the nucleic acid molecule; and identifying a sequence of bases in the first complementary strand and/or in the at least one second complementary strand using a particle beam.

In some embodiments, the methods further include comparing the pattern of heavy atom labeled bases in the first complementary strand with the pattern of heavy atom labeled bases in the at least one second complementary strand, if formed.

In some embodiments, the nucleic acid molecule is single stranded or double stranded DNA or single stranded or double stranded RNA.

In some embodiments, the complementary strand is DNA or RNA.

In some embodiments, the nucleic acid molecule and/or its complementary strand is formed by a nucleic acid polymerase enzyme.

In some embodiments, the complementary strand of the nucleic acid molecule is formed using polymerase chain reaction (PCR) or extension of a primer that hybridizes to the nucleic acid molecule. In some embodiments, base specific labels are incorporated in the nucleic acid molecule and/or the complementary strand during formation of the nucleic acid molecule and/or the first and/or at least one second complementary strands, or bases comprising attachment sites are incorporated in the nucleic acid molecule and/or the first and/or at least one second complementary strands and are modified after formation by covalently bonding heavy atom labels at the attachment sites.

In some embodiments, a single type of heavy atom label is used. In some

embodiments, the heavy atom label is or contains platinum. In some embodiments, at least two types of bases are labeled with the same type of heavy-atom label. In some

embodiments, one type of base is labeled in the first complementary strand and/or the at least one second complementary strand. In some embodiments, two types of bases are labeled in the first complementary strand and/or the at least one second complementary strand. In some embodiments, three types of bases are labeled in the first complementary strand and/or the at least one second complementary strand.

In some embodiments, the step of identifying a sequence of bases comprises generating a particle beam, exposing the first complementary strand and/or the at least one second complementary strand to the particle beam, and identifying the bases due to characteristic changes to the particle beam. In some embodiments, the step of identifying the bases comprises detecting characteristic changes to the particle beam. In some embodiments, the particle beam is an electron beam.

In some embodiments, the step of identifying the sequence of bases includes deconvolution that includes comparing the pattern of labeled bases in a first labeled complementary strand with the pattern of bases in at least one second labeled complementary strand, wherein the first labeled complementary strand and the at least one second labeled complementary strand are labeled on different sets of one of more bases.

In some embodiments, the pattern of labeled bases in 2, 3, 4, 5, 6, 7, 8, 9, 10, or more second complementary strands are compared.

Other aspects, embodiments and features of the invention will become apparent from the following detailed description when considered in conjunction with the accompanying drawing(s). The accompanying figures are schematic and are not intended to be drawn to scale. In the figures, each identical, or substantially similar component that is illustrated in various figures may be represented by a single numeral or notation (though not always). For purposes of clarity, not every component is labeled in every figure. Nor is every component of each embodiment shown where illustration is not necessary to allow those of ordinary skill in the art to understand the invention. All patent applications and patents incorporated herein by reference are incorporated by reference in their entirety. In case of conflict, the present specification, including definitions, will control.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows the use of two different labeling systems for sequence analysis by deconvolution of data obtained using electron microscopy. Depicted are results of two-track label detection in otherwise identical molecules.

FIG. 2 shows the use of three different labeling systems for sequence analysis by deconvolution of data obtained using electron microscopy. Depicted are results of three-track label detection in otherwise identical molecules, and application to error checking and redundancy.

FIG. 3 shows the use of three different labeling systems for sequence analysis by deconvolution of data obtained using electron microscopy, for enhanced disambiguation of consecutive unlabeled bases. Depicted are results of three-track label detection in otherwise identical molecules, and application to error checking and redundancy.

FIG. 4 shows the use of a single track labeling system for sequence analysis by deconvolution of data obtained using electron microscopy, for enhanced disambiguation of consecutive unlabeled bases.

FIG. 5 shows the use of two different labeling systems for sequence analysis by deconvolution of data obtained using electron microscopy, for enhanced disambiguation of the number of consecutive unlabeled bases. Depicted are results of two-track label detection in otherwise identical molecules.

DETAILED DESCRIPTION

Methods of identifying, sequencing and/or detecting nucleic acid molecules are provided herein that overcome the problems of existing short-read and long-read sequencing techniques. The methods provide a combination of electron microscope-based nucleic acid sequencing with the use of a single type of label to label more than one type or combinations of base(s) on the same molecule, followed by data deconvolution to provide nucleic acid sequence. As used herein, "deconvolution" (or "deconvolute") means using more than one track of data to infer information not directly available from evaluation of a single track.

In exemplary methods, nucleic acid molecules (DNA, RNA) are labeled using heavy atoms that can be visualized using an electron microscope. A nucleic acid molecule can be labeled by synthesizing the nucleic acid molecule by enzyme-based amplification (e.g., PCR) using a specific set of primers for the enzyme-based amplification, or primer extension using a primer that hybridizes to the nucleic acid molecule. In some embodiments, primer extension is carried out by first isolating single stranded DNA, followed by primer hybridization to the single stranded DNA, followed by extension of the primer by a polymerase using labeled bases. During amplification or primer extension, labeled bases are added to the newly synthesized complementary strand of the nucleic acid molecule.

Alternatively, during amplification or primer extension, bases having attachment point are added to the newly synthesized complementary strand of the nucleic acid molecule, and heavy atom labels subsequently are attached to the attachment points. Either method provides complementary strands of the nucleic acid molecules that are labeled at specific bases.

Heavy-atom labeling is accomplished in certain embodiments using a single, heavy- atom label to label 1, 2 or 3 bases. This permits analysis of sequence in a manner analogous to Maxam-Gilbert sequencing, in which more than one base type was identified under certain analytical conditions. A preferred heavy-atom label is or includes platinum (Pt), which provides high-contrast images via electron microscope imaging, but other atoms can be used, such as: CI, Br, I, U, Os, Pb, Au, Ag, Fe, Eu, Pd, Co, Hg, Gd, Cd, Zn, Ac, W, Mo, Mn, Rb, Cs, Ra, Ba, or Sr. Labels can include one, two, three, four, or more heavy atoms.

The methods include the use of one, two or more labeling combinations for nucleic acid molecules that are nominally identical, representing the same region of the genome of interest. Such molecules can have a selected sequence of a complementary strand that is amplified by PCR using the same set of primer molecules or produced by primer extension, or molecules of cells selected from a multi-cell sample that are nominally identical, or from multiple copies of a single cell or single particle (viral) organism. As described below, advantageously, when using two or more labeling combinations, the labeling combinations include zero, one or two labeled base types in common. For example, in a first labeling combination, bases C and T of a complementary strand of the nucleic acid molecule are labeled with heavy atoms. In a second labeling combination, bases A and T of the same type of complementary strand of the nucleic acid molecule are labeled with heavy atoms. The two differently-labeled complementary strands of the nucleic acid molecule, which are identical other than the positions of the labels at one or more bases, are then analyzed as described herein to determine the pattern of labeled bases.

In other embodiments, more than two labeling combinations, out of the six possible couplet possibilities, are used. In other embodiments, one of the four possible variations of using three labeled bases at once are used as labeling combinations. The advantages of such embodiments include reducing the mean distance between labeled bases, thereby reducing "dark base spatial ambiguity."

In certain embodiments, one base type is labeled. As used herein, "base" or "base types" mean enzymologically or chemically distinct base pair types, whether that difference is in the base, sugar, phosphate or analogous regions. In some embodiments, the base type can be any one of A, C, G, or T/U. In these embodiments, one track having one base type labeled in a complementary strand is analyzed for sequence data, e.g., only A, only C, only G, or only T/U. As used herein, "track" (or alternatively "lane") refers to a run of sequence of a labeled complementary strand that labels one, two, or more base types. More than one track, each of which has only one base labeled, can be combined for additional sequence information.

In other embodiments, two base types are labeled. In some embodiments, the base types labeled in a complementary strand are one pair from among the following combinations of two bases: A+C, A+G, A+T/U, C+G, C+T/U, or G+T/U. In some embodiments, one track having two base types labeled in a complementary strand is analyzed for sequence data, e.g., A+C, A+G, A+T/U, C+G, C+T/U, or G+T/U. However, in other embodiments, more than one track (e.g., two, three or four tracks), each of which has only two bases labeled in a complementary strand, can be combined for additional sequence information. See Figure 1 for an example of two tracks each having two bases labeled in a complementary strand. In some embodiments wherein more than one track is run, there is overlap between the two or more tracks (see Figure 1, wherein T is labeled in both tracks). As used herein, "overlap" means that in two or more tracks, at least one base type is labeled in each track. However, in other embodiments wherein more than one track is run, there is no overlap between the two or more tracks (see Figure 5, wherein no base is labeled in both tracks). In embodiments where there is an overlap or more than one overlap between and among tracks, examples of such overlap(s) include the following: overlap between track one and track two; overlaps between track two and track three; overlap between track one and track three; overlaps between track one and track two, and track one and track three; overlaps between track one and track two, and track two and track three; and overlaps between track one and track three, and track two and track three.

In still other embodiments, three base types are labeled. In some embodiments, the base types labeled in a complementary strand are one set from among the following combinations of three bases: A+C+G, A+C+T/U, C+G+T/U, or A+G+T/U. In some embodiments, one track having three base types labeled in a complementary strand is analyzed for sequence data, e.g., A+C+G, A+C+T/U, C+G+T/U, or A+G+T/U. In such cases, the different labels provide sequence information based on base labeling, but the gaps in labeling also provide sequence information. See Figure 4 for an example of one track having three bases labeled in a complementary strand, in which sequence information for the non-labeled base is obtained.

However, in other embodiments, more than one track (e.g., two tracks), each of which has three bases labeled can be combined for additional sequence information. See Figure 1 for an example of two tracks each having two bases labeled. In some embodiments wherein more than one track is run, there is overlap between the two or more tracks (see Figure 1, wherein T is labeled in both tracks).

In still other embodiments, more than one track having one, two or three bases labeled can be combined for additional sequence information. See Figure 3 for an example of two tracks each having two bases labeled combined with one track having three bases labeled. Sequence information from such asymmetric tracks can be combined for sequence determination, error checking, disambiguation, etc.

It also is possible to label four bases or zero bases for controls (positive or negative), error checking, disambiguation, etc.

Sequence information obtained from the tracks can be deconvoluted as follows: (1) comparing where the labeled bases/positions match; (2) comparing where the unlabeled bases match; (3) comparing where the labeled bases match with unlabeled bases; and (4) using labeled spaces to reduce estimates of unlabeled spaces (e.g., homopolymer dark regions). Based on the sequence information for the complementary strand obtained in this manner, sequence information for both strands of the nucleic acid molecule is obtained.

Error checking can be included in any of the embodiments described herein, such as by comparing different tracks having overlapping labeling (e.g., see Figures 1, 2 or 3) or non- overlapping labeling (e.g., see Figure 5).

In any of the embodiments described herein, the coverage ratio (e.g., the average number of reads of a given range of bases in a sequence of interest) can be increased to build contig length, disambiguate spatial error, and/or reduce error via obtaining consensus sequence information.

In an exemplary method, the labeled nucleic acid molecules (i.e., complementary strand(s)) are "linearized" (i.e., aligned) to form a nucleic acid molecule array on an imaging surface. High-throughput methods for performing linearization include nano-confinement of labeled nucleic acid molecules and molecular combing. In certain embodiments, the nucleic acid molecules are substantially straightened prior to identifying the sequence.

In the exemplary method, the linearized labeled nucleic acid molecules are then imaged using a high-resolution electron microscope, followed by image processing to translate images into sequence data, and long-molecule reassembly to determine the sequence of the DNA molecules.

Identifying the bases in the complementary strand of the nucleic acid molecules includes interpreting changes in the particle beam resulting from interactions with the heavy atom labels attached to the bases to detect the bases in the complementary strand(s) of the nucleic acid molecule(s), whereby the sequence of the nucleic acid molecule is determined. Preferably the bases are labeled as described herein. The changes in the particle beam include changes in absorbance, reflection, deflection, energy or direction. The changes in the particle beam also can be changes in a spatial pattern, for example, a one dimensional pattern, a two dimensional pattern or a three dimensional pattern.

Analysis of single track information can either be performed first on multiple molecules within a track, then data from multiple tracks combined. Alternatively, information between multiple tracks is analyzed simultaneously. Alternatively, tracks can be evaluated singly, or in combination, at different points in the analytic process.

The identity of a particular base type is determined in part or completely by label association, position relative to other known locations/bases, or determined by a combination of inspection and inference. In further embodiments, the methods also include attaching the complementary strand(s) of the nucleic acid molecule(s) to a substrate. Preferably the attachment is by nucleic acid sequence- specific molecules, which preferably are oligonucleotides. In other preferred embodiments, the substrate is derivatized to provide attachment points that are sequence non-specific. The complementary strand(s) of the nucleic acid molecule(s) can be attached to the substrate in a grid pattern. Preferably the substrate includes a carbon thin film.

In preferred embodiments, Maxam-Gilbert style analysis utilizing a single, heavy- atom label on two or more combinations of labels (1, 2, or 3 types of bases being labeled) provides for highly efficient sample preparation and electron microscope processing and sequencing. Benefits of this sequencing technique, which is exemplified below include simplified (binary) detection of label presence, and enzymological simplicity which permits analysis of very long molecules (50kb+).

As used herein, "heavy atom labels" are one heavy atom or a combination of more than one heavy atoms. Heavy atoms are those atoms with an atomic number greater than found in natural DNA. A heavy atom label can include one or more atoms bound to one another directly or through intermediate bonds, or bound separately to the same base or base pair. Binding of a heavy atom label can be at the base, sugar, or phosphate, either directly or via a linker.

As used herein, "attachment points" (or "attachment sites") are chemical modifications of a base that may not include a heavy atom label, but that can be used to bind a heavy atom label in a subsequent step. For example, a phosphorothioate-modified dNTP can include attachment points that are one or more sulfur atoms replacing oxygen atoms in the backbone phosphate. These sulfur atoms can be used in a later step to bind heavy atoms labels.

As used herein, "labels" is also meant to include "attachment points" or "attachment sites" that are used in a subsequent heavy atom or other label conjugation step.

The methods for nucleic acid sequencing can involve using a particle beam, such as an electron beam, or ion beam, to obtain information regarding the heavy-atom labeled nucleic acid molecule. Examples of such methods using particle beams for nucleic acid sequencing can be found in U.S. Patents 7,288,379, 7,291,467, 7,291,468, 7,604,942, 7,604,943, 7,910,311, and 8,697,432, each of which is incorporated by reference in its entirety. For example, a sample of heavy-atom labeled DNA can be exposed to a particle beam and changes in the beam resulting from interaction with the sample may form a pattern which can be interpreted to provide the information. In some embodiments, a particle beam instrument (e.g. , an electron microscope) can be used to directly view samples of DNA. As described further herein, the methods can enable nucleic acid sequencing, identifying and/or detection at high speeds, low costs, and high accuracy, amongst other advantages.

In some embodiments, a complementary strand of a nucleic acid molecule may be analyzed to determine the sequence and/or presence of a nucleic acid molecule. In certain embodiments, it is preferred that the sample be formed of one or more complementary strands of the nucleic acid molecule. In other embodiments, the sample may be formed of one or more strands of the nucleic acid molecule along with or separate from the

complementary strand.

Conventional techniques may be used to form a complementary strand of a nucleic acid molecule and/or the molecule itself. Typically, the first step in forming the

complementary strand is to obtain a single strand of a nucleic acid molecule. Any suitable technique may be used to obtain a single strand. Standard denaturing processes (e.g., thermal, enzymatic) which break the hydrogen bonding between the strands may be used. In other embodiments, a single strand can be created by synthesizing it from a template. For example, polymerase chain reaction (PCR), primer extension, or reverse transcriptase processes that are well known in the art may be used. In other embodiments, a single strand may be chemically synthesized one base at a time, for example, in an oligonucleotide synthesis process. Such synthetic processes are well known in the art and can be automated. It is also possible to obtain a single strand by purifying it from a natural source, such as single stranded RNA from cells. Combinations of the foregoing (and other methods known to those of skill in the art) also can be used.

A complementary strand of a nucleic acid molecule can be created from the single strand using any suitable conventional technique. For example, standard polymerization techniques may be used including polymerase chain reaction (PCR) (e.g., standard PCR, long PCR protocols) and primer extension. The techniques generally involve exposing the single strand to an excess of bases under the proper reaction conditions. The bases may be labeled, as described herein. In some embodiments, single or multiple polymerase enzymes are used to facilitate reactions. Polymerase enzymes include DNA-dependent DNA polymerases (including thermostable enzymes such as Taq polymerase), RNA-dependent DNA polymerases (e.g., reverse transcriptases) and RNA-dependent RNA polymerases. In other embodiments, enzymes need not be used (e.g., in vitro chemical synthesis). Other suitable components (e.g., oligonucleotide primers, other enzymes such as primases, and the like) may also be present.

It should be understood that complementary strands may be modified to include other components that would not otherwise be present in a DNA strand. For example, the complementary strand may be modified to include labels (e.g., during formation) that facilitate detection and identification of bases in methods described herein. Labels (e.g., heavy atoms or molecules) when exposed to a particle beam create characteristic particle beam species that may be detected and identified using the systems and methods described herein. Similarly, the nucleic acid molecule also can be modified to include labels as described herein. This advantageously is done during synthesis of the nucleic acid, for example using PCR amplification, which typically results in the synthesis of both strands (i.e., the nucleic acid molecule and its complementary strand), or primer extension, which typically results in synthesis of only the complementary strand.

When labels are present, it may be preferable to attach the labels to bases of the complementary strand only or to both strands of the nucleic acid. Labels can be incorporated in the complementary strand only (e.g., using a single round of PCR or primer extension) or in both strands of the nucleic acid (e.g., using two or more rounds of PCR or primer extension). In certain embodiments, a single type of heavy atom label is attached to more than one type of nucleotide (e.g., cytosine triphosphate (CTP), adenosine triphosphate (ATP), thymine triphosphate (TTP), uracil triphosphate (UTP), guanosine triphosphate (GTP);

conventionally these nucleotides as incorporated into nucleic acid molecules are referred to by a single letter, e.g., A, C, G, T or U). Modified (non-natural) or atypical natural nucleotides also can be used, in which the bases, sugars or phosphate moieties can be different than those present in typical naturally occurring nucleotides (e.g., in A, C, G, T and U). Mixtures of the foregoing can be employed in the invention.

In certain embodiments, not all base types need to be labeled. For example, if three base types (e.g., C, A, T) are labeled and the fourth (e.g., G) is unlabeled, then each

"unlabeled" type may readily be identified as the fourth base type (e.g., G). The position of the unlabeled bases can be inferred from observation of the distances between labeled bases, given the highly regular spacing of bases in nucleic acid molecules. In other embodiments, only two of the base types may be labeled. For example, a first set of sequencing data may be generated with two base types labeled (e.g., T, C) and a second set of sequencing data may be generated with a different set of two base types labeled (e.g., T, A). Both data sets may be processed to provide information regarding the entire sequence, such as is shown in Figure 1. Having a labeled base type in common provides a number of advantages, such as providing an internal control for deconvolution. This deconvolution also eliminates the need to discriminate between different labels. Hence, the detection is binary, i.e., it provides information of whether a label is found in a particular position or not, and thus does not require further analysis to determine what the type of label it is. Thus, the labels can be identical, but do not have to be identical. The labels do not have to be different.

In another embodiment, only two of the base types are labeled in each of three different combinations. For example, a first set of sequencing data may be generated with two base types labeled (e.g., T, C), a second set of sequencing data may be generated with a different set of two base types labeled (e.g., T, A), and a third set of sequencing data may be generated with yet another different set of two base types labeled (e.g., A, G). All three data sets may be processed to provide information regarding the entire sequence, such as is shown in Figure 2.

In another embodiment, either two or three of the base types are labeled in each of three different combinations. For example, a first set of sequencing data may be generated with two base types labeled (e.g., T, C), a second set of sequencing data may be generated with a different set of two base types labeled (e.g., T, A), and a third set of sequencing data may be generated with a set of three base types labeled (e.g., A, T, G). All three data sets may be processed to provide information regarding the entire sequence, such as is shown in Figure 3.

In another embodiment, enhanced disambiguation of consecutive unlabeled bases is provided using a single set of sequencing data generated with three base types labeled (e.g., A, T, C). The single data sets may be processed to provide information regarding the entire sequence, and particularly for consecutive unlabeled bases, such as is shown in Figure 4.

In another embodiment, enhanced disambiguation of consecutive unlabeled bases is provided. For example, a first set of sequencing data may be generated with two base types labeled (e.g., T, C) and a second set of sequencing data may be generated with a non-overlapping, different set of two base types labeled (e.g., G, A). Both data sets may be processed to provide information regarding the entire sequence, and particularly for consecutive unlabeled bases, such as is shown in Figure 5.

The labels may be attached to bases in a variety of different locations. In some embodiments, labels are attached to the bases on, or within, the nitrogenous base (e.g. , adenine, guanine, thymine, cytosine, uracil). For example, in these embodiments, labels may be attached to carbon/nitrogen rings in the base or may replace carbon or nitrogen atoms in the base. In other embodiments, labels are attached to the bases on, or within, the sugar molecule (e.g., ribose in RNA, or deoxyribose in DNA). In other embodiments, labels are attached on, or within, linking groups of the bases. For example, the labels may be attached on, or within, a phosphate linking group. The labels may be attached to oxygen substitutes, such as sulfur (e.g., alpha substituted phosphates, aS) or may replace the phosphorous atom at certain sites.

In certain embodiments, the labels are attached to the bases by covalent bonding. As described further below, covalent bonding provides strong attachment between labels and bases which can enable labeled samples to withstand exposure to relatively high particle beam energies (e.g., greater than about 50 kV for electron beams, for example about 80-120 kV) that may be important to detection and/or identification of nucleic acids.

In certain embodiments, it is preferable that the labels are attached to bases prior to the bases forming the complementary strand (and/or copies of the first strand of the nucleic acid molecule). In these embodiments, the labels may be selected from types, as described further below, that do not prevent polymerase reactions that form the complementary strand (and/or copies of the first strand of the nucleic acid molecule). Thus, in these cases, the complementary strand is labeled during its formation.

However, in other embodiments, it may be desired to attach labels (or additional labels) to bases after formation of the complementary strand (and/or copies of the first strand of the nucleic acid molecule). In these cases, the bases to be labeled (or additionally labeled) may have been modified (prior to formation of the complementary strand and/or copies of the first strand of the nucleic acid molecule) to include a suitable attachment site which can be bound, preferably covalently, to a desired label type. After formation, the nucleic acid strand(s) may be exposed to the labels which attach to the attachment sites. In certain methods described herein, the complementary strand is separated from first strand to form a single complementary strand which is used as the sample. The

complementary strand may be separated from the first strand using conventional denaturing techniques (e.g., thermal, enzymatic). After separation, the first strand may be discarded, or may be retained and otherwise used.

In some cases, separation and use of the complementary strand can simplify detection and/or identification and/or quantitation in subsequent method steps. In some embodiments, however, the complementary strand and the first strand are not separated, and the double- stranded structure is used as a sample in the detection and/or identification steps.

In certain embodiments, when the complementary strand is separated from the first strand, the complementary strand is used as a template to create another strand which may be labeled. This can create a double- stranded structure which includes two labeled strands (i.e., the complementary strand and the new strand created from the complementary strand). In certain methods, this double- stranded structure is used as the sample in the detection and/or identification steps.

Methods described herein may involve attaching a sample (e.g., complementary strand, complementary strand and first strand, complementary strand and new strand), or more than one sample, to a substrate. When more than one sample is attached, the sample may be the same (i.e., based on the same sequence) or different. In general, the substrate should be suitable for exposure to a particle beam. In embodiments in which particle beam species transmitted through the sample are detected, the substrate should permit sufficient transmission of the particle beam.

The substrate is generally thin to enable sufficient particle beam transmission therethrough. For example, the substrate may be less than 5 nanometers (nm); in some cases, less than 2 nm; or, even less than 1.5 or 1.1 nm. The substrate may be formed of a single layer or multiple layers. In certain cases, the layer(s) may be cross-linked. Conventional techniques can be used to form the substrates including vapor deposition and FIB milling, amongst others.

Suitable substrate materials are known to those of skill in the art and can include carbon (e.g. , pure carbon, graphene, diamond), boron nitride (e.g., having a cubic structure), aluminum and certain polymeric resins (e.g., FORMVAR® (polyvinyl formal)). In other embodiments, the substrate is formed from organic materials such as a lipid, natural protein or synthetic protein. The substrate material may be doped with chemicals, for example, to cross-link layers or to facilitate attachment of the sample as described further below.

Samples may be attached to the substrate by chemically bonding at least a portion of the sample to the substrate. Suitable techniques are known to those of skill in the art. For example, molecules present on the surface of the substrate (e.g., pre-existing as part of the substrate or following derivatization of the substrate) may be used to bind to the sample. The molecules may be nucleic acid sequence specific molecules (e.g., oligonucleotides). In other cases, the substrate surface may be derivatized to provide attachment points that are sequence non-specific. In other cases, electrical charge may be used to bind the sample to the substrate surface. The attachment points for the samples can be spaced apart in a predetermined pattern, such as a grid or microarray.

A portion, or portions, of a sample may be attached to the substrate. In some cases, both ends of the sample (e.g., complementary strand, complementary strand and first strand, complementary strand and new strand) may be attached; in other cases, only one end of the sample may be attached; in some cases, one or more non-end portions along the length of the sample may be attached. The attachment at the end(s) or along the length of the nucleic acid molecule(s) can be facilitated, if desired, by including in the nucleic acid during synthesis bases capable of forming bonds with the substrate.

Certain methods described herein involve substantially straightening a sample (e.g., labeled double strand) prior to, during, or even after, attachment to the substrate. This can facilitate detection and/or identification. The labeled double strand may be attached to the substrate, for example, via a linking bond to a bonding site as described further below.

Conventional techniques may be used to straighten the sample. For example, a sample may be straightened using fluid flow (e.g., molecular combing). The fluid may comprise one or more liquids, gases, or combinations thereof. In certain embodiments, the sample is attached and straightened by hybridization in a fluid flow to oligonucleotides present on the substrate surface. In some cases, electrical fields may be used (either in the presence of fluid flow, or alone) to promote sample straightening. In embodiments in which more than one sample is attached to the substrate, it may be preferred for each sample to be aligned substantially parallel to one another to facilitate exposure to the beam. Methods exist to perform molecular alignment of nucleic acid molecules in a thin or monolayer on a substrate. Some focus on isolating one or a few strands of materials and stretching them out for observation and genetic analysis. Examples of such methods are molecular combing using an air-water meniscus developed by the Pasteur Institute (e.g., US Patents to Bensimon et al. 5,840,862, 6,265,153 and 6,548,255) and a molecular alignment technique for optical mapping used by OpGen, Inc. Methods also exist to attach nucleic acid molecules in high density patterns on a substrate with a thickness of tens to millions of atoms. An example would be oligo synthesis or spotting on a microarray.

In certain embodiments, methods and compositions of the present disclosure may be combined with methods to perform high-density molecular alignment of nucleic acid molecules on substrates or surfaces as embodied in PCT Publication Nos. WO 2009/002506 A2 and WO 2010/144128 A2, entitled "High Density Molecular Alignment of Nucleic Acid Molecules," and "Molecular Alignment and Attachment of Nucleic Acid Molecules," respectively, both of which are incorporated herein by reference in their entirety.

One aspect of the disclosure provides a substrate having a combination of materials and dimensions that allows the substrate to have distinct physical properties. Specifically, in one embodiment, the materials and dimensions of the substrate allow it to be used for imaging samples with a particle beam instrument such as a transmission electron microscope. The substrate can include one or more ligands (e.g., nucleic acids, polypeptides,

oligosaccharides, and synthetic polymers) which may form an array. Corresponding changes in labeling chemistry can allow for ligands, binding partners and other relevant materials to be identifiable, quantifiable, and even sequenceable via modified forms of electron microscopy. In certain embodiments, the array dimensions are on the order of nanometers per functional region rather than micrometers as in certain conventional arrays. With these dimensions, smaller amounts of sample material can be used and more accurate genetic analyses performed. These smaller substrate dimensions may also give rise to dramatically reduced production costs, amongst other advantages. The transparency of the substrate, due to thinness, material type and other factors, may provide a suitable contrast ratio between the labeled molecules and the substrate that result in higher quality readings and lower cost analysis than some conventional techniques.

Aspects of the present disclosure may be combined with the description of certain embodiments in U.S. Patent Publication Nos. 2006/0024716, 2006/0024717, 2006/0024718, 2006/0029957, which correspond to PCT Application Publication No. WO06/019903, all entitled, "Systems and Methods of Analyzing Nucleic Acid Polymers and Related Components," as well as 2007/0134699, which corresponds to WO07/120202, entitled "Nano-Scale Ligand Arrays on Substrates for Particle Beam Instruments and Related Methods," each of which is incorporated herein by reference in its entirety. These references may provide, for example, methods and devices for incorporating contrast heavy atom labels in a biologic sample that are designed to interfere with a beam from a particle beam instrument. In certain embodiments, the labeled sample materials are binding partners, which can be bound to ligands in an array on a suitable substrate. A particle beam may be directed through the array and the labels can create interference patterns that are then read by a detector instrument and processed by a data analysis module.

Methods described herein involve exposing the sample to a particle beam. In certain embodiments, it is preferred that the particle beam is a lepton beam such as an electron beam. In other cases, the particle beam may be an X-ray beam. Yet in other embodiments, the particle beam may be an ion beam such as a helium or gallium ion beam. When an electron beam is used, a beam generator produces a beam having a desired voltage which, for example, can be greater than 50 kV, e.g., 80-300 kV, preferably 80-120 kV. Beam energies are a function of both voltage and current. The beam current typically ranges between 5 to 25 μΑ, preferably between 8 and 15 μΑ. The specific beam energy depends, in part, on the specific analysis being performed.

The sample containing labeled nucleic acid molecules may be stabilized to reduce damage to the nucleic acid molecules from the particle beam. For example, embedding the sample containing labeled nucleic acid molecules in a material that preserves the spatial location of the labels allows for higher beam dosage than a labeled nucleic acid molecule itself can sustain. The labeled nucleic acid molecules can be damaged, but the sequence information is retained in the form of the location of the labels.

Methods can include properly focusing the beam on the sample using a lens arrangement as known to those of skill in the art. Methods may also include a calibration step. In certain cases, the system may be automatically calibrated based on known information from nucleic acid molecules in the sample (such as known molecular geometries and structures) using a feedback loop. For example, data obtained from a nucleic acid sample using an electron beam may include internucleotide (e.g., interlabel) distances. As used herein, an internucleotide distance is the distance from one nucleotide base in one strand to the adjacent nucleotide base in the same strand. While the internucleotide distances of, for example, a DNA molecule are generally known, the internucleotide distance in any given sample may not correspond to the generally known distance, but will typically by substantially uniform within a sample as affixed to a substrate, particularly a sample that has been straightened, e.g., by treatment using molecular combing or like methods. Thus, after obtaining a data read on a given sample, various aspects of the system can be calibrated or adjusted using a feedback control system. For example, knowing the internucleotide distances permits feedback relevant to focusing the particle beam and movement of the sample relative to the particle beam.

Though systems used in the methods described herein may include several components similar to that of a conventional transmission electron microscope (e.g., beam generator, lens, etc.), certain systems may be more simple than typical conventional TEMs. For example, in some embodiments, the systems are simplified by limiting the magnification range, accelerating voltages, probe diameter, beam current, and sample flexibility, amongst other features. Also, problems related to spherical aberration in conventional TEMs may be limited, or eliminated, by using a lens arrangement that is pre-set for typical operating conditions for the system.

Characteristics of the particle beam are changed when the beam interacts with the sample. For example, one or more of the following characteristics of the particle beam may change: energy, direction, absorbance, reflection and deflection. Such changes may result from interactions between the particle beam and labels attached to bases as described above. Specific types of labels may produce specific or characteristic changes. Thus, a label (and, the specific base to which it is attached) may be identified by recognizing the specific or characteristic beam changes.

A detector collects particle beam species after the interaction between the particle beam and the sample. The detector typically collects beam species that have been transmitted through the sample, though also can collect beam species that are reflected and/or scattered. The detector may include a charge coupled device (CCD). The CCD may directly convert the beam species into digital information. Technologies other than CCD technology may be used to convert the beam species into digital information.

In some embodiments, a nucleic acid polymer may be detected, and/or sequenced and/or identified based on particle beam species detected by a detector (e.g., the detector described above). Particle beam species may result from exposure of a sample comprising a nucleic acid polymer and/or its complimentary strand to a particle beam (e.g., a lepton beam such as an electron beam). Methods, systems, computers, computer systems, computer storage media, software, and components for analyzing digital information generated by a detector and e.g. a CCD are known in the art, and are exemplified in U.S. Patent Publication Nos. 2006/0024716, 2006/0024717, 2006/0024718, 2006/0029957, which correspond to PCT Application Publication No. WO06/019903, each of which are incorporated by reference in their entirety.

EXAMPLES

Example 1 : Sequencing via two track label detection using electron microscope

A DNA molecule is synthesized that includes heavy atom labels. The labels, or attachment points, are built into the DNA via dNTP precursors that contain the labels or attachment groups. The dNTP precursors are built into the DNA polymer using enzymatic molecular biology techniques such as PCR or primer extension.

In a given molecule, more than one base (or base pair) is labeled. The labels can be identical, or not. The molecule is imaged with an electron beam system, such as an Electron Microscope (EM). The labels identify the location of the bases/base pairs to which they are attached. It is not required to discriminate between the identity of the labels if they are not identical. The resulting image is analyzed, assigning partial base identification information to each position. In Figure 1, the second column shows a schematic of how bases T and C might be partially identified in such a scheme.

The bases corresponding to the labeled positions are partially identified. This information can be converted to partial sequence data and used directly.

In a second version of this technology, more than one labeling reaction is used on molecules of identical sequence in a given region. Figure 1 shows two different labeling systems. In one, T and C are labeled (see second column). In the other T and A are labeled (see third column). Each give partial sequence information. By comparing the two "tracks" of data, complete sequence information can be deduced.

This style of data deconvolution is not limited to systems of using only two labeled bases, the system can work with 2 or 3 labeled bases at a time. It is also not limited to comparing only two labeling combinations. A limit of this technology is that there can be a contiguous region of unlabeled bases/base pairs. This "dark region" could have an ambiguous number of unlabeled bases, complicating downstream data analysis. However, by combining enough different labeling combinations, such "dark regions" are eliminated. If each nucleotide/base pair type is labeled in at least one of the multiple parallel runs, the data can be combined to eliminate "dark regions" of ambiguous numbers of bases/base pairs.

This method is compatible with both ssDNA and dsDNA, as well as other polymers including RNA. Example 2: Sequencing via three track label detection using electron microscope

As mentioned in Example 1, the disclosed data deconvolution methods are not limited to systems of using only two labeled bases, but can work with two or three labeled bases at a time, and also are not limited to comparing only two labeling combinations.

One non-limiting embodiment of the use of more than one labeling reaction on molecules of identical sequence in a given region is depicted in Figure 2. This figure shows the use of three different labeling systems. In the first labeling system, T and C are labeled (see Track 1). In the second labeling system, T and A are labeled (see Track 2). In the third labeling system, A and G are labeled (see Track 3). Each labeling system gives partial sequence information. By comparing the three tracks of data, complete sequence information can be deduced.

Example 3: Sequencing via three track label detection using electron microscope

Another non-limiting embodiment of the use of more than one labeling reaction on molecules of identical sequence in a given region is depicted in Figure 3. This figure shows the use of three different labeling systems. In the first labeling system, T and C are labeled

(see Track 1). In the second labeling system, T and A are labeled (see Track 2). In the third labeling system, A, T and G are labeled (see Track 3). Each labeling system gives partial sequence information. By comparing the three tracks of data, complete sequence information can be deduced. Example 4: Disambiguation of consecutive unlabeled bases by single track label detection using electron microscope

As mentioned in Example 1, there can be one or more contiguous regions of unlabeled bases/base pairs. This "dark region" could have an ambiguous number of unlabeled bases, complicating downstream data analysis.

One non-limiting embodiment of the use of one labeling reaction as depicted in Figure 4. This figure shows the use of a single labeling system, in which A, T and G are labeled (see second column). This labeling system gives partial sequence information that provides enhanced disambiguation of consecutive unlabeled bases.

Example 5: Disambiguation of consecutive unlabeled bases by two track label detection using electron microscope

As mentioned in Examples 1 and 4, there can be one or more contiguous region of unlabeled bases/base pairs. This "dark region" could have an ambiguous number of unlabeled bases, complicating downstream data analysis.

Another non-limiting embodiment of the use of more than one labeling reaction on molecules of identical sequence in a given region is depicted in Figure 5. This figure shows the use of two different labeling systems. In the first labeling system, T and C are labeled (see second column). In the second labeling system, G and A are labeled (see last column). Each labeling system gives partial sequence information. By comparing the two tracks of data, complete sequence information can be deduced that provides enhanced disambiguation of consecutive unlabeled bases.

Claims

1. A method of determining the sequence of a nucleic acid molecule comprising

forming a first complementary strand of the nucleic acid molecule using one, two or three types of heavy atom labeled bases;

forming at least one second complementary strand of the nucleic acid molecule using one, two or three types of heavy atom labeled bases in which at least one, but not all, of the heavy atom labeled bases are the same as used in forming the first complementary strand of the nucleic acid molecule; and

identifying a sequence of bases in the first complementary strand and/or in the at least one second complementary strand using a particle beam.

2. The method of claim 1, further comprising comparing the pattern of heavy atom labeled bases in the first complementary strand with the pattern of heavy atom labeled bases in the at least one second complementary strand.

3. The method of claim 1 or claim 2, wherein the nucleic acid molecule is single stranded or double stranded DNA or single stranded or double stranded RNA.

4. The method of claim 1 or claim 2, wherein the complementary strand is DNA or RNA.

5. The method of any one of claims 1-4, wherein the nucleic acid molecule and/or its complementary strand is formed by a nucleic acid polymerase enzyme.

6. The method of claim 5, wherein the complementary strand of the nucleic acid molecule is formed using polymerase chain reaction (PCR) or extension of a primer that hybridizes to the nucleic acid molecule.

7. The method of any one of claims 1-6, wherein base specific labels are incorporated in the nucleic acid molecule and/or the complementary strand during formation of the nucleic acid molecule and/or the first and/or at least one second complementary strands, or wherein bases comprising attachment sites are incorporated in the nucleic acid molecule and/or the first and/or at least one second complementary strands and are modified after formation by covalently bonding heavy atom labels at the attachment sites.

8. The method of any one of claims 1-7, wherein a single type of heavy atom label is used.

9. The method of any one of claims 1-8, wherein the heavy atom label is or contains platinum.

10. The method of any one of claims 1-9, wherein at least two types of bases are labeled with the same type of heavy-atom label.

11. The method of any one of claims 1-10, wherein one type of base is labeled in the first complementary strand and/or the at least one second complementary strand.

12. The method of any one of claims 1-11, wherein two types of bases are labeled in the first complementary strand and/or the at least one second complementary strand.

13. The method of any one of claims 1-12, wherein three types of bases are labeled in the first complementary strand and/or the at least one second complementary strand.

14. The method of any one of claims 1-13, wherein the step of identifying a sequence of bases comprises generating a particle beam, exposing the first complementary strand and/or the at least one second complementary strand to the particle beam, and identifying the bases due to characteristic changes to the particle beam.

15. The method of claim 14, wherein the step of identifying the bases comprises detecting characteristic changes to the particle beam.

16. The method of claim 15, wherein the particle beam is an electron beam.

17. The method of any one of claims 1-16, wherein the step of identifying the sequence of bases comprises comparing the pattern of labeled bases in a first labeled complementary strand with the pattern of bases in at least one second labeled complementary strand, wherein the first labeled complementary strand and the at least one second labeled complementary strand are labeled on different sets of one of more bases.

18. The method of claim 17, wherein the pattern of labeled bases in 2, 3, 4, 5, 6, 7, 8, 9, 10, or more second complementary strands are compared.

19. A method of determining the sequence of a nucleic acid molecule comprising

optionally forming at least one second complementary strand of the nucleic acid molecule using one, two or three types of heavy atom labeled bases in which none of the heavy atom labeled bases are the same as used in forming the first complementary strand of the nucleic acid molecule; and

20. The method of claim 19, further comprising comparing the pattern of heavy atom labeled bases in the first complementary strand with the pattern of heavy atom labeled bases in the at least one second complementary strand, if formed.

21. The method of claim 19 or claim 20, wherein the nucleic acid molecule is single stranded or double stranded DNA or single stranded or double stranded RNA.

22. The method of claim 19 or claim 20, wherein the complementary strand is DNA or RNA.

23. The method of any one of claims 18-22, wherein the nucleic acid molecule and/or its complementary strand is formed by a nucleic acid polymerase enzyme.

24. The method of claim 23, wherein the complementary strand of the nucleic acid molecule is formed using polymerase chain reaction (PCR) or extension of a primer that hybridizes to the nucleic acid molecule.

25. The method of any one of claims 19-24, wherein base specific labels are incorporated in the nucleic acid molecule and/or the complementary strand during formation of the nucleic acid molecule and/or the first and/or at least one second complementary strands, or wherein bases comprising attachment sites are incorporated in the nucleic acid molecule and/or the first and/or at least one second complementary strands and are modified after formation by covalently bonding heavy atom labels at the attachment sites.

26. The method of any one of claims 19-25, wherein a single type of heavy atom label is used.

27. The method of any one of claims 19-26, wherein the heavy atom label is or contains platinum.

28. The method of any one of claims 19-27, wherein at least two types of bases are labeled with the same type of heavy-atom label.

29. The method of any one of claims 19-28, wherein one type of base is labeled in the first complementary strand and/or the at least one second complementary strand.

30. The method of any one of claims 19-29, wherein two types of bases are labeled in the first complementary strand and/or the at least one second complementary strand.

31. The method of any one of claims 19-30, wherein three types of bases are labeled in the first complementary strand and/or the at least one second complementary strand.

32. The method of any one of claims 19-31, wherein the step of identifying a sequence of bases comprises generating a particle beam, exposing the first complementary strand and/or the at least one second complementary strand to the particle beam, and identifying the bases due to characteristic changes to the particle beam.

33. The method of claim 32, wherein the step of identifying the bases comprises detecting characteristic changes to the particle beam.

34. The method of claim 33, wherein the particle beam is an electron beam.

35. The method of any one of claims 19-34, wherein the step of identifying the sequence of bases comprises comparing the pattern of labeled bases in a first labeled complementary strand with the pattern of bases in at least one second labeled complementary strand, wherein the first labeled complementary strand and the at least one second labeled complementary strand are labeled on different sets of one of more bases.

36. The method of claim 35, wherein the pattern of labeled bases in 2, 3, 4, 5, 6, 7, 8, 9, 10, or more second complementary strands are compared.