WO2006125863A1

WO2006125863A1 - Analysis techniques for liquid chromatography/mass spectrometry

Info

Publication number: WO2006125863A1
Application number: PCT/FI2006/050208
Authority: WO
Inventors: Matej Oresic; Mikko Katajamaa
Original assignee: Valtion Teknillinen Tutkimuskeskus
Priority date: 2005-05-26
Filing date: 2006-05-24
Publication date: 2006-11-30
Also published as: FI20055253A0; FI20055253A

Abstract

A method for analyzing liquid chromatography/mass spectrometry [=”CL/MS”] data comprises: preparing (1-2) a plurality of sample runs; processing (1-4) each of the prepared sample runs in an LC/MS spectrometer to obtain a spectrum in respect of each processed sample run; internally representing (1-10) each spectrum as a layout of mass/charge versus retention time; performing a first peak detection (1-12) to detect peaks of each spectrum; visualizing peaks of each spectrum, wherein the visualizing step comprises: mapping (1-22) each peak to be visualized to a coordinate system in which a first coordinate indicates mass/charge ratio and a second coordinate indicates retention time; and assigning (1-24) a specific visual attribute to each peak to be visualized.

Description

Analysis techniques for liquid chromatography/mass spectrometry

BACKGROUND OF THE INVENTION

The invention relates to processing techniques, including methods, equipment and software products, for analysis of mass spectrometry data as used in connection with liquid or gas chromatography. Later in this document, LC and MS are abbreviations for liquid chromatography and mass spectrometry, respectively. Although the invention will be described in connection with liquid chromatography, it is also applicable to gas chromatography.

Liquid chromatography coupled to mass spectrometry (LG/MS) has been widely used in proteomics and metabolomics research. In this context, the technology has been increasingly used for differential profiling, ie, broad screening of biomolecular components across multiple samples or sample runs, which correspond to different conditions, interventions, or time points, in order to elucidate the observed phenotypes and discover biomarkers. One of the major challenges in this domain is development of better solutions for processing LC/MS of data.

Typical LC/MS experiments include several analytical stages, starting with sample pre-treatment which commonly includes sample cleanup and extraction methods. The sample can then be introduced to an LC column where the molecules separate based on their size (size exclusion chromatography), affinity to stationary phase (affinity chromatography), polarity (ion exchange chromatography), and/or hydrophobicity (reversed phase chromatography). Retention time measures the time between the sample injection and the appearance of the compound peak maximum after chromatographic separation. In analyses of complex mixtures, it is likely that many analytes elute at the same time, and individual compound peaks cannot be resolved by LC techniques alone Mass spectrometry (MS) can then be used to separate the co-elutants according to mass-to-charge ratio (m/z). The co-elutanis enter the LC-MS interface where they are ionized and introduced into the mass spectrometer where the m/z ratio is measured. Several ionization methods exist, among the most commonly used are the soft ionization methods such as electrospray ionization (ESI) and atmospheric pressure - chemical ionization (APCI). The principles of mass detection can also vary, with the most common instruments being triple quadrupole, (quadrupole) ion trap, (quadrupole) time of flight mass spectrometers. Because of the large number of possible applica tions and approaches, it is a challenge to develop a generic solution for processing and analysis of LC/MS data.

One increasingly utilized type of LC/MS application is differential profiling, where the extraction, LC methods, and MS instrument setup are set to provide a broad coverage of compounds, with the main aim to enable relative quantitative comparisons for individual compounds across multiple samples. The applications of such approach can be found in domains of systems biology, functional genomics, and biomarker discovery. While such approaches cannot match targeted analytical measurements in ability to accurately quantify individual analytes, it is the role of data processing methods to enable comparative studies of analytes, even if they may be unknown.

The data processing for differential profiling comprises several stages- Smoothing (spectral filtering) aims at reducing the complexity of spectra and removing the noise. Peak detection finds the peaks corresponding to the compounds or fragments thereof.

One of the major challenges in the LC/MS domain is development of better solutions for processing LC/MS of data. A particular problem is related to visualization of spectrum peaks.

BRIEF DESCRIPTION OF THE INVENTION

An object of the present invention is to provide a method, an apparatus and a computer program product for implementing the method so as to solve the above-mentioned problem. The objects of the invention are achieved by a method, program product and computer system which are characterized by what is stated in the independent claims. The preferred embodiments of the invention are disclosed in the dependent claims.

The invention is based on the idea of complementing conventional LC/MS spectrometry operations with an additional step of visualizing peaks of each spectrum, wherein the visualizing step comprises mapping each peak to be visualized to a coordinate system in which a first coordinate indicates mass/charge ratio and a second coordinate indicates retention time: and assigning a specific visual attribute to each peak to be visualized.

A preferred embodiment of the invention also comprises an alignment step and a gap filling step, or program instructions and data structures for executing these steps. Alignment aims at matching the corresponding peaks across multiple sample runs. Gap filling is a second peak detection operation, in which previously undetected peaks in a spectrum are searched by using knowledge of "neighbour" spectra, ie, the remaining spectra in the set of aligned spectra.

According to another preferred embodiment the invention, normalization may be used to reduce systematic errors, by adjusting the intensities within each sample run.

An advantage of the invention is improved processing of spectral data.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following the invention will be described in greater detail by means of preferred embodiments with reference to the attached drawings, in which

Figure 1 is a fiow chart illustrating main phases in a method according to the invention;

Figure 2 is a block diagram illustrating principal software blocks in an exemplary object-based implementation of the invention;

Figures 3A and 3B illustrate peak detection methods:

Figure 4A shows total ion chromatograms from one elicited and one control sample;

Figure 4B shows a log-ratio view for the top 20% most intense peaks in the control samples shown in Figure 4A;

Figure 5A shows differences between elicited and control groups;

Figure 5B shows distribution within the elicited and control groups;

Figure 6 shows a view of an exemplary user interface.

DETAILED DESCRIPTION OF THE INVENTION

Figure 1 is a flow chart illustrating main phases in a method according to the invention. The invention relates to processing of spectral data from a plurality of sample runs. Each sample run produces a spectrum (spectral data) from a sample. The samples used in me different sample runs can be subsamples from a common larger sample, or they can derive from different samples altogether.

Reference numeral 1-2 denotes sample preparation steps which are known to those skilled in the art and which have been briefly discussed in the background section of this document. Reference numeral 1-4 denotes a step which comprises spectrometry operations, including recording of measured spectral data. Reference numeral 1-6 denotes an optional step in which the spectral data is converted from a vendor-specific data format to some open data format, such as netCDF. A benefit of this step, or the corresponding routine and data structures in the software product, is the ability to support a wide variety of spectrometry instruments. In a further optional step 1-8 the spectral data is smoothed to suppress noise and other spurious data. In some implementations this step may be performed by the spectrometer itself In step 1-10 the spectral data is internally represented in two dimensions, wherein one dimension corresponds to mass-charge ratio m/z, while the other dimension corresponds to retention time rt. The term 'internal representation' means that a visualization of the spectral data is not necessary, at least not at this stage. Reference numeral 1-12 denotes a peak detection step in which peaks in the spectral data are detected.

Steps 1-2 through 1-12 are known to those skilled in the art and a detailed description is omitted for brevity. In these steps the several sample runs are typically processed serially, each sample run at a time. In the following steps the several sample runs are processed in parallel, interdependently.

Steps 1-14 to 1-18 relate to a preferred embodiments and are not essential for the present invention. In step 1-14 data from the several sample runs are aligned such that there is a maximal correspondence between the peaks of the spectra. The verb 'align' may imply visualization, but visualization is not strictly necessary, and any equivalent data processing technique may be used. The alignment operation searches for corresponding peaks across different mass spectrometry runs. Peaks from the same compound usually match closely in m/z values, but retention time between the runs may vary. The retention time largely depends on the analytical method used.

After completion of the alignment process, it is likely that the master peak list has some empty gaps, because it is not certain that every peak is detected and aligned in every sample run. The need to deal with these missing values often complicates further statistical analyses, and for this reason, a method according to the invention comprises a second peak detection step 1-16, the purpose of which is to fill these gaps. In one implementation, the second peak detection step employs the m/z_m and rt_m values for estimating locations in which the missing peaks can be expected. A search is then conducted to find the highest local maximum over a range around the expected location in the raw spectral data. The search is performed over a search win- dow which is preferably user-settable.

Step 1-18 is a normalization step which will be further described under a subheading "Normalization' .

Steps 1 -22 and 1-24 collectively constitute the visualization of peaks. In step 1-22 each peak to be visualized is mapped to a coordinate system in which a first coordinate indicates mass/charge ratio and a second coordinate indicates retention time. In step 1-24 a specific visual attribute is assigned to each peak to be visualized. The visualization steps will be described under a subheading "Visualization".

Figure 2 is a block diagram illustrating an exemplary object-based implementation of an analysis system according to the invention. The analysis system is generally denoted by reference numeral 200. The block diagram in Figure 2 follows the conventions of Unified Modelling Language (UML). This class model includes a set of core classes used for representing raw LC/MS data and interfaces for different types of data processing methods and visualization blocks. In the implementation shown here, the core objects for representing raw LC/MS data don't store the actual measurements inside them, but retrieve the data from disk when necessary. This makes it possible to visualize and process several large raw data files at the same lime. New data process- ing methods can be added to the toolbox by implementing a suitable interface.

Input data formats and conversion Input to the analysis system 200 should be unprocessed measurement data from the mass spectrometer. Such raw data is denoted by reference numeral 205. It is also possible to apply some pre-processing, such as centrolding, before loading the data to the analysis system 200. Such pre-processing may, for example, reduce the amount of storage space needed for the data and/or speed up data processing. This is particularly useful in connection with high-resolution mass spec- trometers such as QTof and FTMS. However, the success of data processing with the analysis system depends on the quality of the input data and the preprocessing methods being used. For supporting input data files in NetCDF format, NetCDF Java Library by Unidata community can be used in the analysis system 200. Many mass spectrometer vendors provide converters for translating raw data files from their proprietary formats to this common presentation format. The system shown in Figure 2 can be made compatible with NetCDF proteomics and metabolomics data created from a wide variety of instruments, including Ouattro Micro (Waters), QSTAR Pulsar (Applied Biosystems), LTQ-FTMS (Thermo Finnigan), and LCQ (Thermo Finnigan). The analysis system can be expanded to include support for upcoming new mass spectrometry data formats such as mzData and mzXML

Smoothing aims to remove noise in the measured spectra, which facilitates further peak detection. Smoothing is an optional stage in data processing and can also be left out if the data is not noisy or if the input data is already available as centroids. For smoothing the spectral data, the analysis system 200 comprises a filter section 210, which can be implemented in any of a wide variety of techniques, including a moving average fiiter and Savitzky- Golay filter.

Peak detection

After the optional smoothing, a peak detector 215 finds the peaks in the spectral data. By way of example, the embodiment shown in Figure 2 implements two peak picking methods, namely a local maxima detection and a recursive threshold detection. Both of these detection methods operate in two steps: first, 1-dimensional m/z peaks are searched within each retention time scan separately, and then 1-dimensional peaks in successive spectra are joined together to form 2-dimensional peaks. The joining occurs only between those m/z peaks which are located in successive spectra, have similar m/z values according to pre-set threshold, and form together a well-shaped peak in the chromatographic direction.

Figures 3A and 3B show a simple example of these two peak detection steps.

The two peak picking methods differ in the implementation of the first step: the local maximum method picks each local maximum in a spectrum as an m/z peak, whereas the recursive threshold method considers only those maxima that have a suitable width, which differentiates noise peaks from real peaks. The choice of methods for smoothing and peak detection depends on the nature of input data, if the data is already pre-processed and/or in centroid form, smoothing is not necessary and peak detection method based on searching for local maximums is typically the best choice. With unprocessed data, the recursive threshold peak detection usually gives better results.

After peak detection, the detected peaks are aligned by an aligner block 220. In one exemplary but non-restrictive implementation, the analysis system implements an alignment technique that matches each individual peak list against a master peak list. For every peak in an individual peak list, the best matching row in master peak list is defined as the one having smallest distance measure.

... wherein m/z_p and rt_p and are the m/z ratio and retention time, respectively, of a peak in an individual peak list, while m/z_m and rt_m ate the average m/z ratio and retention time, respectively, of all peaks from different peak lists assigned to same row of the master peak list. k is an adjustable parameter for controlling the balance between accuracy of m/z ratio and retention time values. Generally, k can be set to a larger number with increased resolution of the mass detector.

Initially, the master peak list ts empty and all peaks of the first peak list are appended to new rows of the master peak list. When adding peaks of the sequential peak lists to the master list, both | m/z_p - m/z_m | and | rt_{p -} rt_m | between a peak and the best-matching row should preferably be within a user-definable threshold level, or the peak needs to be appended to a new row at the end of the master peak list. If a single row of the master peak list is the best match for multiple peaks of a peak list, then the only the peak with smallest distance measure will be added to the best matching row of the master peak list, while the others will be assigned to their second best matches.

Normalization

The analysis system 200 also comprises a normalization block 225. The purpose of the normalization is to reduce the systematic error in data. The embodiment shown in Figure 2 implements two different normalization approaches: a rather straightforward set of linear normalization methods as well as a more ambitious approach that uses multiple internal standard compounds injected to the spectrometry samples.

Linear normalization methods divide all peak intensities of a single sample by some value calculated using data from that sample. By way of example, the linear normalization method in the analysis system may offer four different ways to calculate the normalization factor: average peak intensity, average squared peak intensity, maximum intensity and total raw signal. All of these methods work globally, which means that they normalize the entire sample using a single normalization factor.

As shown in Figure 2, the analysis system 200 also comprises a more ambitious normalization method which uses information from multiple standard compounds. This method assumes that some standard compounds are injected to each of the spectrometry samples in known concentrations prior to LC/MS analysis. The standard compound peaks can be used to calculate a set of normalization factors, one for each standard compound. There are several ways to use this information in normalization. One possibility is to determine which standard compound peak is closest to a peak, and normalize this peak using the corresponding normalization factor. The distance function is same as in equation [1]. A variation of this method is a method based on normalization using weighted contribution of each standard compound. In this method, the same distance metric as in equation [1] can be used to calculate the distance from a peak to each standard compound. Contribution of each standard to the final normalization factor can be weighted by the inverse of distance between the peak and the standard as shown by equation [2]:

In equation [2], m is the number of injected standard compounds, nf_i is the normalization factor calculated using the standard compound with index i, d(p, IS_i) is the distance between the peak to be normalized and the peak of the standard compound with index i. Both methods reduce to the common single-standard calibration when m=1 , in which case only a single internal standard is used.

After processing, the spectral data is ready to be exported from the analysis system as a peak intensity matrix. This matrix can be then further processed with proprietary or off-the-shelf mathematics packages, such as Matlab® (MathWorks, Inc.) or R Statistical Language which already have a large collection of data analysis tools available for statistical analyses of multivariate data.

Visualization

For visualization independently of external software packages, the analysis system 200 implements one or more visuaiization techniques for quickly previewing the processed results. These visualization techniques implement the idea of plotting the peak intensity matrix as a two-dimensional plot where one axis (eg the x-axis) is the retention time and the other (eg the y- axis) is the m/z ratio. Peaks are plotted at the intersection of the coordinates for retention time and m/z ratio using appropriate visual attributes, such as colour, shade, shape, size, line type, or the like.

Figure 4B shows a log ratio plot which is particularly useful for displaying differences between two groups of samples. Differences are measured using a log ratio value which is calculated between average peak intensities of two selected groups;

In equation [3], and are the average intensity of peak p in

the first and second group of raw data files, respectively.

In the log ratio plot, visual coding, such as colour coding, can be used for visualizing the log ratio values. For example, a first shade or visual attribute (eg red or dark hue) can indicate positive log ratio values and a second shade or visual attribute (eg green or light hue) negative log ratio values.

A mathematically precise log ratio is easy to define and implement, but other continuous or stepwise functions can be used. In order to present difference values over a wide range, the function should have a derivate which decreases when its argument increases.

Another useful visualization method is a coefficient of variation plot, which displays a variation of peak intensities within one group of samples. The coefficient of variation plot is drawn similarly as the log ratio plot, but colour or other visual coding is used for displaying the coefficient of variation between peak intensities within a selected group of samples:

In equation [4],

is the average peak jntensity and is the stan

dard deviation of peak intensities in the selected group of samples.

These visualization techniques are particularly useful in quality control, because the analysis system permits the user to return to the raw data and visually verify the obtained results. A benefit of these technique is the ability to clearly see the separation between two sample groups and/or differences and variability across different samples, EXAMPLE: METABOLIC PROFILING

A concrete example of metabolic profiling of plant secondary compounds in Catharanthus roseus will be described. Studies of plant metabolites are a demanding area since plants produce large number of metabolites of high chemical diversity, many of which are unknown. Plant secondary metabolites are produced as responses to changes in the environmental conditions. The biosynthetic pathways of secondary metabolites are largely unknown, and discovery driven 'omics' approaches promise to enhance our knowledge in this domain. In order to illustrate the utility of the analysis system according to the invention, a demonstration of it in connection with metabolic profiling of cell cultures of the medicinal plant Catharanthus roseus will be described. This plant has been extensively studied due to the presence of terpenoid indole alkaloids (TIA), several of which are in high demand for pharmaceutical use. We focused on fraction containing most important secondary metabolites leading to TIA (Methods described in supplementary file). We profiled 20 samples, of which 10 were control strains and 10 were elicited strains. The replicates are the same strain in parallel cultures corresponding to the same time point, so can be considered as biological replicates. We also injected an internal standard compound vincamine (PubChem SID 390304).

Using the analysis system according to the invention with moving average filter (m/z=0.3 window setting), recursive threshold peak detection (default settings), alignment (100s tolerance in retention time, otherwise default settings), gap-filling (60s tolerance in retention time), and normalization by total raw signal, 2175 peaks were detected. Representative total ion chroma- tograms from one elicited and one control sample are shown in Figure 4A. The log-ratio view for top 20% most intense peaks is shown in Figure 4B. After exporting the processed data in tabular format, further analyses of the data matrix were performed in Matlab® using PLS Toolbox (Eigenvector Research, Inc.) and with R Statistical Language. Principal components analysis revealed clear differences between the elicited and control groups, as shown in Figure 6A. Using factor analysis (not shown), we found that the two of the main contributors to the clustering of the elicited group were ajmalicine (PubChem SID 153482) and tabersonine (PubChem SID 163306). The compounds were identified using our internal spectral library based on molecular weight and retention time. Their distribution within the elicited and control groups shows the compounds are significantly upregulated after elicitation (see Figure 5B). Figure 6 shows a view of an exemplary user interface 60. Reference numerals 80A to 60H denote various sections of the user interface. Section 60A is a list data files. Section 60B shows a total ion chromatogram for a selected data file. Section 80C shows a spectrum based on a selected data file. Section 60D shows a list of peaks from the selected data file. Section 60E shows the spectral peaks plotted on a two-dimensional coordinate system. Section 60F shows a total ion chromatogram for another selected data file. Section 60G is a list of peaks after alignment, and Section 60H is a list of alignment results after the second peak processing for selected files.

Although the invention has been described in connection with liquid chromatography, it is equally applicable to gas chromatography . It will be apparent to a person skilled in the art that, as the technology advances, the inventive concept can be implemented in various ways. The invention and its embodiments are not limited to the examples described above but may vary within the scope of the claims.

Claims

1. A method for analyzing liquid chromatography/mass spectrometry [="LC/MS"] data, the method comprising: preparing (1-2) a plurality of sample runs; processing (1-4) each of the prepared sample runs in an LC/MS spectrometer to obtain a spectrum in respect of each processed sample run; internally representing (1-10) each spectrum as a layout of mass/charge versus retention time; performing a first peak detection (1-12) to detect peaks of each spectrum; visualizing peaks of each spectrum, wherein the visualizing step comprises: mapping (1-22) each peak to be visualized to a coordinate system in which a first coordinate indicates mass/charge ratio and a second coordinate indicates retention time; and assigning (1-24) a specific visual attribute to each peak to be visualized.

2. A method according to claim 1, further comprising internally aligning (1-14) the detected peaks of each spectrum; and performing a second peak detection (1-16) to detect peaks missed in the first peak detection.

3. A method according to claim 2, wherein the second peak detection comprises local maxima detection.

4. A method according to claim 2 or 3, wherein the second peak detection comprises recursive threshold detection.

5. A method according to claim 1 , further comprising normalizing the spectra.

6. A method according to claim 5, further comprising injecting one or more standard compounds with a predetermined concentration into each sample run prior to the processing step (1-4) in order to obtain a set of standard compound peaks for each injected standard compound.

7 A method according to claim 6, further comprising searching for the standard compound peak closest to a peak being analyzed and normalizing the peak being analyzed based on a distance measure of the distance between the peak being analyzed and said closest standard compound peak.

8. A method according to any one of claims 2 to 7, wherein the aligning step comprises (1-14): generating a peak list in respect of each spectrum; generating a master peak list; for each peak in each peak list, finding the corresponding peak in master peak list by using a predetermined distance measure.

9. A method according to claim 7 or 8, wherein the distance measure is based on a weighted combination of | m/z_p - m/z_m I and I rt_p - rt_m | , wherein m/z_p and rt_p and are the mass-to-charge ratio and retention time, respectively, of a peak in an individual peak list, m/z_m and rt_m are the average m/z ratio and retention time, respectively, of all peaks from different peak lists assigned to same row of the master peak list

10. A method according to any one of the preceding claims, further comprising visualizing peaks of each spectrum, wherein the visualizing step comprises: mapping each peak to be visualized to a coordinate system in which a first coordinate indicates mass/charge ration and a second coordinate indicates retention time; and assigning a specific visual attribute to each peak to be visualized.

11. A method according to claim 10, further comprising visualizing peaks from a first group and a second group of samples, and the specific visual attribute is based on a ratio of average intensities of corresponding peaks in the first group and a second group.

12. A method according to claim 10, further comprising visualising peaks from a group of samples, and the specific visual attribute is based on a variation of peak intensities within the group of samples.

13. A computer program product, executable in a computer system, the computer program product comprising program code for instructing the computer system to perform the following steps on a plurality of spectra obtained from a spectrometer used to analyze a plurality of sample runs: internally representing (1-10) each spectrum as a layout of mass/charge versus retention time; performing a first peak detection (1-12) to detect peaks of each spectrum; and searching (1-18) for the standard compound peak closest to a peak being analyzed and normalizing (1-20) the peak being analyzed based on a distance measure of the distance between the peak being analyzed and said closest standard compound peak.

14. A computer system comprising the computer program product according to claim 13.