WO2006088350A1 - Method and system for selection of calibration model dimensionality, and use of such a calibration model - Google Patents

Method and system for selection of calibration model dimensionality, and use of such a calibration model Download PDF

Info

Publication number
WO2006088350A1
WO2006088350A1 PCT/NL2005/000124 NL2005000124W WO2006088350A1 WO 2006088350 A1 WO2006088350 A1 WO 2006088350A1 NL 2005000124 W NL2005000124 W NL 2005000124W WO 2006088350 A1 WO2006088350 A1 WO 2006088350A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
calibration
dimensionality
risk
data
Prior art date
Application number
PCT/NL2005/000124
Other languages
French (fr)
Inventor
Nicolaas Maria Faber
Original Assignee
Chemometry Consultancy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chemometry Consultancy filed Critical Chemometry Consultancy
Priority to PCT/NL2005/000124 priority Critical patent/WO2006088350A1/en
Publication of WO2006088350A1 publication Critical patent/WO2006088350A1/en

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N21/00Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
    • G01N21/17Systems in which incident light is modified in accordance with the properties of the material investigated
    • G01N21/25Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
    • G01N21/31Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry
    • G01N21/35Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01JMEASUREMENT OF INTENSITY, VELOCITY, SPECTRAL CONTENT, POLARISATION, PHASE OR PULSE CHARACTERISTICS OF INFRARED, VISIBLE OR ULTRAVIOLET LIGHT; COLORIMETRY; RADIATION PYROMETRY
    • G01J3/00Spectrometry; Spectrophotometry; Monochromators; Measuring colours
    • G01J3/28Investigating the spectrum
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N21/00Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
    • G01N21/17Systems in which incident light is modified in accordance with the properties of the material investigated
    • G01N21/25Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
    • G01N21/27Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands using photo-electric detection ; circuits for computing concentration
    • G01N21/274Calibration, base line adjustment, drift correction
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N21/00Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
    • G01N21/17Systems in which incident light is modified in accordance with the properties of the material investigated
    • G01N21/25Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
    • G01N21/31Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry
    • G01N21/35Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light
    • G01N21/3504Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light for analysing gases, e.g. multi-gas analysis
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N21/00Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
    • G01N21/17Systems in which incident light is modified in accordance with the properties of the material investigated
    • G01N21/25Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
    • G01N21/31Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry
    • G01N21/35Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light
    • G01N21/3563Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light for analysing solids; Preparation of samples therefor
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N21/00Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
    • G01N21/17Systems in which incident light is modified in accordance with the properties of the material investigated
    • G01N21/25Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
    • G01N21/31Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry
    • G01N21/35Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light
    • G01N21/3577Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light for analysing liquids, e.g. polluted water
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N21/00Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
    • G01N21/17Systems in which incident light is modified in accordance with the properties of the material investigated
    • G01N21/25Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
    • G01N21/31Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry
    • G01N21/35Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light
    • G01N21/359Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light using near infrared light

Definitions

  • the present invention relates to a method for providing a calibration model comprising a quantitative relation between object (or sample) measurement data and object (or sample) property value(s).
  • the calibration model can also be used for classifying object property values, e.g. in case of gasoline classification, or for analysis of effects in controlled experiments.
  • the method comprises acquiring measurement data of a set of JV training objects, the measurement data comprising K predictor values associated with the object measurement data and M predictand values associated with the sample property values for each sample, JV, K and M being integer values, resulting in an JV x K predictor matrix X and an JV x M predictand matrix Y, obtaining calibration models interrelating the predictor matrix X and the predictand matrix Y, each calibration model having model terms, the number of which determines the model dimensionality, and validating the obtained calibration models (e.g. by obtaining a test statistic, such as a cross-validated error).
  • the present invention relates to a system for providing a calibration, classification or analysis of effects model comprising a quantitative relation between object measurement data and object property values
  • the system comprising a sample unit for acquiring measurement data of a set of JV training objects, a predictor measurement unit connected to the sample unit for providing measurement data comprising K predictor values associated with the object measurement data, a predictand unit connected to the sample unit for providing M predictand values associated with the sample property values for each sample, JV, K and M being integer values, resulting in an NxK predictor matrix X and an JV x M predictand matrix Y, and a data processor unit connected to the predictor measurement unit and the predictand unit, the data processor unit being arranged for obtaining calibration models interrelating the predictor matrix X and the predictand matrix Y, each calibration model having model terms, the number of which determines the model dimensionality, and validating the obtained calibration models.
  • NIR near-infrared
  • Multivariate calibration is then used to develop a quantitative relation, i.e., a model, between the digitized spectra, stored in a data matrix X, and the concentrations, stored in a data matrix Y, as reviewed by H. Martens and T. Naes, Multivariate Calibration, Wiley, NY, 1989.
  • NIR spectroscopy is also increasingly used to infer other properties (stored in Y) of samples than concentrations, e.g., the strength and viscosity of polymers, the thickness of a tablet coating, and the octane rating of gasoline.
  • the first step towards constructing a multivariate calibration model is to remove undesirable features from the X data by pre-treatment techniques such as filtering or differentiation.
  • the next critical step serves to select the optimum model dimensionality, which is the number of terms that constitute the multivariate model. This step is equivalent to determining the optimum degree of a polynomial for fitting univariate (x,y)-data pairs.
  • it is a much harder problem to solve for multivariate calibration owing to the higher complexity of the data at hand and the often-tiny substructures to be discovered. Many methods have been developed to solve this problem, of which model validation is the most frequently applied one in practice.
  • validation amounts to assessing the ability of the model to predict the properties of interest for unknown future objects, e.g., chemical or biological samples, from the same type.
  • This assessment can be performed in two essentially different modes, namely externally and internally.
  • the adjective 'external' refers to the requirement that the validation objects be independent of the objects used for constructing the model, i.e., the training set, otherwise one does not properly assess the ability to predict for truly unknown future objects. For example, replicates are not allowed.
  • the predictive ability is estimated by applying the model to these independent validation objects and averaging the squared prediction errors, i.e., the differences between model prediction and the associated known value. The square root of this average squared error is known as the root mean squared error of prediction (RMSEP).
  • Internal validation differs from external validation in the sense that the validation objects are taken from the training set itself, i.e., the validation objects are not independent.
  • To execute an internal validation one has the choice between cross- validation and leverage correction.
  • cross-validation one constructs models after judiciously leaving out segments of objects. Then an estimate of RMSEP follows by averaging squared prediction errors for the left-out objects, as in external validation.
  • Cross-validation can be quite computer-intensive, depending on the size of the data sets and the number of segments. Leverage correction is a 'quick and dirty' alternative. Calibration model validation is problematic for various reasons. External validation is best in the sense that a closer assessment of RMSEP is possible.
  • the shape of the RMSEP curve depends on the mode (external or internal), the type of data and also on the particular data pre- treatment (first-, second-derivative, etc.), ad hoc rules are used for visual interpretation.
  • the optimum dimensionality is found, for example, not only as the one that leads to a global minimum, which would be the 'logical' selection criterion, but also, depending on the shape incidentally encountered, as the first local minimum, a plateau, or where the curve 'levels off.
  • a practitioner may use different selection criteria for different pre-treatments of the same data.
  • the present invention seeks to provide an improved method and system for providing a calibration model having a good predictive ability when challenged with future unknown objects of the same type as the objects used for constructing the model.
  • the method and system should be objective in selecting the optimum model dimensionality. As a result, the actual use of the model for predicting a sample property from measured, practical data sets will be much more trustworthy.
  • obtaining the calibration model comprises initialising the calibration model using a predetermined minimum number of model terms, adding an additional model term to increase the dimensionality of the calibration model, calculating a risk of over-fitting by the model associated with the current dimensionality of the model, and repeating the adding of an additional model term and calculating of the risk of over-fitting up to a predetermined dimensionality of the model.
  • the initial calibration model includes a predetermined minimum dimensionality. This may be a single model term, but in cases for which prior knowledge exists, e.g. from experimental design considerations, it may be advantageous to start with a higher dimensionality.
  • calculating a risk comprises calculating a cumulative risk for the calibration model up to and including the current dimensionality of the calibration model, and the repeating of adding and calculating is executed until the cumulative risk of over-fitting of the current calibration model exceeds a predetermined risk threshold value.
  • This embodiment allows to objectively obtain an optimum calibration model dimensionality, based on a risk parameter indicating an acceptable risk of over-fitting.
  • This acceptable risk depends on circumstances, and may e.g. be chosen as 5%, 1%, or even very low (e.g. 0.01%) in case of forensic testing.
  • the method further comprises ordering the additional model terms of the calibration model.
  • an explicit ordering of the model terms may be beneficial, e.g. in Principle Component Regression (PCR).
  • an implied ordering may already be present, e.g. in Partial Least Squares Regression (PLSR), but still an explicit ordering may provide advantages in this case.
  • the explicit ordering may be based on using correlation, using prediction coefficient, using correlation relative standard deviation, or even using top down ordering.
  • the model dimensionality is limited to a maximum value. When all possible calibration models up to the maximum value are observed, this is particularly advantageous to demonstrate that the higher-numbered non-significant terms of the calibration models may be left out with effectiveness.
  • a statistical procedure is selected for quantifying the risk of over-fitting in a further embodiment.
  • the statistical procedure comprises a randomization test.
  • a randomization test parameter e.g. number of randomizations
  • validating the obtained calibration models comprises obtaining a test statistic for each of the obtained calibration models. This may e.g. be based on correlation, model fit, internal validation, external validation, etc.
  • the present invention relates to a system as defined in the preamble, in which the data processor unit is further arranged for obtaining the calibration model by initialising the calibration model using a predetermined minimum number of model terms, adding an additional model term to increase the dimensionality of the calibration model, calculating a risk of over-fitting by the model associated with the current dimensionality of the model, and repeating the adding of an additional model term and calculating of the risk of over-fitting up to a predetermined dimensionality of the model.
  • the data processing unit may further be arranged for executing the various embodiments of the present method.
  • the present invention relates to the use of a calibration model obtained by the present method, comprising inputting measurement data of a sample to the model for obtaining at least one property value related to the sample.
  • FIG. 1 represents a schematic description of the various steps leading to a multivariate calibration model according to current practice
  • FIGS. Ia and b illustrate near-infrared (NIR) absorbance spectra, in FIG. Ia for the prediction of octane rating of gasoline, and in FIG. 2b for the prediction of hydrogen content of gas oil;
  • FIGS. 3a and b illustrate the dependence of cross- validated RMSEP on trial model dimensionality, in FIG. 3a are results for the octane data, and in FIG. 3ZJ for the gas oil data;
  • FIGS. 4a and b illustrate frequency histograms of the test statistic after randomization in comparison with the value actually observed for the input data for increasing trial model dimensionality
  • FIG. 4a are results for the octane data
  • FIG. 4b for the gas oil data
  • FIGS. 5 ⁇ and b illustrate the cumulative risk of over-fitting the data for increasing trial model dimensionality
  • FIG. 5a are results for the octane data
  • FIG. 5b for the gas oil data
  • FIG. 6 represents a schematic description of one embodiment of the method according to the present invention.
  • FIG. 7 shows an embodiment of a calibration model providing system according to an embodiment of the present invention. PREFERRED EMBODIMENTS OF THE INVENTION
  • the present invention is henceforth described as using input data from near- infrared (NIR) spectroscopy, but it should be appreciated that it is possible to enter input data from almost any technical field as long as the objectives set forth through the invention are fulfilled.
  • the method according to the present invention is not limited to spectroscopic input data.
  • the term property used in the present invention shall be given a broad interpretation: it may comprise properties of solid, semi-solid, fluid, vapor samples etc. such as concentration, density, elasticity, viscosity, strength, thickness, class belonging (e.g. octane rating for gasoline classification) etc., but also predictions from probability input data (e.g. stock market information) or other input figures for prediction from any technical field etc.
  • upper-case bold characters are used for matrices, e.g., X and Y, lower-case bold characters for column vectors, e.g., t, italic characters for scalars, e.g., a and A.
  • Transposition of matrices and vectors will be denoted by a superscripted "T", e.g., P ⁇ .
  • the multivariate calibration model under consideration approximates the data for the training set objects as a sum of outer products of vectors:
  • N is the number of training objects
  • K is the number of predictor variables
  • A is the model dimensionality.
  • the predictor data of a single object constitute a row vector in X hence a single index suffices to characterize them.
  • Multiway calibration deals with predictor data of higher complexity than multivariate calibration (A. Smilde, R. Bro and P. Geladi, Multi-way Analysis. Applications in the Chemical Sciences, Wiley, Chichester, 2004).
  • the predictor data of a single object constitute an array hence a single index no longer suffices to characterize them.
  • the resulting predictor data are characterized by four indices, namely three spatial indices and one spectral index, all of which are independent, thus leading to a four- way array.
  • a multiway calibration model can, however, also be equally represented in terms of score vectors (t ⁇ 's), and loading vectors (p ⁇ 's and q a 's), but now there will exist constraints among the elements of the loading vectors — the single pseudo-index corresponds one-one to multiple physical indices in an ordered way. Although these constraints even depend on the estimation procedure deployed, the multiway calibration model can be represented as a sum of outer products of vectors, without loss of information. The problem of selecting the optimum multiway calibration model dimensionality is therefore, in principle, the same as selecting the optimum multivariate calibration model dimensionality. Thus, multivariate solutions are equally valid in the multiway domain.
  • the properties of interest, arranged in the rows of Y, are usually continuous variables. However, the same set-up of X and Y data can be used to predict integer- valued properties of interest. In the case where the integer codes for class membership, the purpose of the model would be classification, rather than calibration. The problem of selecting the optimum classification model dimensionality is therefore, in principle, the same as selecting the optimum calibration model dimensionality. Thus, calibration solutions are equally valid for classification, whether the predictor data are multivariate or multiway.
  • the predictors, arranged in the rows of X, are usually continuous variables. However, the same set-up of X and Y data can be used to predict from integer- valued predictors.
  • the purpose of the model would be analysis of effects, rather than calibration, see H. Martens and M. Martens, Multivariate Analysis of Quality. An Introduction, Wiley, Chichester, 2001.
  • the problem of selecting the optimum analysis of effects model dimensionality is therefore, in principle, the same as selecting the optimum calibration model dimensionality. Thus, calibration solutions are equally valid for analysis of effects.
  • FIG. 1 represents a schematic description of the various steps leading to a multivariate calibration model of the form under consideration, according to the currently most common practice.
  • step 100 samples are taken from a substance or matter and subjected to a multivariate data source (step 110, measurement of K predictor values), for instance a spectrometer, a chromatograph, or an electrochemical instrument, i.e., an instrument that provides multiple dimensional data, vectors, as a result.
  • the predictor data are arranged in, for example, the rows of a matrix X with Nx K elements (step 120).
  • concentration or property measurements step 130
  • step 130 concentration or property measurements from the sample substance or matter yield the predictand matrix or vector Y of size NxM (step 140).
  • step 150 concentration or property measurements
  • the trial models are ranked in step 170 according to the validation results.
  • the results of the best ranking trial model(s) are reported in step 180, e.g. on displaying means. It is seen that constructing a multivariate calibration model is essentially a trial and error process, because the best setting is usually not known in advance.
  • the main problem is to avoid over-fitting. In other words, it is desirable to stop adding terms to the model when they represent noise. Since terms are added sequentially, the actual risk of over-fitting increases monotonously with the number of terms. To decide about adding a term, the associated cumulative risk must be made precise in terms of a probability, which is a well-defined topic in statistics. Thus, to adequately control the actual risk of over-fitting, one should estimate it using a statistical procedure. For each term it must be determined whether the risk of over-fitting is acceptable, or not. If the risk is acceptable, the term under scrutiny passes the statistical test; otherwise one stops adding terms. The risk thus estimated should incorporate the risk of over-fitting for previous terms.
  • the practitioner must select a criterion for ordering the terms, when ordering is opted for. Clearly, the criterion must reflect the relevance of a term for the description of the property of interest (Y).
  • the present version of the COMODITE method uses the correlation between the score vectors and the property of interest, but any criterion based on model fit, internal or external validation etc. may be equally suitable. Consequently, the ordering is considered to be suitable for PLSR by construction, but an explicit ordering step is required for, for example, PCR. • The practitioner must set the overall acceptable risk of over-fitting, ⁇ .
  • the present version of the COMODITE method includes a randomization test for determining the risk, but any statistical procedure that does not make overly stringent assumptions about the data is suitable.
  • the particular choice of a randomization test implies that the model term under test must be constructed from the data from which the lower-numbered terms are eliminated, i.e., residual data sets. The reason for this is, that the previously tested terms would give a spurious contribution to the test statistic under the null-hypothesis. Alternatives for the randomization test may require similar actions.
  • the practitioner must select the test statistic, T.
  • the present version of the COMODITE method uses the correlation between the score vectors and the property of interest, but any criterion based on model fit, internal or external validation etc. may be equally suitable.
  • the practitioner must determine how to update the risk of over-fitting.
  • the cumulative risk is calculated from a product of estimated probabilities, which is exact for independent events and conservative otherwise.
  • the multivariate PLSR calibration models must relate tiny substructures in the spectra to variations in the properties of interest.
  • the optimum model dimensionality depends on the property of interest when calibrating NIR spectra. In other words, limited prior knowledge is available, which makes the selection of model dimensionality a critical step.
  • the results of the method according to the present invention are shown in FIG. 4. Compared are the test statistic obtained for the actual data set (vertical dashed line) and the frequency histogram for the test statistic after randomizing the rows of Y. The number of randomizations is set to the rather high value of 1000 to obtain representative results. If a term contains real structure, then the test statistic for the actual data set should stand out. By contrast, if a term does not contain real structure, then it doesn't matter whether one scrambles the rows of Y ⁇ (relative to the rows in X ⁇ ) since there is no relation between the rows of the residual matrices X a and Y ⁇ anyway.
  • FIG. 5 These plots further illustrate that the method according to the present invention is able to highlight subtle aspects of the model dimensionality selection process. By contrast, cross-validation gives no indication that these data sets could require such vastly different dimensionalities.
  • FIG. 6 A schematic description the method according to the present invention is now provided with reference to FIG. 6. Selection of model dimensionality in step 600 precedes validation step 610 of the trial model. In principle, one only needs to validate the trial data pre-treatment for the optimum selected model dimensionality, which is computationally efficient. The rather conspicuous difference with FIG. 1 is that the essentially different tasks of selection of model dimensionality and model validation are disentangled.
  • the method according to the present invention may be implemented in a hardware environment, e.g. in the exemplary embodiment of a calibration model providing system 10 as shown in Fig. 7.
  • the system 10 comprises a sample unit 11 which is arranged to obtain measurement data from a group of N samples.
  • the sample unit 11 is connected to a predictor measurement unit 13, e.g.
  • the sample unit 11 is also connected to a predictand unit 12, which is arranged to provide the property values related to each sample, in order to obtain the N x M predictand matrix Y.
  • the predictand unit 12 may comprise an actual measurement apparatus, or it may comprise an input unit for entering a known property value (or values) for each of the N samples.
  • the system 10 comprises a data processing unit 14 connected to a memory unit 15.
  • the memory unit 15 may comprise suitable program code to control the data processing unit 14 to function according to the present method, and may comprise a semiconductor memory device, a magnetic memory devices (such as a hard disk), an optical memory device, etc.
  • the data processing unit 14 may comprise a processor or multiple processors.
  • the memory unit 15 is also suitable for storing intermediate calculation results and for storing the eventually obtained calibration model.
  • the data processing unit 14 and memory unit 15 may be formed by a general purpose personal computer, arranged to interface with the predictand unit 12 and predictor measurement unit 13. As will be apparent to the skilled person, input/output devices will be part of the data processing unit 13 for controlling the system 10 by an operator.
  • the data processing unit 14 is furthermore connected with a reporting unit 16 for outputting data related to the present invention, and may be formed by a display, a printer, or further storage means.
  • the system 10 may be used to obtain the desired property values of further actual samples with unknown properties.
  • the predictand unit 12 is not used, and the data processing unit 14 is only used to obtain the desired property value (values) of samples from the related measurement data using the calibration model.

Abstract

Method and system for providing a calibration model comprising a quantitative relation between object measurement data and object property values. The system (10) comprises a sample unit (11), a predictor measurement unit (13), a predictand unit (12), and a data processor unit (14). The data processor unit (14) is arranged for obtaining calibration models, each calibration model having a number of model terms which determines the model dimensionality, and validating the obtained calibration models. The data processor unit (14) further obtains the calibration model by initialising the calibration model, adding an additional model term to increase the dimensionality of the calibration model, calculating a risk of over-fitting by the model associated with the current dimensionality of the model, and repeating the adding of an additional model term and calculating of the risk of over-fitting up to a predetermined dimensionality of the model.

Description

METHOD AND SYSTEM FOR SELECTION OF CALIBRATION MODEL DIMENSIONALITY, AND USE OF SUCH A CALIBRATION MODEL
TECHNICAL FIELD
The present invention relates to a method for providing a calibration model comprising a quantitative relation between object (or sample) measurement data and object (or sample) property value(s). The calibration model can also be used for classifying object property values, e.g. in case of gasoline classification, or for analysis of effects in controlled experiments. The method comprises acquiring measurement data of a set of JV training objects, the measurement data comprising K predictor values associated with the object measurement data and M predictand values associated with the sample property values for each sample, JV, K and M being integer values, resulting in an JV x K predictor matrix X and an JV x M predictand matrix Y, obtaining calibration models interrelating the predictor matrix X and the predictand matrix Y, each calibration model having model terms, the number of which determines the model dimensionality, and validating the obtained calibration models (e.g. by obtaining a test statistic, such as a cross-validated error). In a further aspect, the present invention relates to a system for providing a calibration, classification or analysis of effects model comprising a quantitative relation between object measurement data and object property values, the system comprising a sample unit for acquiring measurement data of a set of JV training objects, a predictor measurement unit connected to the sample unit for providing measurement data comprising K predictor values associated with the object measurement data, a predictand unit connected to the sample unit for providing M predictand values associated with the sample property values for each sample, JV, K and M being integer values, resulting in an NxK predictor matrix X and an JV x M predictand matrix Y, and a data processor unit connected to the predictor measurement unit and the predictand unit, the data processor unit being arranged for obtaining calibration models interrelating the predictor matrix X and the predictand matrix Y, each calibration model having model terms, the number of which determines the model dimensionality, and validating the obtained calibration models. BACKGROUND OF THE INVENTION
US patent US-B-6,480,795 describes an automatic calibration method for a spectrometer for evaluating spectra. In this known method, models are generated and validated for all possible permutations of data pre-treatment methods, wavelength ranges and calibration methods. This method requires very extensive calculations, and still, the resulting quality values used to select a best model are difficult to interpret correctly. Multivariate calibration models play an important role in various technical fields. These models are applied in particular in the chemical, petrochemical, pharmaceutical, cosmetic, coloring, plastics, paper, rubber and foodstuffs industries, but also in forensic, environmental, medical, marketing and sensory applications. As an illustration, consider near-infrared (NIR) spectroscopy, which is increasingly used for the characterization of solid, semi-solid, fluid and vapor samples. Frequently, the objective with this characterization is to determine the value of one or several concentrations in the samples. Multivariate calibration is then used to develop a quantitative relation, i.e., a model, between the digitized spectra, stored in a data matrix X, and the concentrations, stored in a data matrix Y, as reviewed by H. Martens and T. Naes, Multivariate Calibration, Wiley, NY, 1989. NIR spectroscopy is also increasingly used to infer other properties (stored in Y) of samples than concentrations, e.g., the strength and viscosity of polymers, the thickness of a tablet coating, and the octane rating of gasoline.
Usually, the first step towards constructing a multivariate calibration model is to remove undesirable features from the X data by pre-treatment techniques such as filtering or differentiation. When the data have been made appropriate for the actual modeling process, the next critical step serves to select the optimum model dimensionality, which is the number of terms that constitute the multivariate model. This step is equivalent to determining the optimum degree of a polynomial for fitting univariate (x,y)-data pairs. However, it is a much harder problem to solve for multivariate calibration, owing to the higher complexity of the data at hand and the often-tiny substructures to be discovered. Many methods have been developed to solve this problem, of which model validation is the most frequently applied one in practice. In the calibration context, validation amounts to assessing the ability of the model to predict the properties of interest for unknown future objects, e.g., chemical or biological samples, from the same type. This assessment can be performed in two essentially different modes, namely externally and internally. The adjective 'external' refers to the requirement that the validation objects be independent of the objects used for constructing the model, i.e., the training set, otherwise one does not properly assess the ability to predict for truly unknown future objects. For example, replicates are not allowed. The predictive ability is estimated by applying the model to these independent validation objects and averaging the squared prediction errors, i.e., the differences between model prediction and the associated known value. The square root of this average squared error is known as the root mean squared error of prediction (RMSEP). Internal validation differs from external validation in the sense that the validation objects are taken from the training set itself, i.e., the validation objects are not independent. To execute an internal validation, one has the choice between cross- validation and leverage correction. In cross-validation, one constructs models after judiciously leaving out segments of objects. Then an estimate of RMSEP follows by averaging squared prediction errors for the left-out objects, as in external validation. Cross-validation can be quite computer-intensive, depending on the size of the data sets and the number of segments. Leverage correction is a 'quick and dirty' alternative. Calibration model validation is problematic for various reasons. External validation is best in the sense that a closer assessment of RMSEP is possible. However, it is wasteful because the validation objects are not available for the construction of the model. Moreover, it requires substantial expertise from the practitioner to collect enough objects with sufficient spread to cover all relevant long-term variations. Cross- validation, on the other hand, ensures a more economic use of the available data, but it cannot, in a strict sense, be used if the data are designed. This drawback can be ignored if the training set is large enough (say 20-30 objects, depending on the design), but certainly precludes its use in certain sensory applications where the training set can be as small as five objects. In the latter case, an independent validation set will also be lacking. Moreover, cross-validation has the tendency to select too many terms because the same objects are used for both calibration and validation. In other words, with cross-validation one is vulnerable to over-fitting the training data. Over-fitting causes harm because one not only incorporates predictive features of the data in the model, but also noise. The consequence is degraded model performance in the prediction stage. Leverage correction even more likely leads to over-fitting models. It should only be used for constructing initial models, i.e., when pre-treatment is not fixed yet, not the final ones. Not only is calibration model validation problematic by itself, but validation- based selection of model dimensionality is problematic for additional reasons. This can be understood as follows. The validation yields RMSEP values for models of increasing dimensionality. Because the shape of the RMSEP curve depends on the mode (external or internal), the type of data and also on the particular data pre- treatment (first-, second-derivative, etc.), ad hoc rules are used for visual interpretation. The optimum dimensionality is found, for example, not only as the one that leads to a global minimum, which would be the 'logical' selection criterion, but also, depending on the shape incidentally encountered, as the first local minimum, a plateau, or where the curve 'levels off. As a result, a practitioner may use different selection criteria for different pre-treatments of the same data. He or she may also have difficulties deciding which criterion to use for a particular incidentally encountered curve, hi other words, (s)he may take different decisions on the basis of the same numbers. Conversely, (s)he may arrive at the same decision but on different grounds, which is equally disturbing. It follows that different practitioners may obtain different validation-based model dimensionalities for the same data. Moreover, depending on the number of validation objects, the individual RMSEP-values have a high intrinsic uncertainty, which may influence the selection. This uncertainty could be displayed as error bars to 'guide the eye' but this information is incomplete since the RMSEP-values are highly correlated. In short, model validation, whether executed in external or internal mode, is not objective, even for a highly trained expert.
It has been attempted to rationalize the validation-based selection of model dimensionality by comparing competing models in a pair-wise fashion, see H. van der Voet, Comparing the predictive accuracy of models using a simple randomization test, Chemometrics and Intelligent Laboratory Systems, 25 (1994) 313-323, F. Lindgren, B. Hansen, W. Karcher, M. Sjostrδm and L. Eriksson, Model validation by permutation tests: applications to variable selection, Journal of Chemometrics, 10 (1996) 521-532, and E.V. Thomas, Non-parametric statistical methods for multivariate calibration model selection and comparison, Journal of Chemometrics, 17 (2003) 653-659. However, the initial choice of competing model dimensionalities to be further scrutinized is left to the practitioner. As a result, a major source of subjectivity is not eliminated. For example, the finally selected model could still be over-fitting the data, because this fundamental issue is not addressed in any way - after all, each model entering this stage could over-fit. It stands to reason that the chain cannot be stronger than its weakest link. A recent survey in the spectroscopic field shows that there are no software packages that give a clear unambiguous automated warning when data is being over-fitted, see A.N. Davies, Analytical computing survey, Spectroscopy Europe, 16 (2004) 26-27.
SUMMARY OF THE INVENTION
Accordingly, the present invention seeks to provide an improved method and system for providing a calibration model having a good predictive ability when challenged with future unknown objects of the same type as the objects used for constructing the model. The method and system should be objective in selecting the optimum model dimensionality. As a result, the actual use of the model for predicting a sample property from measured, practical data sets will be much more trustworthy.
According to a first aspect of the present invention, a method according to the preamble defined above is provided, in which obtaining the calibration model comprises initialising the calibration model using a predetermined minimum number of model terms, adding an additional model term to increase the dimensionality of the calibration model, calculating a risk of over-fitting by the model associated with the current dimensionality of the model, and repeating the adding of an additional model term and calculating of the risk of over-fitting up to a predetermined dimensionality of the model.
This allows too obtain not only an insight in the validation quality of calibration models of different dimensionality (e.g. using cross-validated error values), but also an insight in the risk that a calibration model of a certain dimensionality will result in over-fitting problems. Hence, an improved choice can be made for the best calibration model dimensionality. The initial calibration model includes a predetermined minimum dimensionality. This may be a single model term, but in cases for which prior knowledge exists, e.g. from experimental design considerations, it may be advantageous to start with a higher dimensionality.
In a further embodiment, calculating a risk comprises calculating a cumulative risk for the calibration model up to and including the current dimensionality of the calibration model, and the repeating of adding and calculating is executed until the cumulative risk of over-fitting of the current calibration model exceeds a predetermined risk threshold value.
This embodiment allows to objectively obtain an optimum calibration model dimensionality, based on a risk parameter indicating an acceptable risk of over-fitting. This acceptable risk depends on circumstances, and may e.g. be chosen as 5%, 1%, or even very low (e.g. 0.01%) in case of forensic testing.
In a further embodiment, the method further comprises ordering the additional model terms of the calibration model. For certain estimation procedures, an explicit ordering of the model terms may be beneficial, e.g. in Principle Component Regression (PCR). For other estimation procedures, an implied ordering may already be present, e.g. in Partial Least Squares Regression (PLSR), but still an explicit ordering may provide advantages in this case. The explicit ordering may be based on using correlation, using prediction coefficient, using correlation relative standard deviation, or even using top down ordering. In an even further embodiment, the model dimensionality is limited to a maximum value. When all possible calibration models up to the maximum value are observed, this is particularly advantageous to demonstrate that the higher-numbered non-significant terms of the calibration models may be left out with impunity.
A statistical procedure is selected for quantifying the risk of over-fitting in a further embodiment. As an example, the statistical procedure comprises a randomization test. A randomization test parameter (e.g. number of randomizations) is advantageously set dependent on the predetermined risk threshold value.
In an even further embodiment, validating the obtained calibration models comprises obtaining a test statistic for each of the obtained calibration models. This may e.g. be based on correlation, model fit, internal validation, external validation, etc.
In a further aspect, the present invention relates to a system as defined in the preamble, in which the data processor unit is further arranged for obtaining the calibration model by initialising the calibration model using a predetermined minimum number of model terms, adding an additional model term to increase the dimensionality of the calibration model, calculating a risk of over-fitting by the model associated with the current dimensionality of the model, and repeating the adding of an additional model term and calculating of the risk of over-fitting up to a predetermined dimensionality of the model. The data processing unit may further be arranged for executing the various embodiments of the present method.
In still a further aspect, the present invention relates to the use of a calibration model obtained by the present method, comprising inputting measurement data of a sample to the model for obtaining at least one property value related to the sample.
BRIEF DESCRIPTION OF THE DRAWINGS
For a more complete understanding of the present invention, reference may now be had to the following description taken in conjunction with the accompanying drawings, in which: FIG. 1 represents a schematic description of the various steps leading to a multivariate calibration model according to current practice;
FIGS. Ia and b illustrate near-infrared (NIR) absorbance spectra, in FIG. Ia for the prediction of octane rating of gasoline, and in FIG. 2b for the prediction of hydrogen content of gas oil; FIGS. 3a and b illustrate the dependence of cross- validated RMSEP on trial model dimensionality, in FIG. 3a are results for the octane data, and in FIG. 3ZJ for the gas oil data;
FIGS. 4a and b illustrate frequency histograms of the test statistic after randomization in comparison with the value actually observed for the input data for increasing trial model dimensionality, in FIG. 4a are results for the octane data, and in FIG. 4b for the gas oil data;
FIGS. 5α and b illustrate the cumulative risk of over-fitting the data for increasing trial model dimensionality, in FIG. 5a are results for the octane data, and in FIG. 5b for the gas oil data; FIG. 6 represents a schematic description of one embodiment of the method according to the present invention; and
FIG. 7 shows an embodiment of a calibration model providing system according to an embodiment of the present invention. PREFERRED EMBODIMENTS OF THE INVENTION
The present invention is henceforth described as using input data from near- infrared (NIR) spectroscopy, but it should be appreciated that it is possible to enter input data from almost any technical field as long as the objectives set forth through the invention are fulfilled. Thus, the method according to the present invention is not limited to spectroscopic input data. Also, the term property used in the present invention shall be given a broad interpretation: it may comprise properties of solid, semi-solid, fluid, vapor samples etc. such as concentration, density, elasticity, viscosity, strength, thickness, class belonging (e.g. octane rating for gasoline classification) etc., but also predictions from probability input data (e.g. stock market information) or other input figures for prediction from any technical field etc.
As an illustration, a modeling of the octane rating of gasoline and the hydrogen content of gas oil, both in terms of their NIR absorbance spectra, is displayed in the present description.
In the present description of a preferred embodiment upper-case bold characters are used for matrices, e.g., X and Y, lower-case bold characters for column vectors, e.g., t, italic characters for scalars, e.g., a and A. Transposition of matrices and vectors will be denoted by a superscripted "T", e.g., Pτ . The multivariate calibration model under consideration approximates the data for the training set objects as a sum of outer products of vectors:
Figure imgf000009_0001
T a a=\ where the rows of X (NxK ) hold the predictor variables (e.g. NIR spectra), the rows of Y (NxM) hold the predictand variables (properties of interest, e.g., analyte concentrations), tα (NxI ), pα (KxI ), and qa ( M x 1 ) are named score, x-loading and y-loading vector associated with dimension a (a = l,---,A), respectively, N is the number of training objects, K is the number of predictor variables, Mis the number of predictand variables, and A is the model dimensionality.
The particular form of the multivariate calibration model under consideration has the following implications with respect to applicability: • In the current formulation, the predictor data of a single object constitute a row vector in X hence a single index suffices to characterize them. Multiway calibration deals with predictor data of higher complexity than multivariate calibration (A. Smilde, R. Bro and P. Geladi, Multi-way Analysis. Applications in the Chemical Sciences, Wiley, Chichester, 2004). In multiway calibration, the predictor data of a single object constitute an array hence a single index no longer suffices to characterize them. Consider, for example, an imaging instrument that takes spectra of individual 'voxels' in 3D. The resulting predictor data are characterized by four indices, namely three spatial indices and one spectral index, all of which are independent, thus leading to a four- way array. A multiway calibration model can, however, also be equally represented in terms of score vectors (tα 's), and loading vectors (pα 's and qa 's), but now there will exist constraints among the elements of the loading vectors — the single pseudo-index corresponds one-one to multiple physical indices in an ordered way. Although these constraints even depend on the estimation procedure deployed, the multiway calibration model can be represented as a sum of outer products of vectors, without loss of information. The problem of selecting the optimum multiway calibration model dimensionality is therefore, in principle, the same as selecting the optimum multivariate calibration model dimensionality. Thus, multivariate solutions are equally valid in the multiway domain.
• The properties of interest, arranged in the rows of Y, are usually continuous variables. However, the same set-up of X and Y data can be used to predict integer- valued properties of interest. In the case where the integer codes for class membership, the purpose of the model would be classification, rather than calibration. The problem of selecting the optimum classification model dimensionality is therefore, in principle, the same as selecting the optimum calibration model dimensionality. Thus, calibration solutions are equally valid for classification, whether the predictor data are multivariate or multiway.
• The predictors, arranged in the rows of X, are usually continuous variables. However, the same set-up of X and Y data can be used to predict from integer- valued predictors. In the case where the integer codes for levels of an experimental design, the purpose of the model would be analysis of effects, rather than calibration, see H. Martens and M. Martens, Multivariate Analysis of Quality. An Introduction, Wiley, Chichester, 2001. The problem of selecting the optimum analysis of effects model dimensionality is therefore, in principle, the same as selecting the optimum calibration model dimensionality. Thus, calibration solutions are equally valid for analysis of effects.
FIG. 1 represents a schematic description of the various steps leading to a multivariate calibration model of the form under consideration, according to the currently most common practice.
In step 100 samples are taken from a substance or matter and subjected to a multivariate data source (step 110, measurement of K predictor values), for instance a spectrometer, a chromatograph, or an electrochemical instrument, i.e., an instrument that provides multiple dimensional data, vectors, as a result. The predictor data are arranged in, for example, the rows of a matrix X with Nx K elements (step 120). Likewise, concentration or property measurements (step 130) from the sample substance or matter yield the predictand matrix or vector Y of size NxM (step 140). Next, data pre-treatments are tried in step 150, which are validated either internally or externally in step 160. The trial models are ranked in step 170 according to the validation results. Finally, the results of the best ranking trial model(s) are reported in step 180, e.g. on displaying means. It is seen that constructing a multivariate calibration model is essentially a trial and error process, because the best setting is usually not known in advance.
The various problems inherent to validation-based selection of model dimensionality lead to the following general features of the method according the present invention:
• The principal reason for validating a calibration model is to assess its predictive ability when challenged with future unknown objects of the same type. The selection of model dimensionality is an entirely different task. Consequently, these tasks can be considered as separate steps of the calibration procedure. • Visual inspection of a plot, as in validation-based selection of model dimensionality, is, in principle, not objective. Since it is desirable that the selected model dimensionality be independent of the practitioner, visual inspection of plots should therefore, in principle, not play a decisive role in the model dimensionality selection process.
• The main problem is to avoid over-fitting. In other words, it is desirable to stop adding terms to the model when they represent noise. Since terms are added sequentially, the actual risk of over-fitting increases monotonously with the number of terms. To decide about adding a term, the associated cumulative risk must be made precise in terms of a probability, which is a well-defined topic in statistics. Thus, to adequately control the actual risk of over-fitting, one should estimate it using a statistical procedure. For each term it must be determined whether the risk of over-fitting is acceptable, or not. If the risk is acceptable, the term under scrutiny passes the statistical test; otherwise one stops adding terms. The risk thus estimated should incorporate the risk of over-fitting for previous terms. This leads to a conditional model dimensionality test, henceforth abbreviated as COMODITE, where the null-hypothesis is formulated as "the term under test is not significant, given that previous terms have passed the test". This procedure is entirely objective in the sense that each practitioner would arrive at the same model dimensionality, once the overall acceptable risk is pre-defined. It truly yields an unambiguous warning that the data are being over-fitted, and this warning is easily automated, if desired.
The particular form of the multivariate calibration model under consideration leads to the following specific features of the method according the present invention:
• The practitioner must decide about the ordering of the terms. Clearly, the quality of the A -dimensional approximation depends on the initial ordering of the ta 's, pα 's, and qa 's among the entire set of candidates. For what follows, it may therefore be expedient to divide potential estimation procedures in two classes, namely a class of procedures for which a suitable ordering is implied on one hand, and a class of procedures for which an explicit ordering step may be beneficial on the other hand. An example of an estimation procedure for which a suitable ordering is implied is partial least squares regression (PLSR). An example of an estimation procedure for which an explicit ordering step may be beneficial is principal component regression (PCR). For PCR, ordering criteria like the correlation (J. Sun, A correlation principal component regression analysis of NIR data, Journal of Chemometrics, 9 (1995) 21-29 and A.M.C. Davies, The best way of doing principal component regression, Spectroscopy Europe, 7 (1995) 36-38), the prediction coefficient (J.M. Sutter, J.H. Kalivas and P.M. Lang, Which principal components to utilize for principal component regression, Journal of Chemometrics, 6 (1992) 217-225), and correlation relative standard deviation (S. Z. Fairchild and J.H. Kalivas, PCR eigenvector selection based on correlation relative standard deviations, Journal of Chemometrics, 15 (2001) 615-625) have been used. An explicit ordering has been reported to give results that are equal or better than with the original top-down ordering (J. Verdύ-Andres and D.L. Massart, Comparison of prediction and correlation based methods to select the best subset of principal components for principal component regression and detect outlying objects, Applied Spectroscopy, 52 (1998) 1425-1434). It is noted that, even when a suitable ordering is implied by the estimation procedure, an alternative ordering may be beneficial. For example, improved PLSR results have been reported after explicit ordering when suitable measurement variables are lacking (NJ. Messick, J.H. Kalivas and P.M. Lang,
Microchemical Journal, 55 (1997) 200-207). In conclusion, an explicit ordering step cannot be ruled out in general.
• The practitioner must select a criterion for ordering the terms, when ordering is opted for. Clearly, the criterion must reflect the relevance of a term for the description of the property of interest (Y). The present version of the COMODITE method uses the correlation between the score vectors and the property of interest, but any criterion based on model fit, internal or external validation etc. may be equally suitable. Consequently, the ordering is considered to be suitable for PLSR by construction, but an explicit ordering step is required for, for example, PCR. • The practitioner must set the overall acceptable risk of over-fitting, α. Depending on the type of application at hand, one would pre-define the overall acceptable risk more or less stringent by requiring a certainty of 95% or 99% (say) that one does not over-fit the data. For example, it is conceivable that in forensic applications one would set α close to 100%. • The practitioner must set the minimum model dimensionality, ^1n . Usually, all terms must be scrutinized, i.e., A^1n =1. However, prior knowledge may be available about the minimum number of terms in the model. If this prior knowledge is absolutely certain, e.g. from experimental design considerations, then one may start testing for A1111n > 1 , without making the procedure overly optimistic.
Conversely, starting at A1111n = 1 could lead to an unnecessarily conservative result in this situation.
• The practitioner must set the maximum model dimensionality, J1011x . In principle, one may stop adding terms once the overall risk exceeds the critical value α.
However, for presentation purposes (e.g. plots) it may be appropriate to scrutinize the higher-numbered (by necessity) non-significant terms as well.
• The practitioner must select a statistical procedure for quantifying the risk of over- fitting. The present version of the COMODITE method includes a randomization test for determining the risk, but any statistical procedure that does not make overly stringent assumptions about the data is suitable. The practitioner must select the number of randomizations, R. This number depends on the critical value α. For example, R = 100 permits α- values down to a = 0.01. The particular choice of a randomization test implies that the model term under test must be constructed from the data from which the lower-numbered terms are eliminated, i.e., residual data sets. The reason for this is, that the previously tested terms would give a spurious contribution to the test statistic under the null-hypothesis. Alternatives for the randomization test may require similar actions.
• The practitioner must select the test statistic, T. The present version of the COMODITE method uses the correlation between the score vectors and the property of interest, but any criterion based on model fit, internal or external validation etc. may be equally suitable.
• The practitioner must determine how to update the risk of over-fitting. The cumulative risk is calculated from a product of estimated probabilities, which is exact for independent events and conservative otherwise.
Below a step-by-step explanation of the present version of the COMODITE method is given.
1. Initialize, i.e., (1) set the model dimensionality under scrutiny, a, to zero, (2) optionally sort the terms according to a suitable criterion, (3) remove the terms up to Aj111n -1 from the data, yielding residual data sets X0 and Y0, and (4) set the cumulative risk, Ca = C0 , to zero. 2. Increase the model dimensionality under scrutiny, a, by one. Compute the residual data sets as X0 = Xa_{ -toPα " and Y0 = Yβ_, -taq] .
3. Compute the test statistic Ta from Xα and Yα .
4. Generate R randomizations (i.e., permutations) of the rows of Yα , resulting in Y0 r ( r = 1, • • • , i? ). Compute for these constellations of data, X0 and Ya r ( r = 1, • • • , R ), the test statistic, Ta r ( r = 1, • • • , R ).
5. Calculate the update of the cumulative risk as the fraction, Fa , of Ta r ( r = 1, • • • , R ) that exceeds the value under test, Ta . If Fa is zero, set it to 1 / R .
6. Compute the cumulative risk as C0 = 1 - (1 — C0-1 )x (l — Fa) . 7. When Ca exceeds the pre-defined risk α, and/or a = A1n^ stop; otherwise return to step 2.
As an illustration, consider the PLSR modeling of the octane rating of gasoline (data set I) and the hydrogen content of gas oil (data set II), both in terms of their NIR absorbance spectra. For both data sets, M = I . For data set I, N = 26 and K = 149 . For further details, see K..H. Esbensen, S. Schonkopf and T. Midtgaard, CAMO, Trondheim, 1994. For data set II, N = 84 and K = 149 . For further details, see J.A. Fernandez Piema, L. Jin, F. Wahl, N.M. Faber and D.L. Massart, Estimation of partial least squares regression (PLSR) prediction uncertainty when the reference values carry a sizeable measurement error, Chemometrics and Intelligent Laboratory Systems, 65 (2003) 281-291.
It can be inferred from FIG. 2 that the multivariate PLSR calibration models must relate tiny substructures in the spectra to variations in the properties of interest. Usually, the optimum model dimensionality depends on the property of interest when calibrating NIR spectra. In other words, limited prior knowledge is available, which makes the selection of model dimensionality a critical step.
The results of cross-validation are shown in FIG. 3. For data set I, the calibration objects are randomly divided in five segments, while seven segments are used for data set II. A global optimum is observed in FIG. 3α but the RMSEP curve 'levels off after adding the second term. Similarly, a local optimum is observed in FIG. 3b but the RMSEP curve 'levels off after adding the second term. For both data sets, the decrease of RMSEP is 19% when adding the fifth term. It is problematic to give a clear judgement regarding the optimum value of A on the basis of these plots.
The results of the method according to the present invention are shown in FIG. 4. Compared are the test statistic obtained for the actual data set (vertical dashed line) and the frequency histogram for the test statistic after randomizing the rows of Y. The number of randomizations is set to the rather high value of 1000 to obtain representative results. If a term contains real structure, then the test statistic for the actual data set should stand out. By contrast, if a term does not contain real structure, then it doesn't matter whether one scrambles the rows of Yα (relative to the rows in Xα) since there is no relation between the rows of the residual matrices Xa and Yα anyway. For data set I, only the first two terms clearly stand out from the distribution under the null-hypothesis, see FIG. 4a. They are significant down to a level far below the rather stringent value a = 0.01 . By contrast, the higher-numbered ones are not significant by the test for liberal values as high as a - 0.10 , because the test statistic under scrutiny is exceeded for the third term in 25% of the randomizations. For data set II, the test statistic under scrutiny is exceeded for the third term in 4% of the randomizations, whereas it clearly stands out for the next two terms, see FIG. 4b. Five terms are significant at the common level a = 0.05 , while two terms are significant at the rather stringent value a = 0.01 . The results displayed in the individual subplots in FIG. 4 are summarized in
FIG. 5. These plots further illustrate that the method according to the present invention is able to highlight subtle aspects of the model dimensionality selection process. By contrast, cross-validation gives no indication that these data sets could require such vastly different dimensionalities.
A schematic description the method according to the present invention is now provided with reference to FIG. 6. Selection of model dimensionality in step 600 precedes validation step 610 of the trial model. In principle, one only needs to validate the trial data pre-treatment for the optimum selected model dimensionality, which is computationally efficient. The rather conspicuous difference with FIG. 1 is that the essentially different tasks of selection of model dimensionality and model validation are disentangled. The method according to the present invention may be implemented in a hardware environment, e.g. in the exemplary embodiment of a calibration model providing system 10 as shown in Fig. 7. The system 10 comprises a sample unit 11 which is arranged to obtain measurement data from a group of N samples. The sample unit 11 is connected to a predictor measurement unit 13, e.g. a spectrometer, which measures the group of N samples to obtain the NxK predictor matrix X. The sample unit 11 is also connected to a predictand unit 12, which is arranged to provide the property values related to each sample, in order to obtain the N x M predictand matrix Y. The predictand unit 12 may comprise an actual measurement apparatus, or it may comprise an input unit for entering a known property value (or values) for each of the N samples.
Furthermore, the system 10 comprises a data processing unit 14 connected to a memory unit 15. The memory unit 15 may comprise suitable program code to control the data processing unit 14 to function according to the present method, and may comprise a semiconductor memory device, a magnetic memory devices (such as a hard disk), an optical memory device, etc. The data processing unit 14 may comprise a processor or multiple processors. The memory unit 15 is also suitable for storing intermediate calculation results and for storing the eventually obtained calibration model. The data processing unit 14 and memory unit 15 may be formed by a general purpose personal computer, arranged to interface with the predictand unit 12 and predictor measurement unit 13. As will be apparent to the skilled person, input/output devices will be part of the data processing unit 13 for controlling the system 10 by an operator.
The data processing unit 14 is furthermore connected with a reporting unit 16 for outputting data related to the present invention, and may be formed by a display, a printer, or further storage means.
Once a calibration model is obtained according to the present invention, the system 10 may be used to obtain the desired property values of further actual samples with unknown properties. In this case, the predictand unit 12 is not used, and the data processing unit 14 is only used to obtain the desired property value (values) of samples from the related measurement data using the calibration model. It is thus believed that the operation and construction of the present invention will be apparent from the foregoing description. While the method shown or described has been preferred, it will be obvious that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined in the attached claims.

Claims

1. Method for providing a calibration model comprising a quantitative relation between object measurement data and object property values, the method comprising: - acquiring measurement data of a set of N training objects, the measurement data comprising K predictor values associated with the object measurement data and M predictand values associated with the sample property values for each sample, N, K and M being integer values, resulting in an NxK predictor matrix X and an NxM predictand matrix Y; - obtaining calibration models interrelating the predictor matrix X and the predictand matrix Y, each calibration model having model terms, the number of which determines the model dimensionality; and
- validating the obtained calibration, characterised in that obtaining the calibration model comprises: - initialising the calibration model using a predetermined minimum number of model terms;
- adding an additional model term to increase the dimensionality of the calibration model;
- calculating a risk of over-fitting by the model associated with the current dimensionality of the model; and
- repeating the adding of an additional model term and calculating of the risk of over- fitting up to a predetermined dimensionality of the model.
2. Method according to claim 1, in which calculating a risk comprises calculating a cumulative risk for the calibration model up to and including the current dimensionality of the calibration model, and in which the repeating of adding and calculating is executed until the cumulative risk of over-fitting of the current calibration model exceeds a predetermined risk threshold value.
3. Method according to claim 1 or 2, further comprising ordering the additional model terms of the calibration model.
4. Method according to any one of the preceding claims, in which the model dimensionality is limited to a maximum value.
5. Method according to any one of the preceding claims, in which a statistical procedure is selected for quantifying the risk of over-fitting.
6. Method according to claim 5, in which the statistical procedure comprises a randomization test.
7. Method according to claim 6, in which a randomization test parameter is set dependent on the predetermined risk threshold value.
8. Method according to one of the preceding claims, in which validating the obtained calibration models comprises obtaining a test statistic for each of the obtained calibration models .
9. System for providing a calibration model comprising a quantitative relation between object measurement data and object property values, the system (10) comprising:
- a sample unit (11) for acquiring measurement data of a set of N training objects; - a predictor measurement unit (13) connected to the sample unit (11 ) for providing measurement data comprising K predictor values associated with the object measurement data;
- a predictand unit (12) connected to the sample unit (11) for providing M predictand values associated with the sample property values for each sample, N, K and M being integer values, resulting in an N x K predictor matrix X and an NxMN x M predictand matrix Y; and
- a data processor unit (14) connected to the predictor measurement unit (13) and the predictand unit (12), the data processor unit (14) being arranged for:
- obtaining calibration models interrelating the predictor matrix X and the predictand matrix Y, each calibration model having model terms, the number of which determines the model dimensionality; and
- validating the obtained calibration models, characterised in that the data processor unit (14) is further arranged for obtaining the calibration model by:
- initialising the calibration model using a predetermined minimum number of model terms; - adding an additional model term to increase the dimensionality of the calibration model;
- calculating a risk of over-fitting by the model associated with the current dimensionality of the model; and
- repeating the adding of an additional model term and calculating of the risk of over- fitting up to a predetermined dimensionality of the model.
10. System according to claim 9, in which the data processing unit (14) is further arranged to execute the method according to any one of the claims 1-8.
11. Use of a calibration model obtained by the method according to any one of claims 1-8, comprising inputting measurement data of a sample to the model for obtaining at least one property value related to the sample.
************
PCT/NL2005/000124 2005-02-21 2005-02-21 Method and system for selection of calibration model dimensionality, and use of such a calibration model WO2006088350A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/NL2005/000124 WO2006088350A1 (en) 2005-02-21 2005-02-21 Method and system for selection of calibration model dimensionality, and use of such a calibration model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/NL2005/000124 WO2006088350A1 (en) 2005-02-21 2005-02-21 Method and system for selection of calibration model dimensionality, and use of such a calibration model

Publications (1)

Publication Number Publication Date
WO2006088350A1 true WO2006088350A1 (en) 2006-08-24

Family

ID=34960583

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/NL2005/000124 WO2006088350A1 (en) 2005-02-21 2005-02-21 Method and system for selection of calibration model dimensionality, and use of such a calibration model

Country Status (1)

Country Link
WO (1) WO2006088350A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109711606A (en) * 2018-12-13 2019-05-03 平安医疗健康管理股份有限公司 A kind of data predication method and device based on model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0415401A2 (en) * 1989-09-01 1991-03-06 Edward W. Stark Improved multiplicative signal correction method and apparatus
US6075594A (en) * 1997-07-16 2000-06-13 Ncr Corporation System and method for spectroscopic product recognition and identification
US6480795B1 (en) * 1998-03-13 2002-11-12 Buchi Labortechnik Ag Automatic calibration method
US20030143520A1 (en) * 2002-01-31 2003-07-31 Hood Leroy E. Gene discovery for the system assignment of gene function
WO2004003969A2 (en) * 2002-06-28 2004-01-08 Tokyo Electron Limited Method and system for predicting process performance using material processing tool and sensor data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0415401A2 (en) * 1989-09-01 1991-03-06 Edward W. Stark Improved multiplicative signal correction method and apparatus
US6075594A (en) * 1997-07-16 2000-06-13 Ncr Corporation System and method for spectroscopic product recognition and identification
US6480795B1 (en) * 1998-03-13 2002-11-12 Buchi Labortechnik Ag Automatic calibration method
US20030143520A1 (en) * 2002-01-31 2003-07-31 Hood Leroy E. Gene discovery for the system assignment of gene function
WO2004003969A2 (en) * 2002-06-28 2004-01-08 Tokyo Electron Limited Method and system for predicting process performance using material processing tool and sensor data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109711606A (en) * 2018-12-13 2019-05-03 平安医疗健康管理股份有限公司 A kind of data predication method and device based on model

Similar Documents

Publication Publication Date Title
Andersen et al. Variable selection in regression—a tutorial
Riedl et al. Review of validation and reporting of non-targeted fingerprinting approaches for food authentication
Cuadros-Rodríguez et al. Quality performance metrics in multivariate classification methods for qualitative analysis
US5606164A (en) Method and apparatus for biological fluid analyte concentration measurement using generalized distance outlier detection
Mehdizadeh et al. An intelligent system for egg quality classification based on visible-infrared transmittance spectroscopy
EP2428802B1 (en) Automatic analysis device and analysis method
Filgueiras et al. Evaluation of trends in residuals of multivariate calibration models by permutation test
JP4856993B2 (en) Self-diagnosis type automatic analyzer
JP2008530536A (en) Method and apparatus for measuring non-invasiveness of specimen
CA2228844C (en) Biological fluid analysis using distance outlier detection
Liu et al. A comparative study for least angle regression on NIR spectra analysis to determine internal qualities of navel oranges
Gómez-Carracedo et al. Selecting the optimum number of partial least squares components for the calibration of attenuated total reflectance-mid-infrared spectra of undesigned kerosene samples
WO2009144124A1 (en) Analysing spectral data for the selection of a calibration model
Zhu et al. Study on the quantitative measurement of firmness distribution maps at the pixel level inside peach pulp
Shao et al. Multivariate calibration of near-infrared spectra by using influential variables
Jiang et al. Molecular spectroscopic wavelength selection using combined interval partial least squares and correlation coefficient optimization
González et al. A robust partial least squares regression method with applications
Omidikia et al. Uninformative variable elimination assisted by gram–Schmidt orthogonalization/successive projection algorithm for descriptor selection in QSAR
JP4366261B2 (en) Method for determining presence / absence of abnormality in measurement reaction process, automatic analyzer capable of executing the method, and storage medium storing program of the method
JP2018504709A (en) Automatic quantitative regression
WO2006088350A1 (en) Method and system for selection of calibration model dimensionality, and use of such a calibration model
Rodionova et al. Application of SIC (simple interval calculation) for object status classification and outlier detection—comparison with regression approach
Zhang et al. Robust principal components regression based on principal sensitivity vectors
JP2005127757A (en) Automatic analyzer
Mishra et al. Iterative re‐weighted covariates selection for robust feature selection modelling in the presence of outliers (irCovSel)

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 05710903

Country of ref document: EP

Kind code of ref document: A1