Full-Permutation GLM Modeling

PeakLab v1 Documentation Contents AIST Software Home AIST Software Support

Direct Spectral Modeling

Full permutation GLM (General Linear Model) is a statistical regression procedure applicable to any multivariate fitting problem. In PeakLab, this procedure was specifically designed to optimally model spectroscopic data – to generate the most accurate predictions possible, especially when the relationship with reference data, often chromatographic values, was complex and far from straightforward.

Stepwise Regression

To understand PeakLab's direct spectral modeling, we will begin by explaining the issues encountered in multivariate stepwise forward and stepwise backward regressions.

In stepwise forward regression, there it a |t|, F, or probability value to enter, and a separate one to remove a predictor. When that criterion is first met when cycling through the X-value predictors or estimators, the model will consist of 1-predictor. The procedure then continues across the predictors in whatever order the modeling algorithm processes them, adding additional predictors that meet the entry criterion, and removing all predictors that meet the removal criterion. When the cycle is complete, the forward regression model will consist of those predictors which were entered and survived the removal. The count of predictors will be determined by the entry-removal progression, and the count and specific predictors will usually be different if the progression through the predictors occurs in reverse order, or is executed in a randomly scrambled sequence. In a stepwise forward procedure, if the x-spacing is too small relative to the resolution of the spectral data, noise is easily fitted at adjacent x-predictors. One wavelength may have a large positive value and the adjacent one might have a large negative value. Knowing how to address such issues, a skilled statistician can use a stepwise procedure and produce a 'good', sometimes a 'very good', model.

In stepwise backward regression, a similar progression occurs, except that the model begins with every predictor fitted (assuming the degrees of freedom in the modeling problem allow such). The least significant parameter is then removed, and the fit is repeated until all predictors meet the significance threshold. It is rare that forward and backward stepwise regression, even with identical significance criteria, produce the same model. The same issues with fitting noise are present in backward regression, and may well have a greater presence since the backward models usually have a higher predictor count, and overall weaker significances, than the forward models.

Partial Least Squares

PLS (partial least squares) is often used for this type of chemometric modeling. The PLS algorithm works by an iterative set of factor (synthetic variable) extractions that process linear correlations between the dependent variable and every WL specified in the modeling. We have seen references to PLS algorithms that suggest a proficiency at 'latent variable' extraction, although when additional variables influencing the modeling are scientifically known, it is unlikely any of the factors extracted will track them. In a practical sense, the factor extractions are very close to a black box insofar as there are many different implementations of PLS algorithms, and each will produce its own set of coefficients and its own prediction errors.

To evaluate a PLS model, it must be reconverted to the original space and in that process there will be a coefficient for every predictor specified in the modeling problem, and in all likelihood, a constant. Even if the modeling is fairly simple, a PLS model will contain p+1, coefficients, where p is the number of predictors or wavelengths in the spectra. In order to decrease the count of coefficients in the PLS algorithm, the wavelength band can be reduced or the spacing between the wavelengths modeled can be increased.

With the mapping to the factor space intrinsic to the algorithm, the modeler can focus on the count of factors, a simple optimization where the factor count with the best prediction performance is selected. The modeler must also address the specific band and density of the predictors, a more complex matter since PLS will not generally distinguish primary responses from indirect ones. We have seen PLS models where the correlations favored those indirect responses, such as hydrophobicity and inverse relationships with non-target materials. A PLS algorithm can produce very good predictions, but the modeler needs to understand the spectroscopy and chemistry well enough to know where the absorbances directly reflect the target compound(s). A lower factor count may be needed to keep the indirection in the modeling to a minimum. The issue with having inverse and moisture effects in a predictive model is that these may not consistently map to the target variables across different materials, labs, or instruments. In effect they may not be stable. A predictive model may suddenly stop working or produce much higher errors, and there will be no indication of what has gone wrong.

PLS models are also inefficient with respect to remote analyses across web servers. Because the half-dozen of so derived factors will map to every predictor, it is necessary for a web server to process up to hundreds of spectral values, one for each coefficient in the model, all of which must be transmitted across the web, and then used in a dot-product computation on the server.

Principal Component Regression

PCR (principal component regression) is sometimes used as an alternative to PLS modeling. Instead of extracting factors based on correlations, principal components are extracted in an eigendecomposition and used for the weights in the iterative construction of the model. Like PLS, the indirect variables must be ultimately converted to spectral parameters where each wavelength in the modeling band of predictors has its own coefficient. In all of the instances of our testing PCR models against PLS models for chromatographic Y - spectroscopic X modeling, we have rarely encountered an instance where PCR outperformed PLS for prediction accuracy. We explain our perspective of this in some depth in the last sections of the Modeling FTNIR Spectra white paper.

PCA (principal component analysis), also known as SSA (singular spectrum analysis), decomposes data into eigenvector-based orthogonal variance-based components. PCA is not generally able to separate the target content from non-target spectral content when those overlap in a given spectral region. It can separate harmonic oscillations, such as sinusoids, in a spectacular way, but not spectroscopic peaks. If you ponder what a separation based on power and variance implies, in two overlapping peaks, the first principal component will capture the highest power peak at each point in time. If a given principal component changes peaks as the power of the two overlapping peaks swaps prominence, there can be discontinuities in the first derivative of that reconstructed principal component. We also note that the best principal component algorithms use lagged matrices which generate edge effects, meaning that you will need to process those eigendecompositions using a wider spectral band than will be used in the modeling. PCAR (principal component autoregressive) algorithms are also more complex in that they involve the fitting of an AR model using a covariance matrix.

We have observed some benefit in creating predictive models with the reconstructed data from a specific principal component. This is offered in PeakLab and should not be confused with PCR.

We have also realized more stable predictive models from fitting the design matrix of direct spectral models with an upper limit on principal components. This is also offered in PeakLab and similarly should not be confused with PCR.

Full Permutation GLM Modeling

The PeakLab General Linear Modeling (GLM) algorithm should be seen as basic least-squares multivariate modeling with an intelligent search procedure which offers a fast, full-permutation determination of the very best direct spectral predictive models.

Full-permutation GLM fitting is now possible with parallel multicore processing and the use of exceptionally fast fitting technologies originally authored when computers were far slower. In those instances where up to 8 predictors or wavelengths will fully address the direct spectral modeling problem, where the predictors cover a reasonably compact band of wavelengths, and where the resolution is such that the x-spacing or wavelength density is not too great, every possible permutation through 8 predictors can be realized within a reasonable fitting time.

For example, in the Modeling NIR Field-Site Data white paper, we illustrate direct spectral fitting using a 5 nm spacing in a 200 nm band from 1650 -1850 nm. With every possible permutation 1-8 predictors fitted, the total fitting time for a basic 4-core machine, is a little over 1 minute for 123 million models fitted. This does mean you can easily enough find the very best predictive model for the predictors and prediction metric specified.

When there is an exceptional resolution, as illustrated in the Modeling FTNIR Spectra white paper, or where a wider band of wavelengths is fitted, as in the moisture model in the Modeling NIR Field-Site Data white paper, PeakLab's X Predictor Filters are wisely used. A filter works to remove predictors unlikely to appear in the best performing retained models. If a given wavelength could not make an appearance in the best 5-predictor models, for example, that predictor may be removed from all 6-predictor and higher fits. In the Modeling FTNIR Spectra white paper where the 1650-1850 nm band was modeled at a 2 nm sampling interval, the estimated fitting time decreased from 1.5 days and 36 billion models to 29 seconds and 14.5 million models as a consequence of utilizing these intelligent filters.

Direct spectral modeling is the tried and true GLM, the multivariate modeling you will find in every statistical product in existence. PeakLab's contribution is the full permutation algorithm and the intelligent filtering of non-performing wavelengths. PeakLab's multivariate models can be reproduced to full precision in every professional regression software commercially available.

Because these truly are GLM conventional fits, they will have a GLM[] designation in the PeakLab Model List.

Intelligent Stepwise Fits

In our experience, a full-permutation direct spectral fit is more efficient and will require a couple of predictors less than the factor count for the optimum PLS prediction model. This means you should be able to achieve what is realized with a PLS model through about ten factors without reverting to higher than eight predictor counts in the direct spectral fits. For instances where highly complex models do require more than eight predictors, and to address instances where filters removed a given predictor but should not have done so, we have the Step[] models in the PeakLab Model List.

The intelligent stepwise fits do not use the full permutation filters, so even if a WL is omitted from the permutation matrix by a filter, it may still appear within one of PeakLab's stepwise models.

These smart stepwise models are not the simple forward stepwise models you may have fitted in statistics software. Each of these models contains a multidimensional search that seeks to find the best models from a given full permutation model starting point. In this different form of algorithm, a WL predictor that is removed from a model can be added back further along in the process. In general, you will see some benefit in these stepwise models that start with very effective full permutation models. These allow models up to 15 predictors to be fit, although the default maximum count of X-predictors is set to 12. You should increase this default only if you find that the Performance Analysis metrics are still improving all the way through 12 predictors.

A designation such as Step[8,4] means that there is an 8-predictor model that was built on a retained 4-predictor model as its starting point. The addition of further predictions in this algorithm is done in a multidimensional search that is run within a 5 x 5 matrix. This means there will be 25 passes of the stepwise regression's addition and removal of predictors. This procedure is effective. It cannot assure that you will find the very best 9-predictor model that exists, for example, but starting with what can be the very best of the 8-predictor models, there is a fair chance of finding such a optimal model of a higher predictor count. These intelligent stepwise models are probably most useful for 9-predictor and 10-predictor models, and for more fully covering the lesser predictor counts when filtering is implemented to hasten the fitting time.

Since this algorithm uses a proprietary search procedure, it is almost certain to never match the forward stepwise fits in commercial software. You should find the fits appreciably better, the significance a fair measure stronger, and the overfitting at adjacent wavelengths should generally disappear in the many iterations of the search procedure.

You should also see this procedure as an intelligent search variation for a conventional GLM multivariate model. Using the predictors identified by this type of algorithm, you can replicate the Step[] fits in any professional statistical or regression software package.

Sparse PLS Models

PeakLab offers 'PLS-like' models where the most significant direct spectral models are integrated into a weighted average prediction that uses a higher count of predictors.

One of the distinguishing properties of the indirection of PLS fitting, the fitting of derived factors instead of the spectra directly, is to effectively include in the model every WL in the prediction data and estimating a coefficient for its contribution to the overall predicted value. While direct spectral modeling precludes the generation of a coefficient for every WL in the predictors, sparse PLS[] models offer a higher number of component WLs.

You can think of sparse-PLS models as 'PLS-like' or 'PCR-like' insofar as they will offer a broader collection of WLs in the modeling band. Those will be sparse, however, in contrast with the PLS algorithm which includes every predictor included in the modeling. The sparse PLS models in PeakLab can consist of up to 15-predictors.

A sparse PLS[nsig,navg,npred] model is one where the most nsig significant predictors produce an optimal WL set, and it is populated with navg models, each consisting of npred predictors. A designation such as PLS[15,35,8] means that the model consists of 15 predictors (15 wavelengths, 16 estimated coefficients). These will be the 15 most significant predictors (wavelengths) in the retained 8-predictor models. This overall model will have been built by the statistical goodness of fit weighting of the 35 direct spectral fits of 8-predictor models where all of the wavelengths in each of these models appear within these 15 significant wavelengths. You can readily equate the npred predictor count with a factor count.

We acknowledge that the additional predictors may offer a more stable prediction, as we demonstrate in the example which follows. We also note the obvious tradeoff of more WLs to soften an anomaly at any given predictor, but a greater count of predictors where such anomalies might appear. The assessment of the value of a sparse PLS model could perhaps be based on whether or not trace contaminants, as well as trace amounts of other species that may have been weakly represented or absent in the model data, might be likely to occur in subsequent blind predictions. In our experience, sparse-PLS models consistently offer slightly better prediction performance relative to the optimum PLS or PCR models, and they will do so with far fewer predictors, and with each assured significant in a direct spectral modeling.

Because sparse PLS models only combine existing GLM fitted models, including these models in PeakLab's fitting adds virtually no additional fitting time to the modeling process.

A sparse PLS model can be replicated with any professional statistical software package, although the effort required may be considerable. You would need to assemble all of the retained models of a given predictor count where each of the wavelengths appear in a larger significant wavelength set, as shown in the Significance plot. Once you have determined the models which will comprise the overall model, you would then need to fit each of these individual multivariate models separately and compute a ppm statistical error of prediction for each. You would then create a composite or blended model by weighting the coefficients of these models by the inverse of this metric, adjusted to sum to 1.0.

An Intelligent Stepwise and Sparse PLS Fit Example

This intelligent stepwise procedure is effective, although at higher predictor counts it will seldom isolate the very best model in a full permutation. At eight predictors, in the example that follows, with just 51 wavelengths, a full-permutation 1-8 predictor fit, no filters, will implement 655 million fits, requiring about 3 minutes on a basic 4-core desktop machine. The odds of this intelligent stepwise procedure finding that one best performing prediction amongst 2/3 of a billion possibilities is not high.

As an illustration of the intelligent stepwise and sparse PLS models, we will fit the moisture data in the Modeling NIR Field-Site Data white paper with all possible permutations through 8-predictors. Since there is no way for PLS or PCR models to remove non-relevant or secondary predictors short of not including them in the first place, we will limit the wavelengths in this modeling data to only those with direct water band information. In this case, that is 1550-1645 and 1805-1950 nm at a 5 nm spacing.

We will first see how many of those 100 retained (the default) top performing 8-predictor GLM models were replicated by one of these smart stepwise models that began with an optimum GLM model of 7 or fewer predictors. Given the 537 million 8-predictor models in the full permutation, we might reasonably expect to see none at all.

Indx nX Predr2 PredAvErr Model

21 8 0.9750529 0.0083683 GLM[8](1590,1610,1645,1850,1875,1910,1925,1950)

22 8 0.9750529 0.0083683 Step[8,6](1590,1610,1645,1850,1875,1910,1925,1950)

34 8 0.9749745 0.0084945 GLM[8](1550,1590,1610,1850,1875,1910,1925,1935)

35 8 0.9749745 0.0084945 Step[8,7](1550,1590,1610,1850,1875,1910,1925,1935)

43 8 0.9749464 0.0085145 GLM[8](1585,1610,1645,1850,1875,1910,1925,1935)

44 8 0.9749464 0.0085145 Step[8,6](1585,1610,1645,1850,1875,1910,1925,1935)

57 8 0.9748898 0.0083370 GLM[8](1590,1610,1850,1875,1905,1925,1930,1935)

58 8 0.9748898 0.0083370 Step[8,7](1590,1610,1850,1875,1905,1925,1930,1935)

94 8 0.9747463 0.0084381 Step[8,7](1585,1610,1645,1850,1880,1890,1925,1935)

95 8 0.9747463 0.0084381 GLM[8](1585,1610,1645,1850,1880,1890,1925,1935)

The smart stepwise procedure managed to find 5 of the best 100 8-predictor fits from the 537 million fitted in the full permutation. There were also 23 smart stepwise models that proceeded past eight predictors to 9 or 10 predictors, four of which outperformed the best 8-predictor model by r² of prediction:

Indx nX Predr2 PredAvErr Model

1 9 0.9756070 0.0082119 Step[9,8](1550,1590,1610,1850,1875,1910,1925,1930,1935)

5 9 0.9754578 0.0082796 Step[9,8](1550,1555,1590,1610,1850,1875,1910,1925,1950)

6 9 0.9754382 0.0084403 Step[9,8](1565,1580,1590,1610,1850,1875,1910,1925,1935)

13 9 0.9751332 0.0082805 Step[9,8](1590,1610,1625,1645,1850,1875,1905,1925,1950)

If we carefully fit this data in a statistical package for forward stepwise regression (Systat), using equal .01 probabilities for both entering and removal, we see a 6-predictor model. The r² leave-one-out prediction is 0.9678. If we use .05 probabilities, we get a 9-predictor model with an r² leave-one-out prediction is 0.9728. A skilled statistician can get a very good predictive model with the forward stepwise regression procedure.

If we fit this data to PLS models (also Systat, NIPALS), we see the optimum prediction performance at 10 factors, an r² leave-one-out prediction of 0.9741. For model evaluation, the 10 factors convert to a coefficient at each wavelength used in the modeling plus a constant, 52 parameters in all, each of the 51 wavelengths having some contribution, positive or negative, to the the moisture prediction. By contrast, the 8-predictor direct spectral model has just nine parameters at the eight wavelengths shown in the model description.

If we specify the sparse PLS models during the direct spectral fitting, we see the following sparse PLS models in the Model List:

Indx nX Predr2 PredAvErr Model

3 15 0.9755030 0.0082837 PLS[15,31,7](1565,1585,1590,1595,1610,1850,1875,1880,1905,1910,1925,1930,1935,1945,1950)

4 14 0.9754684 0.0082818 PLS[14,18,7](1565,1585,1590,1595,1610,1850,1875,1880,1910,1925,1930,1935,1945,1950)

8 13 0.9752716 0.0082996 PLS[13,16,7](1585,1590,1595,1610,1850,1875,1880,1910,1925,1930,1935,1945,1950)

14 10 0.9751306 0.0083423 PLS[10,6,7](1585,1590,1610,1850,1875,1880,1910,1925,1935,1950)

16 11 0.9751034 0.0083624 PLS[11,9,7](1585,1590,1610,1850,1875,1880,1910,1925,1935,1945,1950)

20 12 0.9750584 0.0083560 PLS[12,13,7](1585,1590,1595,1610,1850,1875,1880,1910,1925,1935,1945,1950)

All are weighted averages of the 7-predictor models. The strongest predictive model is the one that uses the 15 most significant wavelengths of those retained 7-predictor models to construct a goodness of fit scaled average of 31 different 7-predictor retained models. This model is PLS-like in that is consists of a larger, though sparse, set of wavelengths, 15 in all, instead of all 51 which occur in the PLS model.