Multiple regression example from Molecular Modeling Pro Plus

Output from Molecular Modeling Pro Plus

Example of part of the print out from a multiple regression analysis. The model investigated here is flash point = intercept + b*(1/enthalpy of vaporization). The inverse transformation of enthalpy of vaporization was found to be the best one variable model in the data set investigated for determining flash point with a brute force regression method. The analysis below is a follow-up (explained in more detail in the tutorial in the help file).

Analysis of variance

------------------------------------------------------------------

Variation source df SS MS Statistics

------------------------------------------------------------------

Total (uncorrected) 360 6818859.7389 F=1066.68276

Mean 1 3523457.24432 rsquare=0.74872

Total (corrected) 359 3295402.49458 s=48.09447

Regression 1 2467320.53578 2467320.53578

Residual 358 828081.9588 2313.0781

------------------------------------------------------------------

Note: probability of significant F =<0.0001

Model coefficients and standard errors:

parameter coefficient standard error t prob

------------------------------------------------------------------

intercept = 259.139 5.52153 46.9325 <<0.00001

1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__

-7722.05 236.437 32.6601 <<0.00001

------------------------------------------------------------------

note: response variable: Flash_Point__C_

Printout of response values, predicted values and residuals:

observed predicted residual

acetal -21 27.8085 -48.8085

acetaldehyde -40 -29.363 -10.637

acetic acid 40 51.613 -11.613

acetic anhydride 54 62.7094 -8.7094

acetol 56 90.1554 -34.1554

acetone -17 -6.97324 -10.0268

acetone cyanohydrin 63 105.799 -42.7992

...(and so on)...

Figure 26. Analysis of variance table, model coefficients and partial print-out of the table of response, predicted and residual values for the flash point one variable model.

Analysis of variance table:

Abbreviations used: df = degrees of freedom; SS = sum of squares; MS = mean squared; F= Fischer's F test; r squared = proportion of variance accounted for by the model (e.g. in this example about 75%); S = model standard deviation (about 95% of the data lies within 2 standard deviations - thus plus or minus about 96 degrees C);

Coefficients table:

The model is: flash point (C) = 259.1 - 7722.05*(1/enthalpy of vaporization)

Both the intercept and the regression coefficient are highly statistically significant (prob <<0.00001)

Printout of predicted and residual values:

From this table find the largest out-liers (residuals) and determine if they have something obviously in common that will lead to a better model. For instance, if all solvents are well-accounted for, but surfactants are poorly predicted, consider developing separate models for solvents and surfactants.

Contributions to PRESS (Predictive Residual Sum of Squares):

Compound Predictive discrepancy

-------------------------- ----------------------

acetal 2405.442

acetaldehyde 115.3149

acetic acid 135.8609

acetic anhydride 76.35809

acetol 1173.177

acetone 102.0248

acetone cyanohydrin 1842.063

acetonitrile 1.354055

...(and so on)...

Total PRESS = 843833.3

Sum of squares of response (SSY) = 8277222

Press/SSY = 0.1019464

The model appears to be reasonable and has passed the cross-validation test.

Figure 27. PRESS output for the one-variable flash point model. The larger the predictive descrepancy, the more influential the data point. Leaving out an influential data point can greatly effect the model. You may want to leave out some of the more influential point and rerun the analysis. Also, as with the table of residuals (figure 26) you may want to look for obvious patterns in the types of chemistries that are influential.

At the bottom of the table is an assessment of whether the model meets the Press/SSY <= 0.4 cut-off. Failure would mean that just a few data points are contributing to most of what is accounted for by the model and you may want to get rid of some of these data points and redo the whole analysis.

Figure 28. Table of observed versus predicted values of flash point in the one variable model. Outliers (data points farthest removed from the regression line) can be identified by clicking on them on the graph with the mouse. Make sure that outlying data points are not typos (check for errors). Consider whether the calculated field (1/enthalpy of vaporization) is likely to calculate the outlying data points well and if not exclude them from the analysis and tell the users of the model that it is not reliable for those types of molecules.

Figure 29. Plot of predicted values of flash point against the residual values (residuals = difference between predicted and observed values). Optimally this plat is a scatter plot with completely random arrangement of the data points. Curvature or patterns may indicate problems like intercorellation of the dependent and independent variables or the need for transformation of one of the fields or addition of a parabolic or cross-product term. Examination of the residuals is a critical part of model validation. Further discussion in the on-line Help file...

Return