
Molecular Modeling Pro Plus Online Help (v. 8) 
Draw
Molecules  Optimize Geometry  Create Databases  Edit
Databases  Calculate Properties 

Quantum Mechanics  Automated Data Analysis Analyze data Manually (tutorial)  Graph data  

The
Drawing Window  menu item descriptions  The
Database Window  menu item descriptions 
This web page is a basic
introduction to the program. More
complete descriptions of the program and tutorials can be found in the printed manual
and the item by item descriptions of the Drawing Window menus and the Database Window menus.
Two tips on installation
1. The first time you use the
program you will want to initialize the links to companion programs. Below are some default settings (File
menu/Initialize Links). If you install
the program in some directory other than c:\molsuite\mmp then the filenames
below should be changed to reflect this.
MMP+ should do this automatically.
If you do not have ChemSite then that text box
should be left blank. If you have a
Brookhaven pdb file viewer put its filename in the
Protein data base viewer box below (Rasmol, found as
freeware on the internet is an example).
If you check Open in database mode and give a database file name, that
database will open every time you open MMP+.
Note: Your path name will not
be c:\molsuite\mmp if you installed the program in a different directory.
Drawing molecules is easy
with this program. Before starting note
that a blue message box often appears near the top of the screen which contains
information on the status of the drawing window and which will help guide you
through the drawing process.
Some rings are not in the library of course. You have three options for drawing these
rings. (a) select a ring from the
library list and change it, (2) type in the number of atoms in the ring and
have the generic ring builder create a ring which you will have to change where
atom types and bond types are not what you want (also accessed through the Ring
button) or (3) draw the atoms in the ring one by one as above and close the
ring using the "Connect" button near the upper right of the screen. Before connecting the atoms you will probably
need to rotate some torsional bonds (Rotate menu/bonds). With items 2 and 3 you will need to Minimize the molecule to get good ring geometry.
Step by step: Drawing benzamide and water together on the screen.


Figure 1 
Figure 2 
Minimizing energy to get good 3D
geometry:
The geometry minimizers are
accessed from the Geometry menu. Choose the menu item :Minimize."

Use of the the Minimize
Geometry window Full MM2 is the standard geometry minimization
procedure. It finds local minima. If you need to have sterochemistry
preserved AND find global minima, run conformational analyses (first menu
item on the Geometry menu) before running MM2. Quick MM2 finds global minima but may ruin stereochemistry If these methods don't work
we recommend the Use Simplex minimization method. This method is slow and sure. Sometimes combining it with MM2 iteratively
gives the best answers. The Moly
minimizer "Refine" routine may be a final resort if all else
fails but is the least recommended method. If you own the program ChemSite, the Amber
force field and minimizer found there is a standard minimizer for
macromolecules. 
Figure 3 

The
standard minimizer in the program is a version of MM2 (Molecular Mechanics)
developed by Norman Allinger and coworkers. The version here was programmed by P. Baracic and M. Mackov and has an
additional feature  the "Quick" minimizer. Use the "Full MM2" when you
want the stereochemistry to be preserved
and only desire a "local minimum" (e.g. the lowest energy model in
the current conformation). The Quick
minimizer finds a global minimum energy conformation by first taking the
molecule apart based on the location of torsional angles, then minimizes energy
of the parts, then reattaches the bonds with a conformational analysis around
the torsional bonds. In the process the
Quick minimizer uses a ring library to speed up the operation. The Quick minimizer can destroy the
stereochemistry that you carefully have built in the molecule, so keep this in
mind. Otherwise it is more powerful than
the "full MM2" despite its name.
The MM2 minimizer looks up optimum values for
bond lengths, bond angles and torsional bond angles. If an atom combination is missing from the
database it returns an error. In this
case you may have to go to the Edit menu and choose Undo from the menu to
retrieve your molecule. Molecules of
greater than 999 atoms cannot be run through the MM2 minimizer. Atom combinations not typically found in
organic chemistry should probably not be run through MM2 either. If you have some of these exceptions, use the
MOLY Simplex minimizer.
With
the MOLY and Simplex minimizers you can see the minimization take place on the
screen. With the MM2 routine you just
wait and the minimized molecule is eventually returned from the MM2 DLL.
Conformational Analysis

Conformational Analysis is
the first item on the Geometry menu.
It allows you to rotate torsional bonds while calculating strain
energy due to atom atom overlap, Coulomb charge
interactions and hydrogen bonding. In
the example below (figure 5) clicking on carbon atoms 1 and 2 rotates the
bond shown so that the isopropyl group rotates around the axis formed by the
bond. You can rotate two bonds at
once. At the end of the analysis the
program sets the molecule in the minimum conformational geometry and prints
out the chart in figure 6. Using the
conformational energy routine iteratively may be necessary if you have more
than 2 conformationally flexible bonds.
You also may often want to run this routine before the MM2
"full" minimizer in order to obtain global energy minima (e.g. the
best possible geometry). 
Figure 4 Figure 5 > 

Figure 6 > Conformational analysis of the molecule shown in
figure 5. Bond 1 is rotation of the
isopropyl group and bond 2 rotation of the amide group (the arylC bond forms
the axis of rotation). Strain energy
is in kcal/mole with lowest energy the best conformation. 
MOPAC
Geometry Optimization
Perhaps
the best geometry optimization method is using the PM3 force field in
MOPAC. The caveat being it too only
finds a local minimum. See the section
under Quantum
Mechanics for more details.
Geometry
Optimization of whole directories of molecules
You can optimize whole directories of molecules from
the menu item under the Geometry menu entitled minimize all molecules in a directory. The process is similar to that described
above for single molecules except you must additionally tell the program which
directory to use. Molecules must be
stored in Macromodel or MDL Molfile
formats. Any of the minimizers can be
used (MM2, Simplex, MOLY, ChemSite Amber). The MM2 Quick minimizer was especially
designed to turn 2D molecules into 3D molecules. Remember, though, that it changes sterochemistry, so take some care with 3D molecules where sterochemistry is important.
Adding
a molecule to a database is as simple as: a) select Save Database from the File menu of the drawing window. and Save this molecule to database from
the submenu; b) select the directory and type in the database name; c) if the
structure is saved in a separate file (this is an option described later) type
in the structural file name and the program will do the rest. However, if the database is not yet made
there are some additional things to do which are described next.
The
first step in creating a new database is the same as adding a molecule to an
existing one:
·
Select Save
Database from the File menu and Save this molecule to Database from the submenu
·
Select the
directory. Select the type of database
from the list at the bottom. The MAP
.csv file type is no longer recommended unless you are using the old program
Molecular Analysis Pro (it is limited to 72 numeric fields and MMP+ is capable
of creating databases with many times more fields). Microsoft ACCESS or XML are both fine for
creating databases. If you are going to
store reactions then use ACCESS or a clientserver database (SQL Server,
Oracle) that is capable of storing multiple interrelated tables. Type in the database name.
·
The next window
looks much like a file saving window, but consider carefully if you want to
store the structures within the database itself or as separate connection
tables (e.g. MDL Molfiles, Macromodel
files, Brookhaven pdb files etc.). Storing them internally will create one very,
very large file if you have a large database.
Storing them in connection table files will create potentially thousands
of small files in a standard file format such as the MDL Molfile.
·
Next to appear is
the window where you select which properties to store (figure 7).

Figure
7.
Property Selection Window after the Check All fast fields box has been checked. 
In
the example above the Check all fast fields (ones which are quickly calculated)
was checked which caused the majority of the fields on the screen to also be checked.
If
you check any of the "Fields requiring you to choose atoms during
saving" then if you later wish to add whole directories of molecules to
the databases you will have to sit by the computer and select atoms in each
molecule for the properties you select.
For instance if you select interatomic distance you will have to select
the two atoms involved. These properties
can be useful in QSAR studies, they will just slow down batch database
creation.
The
fields at the bottom of the screen with the exception of Bioavailability
require CNDO or MOPAC calculations and can take a minute or two to
complete. Including these fields in a
large database will add hours to the time it takes to calculate the
properties. However, the properties
calculated can be very useful.
Some
of the property boxes select multiple properties and place multiple fields in
the database. These include the Burden
CIM indices, Bioavailability, MOPAC/PM3 geometry + properties (which also does
a geometry optimization using MOPAC  which will be stored), polymer properties
and the Joback and Reid thermodynamics. It especially includes the substituent
indicators which will add over 160 fields (substructure fragments like the
number of carbons, number of ester groups, number of amides etc.) to the
database.
See
the Calculate
Properties section below for more
details on physical property calculations.
·
The program will
ask you to type in a name for the molecule.
This field cannot be blank.
·
The database will
then be created and you will be asked if you want to see it. You should take a look and make sure that the
program has done what you wanted it to do.
Besides adding molecules from the File menu of the drawing
window, you can also add molecules to the databases from the Edit menu of the
Database window. See The Database Window  menu
item descriptions.
Tip: A database of industrial chemicals called
industrial.mdb is installed with MMP+.
This file contains over 800 molecules and can be the starting point for
creation of a large database of commonly used chemicals. It also contains information about the
compounds from the literature that you may find useful.
Creating
databases in batch mode
Besides saving molecules one at a time, there are ways
to create databases or add molecules to existing databases many at a time. If molecules are stored in a directory in one
of the standard connection table formats that the program reads (e.g. MDL Molfiles, MACROMODEL files), then you can use the batch
procedure in MMP+ that will take all the molecules in the directory and add
them to the database. The procedure is
almost identical to saving an individual molecule. Go to the File menu of the drawing window and
select Save Database/All Macromodel files in a directory or Save Database/All Molfiles in a directory.
Creating
a QSAR database of closely related analogs from a substructure
Draw in the substructure. Select from the File menu of the Drawing
window  Save Database/Make QSAR database... You then select one or two sites on the
molecule to vary and tell the program what the database name should be and
where you want to store it. You can add
the molecules to an existing database or create a new one. The program automatically will add the
various substituents to the substructure drawn.
If you only vary one site it adds all the substituents listed in the
file aromqsar.txt (aromatic substituents) or alipqsar.txt (aliphatic subsituents). If you
vary two sites then it adds 900 molecules to the database (uses the first 30
substituents in the files aromqsar.txt or alipqsar.txt and 30x30=900). If you edit the text files be sure to save
them first to a backup file (you can add substituents by editing these files if
you have literature values for the fields in the files and have saved a
connection table). The physical property
fields created by the routine are somewhat different than those created by other
routines in MMP+. Many of the values are
substituent values from the literature (pi, mr, sterimol parameters, sigma) and are obtained from the
tables in the two text files.
Step by
Step  Creating a QSAR database
·
Draw benzamide. If you
need instructions, go here...
·
Go to the File
menu. Select Database Save and Make QSAR database
·
Click on the
hydrogen para to the amide group.
·
We will only vary
one substituent site, so hit "No" when it asks if you want to vary
another.
·
Name the database
HelpExample and save it to the /mmp
root directory.
·
Although it will
take a while, let's have the program minimize with MOPAC/PM3 and save some of
the quantum mechanics properties (check the third check box in figure 8
below). We are going to create one
indicator variable two so change the text box displaying a zero to a
"1". The QSAR database
creation adds the substituents with a crude (30 degree) conformational
analysis. MOPAC PM3 will take care of
the fine geometry minimization.

Figure
8. The QSAR database creation options
screen. Each check box will save
several additional calculated fields to the database. Checking polymer properties will save a
number of properties using van Krevelen's
methods. Thermodynamics saves the Joback and Reid method calculations. Bioavailability saves the Rule of 5
calculations (0 = uptake likely), Log P (Moriguchi
method) and the number of hydrogen bond acceptors and donors. Indicator variables are useful when running
this routine iteratively (e.g. using different substructure templates and
merging the data). The indicator
variable values will distinguish each iterative run from the next. 
·
The first three
letters of the Macromodel file names we will set to
bap (for benzamide and para).
·
The building
block structure we will call benzamide (p)
·
The value for
indicator 1 we will set to "1".
·
The database will
now be created. Most of the time spent
will be in the MOPAC calculations. If
the program asks if you would like to add some hydrogens, answer
"No".
·
When the program
asks if you would like to open the database answer "No".
We now will repeat the process with meta substitution:
·
Go to the File
menu. Select Database Save and Make QSAR database
·
Click on the
hydrogen
·
We will only vary
one substituent site, so hit "No" when it asks if you want to vary
another.
·
Name the database
HelpExample and save it to the /mmp
root directory.
·
When the program
asks if you want to overwrite the database select "No".
·
When the program
asks for the three letters of the Macromodel file
names set them to bam (for benzamide and meta).
·
The building
block structure we will call benzamide (m)
·
The value for
indicator 1 we will set to "0".
·
The database will
now be created. Most of the time spent
will be in the MOPAC calculations. If
the program asks if you would like to add some hydrogens, answer
"No".
·
When the program
asks if you would like to open the database answer "Yes".
·
In the database
window go to the View menu and select "File Card". Your screen should be displaying something
similar to figure 9 below.
Figure 9. QSAR database created by our step by step
tutorial. The following values are for
the substituent (not the molecule as a whole) and are literature values from
tables: pi (lipophilicity), MRef (molar
refractivity), Sigma_meta, Sigma_para,
L1 (length), B1 (minimum width), B4 (maximum width). Indiacator 1 is
"1" for para substitution and "0" for meta
substitution. The fields after
indicator_1 were calculated by MOPAC.
The values before pi are molecular properties calculated by MMP+. Tip: clicking on the gray area near the
bottom ["name for benzamide (p)(NH2) = benzamide (p)(NH2)] will bring it full screen. Clicking again will bring back the above
view. This allows you to view overly
long text fields.
Comparison of ClientServer and Desktop database
features
This section does not apply
to SQL Server or Oracle databases (ClientServer). If you have Molecular Modeling Pro Plus
Enterprise Edition go
here for instructions on opening
databases. The editing features of the
ClientServer and desktop database function almost identically with one
exception. For the Desktop applications,
to save changes to the databases you must go to the File/Save menu and tell the
program to save the data. In
ClientServer mode only authorized users can make changes to the database and
even authorized users may not have permission to edit all fields. Changes made to ClientServer databases are
made immediately to the database and require no further saving. Your system administrator should backup the
database daily or at least weekly to recover the database if someone makes a
catastrophic error. Desktop databases
are by default backed up every time you open one (to database filename.bak).
Basic Editing tasks
To change values simply type
into the spreadsheet cells. When you
have made the changes save the data (File menu).
Tip: Memo fields can contain
over a megabyte of data. Memo fields
include and field named "Name", SMILES, formula, general_information,
trade_name, safety_data,
and IUPAC_name.
To view all the information in one of these fields click on the gray
area near the bottom of the screen containing the text information. This will bring it full screen (you cannot
edit the information here however). To
bring back the spreadsheet view simply click on the text again.
Definitions: Record =
molecule; Field = molecular property
To add a new molecule to the
database draw in the molecule, then go to the database window's Edit menu and
choose Records(molecules) and Add Molecule. Follow the prompts.
To delete a molecule, select
the molecule in the database spreadsheet by clicking on it. Go to the Edit menu and choose Records(molecules) and Delete current molecule. Follow the prompts.
If not using a clientserver
database, record (molecule) additions and deletions do not take effect until
you save the data. If in clientserver
mode you may or may not have rights to add and delete molecules (flags are set
by the system administrator).
Reactions
Databases of reactions can be
made as well. These databases require
two tables so must be stored in a database format that uses multiple tables
like Microsoft ACCESS, Microsoft SQL Server or ORACLE. For more information on creating reaction
databases see the description HERE.
Substructure Searches
Load the database. Draw in
the substructure. Go to the Edit
menu. Choose Substructure Search. The program will ask you if you want to create
a dataset of the molecules containing the subset. Usually you will want to choose NO here,
especially if you have made changes to the database and not saved them. Choosing YES will create a new database
containing only the molecules containing the substructure (which you can make
permanent by saving). Choosing NO will
give you a list of molecules which contain the substructure from which you can
select molecules in the current database.
For more go HERE.
For a more complete
description of the menu items in the database window go HERE.
Calculation of
Chemical Properties from Structure
Physical properties are
calculated from structure using methods garnered from the literature, plus by
the MOPAC and CNDO programs, plus with methods developed internally by Norgwyn Montgomery Software Inc. Most of these properties are accessed from
the Calculate menu of the drawing window.
They also are calculated during database creation. The basic procedure is to draw
the molecule, then simply select the properties you wish to calculate from the
Calculate menu. MOPAC and CNDO properties are accessed from the Tools menu. A description of the properties calculated
are HERE. The menu
items found under the calculate menu are described in: The Database Window  menu
item descriptions. Figure 10 shows the Calculate menu.
With the advent of version 8 of the program, the old
methods are being replaced by new methods created with the MMP Substructure Analysis
routine which gives superior models of properties. These models are documented and are
downloadable from the COMMUNITY sharing page at https://www.norgwyn.com/models/Models.html.

Figure 10. The Calculate
menu is used to find calculated values for the molecule drawn. Most of the calculations have limitations
to organic molecules containing H, C, N, O, F, Si, P, S, Cl, Br and I. A description of the properties is found HERE. MOPAC and CNDO physical
property calculations and geometry optimization is done from the Tools menu. Many of the menu items
contain multiple property calculations.
For instance, the Solubility Parameter menu item is linked to
calculations for the solubility parameter (3 methods), HLB, Log octanol water
partition coefficient (Log P), water solubility, mean water of hydration,
polar surface area, % hydrophilic surface area, and the Hansen 3D solubility
parameters. 
Quantum
Mechanics
Molecular Modeling Pro Plus supports two public domain
semiempirical quantum mechanics programs: CNDO and
MOPAC (version 6 included). The program
MOPAC.exe included with this program was developed at the Air Force Academy and
the version here was adopted for Windows by Victor Lobanov
at the
The program CNDO (Complete Neglect of Differential
Overlap) is the earlier of the two programs and was developed by J. Pople and coworkers in the 1960s and 1970s. It is still thought one of the best ways to
calculate partial atomic charges and dipole moments. MMP+ also uses it to calculate LUMO (lowest
unoccupied molecular orbitals) and HOMO (highest occupied molecular
orbitals). For most other uses, though,
the MOPAC methods are more sophisticated and faster. Pople abandoned
semiempirical methods and worked on abinitio calculations after CNDO.
MOPAC was developed by J.J.P. Stewart and colleagues
at the Air Force Academy. It is actually
a collection of methods semiempirical quantum mechanics methods developed by
them and by M.J.S. Dewar and colleagues at the

·
Figure 11. Enter the MOPAC screen from the Tools menu
of the drawing window. Choose the
method to use. The AM1 and PM3 force
fields are more modern than the MNDO and MINDO methods. MNDO and PM3 support the most atom types. If you checke
POLAR  the polarizability and first and
second hyperpolarizabilities are to be
calculated. THERMO  Thermodynamics calculations can be performed on
molecules. The key words FORCE and ROT
also must be included. The combination of these three key words will give the
user additional thermodynamic property calculations: internal energy, heat capacity, partition function and entropy for translation,
rotation and vibrational energy over a range
of temperatures 
If you do more than one iteration (don't check 1SCF),
MOPAC will find a local geometry energy minimum (i.e. will return changed,
optimized atom 3D coordinates) and MMP+ will display the new geometry. MMP+ will also cause the MOPAC output to be
displayed in a Notepad window.
For a more complete description of MOPAC and CNDO see The Drawing Window  menu item descriptions . A description of the MOPAC output is found HERE.
Molecular Modeling Pro Plus has two methods for automated generation of
QSAR/QSPR models which were designed to let chemists with only a general
knowledge of statistics create usable models fairly quickly. You access these methods
by first opening a database. You must first add the experimental data you wish
to model to the database by typing it in or importing it from Excel. Then you
just select either "Substructure Analysis" or "Stepwise
regression" from the Analyze menu of the Database window. The Substructure
Analysis routine will analyze all the molecules in your database for the unique
substructures contained in the structures and make variables from the
molecular fragments to correlate (regress) against the experimental data. The
Stepwise regression model will use all the existing fields (calculated and
experimental properties) already contained in the database instead of
substructural fragments. A further explanation follows.
To access this
feature: Open a database. Go to the Analyze menu. Choose the first item,
"Substructure Analysis."
What this
analysis will do: After you choose the property to model, this method will
analyze all the structures in your database and create independent variables of
the substructures contained within the molecules. It will then use Stepwise
multiple regression to find the substructual features
that are significantly correlated with the property and places the model in
your \models subdirectory. You can then predict the property of any molecule
you draw from the Drawing window by selecting "user defined
properties" from the Calculate window.
What can be
analyzed: This method does not care what atom types you have in your database.
It creates the substructures from the molecules themselves, not from a lookup
table. You should be able to analyze any set of molecules as long as you have
input the structures correctly. Molecules should be drawn in a consistent
manner (e.g. if you draw nitro as N(=O)=O one time and
N+(=O)O the next it will not recognize them as the same substructure. Salts
and ionic molecules are another type of molecule that it is possible to draw in
many different ways.)
CAVEAT EMPTOR:
On large databases this method will take days to complete the task. On a
database of 72 structurally related molecules this method takes a few minutes
to complete its task on a slow XP computer. On a database of 800 molecules
containing mostly solvents and surfactants, this method took 2 hours to
complete its task on a slow XP computer. On a database containing over 4000
molecules with over 1300 unique substructural features on an Intel I5, Windows
10 computer the program took about a week to converge. Fortunately, the
computer was still usable as the task ran in the background. However, the
database being used was tied up for the duration of the analysis. The length of
the task is a function of your computing speed. The size of the database that
can be used is limited by RAM.
Details of the
Method:
The number of
variables that will be generated is function of the size and complexity of the
structures in the database. The default setting for maximum number of
substructural variables is 1500. If you get an "out of memory" error
go to the Options menu of the Database window and decrease the number of
variables allowed in the analysis. If using the mydata.mdb or chemelectrica.mdb
database you will need to set this to about 1500 because the program will find
about 1300 substructures than this number (as of January 2018).
Explanation of this window:
1. Maximum number of causal variables: The program
will calculate this based on the size and complexity of your database, up the
maximum allowed (see previous paragraph). You can adjust this number here and
in the Options menu.
2. Exclude by molecular weight (boxes 2 and 3): You
can limit the analysis to only consider molecules between a minimum and maximum
molecular weight.
3. Include only molecules with:  Check this box if
you want to only want to analyze molecules containing specific atom types. You
can modify which atoms you look for by adding or subtracting atomic symbols
from the comma separated list.
4. Consider transformations of the data:
If checked, the program will try to determine the best
transform for the data (i.e transform the experimental
response parameter.) It will do this
using only for the models of the count for each type of atom as independent
variables (i.e. number of carbons, number of hydrogens etc.). The transforms of response used by MMP are a
combination of: (a) multiply or divide by molecular weight and/or molecular
volume with raw data, and (b) log10, square, square root or logistic and the
inverse of these five transforms. A
total of 70 different combinations of these factors will be analyzed. The program will choose one, then proceed
with the rest of the analysis. If you do
not choose to consider the transformations, the program will still see if the
data is better fit by dividing all the atom counts (e.g. number of carbons,
number of oxygens, etc.) by molecular weight (it will not do this if you select
transformations, as multiplying the response by molecular weight is similar to
dividing the independent variables by molecular weight.)
5. Look at cross product and squared terms. After the model completes the stepwise
regression of all the independent variables, it will look the squares and cross
products of all the terms in the model to see if they improve it further.
6. Do not amalgamate CH3, OH, SH, NH2, NO2 or halogen.
If checked, the method will use a simpler, faster set
of fragments to analyze the data. You would complete the analysis sooner and
perhaps have a more general model, but lose some of
the structural information in the data.
MMP+ combines terminal groups, like H, OH, =O, =S, F,
Cl, Br, I, NH2, #N, NO2 and CH3 with the atoms they are attached to in the
default model. It also looks for concentrations of heteroatoms with structures:
O[or N]C(=O) [or =S]N [orO].
If you check this box the program will only attach H, =O and =S.
If you check the box the molecule CH3CH2C(=O)NHCH2CH3 would generate the following pairs for analysis:
Number of C(H)(H)C(H)(H)(H) pairs = 2; Number of
C(=O)C(H)(H) pairs = 1; Number of C(=O)N(H) pairs = 1; Number of C(H)(H)N(H)
pairs = 1
If you left the box unchecked it would generate:
Number of C(H)(H)(CH3)N(H)
pairs = 1; Number of C(=O)N(H) pairs = 1; Number of C(=O)C(H)(H)(CH3) pairs =
1
In a database containing all kinds of molecules
containing over 4000 values for log Kow, checking
this box generated 400500 unique combinations of pairs, while leaving the box
uncheck generated over 1300 substructural features. Leaving the box unchecked gives a model
containing more chemical information, with the danger that the model will be
less general.
7. Minimum probability allowed. I have set the default
at 0.05, the typical value used by statisticians. This appears to work even
with data acquired from multiple sources from the internet. Setting this value
correctly is a function of the accuracy of your data, and
setting it high or low will effect where the analysis stops (how many
independent variables are included in the model.) For highly accurate data this
value should be set very low (molecular weight, for instance, could be set at p
<=0.0001). For sharing in the literature, you may find resistance from
reviewers or regulators with p > 0.05. [p = probability that an independent
variable's correlation is due to chance. 0.05 means 5% probability.]
8. Use connectivity and valence indices in the
analysis. This will cause the method to consider connectivity and valence
indices 1 through 4 and the kappa shape index 2 in the analysis. If will not,
though, force them into the model as an included variable. This feature is
useful if you think the model should contain both whole molecule properties and
substructural fragments in the model.
9. Include Hydrogen Bonding Interaction terms. The
program will consider some additional variables if you check this box. It will
look to see if each molecule contains more than one of the following features
and will create a cross product of the number of each pair in every molecule. The
H bonding groups it looks for are NH, NH2, OH, =O, COOH, SH, S(=O) and P(=O).
In our example of CH3CH2C(=O)NHCH2CH3 this would
create one additional term: NH x C=O. It would also create this term for the
molecule O=CHCH2CH2NHCH3 (not just for amide.)
10. "Do not include molecules with formally
charged atoms." Checking this box will eliminate molecules with formally
charged atoms, for instance, Na+, Cl from the analysis. It will eliminate most
salts, for instance, from the analysis of the chemelectrica.mdb and mydata.mdb
data sets.
11. Alternate
between forward and backward regression.
When this is checked the program will for do forward regression for a
number of steps. Next it will add all
the substructural features that had p<0.05 (or the
significance selected in step 7 aobove) to the model
and do a backwards regression. It will
continue until no variable with p<0.05 or which increases r square by more
than 0.0001 can be found and then do a final backwards regression. The program will do about 10 forward steps
the first time, but thereafter will do the number of forward steps you specify
in the line after the check box (default = 3 steps.) For large data sets, we recommend you keep
this box checked, because it greatly speeds the analysis (answer in hours or
days instead of weeks.)
If you do
not check this box, then the model will do a forward regression all the way to
the end and do a backwards regression only after no more significant variables
can be added. Leaving this box unchecked
can lead to a very long analysis with large, complex data sets (lasting weeks.)
12. Transform by dividing the variables by each
other. Checking this box will cause the
program to consider dividing every significant independent variable by each other. In conjunction with repeating the process
(item 13 below) very complicated nonlinear models are possible. It adds one to the denominator to avoid
division by zero. A consequence of this
is that it will also evaluate the x/(x+1) transform (dividing a variable
by itself plus one which describes a curve converging on one.) Checking this
option will greatly increase the time of analysis and may make the model less
general (while better predicting that data in the database.) With repeated iterations (next paragraph) it
can be used to model very complicated internal molecular interactions.
13. Number of times to repeat the process (max 5): If
you check cross products and division, you will usually be able to get a better
model by repeating the process by going back to the step of doing a combination
of forward and backward regression, then repeating the combinations and
divisions steps. The process looks like
this:
Usually two iterations are sufficient (keep the default
2), but sometime three will do it.
Iterating the process will greatly lengthen the time of the analysis. Variables can become very complicated as
cross product terms and division terms can be further multiplied and divided by
each other.
14.
Comments. If you would like to
append a line of comments to the output file (author, references, explanation),
place it here. For texts longer than a
sentence or so, create a separate document.
After hitting the "Done" button, a new
window appears from which you will select one response parameter (usually
experimental data you wish to analyze.) Select a parameter and hit
"Done."
Next a window will appear asking if you wish to force
some variables from your database into the analysis. This allows you to combine
the molecular properties calculated by MMP with substructural features of your
dataset.
Next you will see all the molecules in the database
flash by as the program determines the substructural features of the
molecules. The program is not looking for specific features, but creates the
substructural variables from scratch, depending on what combinations of atoms
are contained in all the molecules in the database. After completing this, the log window appears
and tells you how many molecules and variables will be included in the
analysis. The program next looks for fields created that have the same exact
values as other fields and for fields which have all the same values and
eliminate these variables from the dataset. The program will display the names
of the eliminated fields in the log window. The program proceeds with forward and
backwards stepwise regression, adding parameters to the model until (a) no
variables can be added that do not meet your significance level and the model r
square does not improve by at least 0.0001 or (b) you hit the Cancel button.
After completion the model will do a final backwards Stepwise regression to
remove any variables that did not meet the level of significance that you
specified, one step at a time starting with the least significant variable. If you selected cross products and squared
terms, it will then look at all the combinations of the variables already in
the model.
The program will ask you to save the dataset created
by the program to a tab delimited text file. This will save the molecule names,
the response (dependent variable), the predicted values, the molecular formula,
the molecular weight, the connection table link and all the independent
variable values. You can open up this database later in MMP or in Excel or any
other program that opens tab delimited text files. Note this data will not be
particularly secure. The program will also separately save all the information
in the log window to a file called mmp.txt and open it with Notepad. If you
want to save this log file save the file in Notepad and give it a different
name.
After saving the data the program will display the
model itself, as well as charts of predicted versus response and predicted
versus residual values. Residuals are response minus expected values. Clicking
on the data points to the far left or right of the residuals chart will
identify the outliers (poorly predicted molecules.) Often these outliers are
inaccurate experimental results. The file containing the model itself is
automatically stored in the mmp\models subdirectory.
This file is parsed by the program and should not be changed, though you can
change the name of the file. If you share this file with other people and they
put it in their mmp\models directory, then they will
be able to use the file to predict properties of molecules using MMP.
To predict the value of the property for a molecule
not in the database or for one which does not have an experimental value do the
following:
1. Draw the molecule.
2. In the Drawing window go to the Calculate menu and
choose "User Defined Properties." A new window will appear with a
list of user defined properties. Click on the one you want and hit Run.
NOTE: MMP+ comes with models for boiling point,
specific gravity and log Kow already in the \models
folder. These models are used by the program to calculate these properties, so
overwriting or deleting these files will change the default method used by MMP
for calculating these properties. If you are dissatisfied with the way MMP
calculates these properties you can rerun the analysis by opening the chemelectrica\mydata.mdb database, adding or removing
molecules, editing the literature data fields, and rerunning the substructure
analysis. Running the log Kow analysis took about a
week on an Intel I5 processor.
COMMUNITY SHARING PAGE:
New models will be published periodically at www.norgwyn.com/models/Models.html. You can
download them there free of charge or submit new models for everyone to use that
will appear on this web page. From the
web page, just download the model text file into your mmp\models
subdirectory and they will be usable from the Calculate/User Defined Properties
menu of MMP.
A brief explanation of what is in the file for the
model:
====================================================================
The Total (uncorrected) number of 305 is the number of
molecules contained in the final model. (df stands
for degrees of freedom)
The number 112 next to “Regression” is the number of
variables. You usually want about 4x
more molecules than variables, depending on the quality of the data.
The larger the F score, the better the model. The probability of the F statistic being
significant is shown below the table.
R squared multiplied by one hundred is the percent of
the variance explained by the model.
S is the model standard deviation. The 95% confidence level is twice this, plus
or minus.
Below the analysis of variance table, you can see the
first terms in the model itself. The
format is, name of the variable, the coefficient that is multiplied by the
number of time the substructure appears in a molecule and the standard error, t
score and probability that the term is significant (0 is a term so significant,
the program could not determine the value.)
To predict the value of the property for a molecule
not in the database or for one which does not have an experimental value do the
following:
1. Draw the molecule.
2. In the Drawing window go to the Calculate menu and
choose "User Defined Properties." A new window will appear with a
list of user defined properties. Click on the one you want and hit Run.
NOTE: MMP+ comes with models for boiling point,
specific gravity and log Kow already in the \models
folder. These models are used by the program to calculate these properties, so
overwriting or deleting these files will change the default method used by MMP
for calculating these properties. If you are dissatisfied with the way MMP
calculates these properties you can rerun the analysis by opening the chemelectrica\mydata.mdb database, adding or removing
molecules, editing the literature data fields, and rerunning the substructure
analysis. Running the log Kow analysis took about a
week on an Intel I5 processor.
2.
Stepwise Regression.
What this analysis will do: After you choose the
property to model, this method will attempt to correlate it with a combination
of all the other properties in your database. It will then use Stepwise
multiple regression to find the properties that are significantly correlated
with the response you choose and places the model in your \stepwise
subdirectory. You can then predict the property of any molecule you draw from
the Drawing window by selecting "user defined properties" from the
Calculate window.
1. Maximum
number of causal variables: The program will calculate this based on the size
and complexity of your database, up the maximum allowed (see previous
paragraph). You can adjust this number here and in the Options menu.
2. Exclude by
molecular weight (boxes 2 and 3): You can limit the analysis to only consider
molecules between a minimum and maximum molecular weight.
3. Include
only molecules with:  Check this box if you want to only want to analyze
molecules containing specific atom types. You can modify which atoms you look
for by adding or subtracting atomic symbols from the comma separated list.
4. Use log
molar transform: After checking this box, your dependent variable (response
variable) will be transformed before analysis, i.e., y will become
log10(y/molecular weight). Note that the
data will be transformed back to the original transform when determining
properties of new molecules with unknown property values.
5. Take the
inverse of the response. After checking this box your dependent variable will
be transformed before analysis (i.e. y will become 1/y).
6. Minimum
probability allowed. I have set the default at 0.05, the typical value used by
statisticians. This appears to work even with data acquired from multiple
sources from the internet. Setting this value correctly is a function of the
accuracy of your data, and setting it high or low will
effect where the analysis stops (how many independent variables are included in
the model.) For highly accurate data this value should be set very low
(molecular weight, for instance, could be set at p <=0.0001). For sharing in
the literature, you may find resistance from reviewers or regulators with p
> 0.05. [p = probability that an independent variable's correlation is due
to chance. 0.05 means 5% probability.]
7. Alternate between forward and backward
regression. When this is checked the
program will for do forward regression for a number of steps. Next it will add all the substructural
features that had p<0.05 (or the significance selected in step 7 aobove) to the model and do a backwards regression. It will continue until no variable with
p<0.05 or which increases r square by more than 0.0001 can be found and then
do a final backwards regression. The
program will do about 10 forward steps the first time, but thereafter will do
the number of forward steps you specify in the line after the check box
(default = 3 steps.) For large data
sets, we recommend you keep this box checked, because it greatly speeds the
analysis (answer in hours or days instead of weeks.)
If you do not check this box, then the
model will do a forward regression all the way to the end and do a backwards
regression only after no more significant variables can be added. Leaving this box unchecked can lead to a very
long analysis with large, complex data sets (lasting weeks.)
8. Transform
by dividing the variables by each other.
Checking this box will cause the program to consider dividing every
significant independent variable by each other.
In conjunction with repeating the process (item 13 below) very
complicated nonlinear models are possible.
It adds one to the denominator to avoid division by zero. A consequence of this is that it will also
evaluate the x/(x+1)
transform (dividing a variable by itself plus one which describes a curve
converging on one.) Checking this option will greatly increase the time of
analysis and may make the model less general (while better predicting that data
in the database.) With repeated
iterations (next paragraph) it can be used to model very complicated internal molecular
interactions.
9. Number of
times to repeat the process (max 5): If you check cross products and division,
you will usually be able to get a better model by repeating the process by
going back to the step of doing a combination of forward and backward
regression, then repeating the combinations and divisions steps.
Usually two
iterations are sufficient (keep the default 2), but sometime three will do
it. Iterating the process will greatly
lengthen the time of the analysis.
Variables can become very complicated as cross product terms and division
terms can be further multiplied and divided by each other.
10. Comments.
If you would like to append a line of comments to the output file
(author, references, explanation), place it here. For texts longer than a sentence or so,
create a separate document.
After hitting
the "Done" button, a new window appears from which you will select
one response parameter (usually experimental data you wish to analyze.) Select
a parameter and hit "Done."
Next a window
will appear asking if you wish to force some variables from your database into
the analysis. This allows you to combine the molecular properties calculated by
MMP with substructural features of your dataset.
The list of
properties will appear again and this time you select properties to exclude
from the analysis (otherwise the program will look at every numeric field in
the database.)
The program
looks for fields created that have the same exact values as other fields and
for fields which have all the same values and eliminate these variables from
the dataset. The program will display the names of the eliminated fields in the
log window. The program proceeds with forward and backwards stepwise
regression, adding parameters to the model until (a) no variables can be added
that do not meet your significance level and the model r square does not
improve by at least 0.0001 or (b) you hit the Cancel button. After completion
the model will do a final backwards Stepwise regression to remove any variables
that did not meet the level of significance that you specified, one step at a
time starting with the least significant variable.
The program
will ask you to save the dataset created by the program to a tab delimited text
file. This will save the molecule names, the response (dependent variable), the
predicted values, the molecular formula, the molecular weight, the connection
table link and all the independent variable values. You can open up this
database later in MMP or in Excel or any other program that opens tab delimited
text files. Note this data will not be particularly secure. The program will
also separately save all the information in the log window to a file called
mmp.txt and open it with Notepad. If you want to save this log file save the
file in Notepad and give it a different name.
After saving
the data the program will display the model itself, as well as charts of
predicted versus response and predicted versus residual values. Residuals are
response minus expected values. Clicking on the data points to the far left or
right of the residuals chart will identify the outliers (poorly predicted
molecules.) Often these outliers are inaccurate experimental results. The file
containing the model itself is automatically stored in the mmp\stepwise
subdirectory. This file is parsed by the program and should not be changed,
though you can change the name of the file. If you share this file with other
people and they put it in their mmp\stepwise
directory, then they will be able to use the file to predict properties of
molecules using MMP.
After the analysis is done, to predict the value of
the property for a molecule not in the database or for one which does not have
an experimental value do the following:
1. Draw the molecule.
2. In the Drawing window go to the Calculate menu and
choose "User Defined Properties." A new window will appear with a
list of user defined properties. Click on "Go to Stepwise Directory"
(unless already clicked.) Click on the property you want and hit Run.
MMP+ has integrated into it statistical tools used in
molecular design and QSPR (quantitative structure property relationships  the
study of the relationship of experimental properties of molecules to chemical
structure). QSAR (quantitative structure activity relationships) is the
application of these techniques to biological and pharmaceutical properties of
molecules, such as the inhibition of an enzyme important in some human
pathology. MMP+ also contains some
graphing and charting features used in this sort of research (see next section). When applied
to chemical data instead of biological response data QSAR is called QSPR
(quantitative structureproperty relationships).
The statistical routines are accessed from the
Analysis menu of the Database Window. To
use the properties you must have first created a
database and opened
it. Alternatively you can use the
industrial.mdb database that comes with MMP+ for demonstration purposes. To learn about the data analysis features in
MMP+ open this database now (to open a database go to the File menu in the
Drawing Window, select Open Database, then choose ACCESS database from the file
types and open industrial.mdb in the MMP directory). The database window will
open as in Figure 12.

Figure 12.
Database window showing the Analyze menu. The industrial.mdb ACCESS database that
comes with the program has been opened. 
In this initial analysis we are going to use standard
multiple regression analysis. Multiple
regression is sensitive to the number of compounds relative to the number of
independent variables. Most QSAR/QSPR
data sets are characterized by containing approximately 15 compounds. In order to eliminate the possibility of
chance correlations, no more than 3 independent variables should be considered
for regression in the initial phase of the study. When these conditions exist, multiple
regression is the method of choice for the QSAR study. Otherwise you may want to use partial least
squares regression (part 3).
For purposes of demonstration we are going to model the
field Flash Point, which contains approximately 360 experimentally determined
values.
There are several possible ways to start the modeling
process. If you have no clue where to
start, the Regress with all combinations, the Stepwise regression procedure and
the partial least square (PLS) routines can all be considered. In this demonstration we will use linear
least squares regression and the routine Regress
with all combinations. Select this
option from the Analyze menu and check the two boxes so the options panel
appears as in Figure 13.

Figure 13.
Options panel for the Regress
with all combinations routine. We
will look at all 1, 2 and 3 variable combinations of variables and look at
transforming both the result parameter (Flash Point) and the causal
parameters. 
A list of variables appears and the program asks for
the Response parameter. Choose Flash_point_C from near the bottom of the list (and hit the
Done button). The list of fields appears
again, this time asking for parameters to include (force into the model). Just hit Done without selecting anything (no
included variables). The list of fields
appears a third time, but this time asks if you want to exclude anything
(figure 14). Click on the fields starting with name and ending with Trade
Names. Also select the fields starting
with Solvochromatic_cavity_term and ending with the
last field, connection_table_name as we do not want
to include the experimental, noncalculated fields.

Figure 14.
The field list in the Regress
with all combinations routine. The
example shows exclusion of text fields
not to be part of the analysis. 
The procedure will begin its calculations, considering
every combination of parameters. Since
we checked off the transformation and crossproducts boxes its will also consider
transform of both the result parameter (dependent variable) and the causal
parameters (independent variables). The
top 50 onevariable models, transformed variable model, crossproduct models
and two variable models (including transforms) will print out. Below is an abbreviated version of this
printout with only the best 5 models in each category listed (instead of 50).
Molecular Modeling Pro Plus 2004, Norgwyn Montgomery Software Inc. One variable models for Flash_Point_(C)
Variable df Intercept Coefficient r squared probability Critical_pressure_(bar)
360 215.768
3.97404 0.56163 << 0.00001  surface_tension_in_water 360 541.294 23.9295 0.546262 << 0.00001  boiling_point 360
47.3372 0.147412 0.461673 << 0.00001  water_solubility 360
93.4639 12.3763 0.388525 3.03687E40  Log_P 360 66.8162 8.48604 0.289003 1.99958E28  ==================================================
Transformed variable models for Flash_Point_(C) Variable df Intercept Coefficient r squared probability x transform:Flash_Point_(C)
1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)
360 261.011
7787.1 0.744842 << 0.00001  x transform:Flash_Point_(C)
1/ln(connectivity_1) [+4.86681610107422] 360 459.642 860.804 0.7332 << 0.00001  x transform:Flash_Point_(C)^2
1/ln(kappa_2) [+8.8418798828125] 360 123880
297587 0.718872 << 0.00001  x transform:Flash_Point_(C)^2
1/ln(connectivity_1) [+4.86681610107422] 360 116635
232747 0.716677 << 0.00001  x transform:Flash_Point_(C)^2
1/ln(log_olive_oil_gas_partition_coefficient)
[+25.8770092773437] 360 167359
613474 0.714944 << 0.00001 ==================================================
Cross products, divisions and subtractions
of 2 variables for Flash_Point_(C) Variable df Intercept Coefficient r squared probability 1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)*1/ln(connectivity_1)
[+4.86681610107422] 360 236.507
14276.4 0.743771 << 0.00001  1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)*ln(boiling_point)
[+55.0082995986938] 360 284.957
1693.46 0.740526 << 0.00001  surface_tension_in_water*ln(boiling_point) [+55.0082995986938] 360 252.301 2.2992 0.734733 << 0.00001  surface_tension_in_water*1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)
360 272.602
324.658 0.723927 << 0.00001  LUMOsurface_tension_in_water
337 696.529
29.7928 0.716809 << 0.00001  ==================================================
Two variable models for Flash_Point_(C)
Variable df Intercept Coefficient r squared probability 1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)*1/ln(connectivity_1)
[+4.86681610107422] 360 271.411
24629.7 0.808152 << 0.00001 1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)*1/ln(connectivity_1)
[+4.86681610107422]^2 510166  1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)
337 231.512
4543.29 0.802904 << 0.00001 LUMOsurface_tension_in_water 15.9342  surface_tension_in_water*1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)
337 248.001
187.422 0.801363 << 0.00001 LUMOsurface_tension_in_water 16.7771  1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)
337
230.23 4639.37 0.800536 << 0.00001 HOMOsurface_tension_in_water 15.5848  1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)*ln(boiling_point)
[+55.0082995986938] 337 207.114
1023.02 0.80015 << 0.00001 LUMOsurface_tension_in_water 15.7089 ==================================================
Two variable models for Flash_Point_(C)^2
Variable df Intercept Coefficient r squared probability 1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)*1/ln(connectivity_1)
[+4.86681610107422] 360 67751.9
7887220 0.756959 << 0.00001 1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)*1/ln(connectivity_1)
[+4.86681610107422]^2
215334000  surface_tension_in_water 360 29844.4 2420.96 0.755017 << 0.00001 1/ln(connectivity_1) [+4.86681610107422] 180281  surface_tension_in_water 360 28059.2 2571.22 0.751007 << 0.00001 1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)1/ln(connectivity_1)
[+4.86681610107422] 195313  boiling_point 337
144783 29.222 0.749497 << 0.00001 LUMOsurface_tension_in_water 5805.75  boiling_point 337
146922 30.2855 0.74891 << 0.00001 HOMOsurface_tension_in_water 5735.66  
Figure 14.
Printout from the Regress all combinations routine (options selected:
2 independent variables, with transformations and crossproducts). See discussion. 
The Regress all
combinations routine gives us needs to be used with a degree of common
sense. It is only the first step in the
analysis of the data. In this real world
example, none of the one variable models is particularly satisfying. The best one variable model is:
Flash point (C)
= 215.768 + 3.97404*Critical_pressure_(bar)
N=360, r squared = 0.56163, p << 0.00001
Although from the probability (<<0.00001) we can
deduce that their is a significant relationship between flash point and criticial pressure, it does not make good chemical sense
that critical pressure is the main underlying cause in the variance in flash
point. It is more likely that they have,
in part, the same underlying cause, and are therefore intercorrelated, but not
cause and effect. The same can be said of
all the other one variable models. The
other statistics, N is the degrees of freedom (number of data points  number
of independent variables), and r squared is the proportion of the variance
accounted for by the model (i.e. in this example approximately 56%, while 44%
is still unaccounted for and in the model error term). The p <<0.00001 term indicates that
there is less than an 0.001% chance that the relationship is strictly due to
chance.
The best transformed model is much more satisfactory
from a chemical standpoint.
Flash point (C)
= 261.011 + 7787.1*1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)
N = 360, r
squared = 0.744842, p << 0.00001
This model also accounts for more of the variance in
Flash point (74.5%).
The next section has models that are combinations of
two variables (multiply them, divide them, subtract them). These models are more complicated then the
transformed enthalpy of vaporization model and do not explain any more of the
variance. Because of this, we will
tentatively choose the 1/enthalpy of vaporization model as better than these
models. The best of these models is:
Flash point (C) = 236.507 
14276.4*(1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)*1/ln(connectivity_index_1
+ 4.87) )
N = 360, r squared =
0.743771, p << 0.00001
The best two variable model is a parabolic version of
the crossproduct model above:
Flash point (C) = 271.411

24629.7*(1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)*1/ln(connectivity_1+4.87))
+
510166*(1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)*1/ln(connectivity_1+4.87))^2
N = 360 , r squared = 0.808152, p << 0.00001
Since this model is a fairly significant improvement,
it will come under consideration as we develop the model. It will be compared with the simpler one
variable transformed model (1/enthalpy of vaporization) as we refine the
model. Connectivity index 1 is a measure
of molecular size. So this model implies
a parabolic relationship between the inverse of the enthalpy of vaporization
multiplied by the inverse of the log of molecular size. In other words the lower the enthalpy of
vaporization and the lower the molecular size, the higher the flash point.
The final section of results are for models of the
square of the flash point, which was the best dependent variable transform
found. However none of these models
explain as much of the variance as the two variable model for the untransformed
values of flash point, so we can discard them.
At this point, we will graph the data (next section)
and further refine the models (Analyze Data, part 2).
Graph Data
Again, if you want to follow along and do what is
described open the industrial.mdb database that comes with the program. Go to
the database window. Go to the Graph
menu and select XY(2D) plots/Simple. For the y (response) axis select Flash_Point_C from the list that appears and for the x axis
select enthalpy_of_vaporization_at_the_boiling_point_kj/mole.

Figure 15. The "Simple" xy plot. Its one
advantage over the advanced plot is that you can click on a data point with
the mouse and the data point is identified (in this example the arrow points
to PPG24buteth27). This ability
make it easy to identify outliers not described by the model and wrongly
entered data points. 
You can clearly see in figure 15 that the flash point
data is poorly modeled by a straight line regression to enthalpy of
vaporization. However, we know from the
analysis in the previous section that there is a much better relationship between
flash point and the inverse of enthalpy of vaporization. We can see this relationship using the
advanced graphing feature.
Go to the Graph menu.
Select XY (2D) plots/Advanced.
For the y axis select Flash_Point_C from the
list that appears and for the x axis select enthalpy_of_vaporization_at_the_boiling_point_kj/mole. Answer the question about wanting a second y
axis "No". The options panel
appears as in figure 16. Fill in the
options as shown.

Figure 16.
The options panel for the "advanced" 2D XY plot. We are going to color the plot by a third
variable (connectivity index 1). The
minimum and maximum x and y values have been rounded for better looking graph
labels. We also are requesting that
the x axis undergo the 1/x transform. 
Hit the done button in the options panel. The list of fields again appears so you can
select the field to color the data points by.
Select connectivity_1 from the list.
The plot in figure 17 appears.


Figure 17.
Example of the XY (2D) "advanced" plot. Flash point is fairly well correlated with 1/enthalpy
of vaporization (the curve drawn through the data points is that made by the
linear least squares regression model).
Data points are colored by a third variable (connectivity index 1). It appears that enthalpy of vaporization
and connectivity index 1 are probably also intercorrelated as the colors of
the data points are clustered. The
model flash point = 1/enthalpy of vaporization accounts for about 75% of the
variance for 360 values of flash point of solvents and surfactants found in
the industrial.mdb database. Most of
the values with lower connectivity indices (dark blue) are solvents and are
found toward the left side of the graph.
Their behavior is different from surfactants. The Joback and
Reid enthalpy of vaporization calculation was built with solvents
wellrepresented and surfactants not represented. A question to ask about this model is that
is the inverse transform really needed or is it reflecting inaccuracy in the
calculation of enthalpy of vaporization.
Note that a very straightforward linear relationship appears to exist
between flash point and enthalpy of vaporization for solvents. These sorts of insights can only be found
by plotting the data. 

From figure 17 we note that The PPG24buteth27 point
previously noted in figure 15 is an outlier.
There is quite a bit of "slop" near the inflection point of
the curve as well. Most of the data
points with a low enthalpy of vaporization are tightly clustered in a linear
relationship with flash point. A question
that should be asked is if there should not be two separate models. One for small molecules with low connectivity
indices (dark blue data points) and a second model for the rest.
Next let's investigate the two variable models using
the standard 3D chart. With the
solubility database open and in the Database Window go to the Graph Menu and
from the 3D plot menu choose 3D Graph.
When the program asks for the y (response) parameter choose flash point. For the first x (causal parameter) choose enthalpy_of_vaporization_at_the_boiling_point_(kg/mol) and
for the second x parameter select connectivity_1. Next comes the options window. Under y (response parameter) leave all the
transformation boxes unchecked. Under
x1, check 1/x1. Also check the "Use
the square of the causals" box.

Figure 18.
The 3dimensional plot options box. 
Running this routines with the settings in figure 18
generates the analysis of variance chart in figure 19 and the 3D plot in
figure 20 as well as a plot of the model predicted vs. actural
values and model predicted vs. residuals (not shown).
Analysis of variance 
Variation source df SS MS Statistics 
Total (uncorrected) 361
6978060.7389 F=304.47018
Mean 1 3592866.3781 rsquare=0.8109
Total (corrected) 360 3385194.3608 s=42.46382 Regression 5
2745066.8301 549013.36602 Residual 355 640127.53069 1803.17614 
Note: probability of
significant F =<0.0001 Model coefficients and
standard errors: parameter coefficient standard error t
prob 
intercept = 340.345 12.8173 26.5534 0 1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)
15998.7 999.898 16.0004 6.5861E44 connectivity_1 0.7123 0.17697 4.02498 0.0000694934 1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)*connectivity_1
42.1873 5.69143 7.41244 8.93892E13 1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)^2
175578 19755.1 8.88776 3.0396E17 connectivity_1^2 0.00115334 0.000391426 2.94649
0.00342328 
note: response variable: Flash_Point_(C) 
Figure 19. Example
of the Analysis of variance table and model printed out by the 3dimensional
plotting routines. 
In this analysis we are comparing a parabolic model
with the two variable model found in part one.
Here are the two models side by side:
Two variable model from part 1.
Flash point (C) = 271.411

24629.7*(1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)*1/ln(connectivity_1+4.87))
+
510166*(1/Enthalpy_of_vaporization_at_the_boiling_point_(kJ/mole)*1/ln(connectivity_1+4.87))^2
N = 360 , r
squared = 0.808152, p << 0.00001
Parabolic model (figure 19):
Flash point (C) = 340.345 +
15998.7*(1/enthalpy of vaporization) +
0.7123*connectivity index 1
+42.1873*((1/enthalpy of vaporization) * connectivity
index 1) +
+175578*(square of 1/enthalpy of vaporization) +
+0.001155334*(square of conntectivity
index 1)
N=360, r
squared = 0.81, model and all variables highly significant
Which model is to be preferred? The first model contains fewer terms with
about the same r squared value and is probably the one to use. However, as we develop the model, it will be
good to keep in mind that we may be able to substitute in the parabolic model
if the log(connectivity index 1) cross product begins to look suspect. Below
(figure 20) is the response surface for the parabolic model.


Figure 20.
Response surface generated by the 3D chart routine for the flash
point model discussed above. It is
evident from the plot that flash point rises with increasing connectivity
index 1 and decreasing 1/enthalpy of vaporization. The actual model used to construct this surface
is detailed in figure 19. 

An alternate way of displaying this same data is the
contour plot. To generate a contour plot
expressing this same flash point model of the data in the industrial.mdb go to
the Graph menu. Choose Contour plot from the 3D Plot menu. When the program asks for
the y (response) parameter choose flash
point. For the first x (causal
parameter) choose enthalpy_of_vaporization_at_the_boiling_point_(kg/mol) and
for the second x parameter select connectivity_1. Next comes the options window. Under y (response parameter) leave all the
transformation boxes unchecked. Under
x1, check 1/x1. Also check the "Use
the square of the causals" box. The options panel will again look like
figure 18 except that we will accept the default graph title. This generates the same statistics (figure
19) and will generate the plot in figure 21 instead of figure 20. If you will compare figures 20 and 21 you
will note that they really contain exactly the same information.

Figure 21.
The flash point model depicted as a contour plot. The same information is contained in figure
20. 
The last type of plot is a rotating 3D plot. Go to the Graph menu. From the 3D
plot menu item select Rotatable 3D
Plot. From the lists that appear select the same 3 fields as in the other
examples (enthalpy of vaporization at the boiling point (kg/mole) flash point
(C) and connectivity 1). The plot in
figure 21 appears.

Figure 22.
Investigating the relationship between flash point, enthalpy of
vaporization and connectivity index 1 using the rotating 3D plot. This plot can be rotated about and the data
points colored by values of a field to look for relationships in 3 and 4
dimensions. 
Use the rotate menu to rotate the plot. You can freeze the rotation at any point by
holding the mouse button down. Look for
linear relationships, clusters of points and other features that when explained
will give insight into the relationships between the variables.
The analysis of the flash point model continues in the
next two sections.
In this section we are going to critically examine the
models generated in part 1 with the brute force Regress all variables routine.
After doing this, we will briefly discuss what the next steps are beyond
the one and two variable models we found in part one (i.e. looking at more
complicated models). Some statistical
training will be helpful in this section.
Creation of transformed fields:
Open the industrial.mdb database that comes with the
program. Go to the database window. The first thing we are going to do is create some
new fields. Have patience here as some
of these steps will take the computer a few minutes to complete. The first new field we will create is
1/enthalpy of vaporizatiion.
Go to the database Window. Go to the Analyze menu. Select Transform
a Field. The options panel in figure 23 will appear. Select Enthalpy of vaporization... from the
list at the left and 1/(enthalpy of
vaporization... as shown.

Figure 23.
The field transform options panel. 
When the program asks which field to store the new
field end, scroll to the bottom of the list and select "New
Field." Wait. Save the database (File, Save) to the Microsoft
Access format using the file name industrial2.mdb.
Next we create the more complicated transformed field
used in the two variable model:
(1/enthalpy of vaporization)*(1/(ln(connectivity index
1 + 4.866816017422))
Creating this field is a multistep process using the
Analyze menu/Transform a field routine:
1. Create a field that adds 4.866816017422 to
connectivity_1
2. Create a field containing the log of the field
created in step 1
3. Create the inverse (1/x) of the field created in
step 2.
4. Multiply the field created in step 3 with
1/enthalpy of vaporization created in the first example (figure 23).
5. Save to industrial2.mdb.
Multiple Regression Analysis with
PRESS:
One variable model:
Let's look at the best one variable model from Part 1
in more detail (flash pt. = b0 + b1*1/enthalpy of vaporization). Go to the Analyze menu. Select Multiple
Regression Analysis. When the
program asks for the y (response) variable choose Flash Point (C) off the list
and check the box labeled Conduct PRESS
Analysis.


Figure 24. Select
y (dependent variable) from the list and check the PRESS analysis box and hit
Done. 
Figure 25.
Select x (independent variable) from the list and hit Done. 
Next select 1/enthalpy of vaporization from the causal
parameter list. The following analysis
will appear. Explanations and analysis
appear in the figure legends.
Analysis of variance 
Variation source df SS MS Statistics 
Total (uncorrected) 360 6818859.7389 F=1066.68276 Mean 1 3523457.24432 rsquare=0.74872
Total (corrected) 359 3295402.49458 s=48.09447 Regression 1
2467320.53578 2467320.53578 Residual 358 828081.9588 2313.0781 
Note: probability of significant F
=<0.0001 Model coefficients and standard errors: parameter coefficient standard error t
prob 
intercept = 259.139 5.52153 46.9325 <<0.00001 1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
7722.05 236.437 32.6601 <<0.00001 
note: response variable: Flash_Point__C_ Printout of response values, predicted
values and residuals: observed predicted residual acetal 21 27.8085 48.8085 acetaldehyde 40 29.363 10.637 acetic acid 40 51.613 11.613 acetic anhydride 54 62.7094 8.7094 acetol 56 90.1554 34.1554 acetone 17 6.97324 10.0268 acetone cyanohydrin 63 105.799 42.7992 ...(and so on)... 
Figure 26.
Analysis of variance table, model coefficients and partial printout
of the table of response, predicted and residual values for the flash point
one variable model. Analysis of variance table: Abbreviations used: df =
degrees of freedom; SS = sum of squares; MS = mean squared; F= Fischer's F
test; r squared = proportion of variance accounted for by the model (e.g. in this
example about 75%); S = model standard deviation (about 95% of the data lies
within 2 standard deviations  thus plus or minus about 96 degrees C); Coefficients table: The model is
 : flash point = 259.1  7722.05*(1/enthalpy of vaporization) Both the intercept and the regression coefficient
are highly statistically significant (prob
<<0.00001) Printout of predicted and residual values: From this table find the largest outliers (residuals) and determine if they have something
obviously in common that will lead to a better model. For instance, if all solvents are
wellaccounted for, but surfactants are poorly predicted, consider developing
separate models for solvents and surfactants. 
Contributions to PRESS (Predictive Residual
Sum of Squares):
Compound Predictive
discrepancy   acetal 2405.442 acetaldehyde 115.3149 acetic acid 135.8609 acetic anhydride 76.35809 acetol 1173.177 acetone 102.0248 acetone cyanohydrin 1842.063 acetonitrile 1.354055 ...(and so on)... Total PRESS = 843833.3 Sum of squares of response (SSY) = 8277222 Press/SSY = 0.1019464 The model appears to be reasonable and has
passed the crossvalidation test. 
Figure 27.
PRESS output for the onevariable flash point model. The larger the predictive descrepancy, the more influential the data point. Leaving out an influential data point can
greatly effect the model. You may want
to leave out some of the more influential point and rerun the analysis. Also, as with the table of residuals
(figure 26) you may want to look for obvious patterns in the types of
chemistries that are influential. At the bottom of the table is an assessment of
whether the model meets the Press/SSY <= 0.4 cutoff. Failure would mean that just a few data
points are contributing to most of what is accounted for by the model and you
may want to get rid of some of these data points and redo the whole analysis.


Figure 28.
Table of observed versus predicted values of flash point in the one
variable model. Outliers (data points
farthest removed from the regression line) can be identified by clicking on
them on the graph with the mouse.
Make sure that outlying data points are not typos (check for errors). Consider whether the calculated field
(1/enthalpy of vaporization) is likely to calculate the outlying data points
well and if not exclude them from the analysis and tell the users of the
model that it is not reliable for those types of molecules. 

Figure 29.
Plot of predicted values of flash point against the residual values
(residuals = difference between predicted and observed values). Optimally this plat is a scatter plot with
completely random arrangement of the data points. Curvature or patterns may indicate problems
like intercorellation of the dependent and
independent variables or the need for transformation of one of the fields or
addition of a parabolic or crossproduct term. Examination of the residuals is a critical
part of model validation. Compare with figure 32 and note the further
discussion there. 
A usual followup here would be to look at the second
and third best one variable models (or some other model which makes good
intuitive chemical sense). Particularly
we would want to be aware if these other models have one or two very large
outlying data points which if removed would give a model superior to the
1/enthalpy of vaporization model. If
they did it is likely that we would prefer one of these other models,
particularly if on examination the outlying data points have a logical reason
for exclusion.
Two variable model:
Analysis of variance 
Variation source df SS MS Statistics 
Total (uncorrected) 360 6818859.7389 F=770.74194 Mean 1 3523457.24432 rsquare=0.81196
Total (corrected) 359 3295402.49458 s=41.66302 Regression 2
2675719.231 1337859.6155 Residual 357 619683.26358 1735.80746 
Note: probability of significant F
=<0.0001 Model coefficients and standard errors: parameter coefficient standard error t
prob 
intercept = 269.413 5.25699 51.2485 <<0.00001 1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole___plus
24378.4 998.955 24.4039 <<0.00001 Square_1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole
503119 45561.2 11.0427 1.326E24 
note: response variable: Flash_Point__C_ Total PRESS = 631855.3 Sum of squares of response (SSY) = 8277222 Press/SSY = 7.633664E02 The model is excellent and passed the
crossvalidation test easily. 
Figure 30.
Two variable model of Flash point.
See figure 26 for an explanation of abbreviations used in the analysis
of variance table. The two variable model explains about 6% more of the
variance in the table compared to the one variable model in figure 26 (e.g. r
squared = 0.81 vs. r squared = 0.76) and has a significantly lower model
standard deviation (S = 41.66). For a
data set of this size (df = 359) this is a
significant and all things being equal, this model is preferred over the one
variable model. The PRESS/SSY
crossvalidation value has improved substantially as well. In the coefficients table check to see that
all coefficients are statistically significant (they are). 

Figure 31.
Plot of observed versus predicted values for the two variable model
for flash point. About 90% of the data
points lie quite close to the regression line. However the other 10% are poorly accounted
for. There are several possible explanations. (1) wrong selection of independent
variables; (2) need more terms in the model; (3) poor accuracy in flash point
determination; (4) poor accuracy in enthalpy of vaporization calculation; (5)
some molecule types fit in the model and some don't (examine the data points
for this). 

Figure 32.
Plot of predicted versus residual values for the two variable flash
point model. Data points on the far left or right of the graph are outliers. The well accounted for points are a scatter plot,
but the outliers form a pattern going from the
upper left to the lower right of the screen (i.e. as the predicted value
rises, the residuals become more negative).
This indicates that there is likely another term that needs to be
added to the model or that the field transformation we have selected is not
the right one. 
What next?
The analysis presented here is only the first
step. The next step is to build on what we
have found by going beyond 2 variable models.
We have plenty of degrees of freedom left (360) and with such a large
data set could comfortably add many
variables. A rule of thuimb
is that there should be at least 8 molecules per variable, although it depends
on what purpose the data is being used whether you can accept more or fewer
variables.
The process is to include the variables found in the
one and two variable models above in all the more complicated models we
build. For instance, we can look for 2
and 3 variable models of flash point in which one of the variables is always
1/enthalpy of vaporization. The analysis
is done exactly as in Part 1 with the Regress
all combinations routine, except when the program asks for included
variables where previously we checked nothing, we now check 1/enthalpy of
vaporization (results, figure 33).
Molecular Modeling Pro Plus 2004, Norgwyn Montgomery Software Inc. Variables included in every model are:
1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__ One variable models for Flash_Point__C_
Variable df Intercept Coefficient r squared probability 1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
359 0.0249783
6095.31 0.784338 << 0.00001 surface_tension_in_water 8.41897  1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
359 221.921
7059.8 0.778774 << 0.00001 percent_hydrophilic_surface 0.575727  1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
359 278.799
9964.1 0.771038 << 0.00001 Square_1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole
223906  1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
359 260.272
7814.9 0.769263 << 0.00001 Hydrogen_bond_number 9.20856  1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
359 207.119
7419.51 0.768355 << 0.00001 surface_tension 1.39539  ==================================================
Transformed variable models for Flash_Point__C_ Variable df Intercept Coefficient r squared probability x transform:Flash_Point__C_
1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
303 248.543
7919.01 0.831664 << 0.00001 sqrt(Log_P) 6.12759  x transform:sqrt(Flash_Point__C_) 1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
264 18.1367
437.486 0.805329 << 0.00001 sqrt(Log_P) 0.16568  x transform:Flash_Point__C_
1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
317 220.523
7183.72 0.8041 << 0.00001 sqrt(CNDO_LogP) 13.1627  x transform:Flash_Point__C_
1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
329 304.009
8328.6 0.792738 << 0.00001 sqrt(LUMO) 73.1153  x transform:Flash_Point__C_
1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
359 367.166
5176.06 0.790236 << 0.00001 1/ln(hydrophilic_surface_area)
[+10.3280102539062] 456.501  ==================================================
Cross products, divisions and subtractions
of 2 variables for Flash_Point__C_ Variable df Intercept Coefficient r squared probability 1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
303 168.192
7165.99 0.845318 << 0.00001 sqrt(Log_P)surface_tension_in_water^2
0.104649  1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
303 210.892
6852.07 0.841105 << 0.00001 sqrt(Log_P)*surface_tension_in_water^2 0.0183651  1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
303 153.8
7871.62 0.839776 << 0.00001 surface_tension_in_watersqrt(Log_P) 4.21706  1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
292 274.532
8036.35 0.838247 << 0.00001 sqrt(Log_P)*1/connectivity_3 18.7291  1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
303 258.244
8332.88 0.838074 << 0.00001 water_content_70%_RH__moles_sqrt(Log_P)
9.86567  ==================================================
Two variable models for Flash_Point__C_
Variable df Intercept Coefficient r squared probability 1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
303 37.8682
6105.12 0.856495 << 0.00001 water_solubility 4.28045 surface_tension_in_watersqrt(Log_P) 7.20953  1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
303 43.2173
5512.44 0.856362 << 0.00001 Log_P 4.47645 surface_tension_in_watersqrt(Log_P) 9.30119  1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
303 127.598
6009.68 0.854661
<< 0.00001 water_solubility 2.88661 sqrt(Log_P)surface_tension_in_water^2
0.123239  1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
303 17.2268
4808.2 0.854414 << 0.00001 ln(1/_connectivity_1_) [+0.01] 34.3467 surface_tension_in_watersqrt(Log_P) 4.67267  1/_Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole__
303 208.448
7299.08 0.854263 << 0.00001 percent_hydrophilic_surface 0.632755 water_content_70%_RH__moles_sqrt(Log_P)
16.5484 
Figure 33.
Printout for Regress with All
Combinations routine for the next iteration of the flash point model with
1/enthalpy of vaporization included in all models. The next step would be to graph the best
models and run them through the detailed multiple regression analysis with
PRESS. Then include another variable
and reiterate and so on until it is impossible to add another variable with
statistical significance (p for every variable must be <0.05). The two variable model containing 1/enthalpy of
vaporization and square root of log P (r square = 0.83) looks like a possible
improvement over the two variable model previously investigated. 
Alternatively instead of the Regress with All Combinations tool we could use the Stepwise routine to quickly look for
models up to 10 or even 20 variables (the Regress al Combinations tool is much
more thorough, but takes several iterations of model checking and building the
Included variable list  i.e. it will take some time to build a large
model).
Always ask the question does the model make good
chemical sense. Be particularly
suspicious when a new independent variable is added that is highly correlated
to another independent variable, as it is likely that is explaining one or two
inaccurately measured data points.
Examine the plot of residuals to make sure they look random. If not you may have the wrong variables in
the model or need a transformation of the dependent variable or one or more
independent variables. As you add
variables and the model grows in complexity it may be harder to judge by
intuition whether the model makes sense.
Modeling a subset of the data
The flash point data for low molecular weight
molecules (and low connectivity index 1)
when graphed appeared better correlated to enthalpy of vaporization than
the whole data set. You can do a separate
analysis for just this subset. To subset
the data, go to the Edit menu and select Subset
by text or numeric value. Choose to
subset by connectivity_1 and choose the maximum value of 4.5 to subset by. Rerun the analysis. What you will find is that you will not have
to transform the enthalpy of vaporization values, that the model r squares are
much improved, but that the model standard deviation is significantly lower 
so the models obtained will be more accurate than the ones for the whole data
set when predicting flash point values of solvent sized molecules. You can restore the original data set by
going to the Edit menu and selecting Restore
subsetted data.
Restricting the size further by subsetting
the data with maximum molecular weight of 100 and allowing values from the
literature in the calculations gives these "best" models when Regressed with All Combinations:
Flash point
(C) = 53.3981 + 0.68443*Literature_Boiling_Point_C
N=227, r squared = 0.850624, p<< 0.00001
And
Flash point
(C) = 52.4996 12.0857*CIM_7 +
0.760654* Literature_Boiling_Point_C
N = 209, r squared =
0.913709, p<< 0.00001
Analyze
Data Manually, Part 3  PLS
In parts 1 and 2 we looked at linear least squares
regression techniques in generating a model for flash point. In this section we will look at use of
partial least squares regression to model the same flash point data as in parts
1 and 2.
Partial least squares (PLS) analysis is an alternative
to ordinary multiple regression analysis that has advantages if:
a) your causal (x, independent) variables are
intercorrelated. For instance here is
the correlation matrix for one of the models presented in Part 2 (correlation
matrix obtained from Principle Components Regression routine).
Correlation matrix for regression variables
Flash_Point 1/_Enthalpy sqrt(log P)
Flash_Point__C_ 1. 0.90519 0.73105
1/_Enthalpy_of_vaporization 0.90519 1. 0.78007
square_root(Log_P) 0.73105 0.78007 1.
As you can see, enthalpy of vaporization and
the square root of log P are highly intercorrelated.
b) you have a high ratio of causal (x, independent) variables
to observations (multiple regression works best if there are no less than 4
observations per causal (independent) variable).
Multiple regression may still be used in the above
cases, but answers may be “biased”. The
PLS technique, popularized in the late 1980s, attempts to address these
problems.
Application to the flash point model:
Like the other routines, this analysis uses the
industrial.mdb database that comes with the program. Open this database and go to the Analysis
menu of the database window. Select
Partial Least Squares Regression from the menu.
Select flash point from the list of fields and restrict the number of
principle components to 25. After quite
a while (these calculations take time) you will be presented with some preliminary
cross validation results, and the program will ask you to choose the number of
principle components to use.

Figure 34. It
is good for models have the crossvalidation ration (PRESS/sum of squares of
Y) of <0.5. In fact none of the
models here meet that criterion. The
"sweet spots" (minimized crossvalidation ratios) are for the 1
component and 4 component models. We
will examine the 4 component models more thoroughly below. 

Figure 35.
Normally we would not be interested in any of the higher principle
component models shown here because of the very high crossvalidation
ratios. In fact each additional
component is explaining the variance for one or fewer data points in these
models. However, we will ignore this
for purposes of demonstration and look at the 18 component model. 

Figure 36.
After you select the number of components the program will more
quickly calculate the final model.
Then it will offer the user a chance to look at the models graphically. You should always look at the plots of
observed vs. predicted values and observed or predicted values versus
residuals. 

Figure 37.
The plot of the four component models of flash point (observed versus
predicted). There is one obvious outlier that should be examined and explained or removed. 

Figure 38.
The 18 component model of flash point.
There are three data points clumped that may be outliers. If they are
similar types of molecules that occur no where else in the data set, they can
be removed as long as in the instructions to those using the model a warning
is given that the model does not apply to this type of molecule. 
PLS component 1 : X variance explained, this component: 0.558255525962759 accumulated:
0.558255525962759 Y variance explained, this component: 0.632356946005208 accumulated:
0.632356946005208 PLS component 2 : X variance explained, this component: 0.210855353598904 accumulated:
0.769110879561663 Y variance explained, this component: 5.97700779226291E02 accumulated:
0.692127023927837 PLS component 3 : X variance explained, this component: 0.104635496368257 accumulated:
0.87374637592992 Y variance explained, this component: 4.26706340999879E02 accumulated:
0.734797658027825 PLS component 4 : X variance explained, this component: 2.86296954681797E02 accumulated:
0.9023760713981 Y variance explained, this component: 7.50660083082563E02 accumulated:
0.809863666336081  PLS Model Regression Coefficients:  Y = Flash_Point__C_ 
Intercept: 137.9093  molecular_weight:
0.0275296  molecular_volume: 8.78747E04  molecular_length: 1.13613E03  molecular_width:
1.87833E03  molecular_depth: 1.01165E03  surface_area: 1.458387E03  Log_P:
2.320458E03  HLB: 6.647635E03  Hansen_polarity:
3.332011E03  Hansen_hydrogen_bonding:
1.253919E02  H_bond_acceptor:
2.942572E04  H_bond_donor:
3.558267E04  dipole_moment: 4.520623E04  percent_hydrophilic_surface: 3.796799E02  mean_water_of_hydration:
8.610905E04  boiling_point:
0.1456266 
connectivity_0: 5.591225E03 
connectivity_1: 1.293165E03 
connectivity_2: 1.053856E03 
connectivity_3: 1.748378E03 
valence_0: 7.125869E03 
valence_1: 3.463916E04 
valence_2: 4.808458E04 
valence_3: 2.566669E06  kappa_2:
6.578427E04  water_solubility:
2.305892E03 
connectivity_index_4: 1.428772E03 
valence_index_4: 4.468389E05  dipole_moment_from_CNDO: 1.795975E02  H_bond_acceptor_from_CNDO: 7.366852E05  H_bond_donor_from_CNDO:
1.266422E04 
HOMO: 1.2663E05  LUMO:
8.185258E05  CNDO_LogP:
1.281364E04  LogP__Crippen: 1.456506E03  surface_tension:
1.707584E02  hydrophilic_surface_area:
3.162643E03  surface_tension_in_water:
1.941342E03  Critical_Temperature__K_:
0.1099477  Critical_pressure__bar_:
1.194533E02  Critical_volume__cm^3/mole_:
3.065991E02  Normal_boiling_point__K_:
0.1845125  Normal_freezing_point__K_: 0.157135 
Enthalpy_of_formation_ideal_gas_at_298_K__kJ/mole_: 4.978785E02 
Gibbs_energy_of_formation_ideal_gas_unit_fugacity_at_298_K__kJ: 7.055692E02
 Enthalpy_of_vaporization_at_the_boiling_point__kJ/mole_: 2.585548E02  Enthalpy_of_fusion__kJ/mole_: 2.627288E03 
Liquid_viscosity__N_s/m^2_:
1.769484E03 
Heat_capacity_ideal_gas__J_mole_K_: 5.038143E02  Effective_number_of_torsional_bonds: 2.295547E03  Hydrogen_bond_number:
6.985848E06  entropy_of_boiling__J/K_mole_: 9.00867E03  Heat_capacity_change_at_boiling__J/K_mole_: 4.82065E03 
CIM_1: 1.034556E04 
CIM_2: 2.440436E04 
CIM_3: 1.272485E04 
CIM_4: 3.572252E05 
CIM_5: 1.473604E04 
CIM_6: 1.433094E04 
CIM_7: 2.177646E04 
CIM_8: 2.6544E04 
1/_Enthalpy_of_vaporization_at_the_boiling_point_kJ/mole__: 1.769506E05  square_root(Log_P): 1.338915E04 
Figure 39.
Part of the printout from the partial least squared routine. The "corrected" r squared values
and the model coefficients are listed in this part of the printout. Examination of the model coefficients (multiply
by the average size of the values) can lead to elimination of some of the
fields from subsequent analyses (eliminate the fields that contribute little
to the predicted flash point value). 
ClientServer
Note: ClientServer features are only available on the
Enterprise version of Molecular Modeling Plus.
The clientsever version of the program operates
exactly as the desktop version with the following exceptions.
First, security is set by the system administrator for
all users. The user levels are 
(1) no permission to view data
(2) read only
(3) write access restricted to the data_table
with certain columns offlimits. Can add
molecules and reactions, but cannot delete them.
(4) System administrator level: complete access to all
fields and tables including molecule and reaction deletion. Cannot add or
delete new fields (must be done by system administrator using SQL inside SQL
Server or Oracle).
Second, changes made to the database are instantly
applied (you do not need to Save the database from a menu).
As part of the installation package with the
clientserver version, a halfday seminar will be presented to the new users of
the program on how to use the program.
Substructure and Similarity Searches
Molecular Modeling Pro allows several kinds of
structural searching routines in the database window.
1. Load a database.
2. Draw a substructure in the Drawing window (e.g. a
substructure is a part of a molecule)
3. In the Edit Menu of the Database window select
substructure search. If you have not run
this before, the program will create a new database of substructural
keys for every molecule in the database (i.e. the number of carbons, the number
of double bonds, the number of amide groups etc.). Note that for very large databases, this will
take some time. Incidentally, you can
load and use this database in Molecular Modeling Pro Plus (load an ssf file from the File menu). The program will identify all the structures
in the database which contain the substructure drawn in the drawing
window. A list of matching structures
will be displayed. Click on the names to
4. Alternatively you can find an exact match of the
molecule drawn in the Drawing window by selecting “Exact Match for Structure”
from the Edit menu of the database window.
After the search is complete, the database record that matches will be
highlighted in the spreadsheet.
The program also looks for the most similar molecules
in a database to the molecule drawn in the drawing window. This is done using the Compare menu in the
database window.
1. Load a database
2. Draw a molecule in the drawing window (or get a
structure from the database).
3. There are then the following options for doing a
similarity search
a. Find the molecules which are closest in all the
calculated properties in the database
b. Find the molecules which are closest in the 3D
solubility parameters (dispersion, polarity and hydrogen bonding). This might be useful for solvent or
surfactant replacement in a formulation.
This routine is not affected by molecular conformation.
c. Find the molecules which are closest in the Rule of
Five bioavailability parameters (log P, molecular weight, hydrogen bond
acceptor and donor, polar surface area).
This might be useful for determining which molecules in a database of
analogs might have the closest property to a known drug for uptake. This
routine is not affected by molecular conformation.
d. Find the molecules which most closely match in 3D
structure – looks at size, shape and distribution of hydrogen bonding atoms. This search might be used as a screening tool
to see if any molecules in a large database should be tested for biological or
other activity similar to the molecule drawn in the Drawing Window. This may require the creation of a large
database of 3D properties as a preliminary step, so can take some time. This routine is affected by 3D
conformation. Similar molecules that are
stored with different shapes and dimensions due to being in different
conformations will not be found to be similar.
Molecular orientation, on the other hand, does not affect the outcome of
this search.
e. Find the molecules which have the most substructural
features in common. This should find the
closest analogs of the molecule drawn in the Drawing Window. This search is independent of conformation
and uses the substructural keys database.
If this database is not set up, you must do so by running the
Substructure Search routine from the Edit menu (can take some time for a very
large database).
f.
The final option,
“Find Matching Combinations of Molecules” is the most flexible structural
search done by MMP+. You will be asked
to select the properties to search under (MMP+ must be able to calculate these
fields). You will also be given the
opportunity to exclude molecules in the database from the search. This routine will not only print out the most
similar molecules, but will also look at combinations of two molecules. The combinations are calculated at 10%
intervals by simply multiplying the fractions times the property values and
adding them together. This routine may
have some utility in formulation research where the investigator may be
interested in replacing a formulation ingredient with a mixture of ingredients. The more similar the ingredients in the
mixture are, the more likely the result is likely to work for this use.
Other
programs
Molecular Modeling Plus has
builtin ties to several other programs.
1. Support of MDL molfiles over the clipboard. You can move molfiles
from one program to another over the clipboard between Molecular Modeling Pro
Plus and Chemsite, Accord, Chemical Inventory System
and other programs which support this file transfer in the public part of the
clipboard.
2. There are also builtin
access through the Tools, Geometry and Display menus to Chemsite. There are tieins to Microsoft Excel in the
File menu of the database window.
3. The companion database, ChemicaElectrica, was created
using Molecular Modeling Pro Plus and can be viewed, editing and added to using
MMP+. This database contains over 5500
pharmaceutical and industrial chemicals with information on use, trade names,
and over 110 fields of chemical properties.