A Comparative Analysis of the OCAT Method with Some Other Methods


Introduction

As we explain in our papers and in the description section of our data mining homepage, the OCAT approach forms two sets of rules: the positive and the negative rules. As result of this, classifying a new (i.e., unclassified) observation will result in one of the following three outcomes: the classification will be either correct, or incorrect, or will be an undecided case.  To provide a common ground to compare our results to those of logistic regression, linear discriminant analysis and neural networks, we decided to fix the number of undecided cases (in the testing data) to that obtained by the OCAT approach. As a result, we could find the accuracy rate, on the same number of actual classifications, for each method.

Since logistic regression, linear discriminant analysis and neural networks use monotone classification functions, a fixed number of undecided cases corresponds to a single one-dimensional interval.  Note that this interval may be of many different sizes as long as at captures the given number cases.  Also, note that each function is monotone in the sense that once the function has been determined, each variable has a monotone effect on the classification.  That is, each variable will either have a non-negative effect or non-positive effect (not both) on the outcome.  In linear discriminant analysis this is probably the most apparent because the space is split in two by a hyper plane.  The sigmoid function (used in logistic regression as well as the transfer function in the following neural network) is also monotone in each variable.

In the logistic regression, neural network and Fisher's linear discriminant cases, these intervals are also symmetric, which further simplifies this process. Note that the sigmoid function is symmetric in itself, while Fisher's linear classification rule assumes equal variance-covariance matrices, which results in symmetric probabilities of misclassification.  As a result of this symmetry, for a fixed number of undecided observations, there corresponds an interval laying between two groups of points.  Therefore, the number of correctly and incorrectly classified points is also fixed.  To find the interval corresponding to a fixed number of undecided cases, it is only necessary to find one of its borders (because of symmetry and the middle point is given).  To find an appropriate upper border point we performed a binary search.  Note that one may also start with 0 captured cases (i.e., at the cut-off point) and expand the interval, using its symmetry property, until one finds an interval that captures 20 points.
 

The OCAT Approach

The OCAT rules classified all of the 192 training points correctly (by definition), while for the 66 testing observations the results were:

        Number of correct classifications         = 41
        Number of incorrect classifications       =  5
        Number of undecided cases                 = 20.

Therefore, the accuracy rate is 89.1% (41 correct out of 41+5) on the cases that were actually classified. The number of undecided cases of 20 will be fixed for the remaining approaches to provide the same number of actual classifications.
 

Fisher's Linear Discriminant Analysis

A Quick Description:
Fisher's approach finds the linear projection (multivariate to single values) that maximizes the standardized (by the sample variance) squared distances between the two group means. A new observation is classified based on where it is projected onto this line.  If we want to place all possible observations into one of the two groups, then we can use the average of the two projected group means as the cut-off value.  However, we may also wish to describe a class of uncertain points (as with the undecided cases above) by an interval where the misclassification probability is high.  If the variance-covariance matrices of the two groups are equal, this interval is symmetric about the previous cut-off point. Note that Fisher's linear function estimates the common variance-covariance matrix, and therefore it implicitly assumes that the matrices are equal. The interested reader should consult our references at the bottom of this page) for further insight.

Results:
The linear discriminant function we obtained from the training data is:

Y = -0.0297X1 - 0.0121X2 + 0.0459X3- 0.0775X4 + 0.0020X5 + 0.0459X6 + 0.0783X7 + 0.0124X8- 0.0251X9 + 0.0228X10,

and it provided 86.0% accuracy on the training data using a cut-off value of  7.1959.

For the testing data, we found that the interval [6.6264, 7.7654] determined that 20 cases were undecided.  That is, when fixing the number of undecided cases to 20, the linear discriminant rule will result in the following classifications on the remaining 46 testing observations:

        Number of correct classifications    = 39
        Number of incorrect classifications  =  7 .

Notice that even if we change this interval (while maintaining 20 points within it), the number of correctly and incorrectly classified points will not change due to monotonicity and symmetry property discussed in the introduction. Also, observe that the interval edges are symmetrically displaced by 0.5695 about the original cut-off value.
 

Logistic Regression

A Quick Description:
Logistic regression models the logit of the probability of an observation belonging to class 0 (or 1) as a  linear function of the variables. To obtain the maximum likelihood estimates of its parameters, a non-linear continuous optimization method has to be utilized. Once these estimates are obtained, we can easily find the classification probabilities corresponding to each observed point.  If we want to place all possible observations into one of the two groups, then using 0.5 as the cut-off value on the probabilities, will minimize the misclassification probability. We may also determine the interval of uncertainty (where the misclassification probability is high) directly by choosing a maximum misclassification probability (less than 0.5).  That is, to fix the number of undecided cases in the testing test, we need to find a corresponding misclassification probability cut-off point.  The interested reader should consult our references at the bottom of this page) for further insight.

Results:
The parameter estimates we obtained from the training data are:
     Theta      SE
   -9.2761    1.4505

     Beta       SE
    0.0375    0.0165
    0.0244    0.0429
   -0.0447    0.0492
    0.0720    0.0501
   -0.0101    0.0401
   -0.0435    0.0499
   -0.1318    0.0531
    0.0023    0.0475
    0.0266    0.0305
   -0.0223    0.0128

 
which provided 87.5% accuracy on the training data using a cut-off value of 0.5 on the probability.

For the testing data, we found that the interval [.3125, .6875] corresponds to 20 undecided cases. That is, when fixing the number of undecided cases to 20, the logistic regression rule will result in the following classifications on the remaining 46 testing observations:

        Number of correct classifications    = 39
        Number of incorrect classifications  =  7 .

Even though the accuracy rate is equivalent to that of Fisher's linear discriminant, the actual classifications are not equivalent.  In fact, their classifications differ at 4 points.

Notice that even if we change this interval (while maintaining 20 points within it), the number of correctly and incorrectly classified points will not change due to monotonicity and symmetry properties discussed in the introduction.. Also, observe that the interval edges are symmetrically displaced about 0.5.
 

A Neural Network

A Quick Description:
A Neural Net (NN) consists of processing elements (PEs) and weighted connection between them.  We used a feedforward net which consisted of an input layer, a hidden layer and an output layer.  The input layer corresponds to the 10 independent variables and a bias term.  The PE in the output layer corresponds to the binary classification.  The hidden layer is created to make the net capable of more complex classification rules.  The more processing elements in the hidden layer, the more complex the rules become.  Note that if the hidden layer is eliminated, the resulting rule would be a single hyperplane as with Fisher's rule above.  We experimented with between 2 and 10 hidden layer PEs and didn't find much difference in performance, so we decided to go with 2 hidden PEs.

The connections between the three layers are represented by two matrices.  These matrices are first randomly initialized so some small values before they are trained to accomodate the observed classifications.  We did this using the well known Backpropagation Algorithm.  It feeds the inputs forward in the net, and the errors are fed backward.  This enables one to update the weight matrices according to the errors gradient descent based on the transfer function.  We used the bipolar sigmoid transfer function which not only provides a natural association (-infinity and infinity map to -1 and 1) but it is also easily differentiable. Using bipolar variables seems more reasonable than binary since the 0 value has a different effect than 1 in multiplication and we don't want the classes to be treated differently.

As with any gradient-based optimization procedure, the Backpropagation Algorithm needs to determine the magnitude of the updates (training rate).  The training rate is commonly set to some small value (<1).  We experimented with different values and found that 0.1 worked well.  Often, a momentum term is included to improve the convergence rate.  It didn't seem to have much of an effect in this case, so we excluded it. When running the training algorithm, we observed that the accuracy on the training data reached a plateau around 82-85%, so we decided to terminate the algorithm when an accuracy of 85% or better was achieved (accuracy when using 0 as the cut-off value in output PE).

Once the trained weight matrices are obtained, we can classify the observations in the testing set. If we want to place all possible observations into one of the two groups, then we use 0 as the cut-off value for the bipolar sigmoid transfer function.  As with logistic regression and discriminant analyses performed above we want to fix the number of undecided cases in the testing test.  That is, we need to find a corresponding cut-off point.

The interested reader should consult our references at the bottom of this page) for further insight into Neural Computing.

Results:
After running the training algorithm several times (different initial matrices give differing results), the most accurate classification on the testing data were found with the following weight matrices:

w1 =                        w2 =
   -0.7761    0.6067            0.4334
   -0.6696    0.5701           -2.4373
   -0.0742    0.0608            2.0423
   -0.0105    0.0230
   -0.3816    0.3126
    0.4831   -0.3956
    1.4933   -1.2890
    1.4208   -1.1971
    0.9058   -0.7587
    0.5988   -0.5306
    0.4624   -0.4206

after 7 epochs and the accuracy on the training data improved as follows

acc_trn =  0.4688    0.5938    0.7708    0.8073    0.8281    0.8438
 
and reached 85.4% accuracy on the training data as it was terminated.

For the testing data, we found the interval correspondind to 20 undecided cases. That is, when fixing the number of undecided cases to 20, the logistic regression rule will result in the following classifications on the remaining 46 testing observations:

        Number of correct classifications    = 41
        Number of incorrect classifications  =  5 .
 

* Notice that this is the best observed classification accuracy and that we observed as poor as 37 correct classification (and 9 incorrect), with the same accuracy on the training data.  In a situation where the new observations are not given, we could have obtained a classification accuracy anywhere in between. 37/46 and 41/46.
 
 

ACCURACY (on cases actually classified):

                                                                Training        Testing
 Linear Discriminant Analysis:  86.0%      84.8%  

 Logistic Regression:           87.5%      84.8%

 Neural Network:                85.4%      80.4-89.1%*  

 The OCAT (our) Approach:       100%       89.1%

References:

Fauset, L., "Fundamentals of Neural Networks," Prentice Hall, Upper Saddle River, NJ, U.S.A., 1994.

Fisher, R. A., "The Use of Multiple Measurements in Taxonomic Problems," Annals of Eugenics, 7, 179-188, 1936.

Johnson R. A. and Wichern D.W., "Applied Multivariate Statistical Analysis," Third Edition, Prentice Hall, Upper Saddle River, NJ, U.S.A., 1992.

Myers, R. H., "Classical and Modern Regression with Applications," Second Edition, Duxbury Press, Belmont, CA, U.S.A., 1990.

Visit Dr. Triantaphyllou's homepage