Since logistic regression, linear discriminant analysis and neural networks use monotone classification functions, a fixed number of undecided cases corresponds to a single one-dimensional interval. Note that this interval may be of many different sizes as long as at captures the given number cases. Also, note that each function is monotone in the sense that once the function has been determined, each variable has a monotone effect on the classification. That is, each variable will either have a non-negative effect or non-positive effect (not both) on the outcome. In linear discriminant analysis this is probably the most apparent because the space is split in two by a hyper plane. The sigmoid function (used in logistic regression as well as the transfer function in the following neural network) is also monotone in each variable.
In the logistic regression, neural network and Fisher's linear discriminant
cases, these intervals are also symmetric, which further simplifies this
process. Note that the sigmoid function is symmetric in itself, while Fisher's
linear classification rule assumes equal variance-covariance matrices,
which results in symmetric probabilities of misclassification. As
a result of this symmetry, for a fixed number of undecided observations,
there corresponds an interval laying between two groups of points.
Therefore, the number of correctly and incorrectly classified points is
also fixed. To find the interval corresponding to a fixed number
of undecided cases, it is only necessary to find one of its borders (because
of symmetry and the middle point is given). To find an appropriate
upper border point we performed a binary search. Note that one may
also start with 0 captured cases (i.e., at the cut-off point) and expand
the interval, using its symmetry property, until one finds an interval
that captures 20 points.
Number of correct classifications
= 41
Number of incorrect
classifications = 5
Number of undecided
cases
= 20.
Therefore, the accuracy rate is 89.1% (41 correct out of 41+5) on the
cases that were actually classified. The number of undecided cases of 20
will be fixed for the remaining approaches to provide the same number of
actual classifications.
Results:
The linear discriminant function we obtained from the training data
is:
and it provided 86.0% accuracy on the training data using a cut-off value of 7.1959.
For the testing data, we found that the interval [6.6264, 7.7654] determined that 20 cases were undecided. That is, when fixing the number of undecided cases to 20, the linear discriminant rule will result in the following classifications on the remaining 46 testing observations:
Number of correct classifications
= 39
Number of incorrect
classifications = 7 .
Notice that even if we change this interval (while maintaining 20 points
within it), the number of correctly and incorrectly classified points will
not change due to monotonicity and symmetry property discussed in the introduction.
Also, observe that the interval edges are symmetrically displaced by 0.5695
about the original cut-off value.
Results:
The parameter estimates we obtained from the training data are:
Theta SE
-9.2761 1.4505
Beta
SE
0.0375 0.0165
0.0244 0.0429
-0.0447 0.0492
0.0720 0.0501
-0.0101 0.0401
-0.0435 0.0499
-0.1318 0.0531
0.0023 0.0475
0.0266 0.0305
-0.0223 0.0128
For the testing data, we found that the interval [.3125, .6875] corresponds to 20 undecided cases. That is, when fixing the number of undecided cases to 20, the logistic regression rule will result in the following classifications on the remaining 46 testing observations:
Number of correct classifications
= 39
Number of incorrect
classifications = 7 .
Even though the accuracy rate is equivalent to that of Fisher's linear discriminant, the actual classifications are not equivalent. In fact, their classifications differ at 4 points.
Notice that even if we change this interval (while maintaining 20 points
within it), the number of correctly and incorrectly classified points will
not change due to monotonicity and symmetry properties discussed in the
introduction.. Also, observe that the interval edges are symmetrically
displaced about 0.5.
The connections between the three layers are represented by two matrices. These matrices are first randomly initialized so some small values before they are trained to accomodate the observed classifications. We did this using the well known Backpropagation Algorithm. It feeds the inputs forward in the net, and the errors are fed backward. This enables one to update the weight matrices according to the errors gradient descent based on the transfer function. We used the bipolar sigmoid transfer function which not only provides a natural association (-infinity and infinity map to -1 and 1) but it is also easily differentiable. Using bipolar variables seems more reasonable than binary since the 0 value has a different effect than 1 in multiplication and we don't want the classes to be treated differently.
As with any gradient-based optimization procedure, the Backpropagation Algorithm needs to determine the magnitude of the updates (training rate). The training rate is commonly set to some small value (<1). We experimented with different values and found that 0.1 worked well. Often, a momentum term is included to improve the convergence rate. It didn't seem to have much of an effect in this case, so we excluded it. When running the training algorithm, we observed that the accuracy on the training data reached a plateau around 82-85%, so we decided to terminate the algorithm when an accuracy of 85% or better was achieved (accuracy when using 0 as the cut-off value in output PE).
Once the trained weight matrices are obtained, we can classify the observations in the testing set. If we want to place all possible observations into one of the two groups, then we use 0 as the cut-off value for the bipolar sigmoid transfer function. As with logistic regression and discriminant analyses performed above we want to fix the number of undecided cases in the testing test. That is, we need to find a corresponding cut-off point.
The interested reader should consult our references at the bottom of this page) for further insight into Neural Computing.
Results:
After running the training algorithm several times (different initial
matrices give differing results), the most accurate classification
on the testing data were found with the following weight matrices:
w1 =
w2 =
-0.7761 0.6067
0.4334
-0.6696 0.5701
-2.4373
-0.0742 0.0608
2.0423
-0.0105 0.0230
-0.3816 0.3126
0.4831 -0.3956
1.4933 -1.2890
1.4208 -1.1971
0.9058 -0.7587
0.5988 -0.5306
0.4624 -0.4206
after 7 epochs and the accuracy on the training data improved as follows
acc_trn = 0.4688 0.5938
0.7708 0.8073 0.8281
0.8438
and reached 85.4% accuracy on the training data as it was terminated.
For the testing data, we found the interval correspondind to 20 undecided cases. That is, when fixing the number of undecided cases to 20, the logistic regression rule will result in the following classifications on the remaining 46 testing observations:
Number of correct classifications
= 41
Number of incorrect
classifications = 5 .
* Notice that this is the best observed classification accuracy and that
we observed as poor as 37 correct classification (and 9 incorrect), with
the same accuracy on the training data. In a situation where the
new observations are not given, we could have obtained a classification
accuracy anywhere in between. 37/46 and 41/46.
Linear Discriminant Analysis: 86.0% 84.8% Logistic Regression: 87.5% 84.8% Neural Network: 85.4% 80.4-89.1%* The OCAT (our) Approach: 100% 89.1%
Fisher, R. A., "The Use of Multiple Measurements in Taxonomic Problems," Annals of Eugenics, 7, 179-188, 1936.
Johnson R. A. and Wichern D.W., "Applied Multivariate Statistical Analysis," Third Edition, Prentice Hall, Upper Saddle River, NJ, U.S.A., 1992.
Myers, R. H., "Classical and Modern Regression with Applications," Second Edition, Duxbury Press, Belmont, CA, U.S.A., 1990.