This article is for education and learning purpose. The aim is to understand how to start experimentation in the area of Neural Networks, how to compare results, what all features to consider while doign experiments. It is a handy tool for those doing self learning in the area of Machine Learning. And a good start for educators who want to impart education in these areas and want to know the art of assignments and what all to expect. Further, the results are not benchmarks, they are elaborated for explanations given ahead. For benchmark results look into peer-reviewed research papers.
Here we have implemented backpropagation algorithm and have tested on a subset of original MNIST datasets for 2-class problem of digit image recognition.
Backpropogation with MNIST Data for binary case of 3 and 8 digits
Contents
1. Preprocessing the available data. 1
2. Code Implementation Details. 1
This article just pays focus on the way experiments are to be performed and analyzed and some results and do not play a role in benchmarking results or changing existing theories. It is just for elaboration for the purpose of academicians and students to learn the art of how Neural Networks can be used for two class classifications of image data. Further, the results vary with change in number of hidden layers, initial weights, and other settings.
1. Preprocessing the available data.
I have written a script to collect subset of data from 3 and 8 digits called mnist_38_2.mat. The script takes the training data, training labels, testing data and testing labels and is combined into one data file. This script extracts the 3 and 8 digits from training and testing data and creates a single mat file of the whole data containing the training and the testing data along with their labels.
Note: I have performed much fewer epochs on experimentations due to constraints on my computing devise used. This is too less when GPUs are used. But all this is for illustrative purpose only. You can ask your students to perform higher iterations for reaching a higher accuracy and even testing on various number of hidden layers.
2. Code Implementation Details
Here are details of implementation of the backpropagation algorithm that I have written for binary case of digits 3 and digit 8 classification. I have written the code in Matlab. The Matlab file name is backpropogation_38. Appropriate self explanatory comments are given in the code. Here is a brief summary of the algorithm that being implemented.
The methods are as follows:
i. BackPropogation38
This is the main method of the file from where execution starts. In this method data is read from file mnist38All.
This is for two class problem for more than two class code changes. Taking correct positive class as 3 ie 0 and the other one as ie 8 as 1. This is the main method of the file from where execution starts. In this method data is read from file mnist38.mat then 10 folds of data is created. Experiments were performed with crossvalind with a holdout of 0.1,0.2,0.3 to make number of elements in testing set as 10% 20% and 30% in testing and remaining ones in training. Further vtot is defined as a vector which stores the average values of following quantities for the final result. The quantities stored in vtot are [ TP,TN,FP,FN, Precision, Recall, F-measure, accuracy]. We define the number of hidden and number of output neurons in this method experiments were done using varying number of hidden neurons equal to 100.
Inside the loop for ten folds call to the back propagation is made for training function on each of the training and testing set generated. Then on the same set of testing data the evaluations is performed. trainBP is for training and testBP for testing in the for loop.
ii. TrainBP
This is the main function that does the training of the network . It takes the following parameters and return types are discussed. Followed by the details of procedure:
—————parameters————–
trainingSet: The training data as obtained by crossvalind
num_Hidden: The number of hidden nodes
num_Output: The number of output nides
trainingLabels: The labels of training data, in case of one output the number of colums in data is one else is equal to number of output nodes
—————-Return Arguments——————–
weights_1_ij:weights from input to hidden
weights_2_ij:weights from hidden to output
biasInput: the bias from input to hidden
biasHidden: the bias from hidden to output
—————-Details———————-
Here in the training the parameters such as learning rate are assigned which is set as equal to 1/sqrt(iteration). Then each training pattern is presented in the loop one by one in each iteration. The maximum number of iterations is set to 5000 due to computationally large size of the datasets which takes large time to compute per iteration. The condition of the looping of iterations is till either the maximum number of iterations reached or and error in values computed is less than permissible error of 0.001. Further for each input updated weights are computed.
- S1(j) = S1(j) + weights_1_ij(i,j) * x(i) ; is w1.x, the net input at jth hidden neuron
- S2(j) = S2(j) + weights_2_ij(i,j) * h(i) ; w2.h, the net input at the jth output neuron
- delta_2_weights_2_ij(j) = O(j)*(1-O(j))*(Y(k)-O(j)); delta 2 for each output neuron
- sum =sum = sum + delta_2_weights_2_ij(l) * weights_2_ij(j,l) ;
- delta_1_weights_1_ij(j) = h(j)*(1-h(j))*sum; delta1 for each hidden neuron
- weights_1_ij(i,j) = weights_1_ij(i,j) + eta * delta_1_weights_1_ ij(j) *x(i) ; updation in weight in input to hidden layer
- weights_2_ij(i,j) = weights_2_ij(i,j) + eta * delta_2_weights_2_ij(j) * % h(i) ; updation in weight in hidden layer to output layer
iii. TestBP
This method is for testing the backpropagation algorithm written using iterative methology (non matrix based weight updation). This is the main function that does the testing of the network . It takes the following parameters and return types are discussed. Following is the detail of the procedure:
—————parameters————–
weights_1_ij:weights from input to hidden
weights_2_ij:weights from hidden to output
biasInput: the bias from input to hidden
biasHidden: the bias from hidden to output
testingSet: the testing set as creteated by 10 fold crossvalind
testLabel: the testing labes corresponding to the testingSet
num_Hidden: The number of hidden layer neurons
num_Output: number of output layer nodes
—————Return Arguments——————–
correctlyClassified
count3: The number of misclassified class 3 elements
count8: The number of misclassified class 8 elements
unclassified: the matrix containing 5 unclassified data elements of each of the given class
v: This is a vector returning the TP, TN, FP, FN ie the confusion matrix , it also returns precision, recall, F-Value and accuracy
—————-Details———————-
In this method each pattern in the testingSet is tested for its accuracy. Here the net input in hidden layer is calculated and the activation function applied on net input and the outputs at hidden layer are evaluated. Further, the net input at the output layer is computed and activation function is applied to get the net output. The results are compared, the net output with the the expected output to evaluate the TP, FP, TN, FN, precision, recall and accuracy.
3.Evaluations and Results
The results are evaluated as explained above under the maximum number of epoch equal to 50 as it was taking long time to compute the results. The above three functions are executed and results computed. 100 hidden neurons were taken
The following are experiments conducted. Experiment results are as follows:
Number of Epochs | TP | TN | FP | FN | Precision | Recall | F-Value | Accuracy | |
0.1 holdout | 3000 | 682 | 0 | 714 | 0 | .4885 | 1 | .4885 | 0.4885 |
Average of 10 folds, for each fold | 100 | 682 | 0 | 714 | 0 | .4885 | 1 | .4885 | 0.4885 |
Only these many experiments were conducted due to constraint on time it is taking to run for large number of epochs.
Once settings of initial weights, training and testing data values are changed, it drastically changed the number of epocs required as well as accuracy increased considerably. Here are the results.
Number of Epochs | TP | TN | FP | FN | Precision | Recall | F-Value | Accuracy | |
Average over all folds and 100 epocs with hundred hidden layer neurons | 100 | 679 | 486 | 228 | 3 | .74 | .99 | .846 | 0.83 |
The following accuracy was obtained, but since it was taking too much time its ten folds could not be computed. This accuracy was computed 2 times, 350 epochs. I have attached the final weights of this accuracy with the code.
TP | TN | FP | FN | Precision | Recall | F-Value | Accuracy | |
BackPropagation with 100 hidden layer neurons | 682 | 697 | 17 | 0 | .9757 | 1 | .9877 | .9878 |
Code
function BackPropogation38()
disp(‘..Starting BackPropogation38 Algorithm….’);
%reading data A = load('mnist38All.mat'); %v=[ TP,TN,FP,FN, Precision, Recall, F-measure, accuracy] vtot=[0, 0, 0, 0, 0, 0,0 ,0]; data =[A.train;A.test]; %10 folds with 10-90 ratio of testing and training for i = 1:10 P=.1; groups=data(:,785); [train,test] = crossvalind('holdout',groups, P); train1= data(train, 1: 785); test1=data(test, 1: 785); num_Hidden = 100; num_Output = 1; % two for binary case else 10. [rowtrain,coltrain]=size(train1); [rowtest,coltest]= size(test1); %initilizating weights trainingLabels(1:rowtrain) = train1(:,coltrain); trainingSet(1:rowtrain,1:coltrain-1)=0; trainingSet(1:rowtrain,1:coltrain-1) = train1(1:rowtrain,1:coltrain-1); trainingLabels(1:rowtrain)= train1(1:rowtrain,coltrain); testSet(1:rowtest,1:coltest-1)=0; testLabels(1:rowtest)=test1(:,coltrain); testSet(1:rowtest,1:coltest-1) = test1(1:rowtest,1:coltest-1); testLabels(1:rowtest)= test1(1:rowtest,coltest); for n1=1:rowtrain for n2=1:coltrain-1 if trainingSet(n1,n2) >0 trainingSet(n1,n2)=1; end end end for n1=1:rowtest for n2=1:coltest-1 if testSet(n1,n2) >0 testSet(n1,n2)=1; end end end [weights_1_ij, weights_2_ij, biasInput, biasHidden] = trainBP(trainingSet, num_Hidden,num_Output, trainingLabels,testLabels,testSet); [correctlyClassified,count3,count8,unClassified,v] = testBP(testSet,weights_1_ij, weights_2_ij,biasInput, biasHidden, testLabels, num_Hidden,num_Output); %v stores the output containg TP.FP etc , vtot the total of all %such over 10 folds vtot = vtot + v; correctlyClassified count3 count8 end %computing average of 10 folds vtot = vtot ./i
end
%learning rate eta =1; maxEpochs=100; errorBound=0.001; [trainLengthRow, trainLengthCol] = size(trainingSet); num_Input = trainLengthCol; Y = trainingLabels; %assingning initial weights weights_1_ij(1:num_Input,1:num_Hidden) = 0.01 ; weights_2_ij(1:num_Hidden,1:num_Output) = 0.01 ; biasInput(1:num_Hidden) = 0.01 ; biasHidden(1:num_Output) = 0.01 ; delta_1_weights_1_ij(1:num_Hidden)= 0; delta_2_weights_2_ij(1:num_Output)= 0; epochs =1; error = 1; while((epochs < maxEpochs) && (error > errorBound)) eta=1/sqrt(epochs); error=0; epochs = epochs+1 for k =( 1: trainLengthRow ) x = trainingSet(k,:) ; S1(1:num_Hidden)=0; S2(1:num_Output)=0; %calculating weights for j =(1:num_Hidden) for i =(1:num_Input) S1(j) = S1(j) + weights_1_ij(i,j) * x(i) ; end; S1(j) = S1(j) + biasInput(j) * 1; end; h(1:num_Hidden) = 0; for j =(1:num_Hidden) h(j)= 1/(1+exp(-1*S1(j))); end; for j =(1:num_Output) for i =(1:num_Hidden) S2(j) = S2(j) + weights_2_ij(i,j) * h(i) ; end; S2(j) = S2(j) + biasHidden(j) * 1; end; O(1:num_Output) = 0; for j =(1:num_Output) O(j)= 1/(1+exp(-1*S2(j))); end; %calculating weights for j =(1:num_Output) delta_2_weights_2_ij(j) = O(j)*(1-O(j))*(Y(k)-O(j)); end; for j =(1:num_Hidden) sum = 0; for l=(1:num_Output) sum = sum + delta_2_weights_2_ij(l) * weights_2_ij(j,l) ; end; delta_1_weights_1_ij(j) = h(j)*(1-h(j))*sum; end; %updating weights %calculating new weights for i =(1:num_Input) for j =(1:num_Hidden) weights_1_ij(i,j) = weights_1_ij(i,j) + eta * delta_1_weights_1_ij(j) * x(i) ; end; end; %computing bias for j =(1:num_Output) biasHidden(j) = biasHidden(j) + delta_2_weights_2_ij(j) * 1 ; end; %updating weights %calculating new weights for i =(1:num_Hidden) for j =(1:num_Output) weights_2_ij(i,j) = weights_2_ij(i,j) + eta * delta_2_weights_2_ij(j) * h(i) ; end; end; %computing bias for j =(1:num_Hidden) biasInput(j) = biasInput(j) + delta_1_weights_1_ij(j) * 1 ; end;
% error as output approaching target
error = error + sqrt( (O(1)- Y(k)) * (O(1)- Y(k)) );
end if epochs % 10 == 0 save('weights.mat','weights_1_ij','weights_2_ij','biasInput','biasHidden','testLabels','testSet','trainingSet','trainingLabels'); end
end
end
function [correctlyClassified,count3,count8,unClassified,v] = testBP(testingSet,weights_1_ij, weights_2_ij,biasInput, biasHidden, testLabel, num_Hidden,num_Output)
correctlyClassified = 0; count3 = 0; count8=0; TP=0; TN=0; FP=0; FN =0; P=0; R=0; F=0; [testLengthRow,testLengthCol]=size(testingSet); unClassified(1:10 ,1: testLengthCol) = 0; % checking accuracy by number of correctly classified for k=(1: testLengthRow ) x=testingSet(k,:); S1(1:num_Hidden)=0; S2(1:num_Output)=0; num_Input =testLengthCol; %calculating for j =(1:num_Hidden) for i =(1:num_Input) S1(j) = S1(j) + weights_1_ij(i,j) * x(i) ; end; S1(j) = S1(j) + biasInput(j) * 1; end; h(1:num_Hidden) = 0; for j =(1:num_Hidden) h(j)= 1/(1+exp(-1*S1(j))); end; for j =(1:num_Output) for i =(1:num_Hidden) S2(j) = S2(j) + weights_2_ij(i,j) * h(i) ; end; S2(j) = S2(j) + biasHidden(j) * 1; end; O(1:num_Output) = 0; for j =(1:num_Output) O(j)= 1/(1+exp(-1*S2(j))); end; % error as output approaching target if sqrt( (round(O(1))- (testLabel(k))) * (round(O(1))- (testLabel(k)) )) == 0 % correctly classified examples correctlyClassified=correctlyClassified+1; %compute TP, TN if(testLabel(k)==1) TP = TP+1; else TN = TN +1; end else % wrongly classified examples if(testLabel(k)==1) FN = FN+1; else FP = FP +1; end %storing 5 misclassified classes from each class if(count8<5 && testLabel(k)==0) count8 = count8 + 1; unClassified(count8,1: testLengthCol) = testingSet(k,1: testLengthCol); end if(count3<5 && testLabel(k)==1 ) count3 = count3 + 1; unClassified(count3+5,1: testLengthCol) = testingSet(k,1: testLengthCol); end end end k %for storing 'TP, TN, FP, FN, Precision, Recall, %F value , accuracy v=[TP, TN, FP, FN, TP/(TP+FP), TP/P, 2*P*R / (P+R) , correctlyClassified/testLengthRow] disp('TP, TN, FP, FN, TP/(TP+FP), TP/P, 2*P*R / (P+R) , correctlyClassified/trainLengthRow'); unClassified; accuracy = correctlyClassified/testLengthRow ; accuracy
end
Backpropagation algorithm for 10 digits learning
Applying BPNN classifier on original multiclass data with the same experimental settings as in the binary case above. The results are not as high, and only upto few epocs could be performed, again due to limits on computing capacity of systems for experimentation. This is due to the fact that the complete Neural Network, initial weights have to change with the change in problem. The outputs are 10 classes represented in form of 1’s and 0’s of ten digits combinations. Conclusion, experiments of parameters for 10 class data need to be performed.
Misclassified Images for 2 class digit recognition problem
Code was written in each of the algorithm that stores 5 misclassified samples from each class in a mat file. And a script that is in this folder to read each of the misclassified row from this mat file and convert it in a 28 x 28 matrix in image form. The script is as follows:
%This is a script that reads data which was unclassified and stored in a
%mat file and displays the image file corresponding to that data.
data = load('mat38unclassified.mat'); A = data.unClassified; [r,c] = size(A); %read all r data that are present in unclassified mat file for t=1 : r k=1; x(1:c)=A(t,1:c); a(1:28,1:28)=0; %convert it back to matrix form for i=1: 28 for j =1: 28 a(i,j)= x(k); k=k+1; end end % display each of the figure separately figure(t); imshow(a); end
The following are misclassified ones from 3 and 8 digits backpropagation for 2 class program




Following are misclassified “3” digit.






Comparison with other ML Techniques
Method | TP | TN | FP | FN | Precision | Recall | F-Value | Accuracy |
Logistic regression | 936 | 990 | 20 | 38 | .9791 | .9610 | .9699 | .9708 |
Naïve Bayes | 938 | 903 | 72 | 71 | .928 | .928 | .928 | .92 |
Comment: Logistic Regression is performing the best and Naïve Bayes is also performing good.
RNN takes high amount of time to learn. So the execution time for RNN is very high.
Naïve Bayes require and extra overhead of discretizing the data. Otherwise Naïve Bayes is a fast algorithm in terms of execution time as its output is direct output that does not require iterations.
The accuracy in these experiments is execution time and system restrictions, which prevented optimization of error and hence increase in accuracy. This does not say than a particular do not perform well-it is just an experimental setup for education and learning purpose for how to go ahead with experimentations.
Weka was used for Naïve Bayes. And the settings cant be changed the true positive rate and false positive rate is coming out to be 0.92 and 0.926 respectively. Which is upper right hand column of the ROC graph.
Naive Bayes ROC Curve: Since using Weka GUI it is not possible to create ROC curve though the TPR and FPR is given by (.926,.92). Also see the screen shoot below for details. Only two point plots possible.
ROC for implemented Logistic Regression:

Note again: Much fewer epochs were performed on experimentations due to constraints on computing devise used. This is too less when GPUs are used. But all this is for illustrative purpose only. Also iterative approach for implementation has been used in the code implemented for illustration. Modern day approach do use matrix computations for efficiency in processing, given the advent in processors and computing facilities used for matrix computations.