Confusion Matrix - SkillBakery Studios

## Wednesday, July 29, 2020

It should not be that confusing but most of the time when someone asks me to calculate the specificity or the sensitivity of the accuracy of a model, I just end up googling the formulas. Confusion Matrix is basically used to figure out which of our classification models (Logistic Regression, KNN, or Random Forest) are better in classifying the data. Let’s suppose we have data for weight, blood circulation, heart rate, blocked arteries, etc. for the number of patients for which we have to predict whether the patient has heart disease or not. On the data, we will run our classification models like Logistic Regression or K-NN by dividing the data into Training and Test Data.

The process goes like we first train the model and then test it to predict the outcome. Now, to summarize how different models performed, we use the confusion matrix. The Confusion Matrix will look like the below image. The columns are the actual values of the patients while rows are the predicted values of the model.

Confusion Matrix True Positives (TP) is the number of times the model predicted correctly that the patient has heart disease. True Negatives (TN) are the number of times the model predicted correctly that the patient doesn’t have heart disease. False Positive (FP) is the number of times the model predicted that patient has a heart disease but in actual they don’t. False Negative (FN) is the number of times the model predicted that patient doesn’t have heart disease but in actual they do have.

Now, let’s suppose we did the classification for 100 patients and below is how the confusion matrix looked like in case of Logistic Regression.

Now, we can calculate the accuracy, specificity, or the sensitivity of the model.

Accuracy or Precision = (TP+TN)/Total = (50 +30)/100 = 0.80

This tells us how often the classifier is correct.

Sensitivity = TP/ (TP+FN) = 50 / (50+5) = 0.909 i.e. approx. 91% of the time patients with heart disease were correctly identified.

Specificity = TN/ (TN+FP) = 30/ (30+15) = 0.66 i.e. approx. 67% of the time patients without heart disease were correctly identified.

We can then calculate the values in the same way for the K-NN method or other classification methods. Then on the basis of what is more important to us i.e. patients getting identified correctly with heart disease, the model with higher sensitivity should be used. And if we intend to see patients without heart disease getting correctly classified then the model with higher specificity should be chosen.