Confusion matrix with the use case in cyber crime.

Armaan Zaheer
5 min readJun 6, 2021

--

The million dollar question — what, after all, is a confusion matrix?

Confusion matrix-

A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making.

For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:

Let’s understand each term below-

True Positive (TP)

  • The predicted value matches the actual value
  • The actual value was positive and the model predicted a positive value

True Negative (TN)

  • The predicted value matches the actual value
  • The actual value was negative and the model predicted a negative value

False Positive (FP) — Type 1 error

  • The predicted value was falsely predicted
  • The actual value was negative but the model predicted a positive value
  • Also known as the Type 1 error

False Negative (FN) — Type 2 error

  • The predicted value was falsely predicted
  • The actual value was positive but the model predicted a negative value
  • Also known as the Type 2 error

Why Do We Need Them?

Classification Models have multiple categorical outputs. Most error measures will calculate the total error in our model, but we cannot find individual instances of errors in our model. The model might misclassify some categories more than others, but we cannot see this using a standard accuracy measure.

Furthermore, suppose there is a significant class imbalance in the given data. In that case, i.e., a class has more instances of data than the other classes, a model might predict the majority class for all cases and have a high accuracy score; when it is not predicting the minority classes. This is where confusion matrices are useful.

A confusion matrix presents a table layout of the different outcomes of the prediction and results of a classification problem and helps visualize its outcomes.

It plots a table of all the predicted and actual values of a classifier.

Basic layout of confusion matrix

Use case on Cyber Crime -

In recent years, botnets have become one of the major threats to information security because they have been constantly evolving in both size and sophistication. A number of botnet detection measures, such as honeynet-based and Intrusion Detection System (IDS)-based, have been proposed. However, IDS-based solutions that use signatures seem to be ineffective because recent botnets are equipped with sophisticated code update and evasion techniques.. Experimental results show that machine learning algorithms can be used effectively in botnet detection and the random forest algorithm produces the best overall detection accuracy of over 90%.

Experimental Dataset-

To evaluate the classification performance of botnet domain names using machine learning algorithms, we use the extracted and labeled domain name datasets, which include a set of benign domain names and a set of malicious domain names used by botnets. The set of benign domains includes 30,000 top domain names ranked by Alexa.. From the original datasets, the domain names are processed to remove the top-level domain. For example, with the domain name “example.com”, after processing, the resulting data is “example”.

Experimental Scenarios-

After the pre-processing process, 18 features extracted from each domain name form a record in the experimental dataset. We selected three training datasets, namely T1, T2 and T3, and a testing dataset, called TEST from the experimental dataset. Records from benign domain names are labeled “good” and records from malicious domain names are labeled “bad”. Data for testing is not part of the training sets. Table 1 describes the dimensions of the components of the training datasets and the testing dataset.

The classification measures used in our experiments include PPV (Positive predictive value), FPR (False positive rate), TPR (True positive rate), ACC (Accuracy) and F1 (F1 measure). These measures are computed using the following formulas:

The confusion matrix -

Experimental Results-

The model’s experimental results using 4 machine learning algorithms on the T1, T2, T3 training datasets and the TEST testing dataset described in Section.

Classification performance of the detection model using T1 training set.

Classification performance of the detection model using T2 training set.

Classification performance of the detection model using T3 training set.

Conclusion -

The experimental results on DGA botnet and FF botnet datasets show that most of the machine learning techniques used in the model achieved the overall classification accuracy over 85%, among which the random forest algorithm gives the best results with the overall classification accuracy of 90.80%. Based on this result, we propose to select the random forest algorithm for the proposed botnet detection model using confusion matrix . In the future, we continue to test the proposed model with larger datasets and analyze the effects of the domain name features on the detection accuracy, as well as research and propose new features to improve the detection accuracy of the proposed model.

--

--

No responses yet