K-mean Clustering and its real use-case in the security domain

5 min readSep 2, 2021

K-means Clustering:-

K-Means is one of the most popular “clustering” algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster’s centroid than any other centroid.

K-Means Clustering is part of a group of learning algorithms called unsupervised learning. In this type of learning model, there is no clear identification of the label / category / category for each data point.
Each point of data in your database is a vector of symbols, that is, features without a specific label that it can assign to a particular set or category. The algorithm will learn on its own how to collect data points with similar features and collect them together.

K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) choosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

Figure 1: K-means algorithm. Training examples are shown as dots, and cluster centroids are shown as crosses. (a) Original dataset. (b) Random initial cluster centroids. (c-f) Illustration of running two iterations of k-means. In each iteration, we assign each training example to the closest cluster centroid (shown by “painting” the training examples the same color as the cluster centroid to which is assigned); then we move each cluster centroid to the mean of the points assigned to it.

Implementation of K Means Clustering:-

Step 1: Let’s choose the k number of groups, that is, K = 2, to divide the data into categories. We will select two random points that will serve as the centroid to build the collection.

Step 2: We will now allocate each data point to the distribution structure based on its distance from the nearest K-point or centroid. It will be done with a drawing between the two centroids.

Step 3: points on the left side of the line next to the blue centroid, and the points on the right of the line are closer to the yellow centroid. A collection of left Form One with blue centroid and right with yellow centroid.

Step 4: Repeat the process by selecting a new centroid. To select new centroids, we will find a new magnetic field for these centroids, shown below:

Step 5: Next, we will redistribute each data point to the new centroid. We will repeat the same process as above (using the middle line). The yellow data point on the blue side of the center line will be included in the blue set

STEP 6: As redistribution has occurred, so we will repeat the above step to get new centroids.

STEP 7: We will repeat the above process of finding the center of gravity for centroids, as outlined below:

STEP 8: After discovering the new centroids we will re-draw the center line and refresh the data points, as in the steps above.

STEP 9: Finally we will divide the points based on the middle line, such as that two groups are formed and no different point will be assigned to one group

The last Cluster built is as follows:

Choosing the Right Collection Number:-

The number of collections we choose for the algorithm should not be random.

Each collection is made up of a calculation and comparison of the standard distances of each data point within a collection from its centroid.

We can select the appropriate number of collections with the help of the Within-Cluster-Sum-of-Squares (WCSS) method.

WCSS represents the total number of squares of data points in each collection from its centroid.

The main idea is to reduce the distance between data points and centroid collections. The process is written until we reach the minimum number of distances.

Real Use-Case in Security Domain:-

Cyber-Profiling Criminals

The process of criminalizing criminals is often referred to as the investigation or analysis of cyber-crime profiling crime. Crime profiles are generated in the form of data by personal characteristics, trends, habits, and geographic characteristics of the offender (e.g.: age, gender, socio-economic status, education, location). The correction of criminal profiles will be accompanied by an analysis of the physical evidence available at the scene, the process of extracting the victim’s (victim’s) disclosure, finding a way to work (whether the crime is planned or unplanned), and the process of deliberately excluding perpetrators (signature).

The new approach to cyber profiling is to use k-means clustering techniques to classify the Web-based content through data user preferences. This preference can be interpreted as an initial grouping of the data so that the resulting cluster will show user profiles.

User installation can be seen as a result of user interests, goals, features, behavior and preferences. User profiles are designed for user background information. The user profile represents a concept model held by the user when searching for web information.

Thanks for Reading.. 😍