Applying Data Science to Malware —Part 1

Applying Data Science to Malware —Part 1

May 14, 2024

With Malware exploding in numbers, I decided to learn and apply Data Science to Malware.

So first I need a number of Malware samples, which I obtained from

Now the following techniques can work on any set of Malware, maybe if your a business/organization who is being targeted or you’ve been following a certain group of Malware authors and you want to see how the Malware is connected, if they use the same resources, hosts, code, etc then that would yield some interesting data and start to paint a picture.

Unfortunately, I don’t have access to those sets of Malware but that doesn’t say we can’t apply the techniques to Malware collected from honeypots.

Ransomeware samples

From the Malware samples, the Ransomware folder looks to have a number of samples we could apply the techniques on.

Step one: unzip all the Malware within that dir:

find . -name “*.zip” | while read filename; do 7z x $filename -pinfected -aou; done;

Step two: start building the script

Now I won’t post the whole script on here, I’ll add a link at the bottom it once I put it up on Github.

Step two: start building the script

Now I won’t post the whole script on here, I’ll add a link at the bottom it once I put it up on Github.
So let’s take a look at the interesting stuff:

for root,dirs,files in os.walk(args.target_path):
 for path in files:
 #try opening the file with pe to see if it’s really a pe file
  pe = pefile.PE(os.path.join(root,path))
  except pefile.PEFormatError:
 fullpath = os.path.join(root,path)
 #extract printable strings from the target sample
 strings = os.popen(“strings ‘{0}’”.format(fullpath)).read()
#use the search_doc function in the included reg mod, to find hostnames
 hostnames = find_hostname(strings)
 if len(hostnames):
 #add the nodes and edges for the bipartite network
for hostname in hostnames:


if hostnames:
 print “extracted hostnames from:”,path

What this does, is looks through each file in the given directory, check if it has a PE header if so we run the program “strings” on it, then get the list of strings from the file and run it through a function called “find_hostname” (which I’ve not posted here, but it goes through a regex process to strip the input and run the list through a list of domain suffixes to say if it string matches a list within domain suffixes, then it is accepted as a domain)

Then we create our network.

If we have a positive list of hostname, we’ll create a node for that malware.


Now we’ll start to create nodes and edges for each hostname we find that is connected to that malware.


And then print the hostname to the screen.
And the results are, to me anyway, interesting. We can see the following

extracted hostnames from smb-b4tq2hti.bin

Note — I have removed some characters from the hostname, safety first :)

Now let’s save everything to a “.dot” file so that we can convert the network into a visual graph.

#write the dot file to disk
malware = set(n for n, d in network.nodes(data=True) if d[‘bipartite’]==0)
hostname = set(network)-malware
#use networkX’s bipartite network projection function to produce the malware and hostname projections
malware_network = bipartite.projected_graph(network, malware)
hostname_network = bipartite.projected_graph(network, hostname)
#write the projected networks to disk as specified by the user
write_dot(malware_network, args.malware_projection)

So we will have 3 files, the first being the whole network, the second to show the connection between the Malware, and the third to show the connections between the hostnames.

python /home/osboxes/myAnalysis/malware-samples/Ransomware/extracted ./ ./ ./

We then use fdp (one of many choices but it is suited for a small network) to convert the “.dot” files into images:

fdp -T png -o orignal_ransomeware_image.png

And now let’s view the results:

Original network, showing the connection between the Malware and hostnames:

A network of Ransomeware and hostnames

The blue circle represents hostnames and the black circle represents the Malware. Now, granted it’s quite hard to actually see the connection via Medium but this was something I was hoping to see.
The cluster on the left is from the “Wannacry” folder and we would expect to see those files and hostnames linked.

Next, let’s view just the Malware connections:

Malware projection

And the hostname projection:

Hostname projection

Now, visually the hostname doesn’t tell us much, so that’s going to take me some time to adjust the network for the hostname to get it to be more visually pleasing and useful.

I won’t do a review of my findings as this is just to apply what I learn to some real-world Malware. And I’m quite happy with the findings and itching to see how else we can use the data we learn from Malware via Malware analysis and add it to these methods.

Now, I did try to build a graph based on image relationship for the ransomware malware, which is done by extracting the images from the malware but the results were far less “exciting” but that could be because the malware doesn’t use images or it is obfuscated.
Either way, we can use the same methods on different samples to see what they yeild.

Everything I learned and applied in the above is from the book “Malware Data Science”. I highly recommend it.

Leave A Comment

Haq Security

Over A Decade Years Of Experience Coupled With Certificates!

Over A Decade Years Of Experience Coupled With Certificates!

We have experience in working with different platforms, systems, and devices to create products that are compatible and accessible.