Tuesday, November 29, 2016

Hunting for Malware Critical Process Impersonation

A popular technique for hiding malware running on Windows systems is to give it a name that's confusingly similar to a legitimate Windows process, preferably one that is always present on all systems. The processes that Windows normally starts during boot (which I call the critical system processes) make good targets. Probably the most stereotypical example of this is malware running as scvhost.exe (a misspelling of the actual svchost.exe process). As it turns out, though, if you are collecting information about processes running on your systems, this type of activity is pretty easy to spot.
When it comes to comparing strings, most of us are familiar with equality since this is a basic operation in most languages. If str1 == str2 then they match! But string matching isn't always a binary answer. In this case, we specifically don't want to find strings that match exactly, nor do we want to find things that are wildly different. We're really looking for a narrow window of string similarity. That is, we're trying to identify processes that are so similar that they could easily be confused with the real thing, while not actually being identical to the real thing.
As it turns out, there are a number of ways to compute string similarity. They all work a bit differently and have particular strengths and weaknesses, but they typically accept two strings, compare them, and produce some sort of similarity score, like so:
score = compare(str1, str2)
As long as you're using the same algorithm for all the comparisons, you can use the scores to judge which pairs of strings are more or less similar than the others. Probably the most well-known algorithm for this is the Levenshtein distance. The resultant score is simply a count of the minimum number of single character insertdelete or modify operations it takes to convert str1 into str2. For example, the Levenshtein distance between 'svchost.exe' and 'scvhost.exe' (our example above) is 2 (delete the 'v', then add a new 'v' just after the 'c').
Because the algorithm is so simple, the Levenshtein distance would make a pretty decent choice for most uses, though in this case we're going to use a variant known as the Damerau-Levenshtein distance. The only difference here is that the Damerau-Levenshtein distance adds the transpose (switch adjacent characters) operation. Since transposition is one of the common techniques for creating confusingly similar filenames, it makes sense to account for this in our distance algorithm. With the Damerau-Levenshtein algorithm, the distance between 'svchost.exe' and 'scvhost.exe' is 1 (transpose the 'v' and the 'c').
With that background in mind, let's see how we can apply it across our network to detect malware masquerading as critical system processes.

The Hypothesis

Processes whose names are confusingly similar to those of critical system processes are likely to be malicious.

The Data

In my example, I'm looking at the full path to the binary on disk for any process that actually ran. Binaries that never ran will not be included in my analysis.  It's important to note that I'm comparing the path to the binary on disk, not the command line the system says was run, which is arguably the more inclusive check. In reality, you probably want to check both, since it's also quite common for malware to lie about it's command line. However, the command lines in my dataset require a lot of parsing and normalization, while the binary paths are pre-normalized. Using only the binaries makes a much clearer demonstration of the technique. Just be aware that for production use, you probably should spend the effort to normalize your command lines and check them, too.

One final note: on medium to large networks, you'll find that it won't be easy to hold all your process execution data in memory at once. My dataset is rather small, so I don't have to worry about it (again, it makes for a cleaner demonstration) but for larger-scale use, you may need to page your search results, convert the code into Spark or do something else clever to break things up a bit.

The Hunt

First, import some important modules.
import re
from ntpath import basename 
from jellyfish import damerau_levenshtein_distance
If you are a Python programmer, you probably recognize most of these.  The jellyfish module contains a number of routines that compute the "distance" between two strings, which is what this whole hunt is based on.

The following is a nested dictionary that defines the system processes we consider to be the most likely targets of this type of process impersonation. 

CRITICAL_PROCESSES = {'svchost.exe': {"threshold": 2,
                      'smss.exe': {"threshold": 1,
                                      "whitelist":['c:\\program files (x86)\\internet explorer\\iexplore.exe',
                                             'c:\\program files\\internet explorer\\iexplore.exe']},

For each of these processes, we define a distance threshold (distances greater than zero but less than or equal to this value will be considered suspicious). We also keep a whitelist of the full path names to various legitimate binaries we found that would otherwise still register as suspicious. Feel free to tinker with these thresholds and whitelists, since they are likely to be very system dependent.  I had to run through several iterations with a subset of my data before I got values I was happy with.

The workhorse of this hunt is the similarity() function, which is defined as:

def similarity(proc, critical_procs=CRITICAL_PROCESSES):
    for crit_proc in critical_procs.keys():
        distance = damerau_levenshtein_distance(basename(proc).lower().decode('utf-8'), crit_proc.decode('utf-8'))
        if (distance > 0 and distance <= critical_procs[crit_proc]["threshold"]) and not proc in critical_procs[crit_proc]["whitelist"]:
            return (crit_proc, distance)

    return None

Pass similarity() the full path to a binary (e.g., c:\windows\system32\blah.exe) and the critical processes dictionary above, and it'll do the checks. If it finds something that's confusingly similar to one of the defined critical processes, it'll return a tuple containing the critical process to which it is similar and the distance score, like so: (crit_proc, distance). It'll stop checking the given process as soon as it finds a close match that's not whitelisted, so at most one result will be returned. If it finds no matches, it'll simply return None.

It's probably important to point out here that this function intentionally only compares the filenames and ignores the path.  In other words, it should find c:\windows\system32\scvhost.exe, but isn't concerned with things like c:\windows\temp\svchost.exe, where the filename is correct but the directory path is obviously bogus.  We can save this for a future hunt, though it's not hard to imagine combining the two!

At this point you need some process execution data!  The exact way you get this will vary from site to site, so I won't try to provide Python code for this.  Suffice it to say that my dataset is in a database, so I just queried the entire set via a SQL query:

SELECT process_guid, path FROM CarbonBlack WHERE type = 'ingress.event.procstart'

For each unique process execution, Carbon Black provides not only the full file path (the path value), but also a unique ID called the process_guid. If the process turns out to be suspicious, we'll need that so we can investigate further.  The code I used to pull this out of the database simply returns an list of tuples, like (process_guid, path), which is what I'll assume yours does, too.  If it comes some other way, the following code segment may need a bit of tweaking.

Finally, we just have to go through the results, weeding out the non-Windows processes (I have some Linux servers in my dataset) and compute similarity for each. If the similarity() function returns any results, the code prints them along with the GUID so we can follow up in our favorite threat hunting platform.

windows_regex = re.compile("^([a-z]:)|(\\SystemRoot)")

for i in processes_results:
    (guid, path) = i
    if windows_regex.match(path):
        res = similarity(path)
        if res:
            (crit_proc, distance) = res
            print "Process %s is suspiciously similar to %s (distance %d)" % (path, crit_proc, distance)
            print "\tGUID: %s" % guid

When it finds suspicious processes, the code will produce output similar to the following:

Process c:\windows\system32\csrrs.exe is suspiciously similar to csrss.exe (distance 1)
        GUID: 00000009-0000-0477-01d1-f282c48bc278
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-0580-01d1-d883914e50c7
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-057c-01d1-d87e79c39834
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-1330-01d1-d87e72d3d61e
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-13d4-01d1-d87d971196b0
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-1268-01d1-d30d8b3d6765
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-0254-01d1-d309333a60d1
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-158c-01d1-d3091a4e674b
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-1300-01d1-d3083db995cc

As you can see, we did find a few suspicious processes. One of them, csrrs.exe looks pretty suspicious! Fortunately, the rest were all the ls.exe binary provided as part of the OpenSSH package. Since OpenSSH is authorized on our network, this would be a good candidate to add to the lsm.exe whitelist so it doesn't keep coming up whenever we run our hunt.


Even though this hunt is looking at a very specific technique for hiding malware, it can still be effective.  If you have endpoint process execution data, looking for processes running with confusing names can be pretty helpful and fairly straightforward to implement.


  1. Fascinating post, David, thanks for sharing it! I like the idea of data reduction using something like this.

    On a side note, I was at the Naval Postgrad School, taking a course where we discussed the Hamming Distance in binary communications codes, only to take a seminar from Dr. Hamming himself! Also, I first showed up at NPS just a few months before Gary Kildall passed away, and walked by his office just about every day!

  2. This comment has been removed by the author.