Tuesday, November 29, 2016

Hunting for Malware Critical Process Impersonation

A popular technique for hiding malware running on Windows systems is to give it a name that's confusingly similar to a legitimate Windows process, preferably one that is always present on all systems. The processes that Windows normally starts during boot (which I call the critical system processes) make good targets. Probably the most stereotypical example of this is malware running as scvhost.exe (a misspelling of the actual svchost.exe process). As it turns out, though, if you are collecting information about processes running on your systems, this type of activity is pretty easy to spot.
When it comes to comparing strings, most of us are familiar with equality since this is a basic operation in most languages. If str1 == str2 then they match! But string matching isn't always a binary answer. In this case, we specifically don't want to find strings that match exactly, nor do we want to find things that are wildly different. We're really looking for a narrow window of string similarity. That is, we're trying to identify processes that are so similar that they could easily be confused with the real thing, while not actually being identical to the real thing.
As it turns out, there are a number of ways to compute string similarity. They all work a bit differently and have particular strengths and weaknesses, but they typically accept two strings, compare them, and produce some sort of similarity score, like so:
score = compare(str1, str2)
As long as you're using the same algorithm for all the comparisons, you can use the scores to judge which pairs of strings are more or less similar than the others. Probably the most well-known algorithm for this is the Levenshtein distance. The resultant score is simply a count of the minimum number of single character insertdelete or modify operations it takes to convert str1 into str2. For example, the Levenshtein distance between 'svchost.exe' and 'scvhost.exe' (our example above) is 2 (delete the 'v', then add a new 'v' just after the 'c').
Because the algorithm is so simple, the Levenshtein distance would make a pretty decent choice for most uses, though in this case we're going to use a variant known as the Damerau-Levenshtein distance. The only difference here is that the Damerau-Levenshtein distance adds the transpose (switch adjacent characters) operation. Since transposition is one of the common techniques for creating confusingly similar filenames, it makes sense to account for this in our distance algorithm. With the Damerau-Levenshtein algorithm, the distance between 'svchost.exe' and 'scvhost.exe' is 1 (transpose the 'v' and the 'c').
With that background in mind, let's see how we can apply it across our network to detect malware masquerading as critical system processes.

The Hypothesis

Processes whose names are confusingly similar to those of critical system processes are likely to be malicious.

The Data

In my example, I'm looking at the full path to the binary on disk for any process that actually ran. Binaries that never ran will not be included in my analysis.  It's important to note that I'm comparing the path to the binary on disk, not the command line the system says was run, which is arguably the more inclusive check. In reality, you probably want to check both, since it's also quite common for malware to lie about it's command line. However, the command lines in my dataset require a lot of parsing and normalization, while the binary paths are pre-normalized. Using only the binaries makes a much clearer demonstration of the technique. Just be aware that for production use, you probably should spend the effort to normalize your command lines and check them, too.

One final note: on medium to large networks, you'll find that it won't be easy to hold all your process execution data in memory at once. My dataset is rather small, so I don't have to worry about it (again, it makes for a cleaner demonstration) but for larger-scale use, you may need to page your search results, convert the code into Spark or do something else clever to break things up a bit.

The Hunt

First, import some important modules.
import re
from ntpath import basename 
from jellyfish import damerau_levenshtein_distance
If you are a Python programmer, you probably recognize most of these.  The jellyfish module contains a number of routines that compute the "distance" between two strings, which is what this whole hunt is based on.

The following is a nested dictionary that defines the system processes we consider to be the most likely targets of this type of process impersonation. 

CRITICAL_PROCESSES = {'svchost.exe': {"threshold": 2,
                      'smss.exe': {"threshold": 1,
                                      "whitelist":['c:\\program files (x86)\\internet explorer\\iexplore.exe',
                                             'c:\\program files\\internet explorer\\iexplore.exe']},

For each of these processes, we define a distance threshold (distances greater than zero but less than or equal to this value will be considered suspicious). We also keep a whitelist of the full path names to various legitimate binaries we found that would otherwise still register as suspicious. Feel free to tinker with these thresholds and whitelists, since they are likely to be very system dependent.  I had to run through several iterations with a subset of my data before I got values I was happy with.

The workhorse of this hunt is the similarity() function, which is defined as:

def similarity(proc, critical_procs=CRITICAL_PROCESSES):
    for crit_proc in critical_procs.keys():
        distance = damerau_levenshtein_distance(basename(proc).lower().decode('utf-8'), crit_proc.decode('utf-8'))
        if (distance > 0 and distance <= critical_procs[crit_proc]["threshold"]) and not proc in critical_procs[crit_proc]["whitelist"]:
            return (crit_proc, distance)

    return None

Pass similarity() the full path to a binary (e.g., c:\windows\system32\blah.exe) and the critical processes dictionary above, and it'll do the checks. If it finds something that's confusingly similar to one of the defined critical processes, it'll return a tuple containing the critical process to which it is similar and the distance score, like so: (crit_proc, distance). It'll stop checking the given process as soon as it finds a close match that's not whitelisted, so at most one result will be returned. If it finds no matches, it'll simply return None.

It's probably important to point out here that this function intentionally only compares the filenames and ignores the path.  In other words, it should find c:\windows\system32\scvhost.exe, but isn't concerned with things like c:\windows\temp\svchost.exe, where the filename is correct but the directory path is obviously bogus.  We can save this for a future hunt, though it's not hard to imagine combining the two!

At this point you need some process execution data!  The exact way you get this will vary from site to site, so I won't try to provide Python code for this.  Suffice it to say that my dataset is in a database, so I just queried the entire set via a SQL query:

SELECT process_guid, path FROM CarbonBlack WHERE type = 'ingress.event.procstart'

For each unique process execution, Carbon Black provides not only the full file path (the path value), but also a unique ID called the process_guid. If the process turns out to be suspicious, we'll need that so we can investigate further.  The code I used to pull this out of the database simply returns an list of tuples, like (process_guid, path), which is what I'll assume yours does, too.  If it comes some other way, the following code segment may need a bit of tweaking.

Finally, we just have to go through the results, weeding out the non-Windows processes (I have some Linux servers in my dataset) and compute similarity for each. If the similarity() function returns any results, the code prints them along with the GUID so we can follow up in our favorite threat hunting platform.

windows_regex = re.compile("^([a-z]:)|(\\SystemRoot)")

for i in processes_results:
    (guid, path) = i
    if windows_regex.match(path):
        res = similarity(path)
        if res:
            (crit_proc, distance) = res
            print "Process %s is suspiciously similar to %s (distance %d)" % (path, crit_proc, distance)
            print "\tGUID: %s" % guid

When it finds suspicious processes, the code will produce output similar to the following:

Process c:\windows\system32\csrrs.exe is suspiciously similar to csrss.exe (distance 1)
        GUID: 00000009-0000-0477-01d1-f282c48bc278
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-0580-01d1-d883914e50c7
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-057c-01d1-d87e79c39834
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-1330-01d1-d87e72d3d61e
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-13d4-01d1-d87d971196b0
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-1268-01d1-d30d8b3d6765
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-0254-01d1-d309333a60d1
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-158c-01d1-d3091a4e674b
Process c:\program files\openssh\bin\ls.exe is suspiciously similar to lsm.exe (distance 1)
 GUID: 00000009-0000-1300-01d1-d3083db995cc

As you can see, we did find a few suspicious processes. One of them, csrrs.exe looks pretty suspicious! Fortunately, the rest were all the ls.exe binary provided as part of the OpenSSH package. Since OpenSSH is authorized on our network, this would be a good candidate to add to the lsm.exe whitelist so it doesn't keep coming up whenever we run our hunt.


Even though this hunt is looking at a very specific technique for hiding malware, it can still be effective.  If you have endpoint process execution data, looking for processes running with confusing names can be pretty helpful and fairly straightforward to implement.

Monday, September 26, 2016

Detecting Data Staging & Exfil Using the Producer-Consumer Ratio

In their FloCon 2014 presentation PCR - A New Flow Metric, Carter Bullard and John Gerth introduced the idea of the Producer-Consumer Ratio (PCR) for measuring and tracking shifts in the typical pattern of network communication for each host. PCR is calculated on a per-host basis, like this:

(bytes sent - bytes recvd)
(bytes sent + bytes recvd)

This is an interesting metric, because it gives a good indication of the traffic pattern yet ignores many details that tend to complicate understanding, such as the actual volume of data sent or received, the number of flows, the amount of packets, etc. It boils everything down to one simple number in the range [-1.0,1.0]. They provided the following chart to give a rough idea how to interpret the PCR values:
PCRhost role
1.0pure push - FTP upload, multicast, beaconing
0.470:30 export - Sending Email
0.0Balanced Exchange - NTP, ARP probe
-0.53:1 import - HTTP Browsing
-1.0pure pull - HTTP Download

The idea is that you can track the PCR for each host over time and look for shifts to identify significant changes that might indicate possible data exfiltration. I recently came across this technique as I was reviewing contributions to The ThreatHunting Project and though it sounded like a fun thing to play around with. I decided to give it a try on some test data I had laying around, just to see how it would work.

The Hypothesis

By comparing a host's baseline PCR to it's current PCR and looking for large shifts, we should be able to identify hosts that are exfiltrating the data to the Internet. By extension, we should also be able to use the PCR shift to identify central staging points for data, where threat actors gather it in preparation for exfiltration.

The Data

The data I used comes from a test lab which features a realistic corporate Windows environment in small scale (about 40 hosts) as well as a simulated user population that does things like access file servers, send/receive email, browse the web, etc. There's also a simulated Internet. 
In this lab, we monitor our network with Bro, so I used Bro's connection logs (conn.log files) as my data source. The exact format of these files doesn't really matter here, and you can easily adapt this to any flow data you happen to have (argus, SANCP, etc).
I should also point out that in this attack scenario, the same host was used for both data staging and data exfil. This isn't much of a problem when calculating PCR, since the staging and exfil detections each calculate PCR on a different subset of the data (those flows traversing the Internet perimeter for the exfil, and those staying only internal for the staging). Therefore, the combination of big inbound and outbound data transfers don't interfere with each other. Were I to ignore this and just compute PCR across all flows in the dataset, I'd probably have gotten a much more balanced PCR ratio, and therefore the staging and exfil on the same host would cancel each other out. This all just goes to show that PCR-based methods should always take the network vantage point(s) into account, or risk missing things that are anomalous in both directions.

Dealing With Production Datasets

Since this is a test lab, I have both a "clean" dataset (no threat actor activity) and one that contains a mixture of legitimate use and attack traffic (we'll call this the "dirty" dataset). Most readers, though, probably aren't so lucky. If you're trying to do this with your own data pulled from a production network, try defining the dirty data as everything during the previous 7 days and the clean data as anything before that (perhaps to a maximum of 30 or 60 days). Even though your "clean" data may not actually be totally clean, the more you have, the less likely any transient fluctuations are to distort your baseline PCRs.

Exfil Detection

Exfil is data going from inside our network to the Internet, so I started by filtering my flows to select only those where the source (orig_h) is an internal IP and the dest (resp_h) is an Internet host. In plain language, I selected only at flows that crossed the network's perimeter (transit to/from Internet).  Then I simply summed up the bytes sent and bytes received for each host (Bro's orig_ip_bytes and resp_ip_bytes columns, respectively).  
Note that since Bro records bi-directional flows, I had to calculate PCR for any host that appeared as either a source or a destination.  Further, each "destination" host not only received bytes, but sent some of its own, so I had to sum the resp_ip_bytes twice: once as the bytes received by the src host and once as the bytes sent by the dest host.  Ditto for the orig_ip_bytes, but in reverse.  In psuedo code, it would look something like this:
srchost_bytes_sent = sum(orig_ip_bytes)
srchost_bytes_recvd = sum(resp_ip_bytes)

dsthost_bytes_sent = sum(resp_ip_bytes)
dsthost_bytes_recvd = sum(orig_ip_bytes)

I was looking for hosts that have a fairly large PCR shift in the positive direction. In the extreme case, the baseline value would be -1.0 (a pure consumer) and the dirty value would be +1.0 (a pure producer). To make those hosts show up nicely, I calculated the shift as (dirty PCR - baseline PCR). In the best case, the shift would therefore be 2.0.

I constructed a scatter plot of PCR shift for each host, showing the baseline PCR on the X axis, the "dirty" PCR on the Y axis, and the amount of PCR shift as the color. The trend line provides an easy reference to see where "no PCR shift" would fall on the graph, and makes it a bit easier to eyeball the distances for the outliers.
Most hosts will tend to adhere closely to their own baselines, given enough data points. Therefore I would expect that the plot should look very much like a straight diagonal line (the graph of the line y=x). Furthermore, most colors should come out as a nice neutral grey.
Red hosts are more likely to be exfiltration, since those hosts shifted the most in a positive direction (towards being a producer). Theoretically, very blue hosts could indicate the exfil destinations (they suddenly consumed a lot more data than normal). However, being Internet hosts, we can't count on that. I only plotted hosts that have PCR values in both the baseline and in the "dirty" dataset. If I'd never seen a particular Internet host before (as is probably the case with threat actor infrastructure), it wouldn't show up. In practice, once you know the exfil host it's probably not too difficult to identify the host(s) that received the data, but if this were an issue you could try to do something smarter here.
As you can see in the graph above, the most red host isn't actually all that red (the exfil package wasn't very large), but it is the host that exfiltrated the data in our attack scenario.

Staging Detection

Data staging is defined as data moving purely internal to a network. This should show up as a hosts consumer ratio becoming more negative (the staging host). It may also show as some hosts (the hosts from which the data was stolen) becoming more positive if a lot of data was stolen from them.
For this part of the analysis, I specifically filtered the flows to only those  that both originated and terminated within the internal network.  As before, I calculated the PCR shift for each src & dest host in the dataset.  I calculated the PCR shift slightly differently here, since I was looking for different activity. In fact, I was trying to find the inverse of what I was looking for before, so I inverted the shift calculation, too. That is, in the best case, a system would go from a pure producer (PCR 1.0) to being a pure consumer (PCR -1.0). I calculated PCR shift here as (baseline PCR - dirty PCR), which would again mean a PCR shift of 2.0 for a host staging data.  I could have skipped this inversion, but it's an easy way to make the most significant shift look red on the graph, which I like.
I then constructed a new scatter plot of PCR shift for each host, showing the baseline PCR on the X axis, the "dirty" PCR on the Y axis, and the PCR shift as the color, just like the previous graph.
Again, hosts adhering to their own baselines would tend to fall near the diagonal y=x line, and be colored grey. Red hosts are more likely to be unusually large consumers of data relative to their own baselines, and therefore may represent systems being used as data staging points. In fact, as indicated above, the most red host is actually the host that was used to stage the stolen data. It's very red, indicating that it experienced a large shift.
Unlike in the previous graph, where blue points weren't likely to be very useful, they could very well indicate the sources of the stolen data in this graph. If the attacker stole a significant amount of data from any given system, it may be that the act of that victim host transferring it's data to the staging point caused a significant PCR shift in the producer direction. If so, those hosts would tend to fall out of the expected diagonal and be more blue than grey. In fact, most of the blue points here actually were the ones from which data was stolen. Most only lost a little data, it seems, though one host is quite blue, indicating that it may be the source of the bulk of the data.


Based on my test data, it seems like this has promise. In both cases, the "most red" point on the graphs corresponded to the data exfil or staging host. For staging, we were actually able to derive some additional information about the likely sources of the data that was stolen. We may not be able to rely on this in all cases, and it's likely to be much more complicated in a real enterprise environment, but where it's present, it may be quite useful. At least in my small dataset, tracking PCR shift proved an effective method for identifying both data staging and data exfiltration.