Monday, September 26, 2016

Detecting Data Staging & Exfil Using the Producer-Consumer Ratio

In their FloCon 2014 presentation PCR - A New Flow Metric, Carter Bullard and John Gerth introduced the idea of the Producer-Consumer Ratio (PCR) for measuring and tracking shifts in the typical pattern of network communication for each host. PCR is calculated on a per-host basis, like this:


(bytes sent - bytes recvd)
-----------------------------
(bytes sent + bytes recvd)

This is an interesting metric, because it gives a good indication of the traffic pattern yet ignores many details that tend to complicate understanding, such as the actual volume of data sent or received, the number of flows, the amount of packets, etc. It boils everything down to one simple number in the range [-1.0,1.0]. They provided the following chart to give a rough idea how to interpret the PCR values:
PCRhost role
1.0pure push - FTP upload, multicast, beaconing
0.470:30 export - Sending Email
0.0Balanced Exchange - NTP, ARP probe
-0.53:1 import - HTTP Browsing
-1.0pure pull - HTTP Download

The idea is that you can track the PCR for each host over time and look for shifts to identify significant changes that might indicate possible data exfiltration. I recently came across this technique as I was reviewing contributions to The ThreatHunting Project and though it sounded like a fun thing to play around with. I decided to give it a try on some test data I had laying around, just to see how it would work.

The Hypothesis

By comparing a host's baseline PCR to it's current PCR and looking for large shifts, we should be able to identify hosts that are exfiltrating the data to the Internet. By extension, we should also be able to use the PCR shift to identify central staging points for data, where threat actors gather it in preparation for exfiltration.

The Data

The data I used comes from a test lab which features a realistic corporate Windows environment in small scale (about 40 hosts) as well as a simulated user population that does things like access file servers, send/receive email, browse the web, etc. There's also a simulated Internet. 
In this lab, we monitor our network with Bro, so I used Bro's connection logs (conn.log files) as my data source. The exact format of these files doesn't really matter here, and you can easily adapt this to any flow data you happen to have (argus, SANCP, etc).
I should also point out that in this attack scenario, the same host was used for both data staging and data exfil. This isn't much of a problem when calculating PCR, since the staging and exfil detections each calculate PCR on a different subset of the data (those flows traversing the Internet perimeter for the exfil, and those staying only internal for the staging). Therefore, the combination of big inbound and outbound data transfers don't interfere with each other. Were I to ignore this and just compute PCR across all flows in the dataset, I'd probably have gotten a much more balanced PCR ratio, and therefore the staging and exfil on the same host would cancel each other out. This all just goes to show that PCR-based methods should always take the network vantage point(s) into account, or risk missing things that are anomalous in both directions.

Dealing With Production Datasets

Since this is a test lab, I have both a "clean" dataset (no threat actor activity) and one that contains a mixture of legitimate use and attack traffic (we'll call this the "dirty" dataset). Most readers, though, probably aren't so lucky. If you're trying to do this with your own data pulled from a production network, try defining the dirty data as everything during the previous 7 days and the clean data as anything before that (perhaps to a maximum of 30 or 60 days). Even though your "clean" data may not actually be totally clean, the more you have, the less likely any transient fluctuations are to distort your baseline PCRs.

Exfil Detection

Exfil is data going from inside our network to the Internet, so I started by filtering my flows to select only those where the source (orig_h) is an internal IP and the dest (resp_h) is an Internet host. In plain language, I selected only at flows that crossed the network's perimeter (transit to/from Internet).  Then I simply summed up the bytes sent and bytes received for each host (Bro's orig_ip_bytes and resp_ip_bytes columns, respectively).  
Note that since Bro records bi-directional flows, I had to calculate PCR for any host that appeared as either a source or a destination.  Further, each "destination" host not only received bytes, but sent some of its own, so I had to sum the resp_ip_bytes twice: once as the bytes received by the src host and once as the bytes sent by the dest host.  Ditto for the orig_ip_bytes, but in reverse.  In psuedo code, it would look something like this:
srchost_bytes_sent = sum(orig_ip_bytes)
srchost_bytes_recvd = sum(resp_ip_bytes)

dsthost_bytes_sent = sum(resp_ip_bytes)
dsthost_bytes_recvd = sum(orig_ip_bytes)

I was looking for hosts that have a fairly large PCR shift in the positive direction. In the extreme case, the baseline value would be -1.0 (a pure consumer) and the dirty value would be +1.0 (a pure producer). To make those hosts show up nicely, I calculated the shift as (dirty PCR - baseline PCR). In the best case, the shift would therefore be 2.0.

I constructed a scatter plot of PCR shift for each host, showing the baseline PCR on the X axis, the "dirty" PCR on the Y axis, and the amount of PCR shift as the color. The trend line provides an easy reference to see where "no PCR shift" would fall on the graph, and makes it a bit easier to eyeball the distances for the outliers.
Most hosts will tend to adhere closely to their own baselines, given enough data points. Therefore I would expect that the plot should look very much like a straight diagonal line (the graph of the line y=x). Furthermore, most colors should come out as a nice neutral grey.
Red hosts are more likely to be exfiltration, since those hosts shifted the most in a positive direction (towards being a producer). Theoretically, very blue hosts could indicate the exfil destinations (they suddenly consumed a lot more data than normal). However, being Internet hosts, we can't count on that. I only plotted hosts that have PCR values in both the baseline and in the "dirty" dataset. If I'd never seen a particular Internet host before (as is probably the case with threat actor infrastructure), it wouldn't show up. In practice, once you know the exfil host it's probably not too difficult to identify the host(s) that received the data, but if this were an issue you could try to do something smarter here.
As you can see in the graph above, the most red host isn't actually all that red (the exfil package wasn't very large), but it is the host that exfiltrated the data in our attack scenario.

Staging Detection

Data staging is defined as data moving purely internal to a network. This should show up as a hosts consumer ratio becoming more negative (the staging host). It may also show as some hosts (the hosts from which the data was stolen) becoming more positive if a lot of data was stolen from them.
For this part of the analysis, I specifically filtered the flows to only those  that both originated and terminated within the internal network.  As before, I calculated the PCR shift for each src & dest host in the dataset.  I calculated the PCR shift slightly differently here, since I was looking for different activity. In fact, I was trying to find the inverse of what I was looking for before, so I inverted the shift calculation, too. That is, in the best case, a system would go from a pure producer (PCR 1.0) to being a pure consumer (PCR -1.0). I calculated PCR shift here as (baseline PCR - dirty PCR), which would again mean a PCR shift of 2.0 for a host staging data.  I could have skipped this inversion, but it's an easy way to make the most significant shift look red on the graph, which I like.
I then constructed a new scatter plot of PCR shift for each host, showing the baseline PCR on the X axis, the "dirty" PCR on the Y axis, and the PCR shift as the color, just like the previous graph.
Again, hosts adhering to their own baselines would tend to fall near the diagonal y=x line, and be colored grey. Red hosts are more likely to be unusually large consumers of data relative to their own baselines, and therefore may represent systems being used as data staging points. In fact, as indicated above, the most red host is actually the host that was used to stage the stolen data. It's very red, indicating that it experienced a large shift.
Unlike in the previous graph, where blue points weren't likely to be very useful, they could very well indicate the sources of the stolen data in this graph. If the attacker stole a significant amount of data from any given system, it may be that the act of that victim host transferring it's data to the staging point caused a significant PCR shift in the producer direction. If so, those hosts would tend to fall out of the expected diagonal and be more blue than grey. In fact, most of the blue points here actually were the ones from which data was stolen. Most only lost a little data, it seems, though one host is quite blue, indicating that it may be the source of the bulk of the data.

Conclusions

Based on my test data, it seems like this has promise. In both cases, the "most red" point on the graphs corresponded to the data exfil or staging host. For staging, we were actually able to derive some additional information about the likely sources of the data that was stolen. We may not be able to rely on this in all cases, and it's likely to be much more complicated in a real enterprise environment, but where it's present, it may be quite useful. At least in my small dataset, tracking PCR shift proved an effective method for identifying both data staging and data exfiltration.