Wednesday, November 5, 2014

Triage Any Alert With These Five Weird Questions!

(OK, so I went all "BuzzFeed" on the title.  My alternate was going to be "What kind of alert are you?  Take this quiz and find out!" so be thankful.)

Introduction

There are few things more frustrating to users than using a tool which doesn't support (or may even be at odds with) their processes.  Tools should be designed to support our workflows, and the more often we perform a workflow, the more important it is that our tools support it.  As analysts, our most commonly-exercised workflow is probably alert triage.  Alert triage is the process of going through all of your alerts, investigating them, and either closing them or escalating them to an incident.  Escalated incidents lead directly to incident responses (IRs), of course, and there's not always a distinct handoff where triage ends and response begins.  Therefore, some of the basic IR tasks are part of this workflow as well.

About five years ago, I was tasked with training a group of entry level security analysts to do alert triage.  Previously, I'd really just "done it" and never really thought much about how it worked.  After mulling it over for a while, though, I realized that the entire process really boiled down to a set of questions that the analyst needs to have answers for.

  1. Was this an actual attack?
  2. Was the attack successful?
  3. What other assets were also compromised?
  4. What activities did the attacker carry out?
  5. How should my organization respond to this attack?

If you start from the assumption that the analyst sitting in front of an alert console is going to be answering these five questions over and over again, it seems pretty clear that we need to make sure that console makes it as easy and as quick as possible for the analyst to get these answers.  We may not need to answer every one of these questions every time, nor do we always tackle them in the same order.  In general, though, this is the process that we start from, and then adapt it on-the-fly to meet our needs.

I thought it might be interesting to examine these questions in a little more detail.

Was this an actual attack?

Of course, this is the first answer we need.  You could restate this question as "Is this alert a false positive?" and it would mean the same thing.  As an industry, no one has yet figured out how to eliminate all the false positive (FP) alerts while still keeping all the true positive (TP) alerts.  Given that we know that there will be FPs (probably a substantial percentage of them), we need to make it as easy as possible for the analyst to distinguish between FP and TP, and to do it quickly.

The keys here are:

  • Providing the user with the context around the alert (what scenario is it intended to detect, what do actual examples of the TPs look like, etc)
  • Identifying what other information (stuff that's not already in the alert) the analyst needs to see, and providing quick easy access to this (e.g., pivot to examining the PCAP for an alert)

Was the attack successful?

If you've ever monitored the IDS/NSM console of an Internet-facing asset, you know that the vast majority of exploit attempts probably fail.  These are mostly automated scan, probes or worms.  They attack targets indiscriminately, without regards to OS, software stack or version numbers.

Unfortunately, if you alert on exploit attempts, dealing with these alerts is a substantial burden to the analysts.  This is due entirely to the sheer number of alerts they have to deal with.  This is one reason I tend to focus my attention on detections further down the Kill Chain: it cuts down on the number of unsuccessful attack attempts the analysts have to deal with.

Even though we try to avoid them as much as possible, there are still plenty of situations where we are alerting on something that may or may not have been a success.  For example, if we alert on a drive-by download or a watering hole attack, the analyst then has to see if the browser reached back out to download the malicious payload.  The key information they need will be much the same as before:

  • Context
  • Quick & easy access to related information

If the answer to this question is "Yes, the attack was successful" the next step is usually to escalate the alert to a full-blown incident.

What other assets were also compromised?

This is where things start to get really interesting.  Assuming the alert indicates a successful attack, then you have to start doing what we call scoping the incident.  That is, before you can take any actions, you need to gather some basic information with which to plan and make response decisions.

The first step when determine the scope of the incident is to assemble a list of assets involved.  "Assets" in this case can be any object in your organization's IT space: computers, network devices, user accounts, email addresses, files or directories, etc. Anything that an attacker would want to attack, compromise or steal would be considered an asset.

For example, you may see an intruder try a short list of compromised usernames and passwords across a long list of hosts, trying to find out which credentials work for which hosts.  In this case, your asset list would include every one of the compromised user accounts, but also all of the hosts they tried to use them on.  If they performed this action from within your corporate network, the source of those login attempts would also be on the list.

It's also worth noting that the asset list will almost certainly change over the course of the incident.  For smaller incidents, you may be able to both assemble the asset list and filter out the ones which weren't actually compromised, all in one step.  For larger incidents, it's pretty common to create the list and then parcel it out to a group of incident responders to verify.  As the analysts determine assets are not compromised, they may drop off the list.  Similarly, as new information comes to light during the course of the investigation, new assets will probably be added to the list.  The asset list is one of the most volatile piece of information associated with any active incident response, and keeping your list current is absolutely critical.

The keys to helping analysts deal with asset tracking are:

  • Creating a template "asset list" that can be cloned for each alert/incident to track compromise status.  At a minimum, you probably need to track the following info for each asset
    • Asset name (hostname, username, email address, etc)
    • Date/time the attacker was first known to interact with that asset
    • Date/time the attack was last known to interact with that asset
    • Date/time an analyst detected the activity
    • Date/time the asset was added to the list
    • Date/time the analyst investigated the asset and determined it to be compromised or not
    • The results of that investigation (often COMPROMISED, NOT COMPROMISED, PENDING ANALYSIS)
    • Brief notes about the investigative findings 
  • Sharing this list with other responders, in such a way as to make it easy to interact with (e.g., via a wiki)

What activities did the attacker carry out?

The next big piece of scoping the incident is just trying to come up with what I call the "narrative" of the attack:  What did the attacker try to do, when, and to what?  The asset list answers the "to what" question, but even that needs to get put into context with the other stuff.

To help answer these questions, incident responders typically create timelines of the significant events that occur during the attack and the IR.  For example, you might start with a simple chart that calls out the highlights of the incident, with dates and times for:

  • The attacker's first exploit attempt
  • All the alerts generated by the attack
  • When the alerts were triaged and escalated to incidents
  • When the affected asset(s) were contained
  • When the affected assets were remediated
  • When the incident was closed

As your investigation gathers more details about what happened, the timeline grows.  As more confirmed malicious events are discovered, and incident response milestones achieved, these are added to the timeline.  Just like the asset list I mentioned before, the timeline typically changes a lot during the course of the investigation.

Once you have gathered the timeline data, you may want to display it in different ways, depending on what you're trying to accomplish or who you are sharing it with.  For example, a table view is common when you're editing or managing the events, but some sort of graphical view or interactive visualization is a much nicer way to show the narrative to others or to include in your incident reports.  There are some timelining tools available out there, but to start with, I recommend trying just a plain spreadsheet (or similar format) and seeing how that works before getting too complicated.

The keys to helping the response team track activities are:

  • Creating a timeline template that can be cloned for each incident.  Key fields to track here might include
    • A description of the entry (attacker action, milestone reached, etc)
    • Date/time that entry occurred
  • Sharing this list with other responders, in such a way as to make it easy to interact with (e.g., via a wiki)

How should my organization respond to this attack?

This is the big question!  Once you have assembled the list of compromised assets and the timeline of events, you exit the scoping phase and are ready to begin planning your incident response strategy.  In most a cases, a typical strategy involves containment (removing the attacker's access to the compromised assets) followed by remediation (bringing the compromised assets back into production state).  The topic of incident response strategies is far too detailed to get into here, but it's worth noting that this is not a one-size-fits-all situation.  Different types of incidents require different plans (sometimes called playbooks).  A well-managed CIRT will have a number of standard IR plans ready to go, but even then they often need to be tailored to the individual incident.  And it's still not uncommon to find incidents that don't fit any of the standard plans exactly, in which case the response teams need to create an entirely new plan on the fly (using pieces of existing plans, if they're lucky).

The creation and application of these plans still requires a high degree of skill and experience, something that a lot of organizations don't have enough of in-house.  If this is an issue, you may consider engaging a consultant with experience building CIRT teams to help guide you through this process.

The keys here are:

  1. Identifying who in your organization has the experience necessary to come up with good playbooks, or getting outside help.
  2. Creating playbooks that are specific to your organization's IT environment, policies and security goals.
  3. Sharing these playbooks among all the incident responders
  4. Training with and testing the playbooks on a regular basis

What does this all mean?

To a large extent, an organization's ability to detect and respond to security events depends on the quality of their tools.  It's not enough to just go out and get a bunch of "solutions" if they don't support the way you work.  We need to be able to leverage those tools to improve our ability to get our work done faster and better.  By identifying the common workflows we go through, we can design our toolset to help us be more productive. This translates into more effective responses that happen faster, thus helping us protect ourselves much better. 

I know that other very experienced incident responders will read this.  I'd love to hear your feedback, so please leave a comment!