Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!

An Introduction to Exploratory Data Analysis with Network Forensics

"Universal law is for lackeys; context is for kings."
-Capt. Gabriel Lorca, Star Trek: Discovery

Workflows are often not as clearly defined in reality as they are in the organizational charts and docs, if those even exist. As a result, the more dismal side of network and threat investigation is often carried out as an iterative process between a data analyst and a threat researcher. Before other engineers, executives, and data scientists hear about a given investigation or start building models to explain the observed network behavior, these first two often have to work together to turn their data into an easily understood story. There are no assumptions or hypotheses to “test” just yet; the battle is still in the trenches of data analysis, where the long fight to convert data into actionable business objectives has not yet been won. This is where we decide what and why to test because the cloud isn’t free, and neither is our time. The purpose of this post is to show how we can understand and visualize network data in order to better steer a threat analyst down a path of focus when attacks occur.

A more subtle facet of this process that I would like to focus on in this post is shared context between a data analyst and a threat researcher: we don’t know yet what questions to ask or what the problem space is, so starting there is a great idea. If "big data is anything you can't fit in Excel,"[1] the example I use in this post is admittedly small data—but that doesn’t mean it can’t be "good data" for the purposes of investigation, especially when members of a team may make assumptions that these data don’t necessarily support. In my experience, grander campaigns will fail if consensus on the most basic goals isn’t collected first, because errors multiply over the life of a project. For example, you may have collected data on sessions containing some set of network-based detections, but how old was the intelligence that identified those threats? If, for example, the answer is "over a year old," and the threat was identified in the last week, does that change the team's confidence in the threat content of the data that was gathered? The answer to this question will often be “yes.”

Let’s take a look at an example in order to illustrate some of these ideas. The data illustrated by the row below were recorded during a WannaCry attack that lasted from April 26, 2018 until May 5, 2018. A single row, or record, is an IDS signature (sid) on a given conversation between a source (src) and destination (dst) IP address. There are 4,818 such sid-conversation records during the attack. We can also see the timestamps of the first (start) and last (end) IDS hits generated from a conversation; observation timestamp range (OTR) is the difference between start and end and represents the length, in seconds, of "known bad" traffic; the total number of those observations (n_obs) within the OTR; and the directionality of that conversation (direction). Note that in the table below, I have sanitized the first two columns in order to preserve anonymity:


Though the data aggregated above are intentionally basic, thanks to other investigations we have high confidence that all of the 2.46MM observations generated from the 5 unique signatures that identified this traffic are true positives. One of the main purposes of this basic exploratory data analysis (EDA) is to determine which of these many conversations a threat researcher should focus on.

Before we get to visualizing these data, we can make a few high-level statements about the data with some basic grouping and counting:

  1. Out of all 4,818 conversations, only 32 are internal-to-external.
  2. Of those 4,818, 4,010 (83.2%) are unique pairs of source and destination IP addresses.
  3. In all, there are 234 unique source IPs, and 469 unique destination IPs.
  4. The top 5 source IPs account for between 63 and 68 non-unique conversations, each.
  5. The top 5 destination IPs account for between 150 and 215 non-unique conversations, each.

We can also count the number of unique src-dst IP pairs (#2) for each signature to find that signature 2024216 (an Emerging Threats Open signature to detect a DOUBLEPULSAR beacon response) represents the largest group with 3,280 (68.1%) of the non-unique conversations:


Other ways of graphing these conversations are a function of the other types of data we have. For the purposes of this post, I chose the data under discussion for a couple of reasons: number of observations and OTR are two continuous and relatively easy-to-understand variables; there is a reasonable belief that longer conversations will generate more observations; and, regardless of other empirical reasoning, the patterns I’m reviewing here are what I found when I first did this almost a year ago. That may sound hand-wavy, but the point is that during EDA you often don’t know exactly what you’re looking for or what assumptions to make about the data yet—in this case, these data started a lot of valuable conversations.

Next, I’ll make a scatter plot of the number of IDS hits on a conversation against the OTR; a scatter plot is a good choice because of the continuous nature of the variables I’m currently inspecting. I’m not trying to assert a relationship between these variables, and certainly not a causal one. We are still exploring the characteristics of the data itself; building models and drawing conclusions isn’t only “down the road"—we don’t even know yet what road we’re on! My goal here is to explore any potential variation in the data that might be meaningful, which is another great reason to choose a scatter plot: I want to find patterns that even a five-year-old could point at and say “that looks different.” Only with a collection of meaningful patterns illustrating the data generation processes would I start a conversation with a threat researcher to see if those patterns are expected and if our assumptions are shared.

So, what do these data look like when displayed on a scatter plot?


There is a ton of variation in both of these variables with which we can try to tell a story. Certainly we could do something like run a regression to quantify what looks like a positive relationship between these two features, but since it’s not immediately clear if such a relationship would be forensically useful even if it existed, let’s first add some color to the output from above:


This looks like it might be a bit more informative, but there’s still some bunching going on around the axes, which prevents us from seeing if anything in the smaller ranges is worth telling our threat researchers about. One thing we can do is take the natural log of each of the two dimensions plotted on the axes. This has the effect of scaling the data in such a way that small and large values are better able to be compared visually, since the scale is more compact. In fact, I displayed the above scatter plot with an x-axis in hours, despite OTR being in seconds; it's easier to accept that scaling because we're so familiar with it. Taking the natural log of a number is the inverse operation of exponentiating it; the scales are likewise logistic. Put more simply, successively much bigger numbers are scaled down to smaller and closer values. This is why a log-curve or growth-curve rises fast for small values and then flattens out for larger ones; by logging observation count and OTR, we’re applying the same principle to the scale of our axes. You can see this effect in the axes of some of the graphs below, but for now consider this table, which gives the first six orders of magnitude as both their linear and naturally logged values:


Besides condensing larger scales to smaller scales, we get a bit of a conceptual bonus when using these naturally logged values. For values that are “close” to each other, the difference of natural logs can be interpreted as an approximation of their relative proportions, as a percentage. For example, ln(10.001) - ln(10) = 0.000099995, which is approximately equal to 0.001 / 10, or 0.0001. You can multiply either that difference or that quotient by 100 and you will get 0.01%, and that relationship is a big reason to use log-scale: we can often do some mental arithmetic to get a sense for percentage difference. Let’s go ahead and log these two dimensions to see the visual effect this rescaling has on our data. I’ll also color this log-log scatter plot by the signature label on the conversation:


Neat! There’s some interesting grouping and clustering in these data points. It’s easy to see the type of scaling applied by the natural log: the visual distance between 10 and 150 observations is the same as the distance between 150 and 3000. There’s no more bunching, but there is some peculiar banding along single values of both OTR and number of observations. I’ll discuss this more below, and add one more feature to this graphic by faceting the scatter plot by the direction of traffic:


Ah-ha! Even though some conversations triggering Signature 2830018 are internal-to-internal, that signature alone was triggered by all conversations that were internal-to-external. The fact that only this signature hit on internal-to-external conversations may help a threat researcher determine the origin or extent of the infection. These are the 32 conversations referenced in our earlier list of high-level observations (#1), but now we also have a more precise sense of how long these sessions were, how many observations we generated on them, as well as which signature they triggered. The minimum OTR in this group is 6.1 hours and the median is 117 hours; we can see this is quite different from the group as a whole.

Note that none of the outbound conversations were in the largest group by signature, and in fact all outbound traffic is in the smallest signature group (2830018). Let’s look for other differences between these signature groups using a box plot of OTR lined up with the column graph from above. We already know that these aren’t balanced groups (only 32 of our 4,818 conversations are signature 2830018), but now we can clearly identify other issues of skew in our data, namely that 3 of our 5 signatures triggered on very short sessions. Many of the rest of the conversations were technically outliers in their respective group (at least by the “1.5x the interquartile range” rule):


The analyst or data scientist for this project still has some other work to do, however. There are fascinating artifacts in the left facet of the log-log scatter plot, above; the striations along both axes likely imply something about the intelligence gathering mechanisms used. Furthermore, this odd pattern seems to be present in every signature group except for the outbound facet. Also, Signature 2024216 appears to have at least two very distinct clusters of conversations, as well as at least one of the aforementioned graphical artifacts.

A good story should attract listeners from all walks of life, and those we tell at work are no different. The curation and preparation of these data started important conversations: if there were one goal of this kind of analysis, this would be it. We started in the hypothetical world where we had just discovered a threat or received word from a client that “something didn’t look right” on their network. The next step is to realize that wherever you go with the story you start telling will shape the entire discussion—permanently. Even if it’s wrong and you or someone else has to start over. As Graham Shaw said in The Art of Business Communication, “the amazing power of pictures to stay in the memory is well-documented.” Given this fact, it’s important to make small, impactful moves in the way we shape this narrative; for this reason I advocate for the validation of basic beliefs about a problem and its setting. The analyst’s role in this phase is to inform, not persuade—and for this reason I would rather have something of seemingly small yet inarguable importance that I can reproduce scientifically than some earth-shattering but controversial finding. A good example was how we found the internal-to-external conversations above. It’s also important to note that the counts underlying the column graph do not hint at the observation timestamp range of those conversations; you need to inspect the boxplot to see that really only two signature groups, 2024216 and 2830018, contribute very much to the overall variation in OTR.

Above all else, network security professionals are entrusted with protecting value on client networks, and we’re not here to massage egos or make names for anyone. We don’t need complicated or contentious assumptions (yet): we’re here to present the facts, provide their relevant context, and tell this story early, quickly, and effectively.

  1. Stephanie Hamel, author of the Online Analytics Maturity Model ↩︎