Thursday, November 6, 2014

Leveraging Hadoop for Security Analysis in Networks

I have been trying to stay active on the blog but as you all can tell we have become somewhat busy over the past few weeks. This is to be expected when we are in a predictive mode instead of a reactive one. Earlier tonight I was showing another security engineer how we can pull various data points into our hadoop cluster and analyze things quicker and more efficiently than we can on a single computer. Over the past 5 years I have built cloud environments for many purposes some of which ended up being long term solutions and some of which went away when administrators realized that operating a cloud requires a specific set of skills and the right people in the right places. A single person can maintain a cloud computing environment with the right software management in place but the work can be consuming until you automate everything.

As part of our business model the cloud is an integral part of what we do. From collecting news and intelligence information for analysis, to running batch jobs to do task that would take a security engineer hour upon hour to complete manually, the cloud makes life easier. There are some tradeoffs when utilizing the cloud as you have to fully understand what each piece of software does and how they interact with one another to get a job to a completed state.

So why this article? Because as security engineers you can't possibly look at all the data in an enterprise. You can however build watch list in a cloud to notify you when certain content is being ingested into the system. This saves the engineers time. Here's an example. Say a new malware variant comes on the scene and you want to analyze how many machines are infected with the malware. You will need a few things to get the task done in a reasonable time frame. Doing task manually make take weeks or even months but leveraging hadoop, solr, cassandra, hive, pig, etc, etc. you can do these same task in under a day. I like to let the cloud work while I sleep. I wake up feeling somewhat productive if the task are completed when I start my work day. But let's get back to our example. In order to find that elusive new malware variant you need some things. Let me list them out.

Computing Power (CPU's) - Processing Gigabytes or Petabytes of data requires heavy usage of CPU's. By leveraging multicore CPU's in a cluster of machines you eliminate this problem.

Memory (RAM) - You have to have memory. Things are read and written much quicker in memory than on disc drives (think solid state drives here). If you can afford them put SSD's in your cloud. Your batch jobs will thank you.

Disc Space - You have to have the space to store things. If you can't store the Petabytes of data you can't analyze it either. You need a place to store your vectors, configurations, investigative files, etc.

Vectors - You have to have points of data to work with. In our malware example let's say we will use the MD5 hashes of the malware to detect it. That's a vector of identification. Once we identify it we need to process it.

Scripts, Parsers, Libraries and such - You have to have a consistent,  and standardized way of doing things so your jobs are repeatable. You want predictable results without error. You will use a multitude of scripts, mapreduce jobs, indexes (to speed searches and queries), and parsers to find what your looking for.

Now that we have found the malware on the network what do we do with it. One of the most likely things you will do with a cloud is build statistics. In this case we want to build a list of infected IP addresses and domain names so we know what entity is infected so we can report on it (probably on a blog such as this one). However we don't want to sit and sift through Petabytes of data so we write our process out in mapreduce or some other language and let it work while we sleep.

Live Streaming Data - In order to find malware infections in near real time we need to have near real time data. Products such as sqoop and flume help in this regard. So we pull in things such as network pcaps, honeypot logs, malware submission reports, etc, etc.

So using all these various tools and data points we begin collecting statistics but that's not all we have to do. We want to identify the IP address or domain owners so we can notify them (just like we do when a patients information get's released to the public). We have to know whom to notify so it's imperative to identify.

Tracking - Once you have the information in hand you now need a way to track the outcome of your notifications. This is where old technology such as a pen and paper, an electronic notebook or hell maybe even one of those fancy trouble ticketing systems would work.

In order to make it in this fast moving world you have to do things quicker, more accurate and get the information out there before your competition. This is what a cloud does for us and what it could do for your organization.

Happy Hadooping!

About the author: Kevin Wetzel has been a leading researcher and cloud engineer since 2006. He has worked for various organizations to include the Department of Defense, Department of Homeland Security, various Health Care and Insurance organizations, business owners and politicians as well as private parties. Mr. Wetzel is a fan of cloud computing to make business processes run more efficiently. SLC Security Services LLC relies heavily on this type of technology in many of our services and products. Cloud computing can mean the difference between just getting the job done and getting the job done efficiently and before your competition. Kevin is a CCHA (yeah I got the certification before they changed it to CCAH), a licensed Investigator and Counterintelligence Specialist with SLC Security Services LLC. For more information on SLC Security you can visit the company website at

No comments:

Post a Comment