Wednesday, October 8, 2014

It's all about the vectors baby! Oh and a healthy does of metadata...

When we first started building intelligence systems for this program over two years ago we started with several goals in mind. Some of those goals are listed below:
  1. Read RSS content and store the full text and images from the RSS feeds long term
  2. Same goes with Facebook and Twitter, users can delete it but we want an artifact copy
  3. Using strings and keywords we want to search data to find relevant information as fast as possible
  4. Customized dashboards for our security analyst
  5. The ability to tag data and feeds so be able to create categories that would be of interest to analyst
  6. Take user suggestions and make them a reality
What we ended up with was a good product that allows us to store information for a set time frame and allows analyst to research the links between artifacts. Today the technology is changing and fast. Back then in 2012 the only solution we had was to either use Hadoop for distributed computing to gain the advances in speed and searching (using mapreduce) or to write our own cloud based system. So where are we in 2014?

Today we have landed somewhere in the middle. While we utilize Hadoop to search content for keywords and build relational data most of what we do it based on our category tagging. Only one of our products is proprietary. 95% of the software we use is either open source or commercial software that we have purchased or licensed for a particular purpose. All of our intelligence software is open source. How we process that is the proprietary portion of the equation. Utilizing a string library that we wrote in house we are able to do some really interesting things such as archiving chat logs, following users on IRC and finding relevant terms from a manually generated hit list in several sources of data to include Facebook, RSS and other Internet locations. You see most of the information required in building an intelligence system is out there you just have to list out your requirements and find open source libraries that provide that functionality and make a process flow that utilizes the products to their potential. Most of the software that we use was not originally developed to do what we are using it for, and that's OK. We will modify the software to do what we want it to do.

It doesn't matter where the information comes from at all. What matters is how you analyze that data for information and connect it to other information in your data store. If you can't connect the dots the data is nothing more than a checkpoint in time. If you can't pull out keywords and terms you don't know how to classify the artifact. If you can't build links through matching terms you can't figure out what artifacts are related to other artifacts. This is data science at it's best. What about the metadata in files. I can't tell you the number of times that we were able to get a computer name out of dumped password disclosures or to actually find similarities in different leaks due to dialect and other specific detail that allowed us to connect the data to a group or individual. 

Honestly in the security analysis field the research is all about the vectors and metadata. Use these clues to connect the dots. You may be surprised at what you find.

No comments:

Post a Comment