On the DBIR, data analysis and information security

I read through the DBIR from Verizon yesterday and I believe (actually, I hope) this report will be a turning point in the way that we handle information security and data breach reporting.

I don’t say this because of the refreshing lack of pie charts, or the meme density on the analysis text being over Nine Thousand. I say this because of how the data was being handled in the report. It was being handled… like data.

By the numbers

The DBIR team made bold analytical decisions on how to manipulate what they had. Clustering the attack/breach information based on behavior and other features has not been done before as far as I can tell. When you look at it, the outcome seems logical and natural.

Now this, boys and girls, is what good data analysis looks like: you formulate a hypothesis and use the data to validate it. The only interference by the analysis (unless there was some huge cherry picking involved, unlikely given the unprecedented size of the data set) was to decide where to slice off the dendogram in the hierarchical clustering output.

And while this may seem “Oh! Magical!” to us Infosec people, an experienced data analyst would go: “Meh, so what? That is literally the first thing you do when you have a new data set to analyze”.

Yes, my data analyst friend, it is. And I really hope it is not the last thing we do with it.

Data will set you free

I was about to wrap this up when I came across a few criticisms in Twitter (from people whose work I absolutely respect) discrediting the results, in the spirit of “We already knew that! Cut out the marketing stunt!”

I can totally get behind the pushback on marketing, as I am guilty of that many times myself. The “same-ism” and inaccuracies on these breach reports are staggering. However, we are dealing with a different animal here, and I wish the naysayers would read through it and draw a more informed conclusion.

The simple fact that we can analyze a sample of the unknown population of breaches worldwide and reach the same conclusions as experts who have access to classified info and secret squirrel intel fills me with hope. Hope that we may be ready to move away from the cargo cult and shamanism that pervades our industry today. Hope that data can start to be used more meaningfully to drive our decision making on Infosec.

Information Security is not an special snowflake. We are not “different”. If you have enough data to analyze, the patterns will emerge.

Suggested reading: The Hedgehog and the Fox


Tactical and Strategic Data

One of the best outcomes of my recent experiences in BlackHat and DEFCON was becoming more aware of the greater data-driven community and their opinions and experiences with dealing with data.

As such, I was exposed to the ideas of some of our risk luminaries, and have been catching up on some awesome reading on a lot of very good blog posts. There were in particular two recent blog posts that caught my eye, both of them seemingly inspired on a very long discussion on the SIRA mailing list, which I was not a part of until after the carnage discussion had ended.

Both posts deal with the role of different kinds of data on the greater analysis picture of Information Security, and my personal belief lies somewhere in the intersection of these two posts.

Alex’s post approaches it in a more econometric and philosophical way, as he layers the kinds of data on their level of strategical usefulness, from packet to log data, information risk and finally operational risk data, in a fair comparison to micro and macro economics.

Allison’s is much more pragmatic, seeking the differentiation (and demonstrating the similarities) of big data storage and processing technologies and our more traditional SIEM relational database infrastructure, and how the same techniques or at the very least the same data-driven concepts could be applied to both, in a fair comparison to errr… fruit salad.

And both reach the exact same point in very different ways: it is all data, and it can help your security objectives through data analysis and probabilistic techniques.

I have made this point over and over again in face to face meetings and in my talks: there are few individuals and organizations in InfoSec that are embracing the new capabilities that are being made available in our era of almost infinite storage and incredible computing power. We are always very late to the party in all technological and procedural advances based on a misguided belief that “InfoSec is different”, and then we are dragged kicking and screaming to the new reality by sub-optimal vendor interpretations of what we should need.

The main reason why I chose to begin the development of MLSec Project on “simple” SIEM and Log Management data is because every single organization has a lot of this data lying around. If quick wins can be demonstrated with this kind of data, maybe we can awaken the curiosity and the appetite of those organizations so that bigger questions can then be asked and answered satisfactorily.

The truth is that as much as I would love to tackle the more strategic and broad problems first, the necessary data is just not there. We are still taking such a beating in the lower levels of tactical and reaction-based security that there is sadly no time to build a shelter from the rain to try to do more noble and holistic work.

Maybe we can help turn this tide a bit. Let me know what you think in the comments.

MLSec – Using Machine Learning to support Information Security

For the next few weeks, I will be discussing the main thesis of my BSides Las Vegas talk on this blog, in order to build the correct frame of mind and discourse necessary to be able to be well prepared for the talk.

If anyone is reading as of now, this a great opportunity to make some comments and let me know what you think of the subject matter and opinions. I’ll probably just ignore you, but hey, at least you tried!

I will be identifying those posts with MLSec on the title, for reasons that should be made clear pretty soon, and the full list of posts can be found below (as soon as there are any, that is)

Preparing an InfoSec Presentation

Ever since being accepted to the Proving Grounds in the BSides Las Vegas 2013 edition, I have been toying with the idea of documenting the process of putting this presentation together as a way of helping me putting the ideas that I need on paper before evolving them into their final form.

I am obviously very excited to be presenting, but also a little scared, specially because:

  • It has been some time (+2 years) since I had to present to a large InfoSec audience. The audience was very different as well, so that adds to the unfamiliarity.
  • I have never presented a talk on Machine Learning before, being a relative newcomer to the subject myself. I am not really sure how to make it interesting, specially for a crowd that potentially is hungry for “0day” and “pwns”

But the most terrifying thing is that this time I seem to CARE more about it, being my research and all. Not that I didn’t appreciate the other talks I did but I was mostly regurgitating information from other people and trying to make it pleasant to the viewers. Now I would like them to care for my research also.

Should be an interesting experience. Stay tuned.