On the DBIR, data analysis and information security

I read through the DBIR from Verizon yesterday and I believe (actually, I hope) this report will be a turning point in the way that we handle information security and data breach reporting.

I don’t say this because of the refreshing lack of pie charts, or the meme density on the analysis text being over Nine Thousand. I say this because of how the data was being handled in the report. It was being handled… like data.

By the numbers

The DBIR team made bold analytical decisions on how to manipulate what they had. Clustering the attack/breach information based on behavior and other features has not been done before as far as I can tell. When you look at it, the outcome seems logical and natural.

Now this, boys and girls, is what good data analysis looks like: you formulate a hypothesis and use the data to validate it. The only interference by the analysis (unless there was some huge cherry picking involved, unlikely given the unprecedented size of the data set) was to decide where to slice off the dendogram in the hierarchical clustering output.

And while this may seem “Oh! Magical!” to us Infosec people, an experienced data analyst would go: “Meh, so what? That is literally the first thing you do when you have a new data set to analyze”.

Yes, my data analyst friend, it is. And I really hope it is not the last thing we do with it.

Data will set you free

I was about to wrap this up when I came across a few criticisms in Twitter (from people whose work I absolutely respect) discrediting the results, in the spirit of “We already knew that! Cut out the marketing stunt!”

I can totally get behind the pushback on marketing, as I am guilty of that many times myself. The “same-ism” and inaccuracies on these breach reports are staggering. However, we are dealing with a different animal here, and I wish the naysayers would read through it and draw a more informed conclusion.

The simple fact that we can analyze a sample of the unknown population of breaches worldwide and reach the same conclusions as experts who have access to classified info and secret squirrel intel fills me with hope. Hope that we may be ready to move away from the cargo cult and shamanism that pervades our industry today. Hope that data can start to be used more meaningfully to drive our decision making on Infosec.

Information Security is not an special snowflake. We are not “different”. If you have enough data to analyze, the patterns will emerge.

Suggested reading: The Hedgehog and the Fox

Tactical and Strategic Data

One of the best outcomes of my recent experiences in BlackHat and DEFCON was becoming more aware of the greater data-driven community and their opinions and experiences with dealing with data.

As such, I was exposed to the ideas of some of our risk luminaries, and have been catching up on some awesome reading on a lot of very good blog posts. There were in particular two recent blog posts that caught my eye, both of them seemingly inspired on a very long discussion on the SIRA mailing list, which I was not a part of until after the carnage discussion had ended.

Both posts deal with the role of different kinds of data on the greater analysis picture of Information Security, and my personal belief lies somewhere in the intersection of these two posts.

Alex’s post approaches it in a more econometric and philosophical way, as he layers the kinds of data on their level of strategical usefulness, from packet to log data, information risk and finally operational risk data, in a fair comparison to micro and macro economics.

Allison’s is much more pragmatic, seeking the differentiation (and demonstrating the similarities) of big data storage and processing technologies and our more traditional SIEM relational database infrastructure, and how the same techniques or at the very least the same data-driven concepts could be applied to both, in a fair comparison to errr… fruit salad.

And both reach the exact same point in very different ways: it is all data, and it can help your security objectives through data analysis and probabilistic techniques.

I have made this point over and over again in face to face meetings and in my talks: there are few individuals and organizations in InfoSec that are embracing the new capabilities that are being made available in our era of almost infinite storage and incredible computing power. We are always very late to the party in all technological and procedural advances based on a misguided belief that “InfoSec is different”, and then we are dragged kicking and screaming to the new reality by sub-optimal vendor interpretations of what we should need.

The main reason why I chose to begin the development of MLSec Project on “simple” SIEM and Log Management data is because every single organization has a lot of this data lying around. If quick wins can be demonstrated with this kind of data, maybe we can awaken the curiosity and the appetite of those organizations so that bigger questions can then be asked and answered satisfactorily.

The truth is that as much as I would love to tackle the more strategic and broad problems first, the necessary data is just not there. We are still taking such a beating in the lower levels of tactical and reaction-based security that there is sadly no time to build a shelter from the rain to try to do more noble and holistic work.

Maybe we can help turn this tide a bit. Let me know what you think in the comments.