Date Modified Tags opinion

What is an IOC?

An Indicator of Compromise 'IOC' is an observed artifact on a network or in the OS, which indicates, with high confidence, an intrusion.

IOC Extractors

The past few months have witnessed so many new projects related to extracting IOCs from various sources. Search for it online and you will find many. Most of these projects are essentially regular-expression search programs.

They use regular-expression to search for various interesting artifacts like

  • IP Addresses
  • Emails
  • Domains
  • Hashes (MD5/SHA/etc)
  • Other actionable artifacts and metadata

I needed a program like this and I wrote short one for myself. I was tempted to spend more time on it, and add more features etc. However, in the end, I resisted. I did contribute to another project instead.

Text extraction

The first problem is extracting text from various documents. While this is technically doable, building a program which will handle most documents is hard. There are many libraries in various languages which will do this, but most of what I looked at are not a 100%. As a simple test, try extracting text from PDFs in a reliable way. Its not straight forward, and it fails on many documents. Also, there is the problem of false positives (see below).

Blind data extraction

Extracting patterns that match regular-expression is easy, but getting the context around it is hard. Without the context, the extracted data still needs to be looked at by a human. Context around extracted data is a difficult problem, and needs NLP. There are entire companies built around just this. It takes a long time to build a system that will do this, and do it the right way.

Any other approach is not going to be strong enough to deploy with high confidence. Typically, once the program extracts the IOCs, the human would have to make sure they are correct and not parts of the document's structure itself. In the end, if a human has to look at it, it defeats the whole purpose.

References

  1. Indicator of compromise