11.9.22

Cyber Ontology Stamp Collection

Just a short and shallow note about knowledge representation, specifically in the domain of cybersecurity. As usual, I am the sole owner of my opinions.

Ontology and Knowledge Graphs

Ontology is the word you use when you get tired of saying "machine-readable representation of concepts" too many ways. The machine-readable part is important: over the decades, everyone working in large organizations runs into the need for modeling domains in software, but software devs need to work with crisp, stable definitions. The people working in the domain (the domain experts) do not have ready and sometimes keep deliberately flexible. So defining an ontology is not a philosophical exercise in the nature of things, but actually tied to the goal of helping the domain experts do their work.

An ontology is like a controlled vocabulary. As software devs know, there is rarely a universally accepted standard machine-readable representation. Everyone defines their own databases, schema definitions etc. Therefore, ontology enters the picture when there are multiple (typically large) independent organizations and we are interested in exchanging information. Unlike XML or file formats, ontology classes of objects (entities) and binary relations. This comes from description logic as well as semantic web / RDF which I won't go into.

The essential thing is that in the description logic / semantic web / etc approach, knowledge representation can be separated into terminology definitions (TBox) and assertions/statements (ABox). The terminology definitions are similar to a schema or object model and definewhat kind of objects and relations we can talk about. For instance, we may want to talk about blog, author, post and connections such as isOwner, wrote, linksTo. The assertion "Author Burak owns Blog Uncut Blocks of Wood" can then use these definitions.

Of course none of this really works really well. Barry Smith, an expert on ontology, in the first 10 mins of his entertaining talk Basic Formal Ontology (BFO), where he goes "people go and create thousands of ontologies and reproduced the silos that ontologies were supposed to overcome". This is reality.

Cybersecurity: STIX and ATT&CK

Let's shift gears and pick an interesting domain now, one where we really all are interested in progress. Would it not be nice if the various words that people in charge of cybersecurityare throwing around would be organized in a consistent manner and actionable, useful information ("intelligence") shared between organizations? Would this exchange information, and rapidly so, and realize the dream of having good actors build a united front against the bad guys?

Let's not get ahead of ourselves. If you listened to Barry Smith, using ontologies is not the silver bullet. It is nevertheless interesting to see what organizations are throwing out there. And if you are here for the knowledge representation, you can still see cybersecurity as an interesting case study of how these things can play out in practice.

When we talk about controlled vocabulary, we are talking about a form of standardization. Richard Struse (Homeland Security, MITRE, Oasis, ...) saw the need to and opportunity for defining a controlled vocabulary that goes by the name Structured Threat Information Expression (STIX™). The MITRE ATT&CK "framework" is a collection of objects describing attacker behavior.

STIX is defining the terminology, the objects of interest.

For the technically minded, there are JSON Schemas for the domain objects and likewise for the relations. It is an exercise to build code that reads these things. There is also a protocol called TAXII for realizing the information exchange.

STIX Objects

We have 19 domain objects, and relations between them. I will not list them all, but give an example of the "targets" relation: Let's consider a statement: The attack-pattern 'SQL Injection' targets the identity 'domain administrators'.
  • We have objects of type attack pattern
  • We have objects of type identity (more on this later)
  • We can have a relation (directed edge) of type targets between an attack-pattern object and an identity object.

You get the idea. There are various relationships, and in the end we can imagine a graph with objects being the nodes and the relations being the edges.

MITRE ATT&CK

As mentioned: MITRE ATT&CK is a knowledge base that makes use of the STIX vocabulary. It is a collection of statements that uses the terminology. It is called a knowledge base because unlike relational database, the knowledge and added value comes from the connections.

It seems that the cybersecurity industry picked up MITRE ATT&CK as a buzzword to signal features/capability, something to enhance credibility of cyber services and products. It is very interesting that the vocabulary definitions - the ontology - is not something that generated as much buzz. To some extent, this is expected: the domain model is something left to domain experts, and the service industry sells the expertise and need not burden the customers with "the details". If you want to get into cybersecurity, you have the option of learning from the ATT&CK "framework", but also to pick up the lingo from the underlying interchange format. It is not meant as a good resource for learning for beginners, but the open source style availability of all this is clearly a good thing.

Socio-technical limits of knowledge exchange

This approach to knowledge graph is entirely standard, but an obvious thing to point out is that it is very limiting that relations are binary. But since it is standard in these approaches to ontologies, let's not get hung up on this.

We have a shared/standardized vocabulary, what is not to like? Why did we not crush the bad guys yet? There are some practical obstacles to "information sharing" in general, as Phil Venables lays out in a recent post Crucial Questions from Governments and Regulators. There is clearly more obstacles such as trust, but no need to go further.

Let's instead narrow this down to organizations that are capable of doing their own software and who are committed to keeping threat intel. We get to more specific socio-technical obstacles. This shines new light on the silo problem already mentioned above.

Challenging the vocabulary. Take the identity object type. The docs say "Identities can represent actual individuals, organizations, or groups (e.g., ACME, Inc.) as well as classes of individuals, organizations, systems or groups (e.g., the finance sector)." What if I want to organize my threat intelligence that doesn't lump them all together?

Reasonable arguments can be made to defend this choice: that a more precise ontology that distinguishes between groups of people or roles is beyond the scope of STIX. That this is for exchange in the end and therefore some kind of mapping is to be expected. However, it looks like there are some real limits that are not easily overcome by any machine-readable representation.

Are the statements useful? In the end, all this information is for people. People who may work with different capabilities, different models, different ideas on granularity and naming. What if the organization supposed to consume this information cannot do anything with the attack pattern "SQL Injection"? Or, they could very much but chooise to not represent SQL Injection as its own thing, because it is useless for their purposes? Every organization is subject to their own boundaries. A list of attack patterns is not by itself a systematic approach: it is literally just a list of stuff.

What are the processes that will make use of this information? In any case, it seems that adaptation, mapping and possibly human curation is necessary if one wants the exchange information to have practical benefits.

Detection

There are other types of information that can be shared. For example, here are collections of detection rules:

At least here, it is very clear what the process is: enhance security monitoring so that it covers the relevant (!) part of these open source repositories. Once again: not everything in there might be relevant. The same curation/adaption may be necessary, and maybe mapping the log formats and internal representation so these rules make sense. However, due to the narrower scope of detection, it is immediately clear what one would use these things for.

Wrap up. Socrates strikes.

I promised a shallow post, and here we are. Knowledge representation is exciting technical challenge, and one that is relevant to many industries and it is happening, picked up in many large organizations (e.g. the use of ontology in life sciences, genetics, or the military). Alas, we can also see how using ontology as a domain model comes with limitations: relations are binary, and representing knowledge has to be tied to some actual human process in order to be useful.

I will end with this Socrates quote (source Socrates on the forgetfulness that comes with writing):

Well, then, those who think they can leave written instructions for an art, as well as those who accept them, thinking that writing can yield results that are clear or certain, must be quite naive and truly ignorant of [Thamos’] prophetic judgment: otherwise, how could they possibly think that words that have been written down can do more than remind those who already know what the writing is about?

I can get excited about information sharing when it comes to education, "speaking the same language." When it comes to practical cybersecurity processes, the usefulness of machine-readable knowledge seems to stand and fall with the tools and how they support the people who need to make use of this information. I have doubts that ontology can serve as the defining elements of a community or movement of the good guys. However, it also seems better than nothing in a field like cybersecurity that needs to deal with a fast-changing environment of evolving technology and adversaries. Exchanging threat intelligence with STIX may help those who "already know what the writing is about," and after due mapping, adaptation, curation to their own model, serve as one step towards coordinated response.