11.9.22

Cyber Ontology Stamp Collection

Just a short and shallow note about knowledge representation, specifically in the domain of cybersecurity. As usual, I am the sole owner of my opinions.

Ontology and Knowledge Graphs

Ontology is the word you use when you get tired of saying "machine-readable representation of concepts" too many ways. The machine-readable part is important: over the decades, everyone working in large organizations runs into the need for modeling domains in software, but software devs need to work with crisp, stable definitions. The people working in the domain (the domain experts) do not have ready and sometimes keep deliberately flexible. So defining an ontology is not a philosophical exercise in the nature of things, but actually tied to the goal of helping the domain experts do their work.

An ontology is like a controlled vocabulary. As software devs know, there is rarely a universally accepted standard machine-readable representation. Everyone defines their own databases, schema definitions etc. Therefore, ontology enters the picture when there are multiple (typically large) independent organizations and we are interested in exchanging information. Unlike XML or file formats, ontology classes of objects (entities) and binary relations. This comes from description logic as well as semantic web / RDF which I won't go into.

The essential thing is that in the description logic / semantic web / etc approach, knowledge representation can be separated into terminology definitions (TBox) and assertions/statements (ABox). The terminology definitions are similar to a schema or object model and definewhat kind of objects and relations we can talk about. For instance, we may want to talk about blog, author, post and connections such as isOwner, wrote, linksTo. The assertion "Author Burak owns Blog Uncut Blocks of Wood" can then use these definitions.

Of course none of this really works really well. Barry Smith, an expert on ontology, in the first 10 mins of his entertaining talk Basic Formal Ontology (BFO), where he goes "people go and create thousands of ontologies and reproduced the silos that ontologies were supposed to overcome". This is reality.

Cybersecurity: STIX and ATT&CK

Let's shift gears and pick an interesting domain now, one where we really all are interested in progress. Would it not be nice if the various words that people in charge of cybersecurityare throwing around would be organized in a consistent manner and actionable, useful information ("intelligence") shared between organizations? Would this exchange information, and rapidly so, and realize the dream of having good actors build a united front against the bad guys?

Let's not get ahead of ourselves. If you listened to Barry Smith, using ontologies is not the silver bullet. It is nevertheless interesting to see what organizations are throwing out there. And if you are here for the knowledge representation, you can still see cybersecurity as an interesting case study of how these things can play out in practice.

When we talk about controlled vocabulary, we are talking about a form of standardization. Richard Struse (Homeland Security, MITRE, Oasis, ...) saw the need to and opportunity for defining a controlled vocabulary that goes by the name Structured Threat Information Expression (STIX™). The MITRE ATT&CK "framework" is a collection of objects describing attacker behavior.

STIX is defining the terminology, the objects of interest.

For the technically minded, there are JSON Schemas for the domain objects and likewise for the relations. It is an exercise to build code that reads these things. There is also a protocol called TAXII for realizing the information exchange.

STIX Objects

We have 19 domain objects, and relations between them. I will not list them all, but give an example of the "targets" relation: Let's consider a statement: The attack-pattern 'SQL Injection' targets the identity 'domain administrators'.
  • We have objects of type attack pattern
  • We have objects of type identity (more on this later)
  • We can have a relation (directed edge) of type targets between an attack-pattern object and an identity object.

You get the idea. There are various relationships, and in the end we can imagine a graph with objects being the nodes and the relations being the edges.

MITRE ATT&CK

As mentioned: MITRE ATT&CK is a knowledge base that makes use of the STIX vocabulary. It is a collection of statements that uses the terminology. It is called a knowledge base because unlike relational database, the knowledge and added value comes from the connections.

It seems that the cybersecurity industry picked up MITRE ATT&CK as a buzzword to signal features/capability, something to enhance credibility of cyber services and products. It is very interesting that the vocabulary definitions - the ontology - is not something that generated as much buzz. To some extent, this is expected: the domain model is something left to domain experts, and the service industry sells the expertise and need not burden the customers with "the details". If you want to get into cybersecurity, you have the option of learning from the ATT&CK "framework", but also to pick up the lingo from the underlying interchange format. It is not meant as a good resource for learning for beginners, but the open source style availability of all this is clearly a good thing.

Socio-technical limits of knowledge exchange

This approach to knowledge graph is entirely standard, but an obvious thing to point out is that it is very limiting that relations are binary. But since it is standard in these approaches to ontologies, let's not get hung up on this.

We have a shared/standardized vocabulary, what is not to like? Why did we not crush the bad guys yet? There are some practical obstacles to "information sharing" in general, as Phil Venables lays out in a recent post Crucial Questions from Governments and Regulators. There is clearly more obstacles such as trust, but no need to go further.

Let's instead narrow this down to organizations that are capable of doing their own software and who are committed to keeping threat intel. We get to more specific socio-technical obstacles. This shines new light on the silo problem already mentioned above.

Challenging the vocabulary. Take the identity object type. The docs say "Identities can represent actual individuals, organizations, or groups (e.g., ACME, Inc.) as well as classes of individuals, organizations, systems or groups (e.g., the finance sector)." What if I want to organize my threat intelligence that doesn't lump them all together?

Reasonable arguments can be made to defend this choice: that a more precise ontology that distinguishes between groups of people or roles is beyond the scope of STIX. That this is for exchange in the end and therefore some kind of mapping is to be expected. However, it looks like there are some real limits that are not easily overcome by any machine-readable representation.

Are the statements useful? In the end, all this information is for people. People who may work with different capabilities, different models, different ideas on granularity and naming. What if the organization supposed to consume this information cannot do anything with the attack pattern "SQL Injection"? Or, they could very much but chooise to not represent SQL Injection as its own thing, because it is useless for their purposes? Every organization is subject to their own boundaries. A list of attack patterns is not by itself a systematic approach: it is literally just a list of stuff.

What are the processes that will make use of this information? In any case, it seems that adaptation, mapping and possibly human curation is necessary if one wants the exchange information to have practical benefits.

Detection

There are other types of information that can be shared. For example, here are collections of detection rules:

At least here, it is very clear what the process is: enhance security monitoring so that it covers the relevant (!) part of these open source repositories. Once again: not everything in there might be relevant. The same curation/adaption may be necessary, and maybe mapping the log formats and internal representation so these rules make sense. However, due to the narrower scope of detection, it is immediately clear what one would use these things for.

Wrap up. Socrates strikes.

I promised a shallow post, and here we are. Knowledge representation is exciting technical challenge, and one that is relevant to many industries and it is happening, picked up in many large organizations (e.g. the use of ontology in life sciences, genetics, or the military). Alas, we can also see how using ontology as a domain model comes with limitations: relations are binary, and representing knowledge has to be tied to some actual human process in order to be useful.

I will end with this Socrates quote (source Socrates on the forgetfulness that comes with writing):

Well, then, those who think they can leave written instructions for an art, as well as those who accept them, thinking that writing can yield results that are clear or certain, must be quite naive and truly ignorant of [Thamos’] prophetic judgment: otherwise, how could they possibly think that words that have been written down can do more than remind those who already know what the writing is about?

I can get excited about information sharing when it comes to education, "speaking the same language." When it comes to practical cybersecurity processes, the usefulness of machine-readable knowledge seems to stand and fall with the tools and how they support the people who need to make use of this information. I have doubts that ontology can serve as the defining elements of a community or movement of the good guys. However, it also seems better than nothing in a field like cybersecurity that needs to deal with a fast-changing environment of evolving technology and adversaries. Exchanging threat intelligence with STIX may help those who "already know what the writing is about," and after due mapping, adaptation, curation to their own model, serve as one step towards coordinated response.

7.6.22

There is only stream processing.

This post is about what to call stream processing. Strangely enough, it has little do with terminology, but almost all with perspective.

If you are pressed for time (ha!), the conclusion is that "stream processing vs batch processing" is a false dichotomy. Batch processing is merely an implementation of stream processing. Every time people use these as opposite, they are making the wrong abstraction.

Can a reasonable definition be general?

A reasonable definition of data stream is a sequence of data chunks, where the length of the sequence is either "large" or unbounded.

This is a pretty general definition, esp if we leave open what "large" means.

  • Anything transmitted over TCP, say a file or HTTP response
  • a Twitter feed
  • system or application logs, say metadata from any user's visit to this blog
  • the URLs of websites you visit
  • location data acquired from your mobile phone
  • personal details of children who start school every year

It also stands to reason that stream processing is processing of a data stream. A data stream is data that is separated in time.

When a definition is that general, engineers get suspicious. Should it not make a difference whether a chunk of data arrives every year or every millisecond? Every conversation about stream processing happens in a context where we know a lot more about the temporal nature of the sequence.

The same engineers, when dealing with distributed systems, have no problem refering to large data sets as a whole, even if they fully know that it may be distributed over many machines. We may well call a large data set data that is separated in space. A data stream could also be both, separated in time and space (maybe we could call that a "distributed data stream").

At the warehouse

Where am I going with this? Let's quickly establish that "batch processing" is but a form of stream processing, in the above definition.

What do people call batch processing? The etymology of this goes back to punch cards, but it is not about those.

Batch processing is frequently found in data warehousing, or extract-transform-load (ETL) process. This is a setup that has been around since the 1970s, and this does not make it bad. What is essential is that data is periodically ingested (extract), say in the form of large files. It is then turned (transform) into a uniform representation that is suitable for various kinds of querying (load).

Accumulating data and processing it periodically, we have seen this before. Does the data fit the general definition of data stream? Surely, since there is a notion of new data is coming in.

What could be alternative terminology? Tyler Akidau uses the word "batch engine" in this post on stream processing. So the good old periodic processing could be called "stream processing done with a batch engine."

I promised that this is not only about terminology. When did people first feel the need to distinguish batch processing from stream processing?

Data stream management systems

The people who systematically needed to distinguish processing of data streams from simply large data sets were the database community. Dear academics, I don't mean to hurt any feeling but I will just count all papers on "data stream processing" or "complex event processing" as database community.

The database researchers implemented efficient evaluation of a (reasonaly) standard, high level, declarative language: relational queries (SQL). Efficient evaluation and performance meant to make best use of available, bounded resources. As part of this journey, architectures appeared that look very similar to what streams of partial results (such Volcano, or iterator model).

Take a look the relational query (SELECT becomes $\pi$, WHERE becomes $\sigma$, JOIN becomes $\Join$). What would happen if we flipped the direction of the arrows? Instead of bounded data like the "Employee" and "Building" tables, we could have continuous streams that go through the machinery.

Exercise: draw the diagram above, with all arrows reversed, and think about how this could be a useful stream processing system. Maybe there is a setup procedure that has to happen in each building before an engineer starts working from there and they would like to know about arrivals and new departures.

Data stream management system (DSMS) became a thing 20 years ago when implementors realized that most of their stuff will continue to work when tuples come in continuously, mutatis mutandis.

  • Michael Stonebraker, of PostgreSQL fame, built a system called Aurora and founded a company StreamBase systems
  • At Stanford they wrote about the Stanford Stream Data Management System and Continuous Query Language (CQL)
  • (I'm not going to do a survey here)

Academic authors will delineate their field using descriptions as follows.

  • "high" rate of updates
  • requirement to provide results in "real time" (bounded time)
  • limited resources (e.g. memory)

We are now getting closer to the more specific meaning people attach to "stream processing:" not only do we receive the data in chunks at different times, we also need to produce results in a "short" amount of time, or with "bounded" machine resources.

In order to understand why database folks streaming, one shoud know that evaluating queries in DBMS with high performance is also a form a stream processing. When a user types a SQL query at the prompt (SELECT ... FROM ..., the all-caps giving that distinctive pre-history feel), the result should come in as quickly as possible, and there are surely some limits on the machine resources. So generalizing query evaluation to continuosly arriving data really was a logical next step to overcome limitations.

Interlude: Windowing and micro-batching

If you did the exercise, you may wonder how a join is supposed to work when we take a streaming perspective? There are multiple answers.

A particular nice and useful scenario is if the joins can be considered as a simple lookups, for example when each row in a data stream is enriched with something that is looked up via a service.

Another scenario is the windowed stream join: here we consider bounded segments (windows) of two data stream and join what we find in those. Usually, this requires some guarantees: the streams are a priori not synchronous. They may either be synchronous enough or one may use some amount of buffering and periodically process what is in the buffer.

Wait - did I just write "periodic processing"? That is right, so it looks like when stream processing is hardcore enough, it contains periodic processing again. This is where usually people will say things like "micro-batch." There are simply scenarios in stream processing that cannot be done without periodic processing (everything that involves windows in processing time).

A horizontal argument

Now, the words "limited machine resources" meant something different 20 years ago. We can (and do) build systems that involves machines and communications, sharding or "horizontal scaling". From the good old MapReduce to NoSQL and Craig Chambers FlumeJava (which lives happily on as Apache Beam) to Spark and Cloud Dataflow, there is a series of systems, APIs that deal with stream processing in a distributed manner.

Tying the knot

The irony of talking about "batch processing" today is that periodic processing of large files, also involves distributed stream processing underneath. When a large input data set is processed with a MapReduce, it is distributed across a set of workers that map-phase locally produce partial result. The shuffle phase then takes care of getting all partial results to the right place for the reduce-phase. The "separation in space" is also a "separation in time:" a particular partition can be "done" before others, which means that results is a result stream.

Depending on the level at which one is discussing a system, the performance expectations we have, the trade-offs, one may be able to ignore spatial and temporal separation. It seems that the recognizing the temporal separation always brings advantages.

8.4.22

Polaroids from Swiss Cyber Security Days SCSD Day 1

Here are a few things I took home from Day 1 of the Swiss Cyber Security Days conference (SCSD) which was two days ago. I work as a software engineer and am used to deal with technical angles, but cyber security is one of the areas where technological change impacts society at large. I made the decision to go shortly after the Russia-Ukraine war broke out; if cyber security was a pressing problem before, the new political era should make it unavoidable for society to ignore and I wanted to find out first-hand what that means.

(Note: the title is a kind of metaphor that is resolved at the end. The public domain images I include here are random things from the interwebs, not actually from the conference.)

Switzerland Officials and National Councillors

Day 1 was opened according to official protocol by Doris Fiala, member of Swiss National Council, member of the parliamentary Security Policy Committee, and president of SCSD. She greeted many of the present officials and parliamentary members. The morning featured Swiss officials inlcuding high ranking ones who are directly responsible for dealing with cyber security.

The speakers provided perspectives on Swiss national policy, the role of federal government and the armed forces. The officials spoke in national languages German and French often switching from one to the other in the same talk, and there was simultaneous translation (like in UN conferences.) A few highlights:

Florian Schütz has been acting as Swiss federal Cyber Security Delegate since 2019 and reported on strategy and outlook. He talked about the activities of the National Cyber Security Center NCSC, and how it relates to the Financial Sector security center which was founded this week in Zurich. It is a fact that cyber security means something different in every sector and that a federal center would be limited in depth, which can be addressed by creating and support sector-specific cyber security centers where decision makers from companies in that sector can participate directly. He also pointed out that most companies in Switzerland are SMB, and even if many are aware of the problem now, they do no know what concretely to do about the risk.

The majority of Swiss companies have less than 0.5M CHF revenue per year of which they might be able to set aside 3-5% for cyber security. This is not a lot! Getting companies to take the risk seriously and improve their security posture through awareness and training was picked up by other speakers.

The challenge for official institutions is that they have in general not been made aware of cyber security incidents, since companies are not incentivized to report breaches. There is a current proposal to make reporting mandatory, which would help improve visibility.

Major General Alain Vuitel is director of the Project Cyber Command of the Swiss Armed Forces, and talked about the process of establishing Cyber Command. This is a signficant step, since the army is structured into four existing commands and the Cyber Command will become the fifth one. Vuitel pointed out the importance of information in warfare and that this is a factor Ukraine, and the civil society and collateral damage from attack on communication infrastructure like satellites and disinformation campaigns.

The cyber attack on KA-SAT satellite communications network on Feb 24th, the day Russia attacked Ukraine, led to loss of maintenance capabilities for a German operator of wind turbines. Also in the weeks leading up to the outbreak of war, there were cyber attacks on Ukraining infrastructure. War is no longer armed conflict; covert operations and information war mean that in times of "peace", there are attacks going on and this accepted fact was observed in real-time. Vuitel also pointed out that modern communication also bring a new potential: every citizen with a mobile phone can act as a sensor, providing information, provided they have the freedom to act, which underlines the importance of communication infrastructure.

keyboard with lock - to stand for ransomware

Finally, he pointed out that while recruitment is challenging, Switzerland has a system of mandatory military service and militia system. This means the armed forces is able to reach large parts of the population and provide them cyber security training and, it provided them with the means to recruit their first cyber bataillon.

Judith Bellaiche, National Councillor and CEO of SWICO, described the evolution of cyber security as reflected in parliamentary discussions over the last 15 years. Political discourse in Swiss parliament reflects society and the topic of cyber security evolved from early years of getting aware, demanding reports and making sense to an increasing political demands for new official competencies and laws. Most impressively, she summarized this as raw fear: the parliamentary representatives feel that the situation is out of control and demand action, with proposals to expand the role of the state. In order to validate, representative surveys were conducted and it turned out that the general population is even more afraid and leans on expanding the role of the state. This is a political development which will likekly have consequences. It would be a major paradigm change if the role of the state was expanded to protect private companies from cyber security, however it does not look like the private sector can fix this themselves. The debate for expanding the technical and organizational competencies of the state is certain to affect Swiss society.

Practitioners and the Private Sector: Ransomware

In a podium discussion organized by the World Economic Forum on "What’s next for multistakeholder action against cybercrime," it was articulated that ransomware is the biggest topic in cyber security. Countless conpanies are victims of ransomware attacks and the conference experience report from incident responders. Jacky Fox, managing director Accenture Security, told a war story of Irish hospitals losing their entire infrastructure, doctors having to resort to pen and paper and unable to operate diagnostic machines, unable to effectively treat patients and that incident responders had to face the impact of their decisions are matters of life and death.

keyboard with lock - to stand for ransomware

Serge Droz, seasoned incident responder and director of FIRST (Forum of Incident Response and Security Teams) pointed out that case data is important. Investigating a single case of ransomware is imposible, while with data from 20-30 cases of ransomware provides basis for investigation as one can recognize patterns. He pointed out the difference in roles that in a multistakeholder action, incident responders have as first priority the return to normal conditions, while prosecution's first priority is to identify the attacker and forensics.

Maya Bundt, Cyber Practice Leader at Swiss Re and representing insurances, pointed out how the idea of data sharing is much debated but rarely gets concrete. There are many different kinds of "data" involved in ransomware attacks, from technical security-relevant data like indicators of copmromise to case-specific data, how did something happen or specific data victim company, and that in the debate this is often all mixed up. For insurances, what matters is cost, and that the height of the ransom is often only a small fraction of the cost to a company which needs to deal with data loss, getting back to business and forensics. It was pointed that the role of the insurance company is to make the risk transferable, while the risk associated with a cyber security attack is not completely transferable. Insurances do play an important role since they can demand companies to implement practices and improve their posture, and they have an interest to do so, because if too many people are attacked, the insurances get so expensive that nobody will be able to afford them anymore.

The White House: Chris Inglis on getting left of the attack

Chris Inglis currently serves as the first US National Cyber Director and advisor to the President of the United States Joe Biden on cybersecurity and of course the US persective was a highlight. His talk was a balance of rallying support and putting "cyberspace" into perspective with a conceptual framework. Cyberspace is a permanent reality that affects everyone's life. It affects whether electricity is available, whether public transportation works and whether your local store is able operate. Consequently, the defense of cyberspace is not about the defense of technical "stuff", but about the critical functions that are served by the stuff. From this, he derives three factors of cyber defense is about:
  • the stuff: already since the 60s-70s, we know about notions and challenges of technical quality
  • the people: not only the ones who are adjacent to stuff, but who work for and benefit from the critical functions that affect everyone's lives
  • the doctrine, or "roles and responsibilities" that we assign in society. He mentions the supply-chain attack SolarWinds and how many actors were aware but thought someone else is responsible, and that adversaries are actively looking out for weak doctrine.

Defense is fragmented, because cyberspace is not perceived as a universal, ambient sphere of society but as disconnected patches, which means adversaries can deal with a target one at a time but are not countered with a coordinated response.

He advocates for resilience by design, arguing that there are too many variables for anything to be called "secure." Actually defend, not the "stuff" but the critical functions, ensure that anomaly can be detected as early as possible and countered with maximal leverage, away from current practices that assume an ineffective and meaningless division of labor. He points out that cyberspace is a shared resource, and this type of defense was done successfully many times before: in the automobile industry, in the aviation industry, for food and drugs. When asked on the role of regulation, he answered that first comes a general understanding of what constitues common practice, tailored to the sector of interest, and only then there might be regulation as a last resort, preferably coordinated and not in 50 overlapping variants.

Wrapping up

There was a lot more to the conference, but I will end here. It is clear to see that society at large, including Switzerland, and is definitely aware and things will happen. I noticed people mentioned "supply chain" and also SBOM and the recent NIST standard, how to improve security posture in healthcare which depends on industry as well as certifications (security patches can apparently break certification) and much more. I picked "polaroid" as a title because just like the good old notions, practices and challenges around software quality that in Chris Inglis's words accompany the software industry since the 60s, polaroid (instant pictures) is something that comes, goes bankrupt, and comes back since we cannot really do without.

I share the optimism of some of the speakers like Serge Droz who say that after many years of helplessness, due to raised awareness and a more coordinated action, we as society should be able to tackle the ransomware phenomenon. Beyond technical questions, establishing practices and security standards will require everyone, society at large to take part in the effort.

29.3.22

The ultimate cross language guide to modules and packages

Intent \ Language go java rust npm racket
source file - - - - module
logical grouping package package module module collection
unit of release module module package
(crate)
package package

racket is special in that a racket package (*) may happily mess with multiple collections.