# Real World Applications of Automaton Theory

This term, I was teaching a course on intro to theory of computation, and one of the topics was finite automatons — DFAs, NFAs, and the like. I began by writing down the definition of a DFA:

A deterministic finite automaton (DFA) is a 5-tuple $(Q, \Sigma, \delta, q_0, F)$ where: $Q$ is a finite set of states…

I could practically feel my students falling asleep in their seats. Inevitably, a student asked the one question you should never ask a theorist:

“So… how is this useful in real life?”

## DFAs as a model of computation

I’ve done some theoretical research on formal language theory and DFAs, so my immediate response was why DFAs are important to theorists.

Above: A DFA requires O(1) memory, regardless of the length of the input.

You might have heard of Turing machines, which abstracts the idea of a “computer”. In a similar vein, regular languages describe what is possible to do with a computer with very little memory. No matter how long the input is, a DFA only keeps track of what state it’s currently in, so it only requires a constant amount of memory.

By studying properties of regular languages, we gain a better understanding of what is and what isn’t possible with computers with very little memory.

This explains why theorists care about regular languages — but what are some real world applications?

## DFAs and regular expressions

Regular expressions are a useful tool that every programmer should know. If you wanted to check if a string is a valid email address, you might write something like:

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})\$/

Behind the scenes, this regular expression gets converted into an NFA, which can be quickly evaluated to produce an answer.

You don’t need to understand the internals of this in order to use regular expressions, but it’s useful to know some theory so you understand its limitations. Some programmers may try to use regular expressions to parse HTML, but if you’ve seen the Pumping Lemma, you will understand why this is fundamentally impossible.

## DFAs in compilers

In every programming language, the first step in the compiler or interpreter is the lexer. The lexer reads in a file of your favorite programming language, and produces a sequence of tokens. For example, if you have this line in C++:

cout << "Hello World" << endl;


The lexer generates something like this:

IDENTIFIER  cout
LSHIFT      <<
STRING      "Hello World"
LSHIFT      <<
IDENTIFIER  endl
SEMICOLON   ;


The lexer uses a DFA to go through the source file, one character at a time, and emit tokens. If you ever design your own programming language, this will be one of the first things you will write.

Above: Lexer description for JSON numbers, like -3.05

## DFAs for artificial intelligence

Another application of finite automata is programming simple agents to respond to inputs and produce actions in some way. You can write a full program, but a DFA is often enough to do the job. DFAs are also easier to reason about and easier to implement.

The AI for Pac-Man uses a four-state automaton:

Typically this type of automaton is called a Finite State Machine (FSM) rather than a DFA. The difference is that in a FSM, we do an action depending on the state, whereas in a DFA, we care about accepting or rejecting a string — but they’re the same concept.

## DFAs in probability

What if we took a DFA, but instead of fixed transition rules, the transitions were probabilistic? This is called a Markov Chain!

Above: 3 state Markov chain to model the weather

Markov chains are frequently used in probability and statistics, and have lots of applications in finance and computer science. Google’s PageRank algorithm uses a giant Markov chain to determine the relative importance of web pages!

You can calculate things like the probability of being in a state after a certain number of time steps, or the expected number of steps to reach a certain state.

In summary, DFAs are powerful and flexible tools with myriad real-world applications. Research in formal language theory is valuable, as it helps us better understand DFAs and what they can do.

# My First Research Paper: State Complexity of Overlap Assembly

My first research paper is completed and has been uploaded to Arxiv! It’s titled “State Complexity of Overlap Assembly”, by Janusz Brzozowski, Lila Kari, Myself, and Marek Szykuła. It’s in the area of formal language theory, which is an area of theoretical computer science. I worked on it as a part-time URA research project for two terms during my undergrad at Waterloo.

The contents of the paper are fairly technical. In this blog post, I will explain the background and motivation for the problem, and give a statement of the main result that we proved.

## What is Overlap Assembly?

The subject of this paper is a formal language operation called overlap assembly, which models annealing of DNA strands. When you cool a solution containing a lot of short DNA strands, they tend to stick together in a predictable manner: they stick together to form longer strands, but only if the ends “match”.

We can view this as a formal operation on strings. If we have two strings a and b such that some suffix of a matches some prefix of b, then the overlap assembly is the string we get by combining them together:

Above: Example of overlap assembly of two strings

In some cases, there might be more than one way of combining the strings together, for example, a=CATA and b=ATAG — then both CATATAG and CATAG are possible overlaps. Therefore, overlap assembly actually produces a set of strings.

We can extend this idea to languages in the obvious way. The overlap assembly of two languages is the set of all the strings you get by overlapping any two strings in the respective languages. For example, if L1={ab, abab, ababab, …} and L2={ba, baba, bababa, …}, then the overlap language is {aba, ababa, …}.

It turns out that if we start with two regular languages, then the overlap assembly language will always be regular too. I won’t go into the details, but it suffices to construct an NFA that recognizes the overlap language, given two DFAs recognizing the two input languages.

Above: Example of construction of an overlap assembly NFA (Figure 2 of our paper)

## What is State Complexity?

Given that overlap assembly is closed under regular languages, a natural question to ask is: how “complex” is the regular language that gets produced? One measure of complexity of regular languages is state complexity: the number of states in the smallest DFA that recognizes the language.

State complexity was first studied in 1994 by Sheng Yu et al. Some operations do not increase state complexity very much: if two regular languages have state complexities m and n, then their union has state complexity at most mn. On the other hand, the reversal operation can blow up state complexity exponentially — it’s possible for a language to have state complexity n but its reversal to have state complexity $2^n$.

Here’s a table of the state complexities of a few regular language operations:

Over the years, state complexity has been studied for a wide range of other regular language operations. Overlap assembly is another such operation — the paper studies the state complexity of this operation.

## Main Result

In our paper, we proved that the state complexity of overlap assembly (for two languages with state complexities m and n) is at most:

$2(m-1) 3^{n-1} + 2^n$

Further, we constructed a family of DFAs that achieve this bound, so the bound is tight.

That’s it for my not-too-technical summary of this paper. I glossed over a lot of the details, so check out the paper for the full story!