My First Research Paper: State Complexity of Overlap Assembly

My first research paper is completed and has been uploaded to Arxiv! It’s titled “State Complexity of Overlap Assembly”, by Janusz Brzozowski, Lila Kari, Myself, and Marek Szykuła. It’s in the area of formal language theory, which is an area of theoretical computer science. I worked on it as a part-time URA research project for two terms during my undergrad at Waterloo.

The contents of the paper are fairly technical. In this blog post, I will explain the background and motivation for the problem, and give a statement of the main result that we proved.

What is Overlap Assembly?

The subject of this paper is a formal language operation called overlap assembly, which models annealing of DNA strands. When you cool a solution containing a lot of short DNA strands, they tend to stick together in a predictable manner: they stick together to form longer strands, but only if the ends “match”.

We can view this as a formal operation on strings. If we have two strings a and b such that some suffix of a matches some prefix of b, then the overlap assembly is the string we get by combining them together:

1.pngAbove: Example of overlap assembly of two strings

In some cases, there might be more than one way of combining the strings together, for example, a=CATA and b=ATAG — then both CATATAG and CATAG are possible overlaps. Therefore, overlap assembly actually produces a set of strings.

We can extend this idea to languages in the obvious way. The overlap assembly of two languages is the set of all the strings you get by overlapping any two strings in the respective languages. For example, if L1={ab, abab, ababab, …} and L2={ba, baba, bababa, …}, then the overlap language is {aba, ababa, …}.

It turns out that if we start with two regular languages, then the overlap assembly language will always be regular too. I won’t go into the details, but it suffices to construct an NFA that recognizes the overlap language, given two DFAs recognizing the two input languages.

2.pngAbove: Example of construction of an overlap assembly NFA (Figure 2 of our paper)

What is State Complexity?

Given that overlap assembly is closed under regular languages, a natural question to ask is: how “complex” is the regular language that gets produced? One measure of complexity of regular languages is state complexity: the number of states in the smallest DFA that recognizes the language.

State complexity was first studied in 1994 by Sheng Yu et al. Some operations do not increase state complexity very much: if two regular languages have state complexities m and n, then their union has state complexity at most mn. On the other hand, the reversal operation can blow up state complexity exponentially — it’s possible for a language to have state complexity n but its reversal to have state complexity 2^n.

Here’s a table of the state complexities of a few regular language operations:


Over the years, state complexity has been studied for a wide range of other regular language operations. Overlap assembly is another such operation — the paper studies the state complexity of this operation.

Main Result

In our paper, we proved that the state complexity of overlap assembly (for two languages with state complexities m and n) is at most:

2(m-1) 3^{n-1} + 2^n

Further, we constructed a family of DFAs that achieve this bound, so the bound is tight.

That’s it for my not-too-technical summary of this paper. I glossed over a lot of the details, so check out the paper for the full story!