My First Research Paper: State Complexity of Overlap Assembly

My first research paper is completed and has been uploaded to Arxiv! It’s titled “State Complexity of Overlap Assembly”, by Janusz Brzozowski, Lila Kari, Myself, and Marek Szykuła. It’s in the area of formal language theory, which is an area of theoretical computer science. I worked on it as a part-time URA research project for two terms during my undergrad at Waterloo.

The contents of the paper are fairly technical. In this blog post, I will explain the background and motivation for the problem, and give a statement of the main result that we proved.

What is Overlap Assembly?

The subject of this paper is a formal language operation called overlap assembly, which models annealing of DNA strands. When you cool a solution containing a lot of short DNA strands, they tend to stick together in a predictable manner: they stick together to form longer strands, but only if the ends “match”.

We can view this as a formal operation on strings. If we have two strings a and b such that some suffix of a matches some prefix of b, then the overlap assembly is the string we get by combining them together:

1.pngAbove: Example of overlap assembly of two strings

In some cases, there might be more than one way of combining the strings together, for example, a=CATA and b=ATAG — then both CATATAG and CATAG are possible overlaps. Therefore, overlap assembly actually produces a set of strings.

We can extend this idea to languages in the obvious way. The overlap assembly of two languages is the set of all the strings you get by overlapping any two strings in the respective languages. For example, if L1={ab, abab, ababab, …} and L2={ba, baba, bababa, …}, then the overlap language is {aba, ababa, …}.

It turns out that if we start with two regular languages, then the overlap assembly language will always be regular too. I won’t go into the details, but it suffices to construct an NFA that recognizes the overlap language, given two DFAs recognizing the two input languages.

2.pngAbove: Example of construction of an overlap assembly NFA (Figure 2 of our paper)

What is State Complexity?

Given that overlap assembly is closed under regular languages, a natural question to ask is: how “complex” is the regular language that gets produced? One measure of complexity of regular languages is state complexity: the number of states in the smallest DFA that recognizes the language.

State complexity was first studied in 1994 by Sheng Yu et al. Some operations do not increase state complexity very much: if two regular languages have state complexities m and n, then their union has state complexity at most mn. On the other hand, the reversal operation can blow up state complexity exponentially — it’s possible for a language to have state complexity n but its reversal to have state complexity 2^n.

Here’s a table of the state complexities of a few regular language operations:

3.png

Over the years, state complexity has been studied for a wide range of other regular language operations. Overlap assembly is another such operation — the paper studies the state complexity of this operation.

Main Result

In our paper, we proved that the state complexity of overlap assembly (for two languages with state complexities m and n) is at most:

2(m-1) 3^{n-1} + 2^n

Further, we constructed a family of DFAs that achieve this bound, so the bound is tight.

That’s it for my not-too-technical summary of this paper. I glossed over a lot of the details, so check out the paper for the full story!

Robots in a line

I was given this interesting puzzle or thought experiment by a friend.

Consider a line of n identical robots with constant memory, all running the same program (except for the first and last robots). Each robot can communicate (send messages) to the robots adjacent to it, taking one second to do so.

To start, the first robot is active, and all the other robots idle. By a series of messages, how can you make it so that all the robots say “yay” at the same time?

First attempt: Counting downwards

My first intuition was making the robots count downwards. Pick any large integer k, for example 1000000.

What the first robot does is send the number k-1 to the next robot, wait for k seconds, and say “yay“. The next robot does the same, sending k-2 to the third robot and starting its own timer. This continues all the way down.

So if there are 20 robots, 19 seconds after the experiment begins the last robot receives a message, and begins to count down from 999981. At this time, the counter for every other robot is the same.

Therefore, 999981 seconds later, all robots should say “yay” at the same time, and we are done.

The problem with this approach is we don’t know what n is; we don’t know how many robots there are. If we choose a sufficiently large n so that n > k, the approach will obviously fail.

While it works for certain values of n, we need something that would work no matter how large n is. It is obvious that this first approach fails.

Second attempt: Counting upwards

Obviously we can’t count downwards if we don’t know how many robots there are! We can invent a scheme that allows us to count the number of robots.

From now I will drop the explicit message sending and describe the message sending procedure explicitly.

Let the first robot start at 1. Then the next robot would be 2, and so on until the last robot is n.

Now the last robot knows how many robots there are in the line. Now we can use what we had in the first attempt, by picking a number k that is greater than n. So once the number of robots is known, we use the timer and counting down method to make all the robots say “yay” at the same time.

However, there is still a fundamental flaw with this solution. Suppose that the robots have m bits of memory. Then the counter can only go up to 2^m, otherwise it will overflow.

If the number of robots is greater than 2^m, this solution will not work.

Solution: Breaking apart the problem

The trick is to break up the problem into smaller pieces. To do so, we have to have a way of reliably breaking apart the problem.

There is a simple way of finding the middle of the line of robots. We can send two signals. The green one goes at the regular speed, that is, one robot per second. It reaches the last robot in n seconds, then bounces back and goes the opposite direction.

The other signal, the yellow signal, goes at a third of the regular speed: that is, one robot every three seconds.

By the time the yellow signal has reached the middle, the green signal would have gone to the opposite end and back. Therefore, any robot receiving both the green and yellow signals knows that it is the middle robot.

Once the middle has been found, we can split the problem into two equal subproblems. Ignoring the multitude of possible off-by-one errors, two middle robots may act like the new end and start robot respectively. The two subproblems now execute at the same time.

The procedure is repeated several times. Each time the number of robots in a chain is halfed. The subdivision is continued until the chain is divided into sections of length 1. At this moment every robot says “yay” at the same time, so we are done.

This problem is often referred to as the Firing squad synchronization problem.