A discussion on matching string prefixes

Strings are an exceptionally flexible data type. They can be used to represent virtually any kind of data, or even behaviour – think of clear text script files.

With all this flexibility also come certain drawbacks however. Doing a lot of computations on or using strings can be very expensive – especially in languages like C#, where strings are immutable.

This is why especially in game development one has to find a balance between the use of strings, and more statically typed code. However, even if we optimise entirely for performance, there will always remain use cases for strings.

Today I want to look at one specific use case of strings: matching string prefixes.

What I mean by that is: To extract all strings from a large collection that start with a given sub-string, and to find the longest common prefix (starting sub-string) of all these.

Such features are commonly used in command line interfaces, combo boxes, and IDEs to allow the user to auto-complete what they are typing automatically or on a given keystroke (often TAB on the command line).

The program might automatically fill in the full string, if only a single one is found, or it might present the user with a list of possibilities.

While modern IDEs use a variety of fuzzy matching techniques, we will look at the simple prefix matching example to keep things simple.

Data structure requirements

The requirements for the data structure we are going to talk about are as follows.

We want it to be able to quickly give us:

  • a list of all strings starting with a given sub-string.
  • the longest shared prefix of all strings in that list.

For example, if our list contains the strings one, two, three, threshing, asking for all strings starting with the sub-string t will return two, three, threshing.
Similarly, t is the longest common prefix in that list, so trying to extend it will return the same. On the other hand if we try to extend the prefix th, we want to get thre, since all strings starting with th also start with thre.

Analysing and optimising the naive implementation

The simplest implementation of this data structure is a plain list of strings.

While certainly efficient space-wise, it does not perform well when we take a look at the runtime.

All strings with prefix

For trying to return all strings with a given prefix, we have to enumerate the entire list, even if only a few strings are returned, comparing the prefix to the beginning of each string.

This may look like a linear time algorithm, but in fact it is not.
Each string comparison has to compare multiple characters, up to the length of the prefix, so the run time is in fact O(n * m), where n is the length of the list and m the size of the prefix.

Note: For any smallish data set (even up to tens or hundreds of thousands), this simple approach may be more than fast enough. In this post we consider not only practicality, but also the theoretical aspects and very large data sets. Even if we end up using this simple solution – and I often do – it is good exercise to think about possible improvements and to stay aware of the limitations of our algorithms.

We can optimise this however: By sorting the list in advance, we can now use a binary search to find the first entry starting with our prefix, and then enumerate from there.

This results in a runtime of O((k + log n) * m) where k is the number of returned strings.

However, we can improve this even further by using a second binary search to find the last string that matches our prefix. With that change we no longer have to compare prefixes for k strings, but can simple enumerate the strings between the two indices blindly.

This reduces the runtime to O(k + (log n) * m), which is optimal with this data structure.

That is actually very good!

Unless we are dealing with millions of strings, this will be more than fast enough for almost any purpose. log n tends to be very small so our runtime is virtually O(k + m), which would be optimal for any kind of data structure solving the given problem: There is no way we can run faster while having to consider and return input and output of lengths m and k respectively.

However, while thinking of log n as being constant may be reasonable approach to many practical problems, we would like to do better.

So far we have only looked at our first query: Returning the list of all strings with a given prefix.

How fast can we determine the longest common prefix in that list?

Extending prefix

The completely naive solution performs rather badly, but we can skip straight ahead to a more optimal solution:

Since we already calculate the first and last index of our list of matching strings (in O((log n) * m) time), we can build our longest extended prefix in a rather efficient way as follows:

We first assume our longest common prefix to be the entire first matching string. We then iterate until the last matched string, and for each step, we cut off as many characters as we need for the current longest prefix to still be a prefix of the iterated string.

That algorithm can be implemented in O(k * i) time, where i is the length of the first matching string: We naturally have to iterate over k strings, and in the worst case we may have to compare O(i) characters for each – which is the case if the common prefix is very long.

Together this makes for a runtime of O((log n) * m + k * i) which is not ideal – though it is optimal for using this data structure.

One last optimisation we could perform is finding the shortest string in our matching list, and use it as a first longest prefix candidate. This will reduce the size of i to the length of that string. However, it does not affect the complexity in other ways – the additional O(k) search does not affect the runtime in big-O itself.

Edit: As Matty pointed out in a comment this can actually be optimized to run in O((log n) * m + j), where j is the length of the longest extending prefix. To do so, we simple get the first and last string matching out prefix (binary search). Since our list of strings is ordered, the shared prefix between these two strings is also the longest shared prefix in the matching list.

Again, for small data sets (and short strings) this is just fine, and I usually would not even bother to make it this complicated.

However, we can do a lot better in principle, and for large data sets, this can make a big difference.

Sneak peek

Next week I will discuss and implement a more efficient method of matching strings: Prefix tries.
Using that data structure, we will be able to reduce the runtime of our two queries as follows:

naive implementation prefix trie
all matches O((log n) * m + k) O(k + m)
extending prefix O((log n) * m + j) O(j) / O(m)

j here is the length of the extended prefix, i.e. the length of the returned string. The solution returning in O(m) is optional, since it requires additional pre-calculations, albeit minor ones.

Conclusion

Today we took a somewhat more theoretical approach than we usually do. Instead of writing code we analysed and optimised simple solutions to the presented problem of matching string prefixes.

Let me know what you think of these kinds of posts, and of course if you have any questions, suggestions or ideas.

Enjoy the pixels!

Leave a Reply

2 comments

  1. Matty says:

    Hmmmz, is it me, or can you do the extending prefix faster – in O((log n) * m + j), where j is the length of the extended prefix string (so shorter than i as length of the first string).

    We look at the first and last string with our given prefix (all strings are sorted).
    We now just loop forwards at the first and last string one character at the time. If these characters match, j = k+1, and we continue. If not, we found our extended prefix string.