String searching algorithms comparison.

With nature OPHELIA sick had. heel him my MARCELLUS the A with my in comes not sweet if! A means may too; that quantity prepare did! have would
not thou But do; thirty fortune, lament And are A of and havior There and. QUEEN am What worse kind. at might at wears that as That jig sinners be
A lord was hath of GERTRUDE HORATIO From hast away.

I’m not getting crazy!

This is a quote from a document generated by the random text generator algorithm I’ve been using to compare the performance of the Boyer-Moore and Knuth-Morris-Pratt algorithm. As usual have a look at my github for the code.

In order to perform a performance comparison I needed a big sample text file, and by big I man some hundreds MiB of data, unfortunately I have no book of this length available so I wrote an algorithm which perform some basic analysis of a given short text –The Shakespeare Hamlet in my case– and on the base of the collected data generate a much longer document, good for string searching performance comparison.

For the purpose of my analysis I prepared two more text generators, one which generate a long Fibonacci word and the other which generate a Thue-Morse word, both good for performance comparison purpose.

Why?

When I started studying text algorithms I was sure to discover incredibly fast algorithms for text searching, or at least to understand how the STL library or any of the modern text editors implement they searching functionalities.

When my second goal was meet and I now somehow understand how vim search for a given pattern or what std::string::find do, I was surprised to see that the two most popular algorithms are not that fast after all, or at least not for the average –even more than average– user. (This is not true, for Computational Biology or Genetics Data analysis, where most advanced algorithms make an enormous difference in terms of performance.)

For this reason I decided to run those little tests and see how the Boyer-Moore and Knuth-Morris-Pratt algorithms stand with a very naive implementation, which does not use any high-tech shift table.

But, when this is true for me and most probably for you, it isn’t true anymore for whom needs to search very long patterns with a simple structure in very long text documents, like for example searching for a protein sequence in the DNA.

About the algorithms.

I will test the algorithm implementation from Jewel of Stringology, the code is in my github, all the code will be compiled with g++ with full optimization and executed on Linux.

For the implementation of the Boyer-Moore algorithm both the bad character shift and good suffix shift were implemented, those are the heuristic described at page 41 of the book and should made the most common implementation of the algorithm.

The code which generate the shift tables is somehow cryptic, and I must admit the algorithm itself is pretty hard to understand, you may want to look here from some details.

If you want to have a look at the BM algorithm which implements only the bad character heuristic then look for the Boyer-Moore-Horspool algorithm, like here. This is my implementation of the procedure to compute the shifts for the searching algorithm:

std::vector compute_boyer_moore_shifts(const std::string& pat)
{
    std::vector suffix_table = compute_suffixes(pat);
    size_t pat_len{pat.size()};
    std::vector shifts(pat.size());
    for(long i{0};i<pat_len;i++)
         shifts[i]=pat_len;
    long j{0};
    for(long i{(long)pat_len-1};i>=0;i--)
    {
        if(suffix_table[i]==i+1)
        {
            for(;j<pat_len-1-i;j++)
            {
                if(shifts[j]==pat_len)
                {
                    shifts[j]=pat_len-1-i;
                }
            }
        }
    }
    for(long i{0};i<pat_len-1;i++)
        shifts[pat_len-1-suffix_table[i]] = pat_len-1-i;
    return shifts;
}

Note the second part of the algorithms (last three lines) is provided by the book, the first part is missing and is leaved as exercise for the reader (And since I’m a good reader, I made that code by myself.. almost..)

For sake of comparison in the repo you may find the Knuth-Morris-Pratt implementation provided in CLRS, this latter version of the algorithm is a little different by the one in Jewels, and does not perform as good as the Jewels one.

The brute force searching algorithm is just a very naive implementation of a left to right scan with overlapped pattern matching, which takes O(m*n) asymptotically but have a linear execution time on average:

std::vector<long> brute_force1(const std::string& text,
        const std::string& pat)
{
    size_t m{pat.size()},
           n{text.size()};
    std::vector results;
    for(long i{0};i<=n-m;i++)
    {
        long j{0};
        while(j<m&&pat[j]==text[j+i])
            ++j;
        if(j==m){
            results.push_back(i);
        }
    }
    return results;
}

About the data.

I’ve prepared three text generators for the purpose of this performance comparison:

  1. Pseudo-real random text generator: The algorithm load an existing real text and perform a word frequency count, then generate a mach bigger random text maintaining the same word frequencies proportion, adding punctuation to the results in order to fake a real text.
  2. Fibonacci text generator: Generate the well known Fibonacci words, this is an interesting type of data since Fibonacci words contains a large amount of periodicities  and symmetries.
  3. Thue-Morse text generator: Those have the property of being overlap-free and square-free.

The data are generated on the fly by the testing procedure. For each type of data I used different search patterns with different length, suited for the specific source test.

The benching procedure is relatively simple, just generate the data and execute for all the available patterns the searching procedures under test:

template
void bench(long text_size,const vector& search_patterns)
{
    TEXT_GEN generator(text_size);
    cout<<"Using "<<generator.get_generator_name()<<", text size: "<<text_size<<endl;
    string text=generator.get_text();
    tuple<string,search_function> functions[3] = {
        make_tuple("Naive",bind(&brute_force1,_1,_2)),
        make_tuple("Knuth-Morris-Pratt",bind(&knuth_morris_pratt,_1,_2)),
        make_tuple("Boyer-Moore",bind(&boyer_moore,_1,_2))
    };
    for(int pat_idx{0};pat_idx<search_patterns.size();pat_idx++)
    {
        for(int i{0};i<3;i++)
        {
            cout<<"Running "<<get<0>(functions[i])<<", pat length: "<<
                search_patterns[pat_idx].size()<<", time: ";
            auto time_start=chrono::high_resolution_clock::now();
            long c = get<1>(functions[i])(text,search_patterns[pat_idx]).size();
            auto time_stop=chrono::high_resolution_clock::now();
            cout<<chrono::duration_cast(
                    time_stop-time_start).count()<<"ms ("<<c<<")n";
        }
        cout<<endl;
    }
}

The lengths of the input text are:

  1. Pseudo-real random text generator: About 50000000 words.
  2. Fibonacci text generator: 40 Iterations of the Fibonacci recurrence, the output word takes about 450 MiB of memory.
  3. Thue-Morse text generator: 50000000 characters.

Results: Pseudo-random text.

The following tables are showing the results for the performance comparison of the three algorithms when searching a the text generated by the pseudo-random text generator.

Searching performance random-textPNG

As you can see the Knuth-Morris-Pratt algorithm is always slower than Boyer-Moore and is even slower than the brute force implementation which does not use any shift table!

That’s somehow surprising, I always believed that KMP is a pretty good choice for everyday searches in everyday text, but I was wrong. The Boyer-Moore algorithm is performing faster than the naive implementation four out of eight times, and is faster just by a very small delta.

Fibonacci word, results:

If after the big-random-text test the brute force algorithm seems to have not been beaten by the other two cleaver and complicated algorithms, let’s have a look at the results with a sequence which has a lot of repetitive structure and symmetry:

Searching performance Fibonacci word

The naive code is faster only in two tests, the single and the two character search! That make sense, none of the advanced shift tables are of any help here since the patter is very short here, calculating prefix/suffix &c is useless. Even with two characters length there’s not real advantage in processing the pattern in order to find some structure in it.

For long patterns the data shows how much efficient KMP and BM are if compared to the naive implementation, now I see why those two algorithms are so much venerated by Jewels, the added complexity of their implementation is worth the game.

Interestingly for very short pattern length the naive code is not that far behind KMP and BM. Even more interesting is that the Boyer-Moore algorithm is the faster one only in one test, seems that between the two is a better choice to go with KMP, but let’s see the last set of results.

Thue-Morse word, results:

Let’s see the data:

Searching performance Thue-Morse.PNG

Again for very short patterns the naive implementation is very fast, surprisingly much faster than KMP and BM for six test out of eight, only for very long patterns the other algorithms are able to perform better.

For long patterns BM wins hands down, is much faster that the naive code and significantly better than KMP, most probably repeating those tests with even longer patterns and even bigger text may reinforce this conclusion.

Long pattern and big text.

The last test I executed is with big patterns and even bigger text, this time I generated a text five hundreds millions word long for random text test, a Fibonacci word one billion character long and a Thue-Morse word of the same size.

Let’s see the results:

Searching performance Big pattern Random Text
Random text test, KMP is astonishingly slower than the naive algorithm!

Incredibly, for this first test KMP is the slower one! BM is way faster than the naive algorithm and almost 24 times faster than KMP. For this test I wasn’t able to build a very long pattern which may possibly find a match in the random text, well, this is somehow expected since the text is completely random.

Searching performance Big pattern Fibonacci
KMP a little faster or BM, naive code again way behind.

For the Fibonacci word experiment clearly the naive implementation is the slower, BM and KMP have very similar performance but BM is not that far.

Searching performance Big pattern Thue-Morse
Again the naive algorithm is far slower

Not very much to comment here, the naive implementation is clearly to be avoided when the pattern is very long.

Conclusion.

Use the standard library string searching algorithms! Very few people need to know how the library will implement those algorithms, and even fewer need to know what’s the difference between KMP, BM or other more advanced algorithms.

I do this because I like it, the day I will stop learning would be the first day of my last days on this planet.

Said that, if you’re still reading this it means that I must provide you with some conclusion which make sense. Well, the naive algorithm seems to be the perfect choice for short patterns and not very long text, otherwise it depends.. depends on your project and you’re needs.

For what I was able to see –and read– the Boyer-Moore algorithm is the choice for string searching in real human readable text, so go with it if you have…human readable documents… to analyze.

Thanks for reading!

 

 

1 Comment

  1. […] for sub string matching like Boyer-Moore or Knuth-Morris-Pratt are linear in the size of the input text plus the cost of building the shift tables for the given […]

Leave a Reply