Segment tree

What is the best thing you may possible do when having a eight hours train trip? Writing a blog post of course! That’s actually an interesting story by itself: Where I was and why it take that long to come back home!

Anyway! Let’s say you’ve an array of integers, as usual from index 0 to n-1, and let’s say that for any interval 0<=i<=j<=n-1 you’re willing to know which is the minimum or maximum value in that particular interval, and since you care about performance you want this query to be executed as fast as possible.

Similarly, let’s say that for the very same array of integers, you’re willing to know if a given number N from the set is in a given range, so to say, the membership problem for a subset of the original array.

Many other similar problem can be solved using segments tree, for now on let’s analyze how to solve the min/max problem, and let’s start with a naive solution.

Naïve implementation.

We want for any interval 0<=i<=j<=n-1 to know which is the minimum value, if there are multiple minimum values we need to return the index of just one of them, this problem is commonly known as range minimum query or RMQ.

Our first naive solution will take O(1) time for querying, O(n^2) space and O(n^3) for the computation of the table, what we want to do is just compute the minimum value for all the possible ranges i,j, and store that values in a table data_table[n][n]. Let’s have a look at the code:

This very naive implementation may be a little improved by noticing that once we’ve calculated the minimum value for the range i,j, then the minimum value for the next range i,j+1 is just min(input_data[data_table[i][j-1]],input_data[j]), clearly the old min value may hold also for j+1 or eventually the new element may be the minimum.

Here’s the O(n^2) code:

Those solutions works just fine, and we like the O(1) query time, but the construction of the table takes too long and we cannot accept the O(n^2) memory consumption as well, we will now develop a linear space solution with O(lg(n)) query complexity using segment trees, let’s start with some theory!

Some theory.

Let’s say we want to build a tree over a sequence of integers, the construction should proceed in a bottom-up fashion, and every node of our tree shall contain the minimum value in all the nodes in the left and right sub-tree, or in other words, each root for each sub-tree contains the min value between its childs.

As you should probably know, a binary heap is just a tree which can be nicely implemented using arrays, for the purpose of our implementation we need to master how those trees works in order to develop our segment tree, since from binary heaps we will borrow the tree representation and the way of indexing child’s of each node, for the purpose of our implementation no other properties from the binary heaps is going to hold.

Using a binary heap style of tree representation allows to reduce the space consumption to linear to the size of the input array, whereas the amortized query time will be O(lg(n)) as one may expect from a binary tree.

To proceed let’s build a binary tree for sixteen elements (leafs) and have a look at how is represented in a binary heap array:

IMG_5568.JPG
This is how looks like a binary heap, only the leafs are explicitly written

In our case we must consider the situation where the number of leafs is not a power of two, we’re not dealing with real heaps but just borrowing the array representation of them, let’s see how the procedure for creating a binary tree for ten elements looks like when applying the same procedure used to build the tree for the heaps:

partial heap.jpg
The result of applying the procedure for building a binary heap on a set of ten elements.

As you can see, the binary tree is not complete as it should be for a binary heap, but we do not care. The procedure stop the recursion when the range of indexes covered by is zero, or differently when i==j.

The leafs with indexes 16,17,9,20,21,11,24,25,13,14,15 are going to contain the original array, the internal nodes are going to contain the min value between the left and right sub-tree.

Let’s have a look at the sequence 2,5,9,44,1,4,5,22,6,5,11:

Segment tree.jpg
The indexes marked in red are the indexes from the original array, each of them points to the current minimum value.

As you can see there are vacant leafs, but this is just a minor issue, most of the space will be occupied by correct values, the memory consumption is four times n plus one, or O(n). The construction time is proportional to the time required to build a binary tree, this is actually linear to the size of the input array, the proof is actually trivial, just note that we’re a kind of visiting a tree of n nodes, accessing each node exactly once, which give us O(n).

How is possible to perform a query on this data structure? For a given interval i,j we need to traverse the tree looking into sub-trees which contain the currently followed pair of indexes, the process is the same as for the tree construction, but this time instead of putting elements we’re gathering elements from the data structure, let’s see how this looks like for a query of the min element between tree and seven:

segment tree query.jpg
Querying the data structure with 3,7.

When traversing the tree we will ignore all the sub-trees which are covering intervals out of the requested range, each time one of the sub-trees will be contained inside the interval we will stop further recursive processing and return the actual index, which is the minimum for the actual interval.

The intervals out of the researched interval are marked with a big X on the picture, in our code the algorithm will return -1 instead. This way of handling the queries will require no more than O(lg(n)) operations for each query, pretty good.

The code.

Let’s now have a look at the implementation of our data structure:

There’s no need to perform the copy of the input data array, but for some clarity in the code I decided to made the copy.

The code performs exactly the operation described in the theoretical part of the post, there’s nothing else going on. Just to be noted that the value -1 returned in query_impl is to cover the case when the current sub range is out of the original query range, this is exactly the case of the big X in the image available in the theoretical part.

Thanks for reading!

PS:

I wasn’t able to complete this post during my train journey as I was planning at the beginning, doing such kind of writing on Polish trains is just not possible 🙂

Leave a Reply