I wasn’t planning any post this week, I don’t have that much free time to write posts. But I was forced to write down my own implementation of a disjoint set data structure for the purpose of one of the projects I’m working on, so I thought why not share this piece of code with you, dear reader!
I hope to find the time to write about my latest project soon, or to be precise about my latest two projects! One is a game, and the other is the cat-control-device, for the latter I received many of the key Chinese electronic component lately, once I solved a couple of minor issues with my Pi I would be able to share some details.
But let’s have a look at what’s a Disjoint-set, shall we?
Theory.
Everybody know about sets, and if you don’t then you need to refresh your memory very soon. Frequently one needs a proper representation of a set, the STL provide us with std::set with is exactly what one will expect std::set is, a collection of unique objects.
Sadly the library doesn’t provide any data structure for representing disjoint sets, so we’re not able to efficiently perform union on disjoint sets, well, we can do it with a little of trickology, but that’s not going to be very efficient.
What we need is a data structure able to provided the efficiently the following operation:
- make_set
- Create a new set with one single element (singleton)
- find_set
- Find the set containing a given element
- union_set
- Perform a union of two sets
Two sets A and B are disjoint if they have no elements in common, that is, if , the union_set operation as the name suggest, performs an union on two disjoint set.
For each set our data structure must choose and use on unique element as representative for the whole set, so if is the representative element for the set i-th set S, the all the members
are said to be part of the set represented by r. In this way once we identified the proper set for a, we can say if b is a member of the very same set, of if otherwise they are disjoint.
Implementation.
We’re going to represent the disjoint set data structure using a simple forest, for each make_set we will create a new tree with one single element, the root. The representative for this tree will be the only available element.
The union_set function will perform a tree merge operation on the two trees representing the sets. To quickly lookup for the proper tree containing one of the members, we will use an hash table, once identified the tree in order to return the representative we’ve just to walk up till the root of the tree.
To speed up the union operation, we will use two heuristics well described in the algorithmic literature:
- Union by rank
- We want to merge the smallest tree to the root of the largest, we define the rank as the depth of the tree, but since this value will not reflect the depth after the path compression heuristic, we call it rank.
- Tree path compression.
- We want to make sure that the height of the tree is never higher than one, so we have a flat kind of tree, where the root is the only parent for all the childs!
Those two heuristic will help to achieve a linear running time. Let’s have a look at the code:
template<typename T> struct node; template<typename T> using node_pointer = std::shared_ptr<node<T>>; template<typename T> struct node { node_pointer<T> parent; T user_data; std::size_t rank; node() = default; explicit node(const T& node_data) : parent{nullptr},user_data{node_data}, rank{0} {} };
The node represent a tree node, user_data are the information stored by the user, basically the members of the sets.
The interface and the declaration of the data structure is the following:
template<typename T> class disjoint_set { std::unordered_map<typename T::set_id_type,node_pointer<T>> node_mapping; node_pointer<T> find_set_impl(node_pointer<T> node); public: disjoint_set() = default; void make_set(const T& new_user_data); T find_set(const typename T::set_id_type& set_id); void union_set(const typename T::set_id_type& a_set, const typename T::set_id_type& b_set); };
The make_set operation is the simplest one, we just need to create a new node and hash that node in order to have a fast retrieval.
template<typename T> void disjoint_set<T>::make_set(const T& new_user_data) { auto new_set = std::make_shared<node<T>>(new_user_data); new_set->parent = new_set; node_mapping.insert(std::make_pair(new_user_data.set_id,new_set)); }
The find_set operation is divided in two function, the first will just lookup for the node in the hash table, the second will search for the root of the tree.
As I mentioned earlier the root of each tree is the representative of the set, note that when the function rolls back the recursion then it sets the parent of each node to the current root, this will flatten the tree.
template<typename T> node_pointer<T> disjoint_set<T>::find_set_impl(node_pointer<T> node) { if(node->parent != node) node->parent = find_set_impl(node->parent); return node->parent; } template<typename T> T& disjoint_set<T>::find_set(const typename T::set_id_type& set_id) { auto val_it = node_mapping.find(set_id); //Assuming that the set always exist, better crash //the whole application than lie by returning some dummy value, //the client clearly fucked up if val_it is not valid. return find_set_impl(val_it->second)->user_data; }
And at the very least, the union_set operation:
template<typename T> void disjoint_set<T>::union_set(const typename T::set_id_type& a_set, const typename T::set_id_type& b_set) { auto a_it = node_mapping.find(a_set), b_it = node_mapping.find(b_set); if(a_it!=node_mapping.end() && b_it!=node_mapping.end()) { auto a_node = find_set_impl(a_it->second), b_node = find_set_impl(b_it->second); if(b_node->parent==a_node->parent) return; //Nothing to join here. if(a_node->rank > b_node->rank) b_node->parent = a_node; else { a_node->parent = b_node; if(a_node->rank == b_node->rank) ++a_node->rank; } } }
Just note that if the the ranks of the two sets are the same, the code will increase the rank of the resulting set after the union is completed.
A detailed running time analysis is not that straightforward, anyway using both the heuristics the running time is where
is a very slow growing function, and where m is a sequence of make-set,union and find-set operation, n of which are make-set, for what we care, it is linear in m.
If you’re still here and you want to know who’s the guy in the picture, then have a look here 🙂
Thanks for reading!