Details About Disjoint Set
The disjoint set ADT is data structure that stores information about a group of sets. It provides two basic methods:
Find - which, given an item, finds the 'name' of the set to which it belongs. The name returned is the item
which is at the root of the tree (the sets are stored as trees). In my implementation, if the item doesn't already
exist, it is added as a new set containing just that item.
Union - which takes the name (root nodes) of two sets and unions them into one.
(The python class I made has a few more methods used in my game, but aren't part of the classic Disjoint Set ADT definition)
My implementation uses a hash table (a Python dictionary, really) to store a tree structure. The tree is
inverted, though, with the leaf nodes pointing towards the root. Each node has a pointer to the node higher than it in the
tree. The root node stores a number called the rank, which is an estimate of the height of the tree. This estimate
is used in the union method to try to minimalize the height of the trees.
To union two items, we just find the root of each node, and point the smaller tree to the root of the larger tree.
The find method actually returns the root of the tree and if two nodes have the same root, they are in the same tree. An interesting idea
is to compress the path of the tree. Suppose we have a tree of tree nodes. 3 -> 2 -> 1 ->
A find method searches up the tree, from 3 to 2 to 1, return 1 as the name (root) of the tree. Because all I am interested in with the find
operation is finding the root of the tree, the leaf structure does not matter. If while I search, I set node 3 to point directly at node 1,
the next time I search it will be quicker. The path compression provides a neat feature, but I didn't use it.
As you may know if you've done some algorithm analysis, a search on a tree usually takes O(log n) operations. (On average, a search will
require log n operations where n is the number of items in the tree). Therefore, performing m searches on a tree will require
O(m log n) operations. If our disjoint set uses path compression, it can be proven that m find operations will require instead
O(m log* n) operations. log* is the inverse of Ackerman's function. This effectively makes log* a constant
operation for pretty much any conceivable values of n. Note: this doesn't mean I've implemented a tree search which runs in constant time,
the path compression results in a cost that is amortized across the m find operations. Any individual search will take O(log n) operations on average,
but the running time of m finds makes the average cost of all the searches approach a constant value.
In my implementation, I didn't bother with path compression. I started off with a simple search, and when I switched to using path compression, I found
my algorithm ran no faster. After some empirical testing with the cave generator, it appeared that both methods stayed pretty much linear. I think this is
because the cave generator tends to add items to the set in a way that keeps the tree heights fairly low and the path compression isn't much of an
advantage. The issue could also be specific to python - find with path comression uses more overhead and perhaps the size of the dataset I require isn't enough
for the divergance in running times to become apparent.
Reference
Data Structures & Algorithms In C++; Weiss, Mark Allen; Addison-Wesley; 1999