This is an implementation of a persistent binary search tree for the GLib open-source data structures library.
You may download the source code in an independent library of its own.
My implementation uses a Treap as the underlying binary search tree, and then uses the node-copying method as desribed by Driscol, et al. to make it persistent. My motivation for developing this, and for my choices in how/why to implement it are outlined in my initial email to the gtk-devel mailing list on the topic:
A persistent binary search tree is simply a BST that is versioned over changes, such that you can do a lookup in any previous version of the tree. An efficient implementation of this data structure requires O(1) additional space per insertion into/deletion from the tree. That is, after n insertions, creating n different versions of the tree, the total space required would be roughly 2*(space for a single version), while allowing you to search in any version of the tree. Searching in any version takes the same number of comparisons and pointer redirections as a tree with only a single version.
Persistent BSTs are used for operations such as a Plane Sweep in Computational Geometry. At the same time, the code behind the persistence can be generalized to provide an additional data structure that performs Fractional Cascading for no extra cost to the BST implementation.
There are currently no readily available open source implementations of a persistent BST. This would be a great tool for students, academic researchers, and other developers.
I have noticed that GLib has a Binary Search Tree data structure, which currently uses an AVL tree. AVL trees can be made persistent, but the space requirement becomes O(log n) per insert/delete instead of O(1).
There are 2 other BSTs that would work effectively to provide a persistent BST, while also maintaining a very efficient implementation for when persistence is not used. They are:
- Red-Black Trees
Treaps can be expected to outperform Red-Black Trees for any sized tree. The constant in the search cost for a Treap is Approximately equal to AVL trees (and slightly less for large trees), and is always smaller than Red-Black Trees. (It's a small constant regardless.) Treaps are also much simpler to implement, requiring less complicated code. However Treaps are a randomized data structure, which some people try to stay away from.
The first implementation decision was to replace the AVL tree behind the GTree data structure with a Treap. An AVL tree is not suitable for a persistant structure since deleting a node may require up to log n rotations, on a path up to the root, in order to rebalance the tree. A Treap only requires a constant average number of rotations both on insert and delete.
While implementing a Treap, I added the
g_tree_search_related() functions to the GTree interface.
These methods perform a search for a
query key q and can return one of three things:
The original implementation used a
pointer in each node of the tree, but no parent pointer. These left and right
child pointers were also used, in the leaves, to point to the previous and
next node in sorted order, so that tree traversal could be done
incrementally without knowing
the parent. To make the tree persistent, each node must be aware of who is
maintaining pointers to it, that is, it's incoming pointers. Under this model,
a node would need to look at its previous/next node at all times, to check if
it is a leaf and has a pointer to it. Finding the previous/next node can take
O(log n) time, so doing this multiple times on an insert is not
desireable. Instead, a node would need to keep a pointer to its next and
previous nodes. This adds two new pointers to the data structure. Instead,
I opted to add a parent pointer, and not use the left and right child pointers
in this way at the leaves. This removed the need for one more pointer, and
also removed the booleans,
right_child, that stated if each child was present or not.
When multiple copies of a node can exist due to persistence, they need to all
share the same
value attributes. Instead of putting these
two pointers in every copy of a node, I opted to move them into an auxilerary
GTreeNodeData structure, which is shared between all nodes through
data pointer. Since there are multiple nodes pointing to
the same data, we can't free the data and call the key and value destroy
functions the first time a node is freed. I use a reference count,
GTreeNodeData that tracks how many nodes
are using that key/value pair. Lastly, the
stolen attribute was
added to the
GTreeNodeData structure to keep track of which nodes
are removed from the tree or stolen from it. Stolen nodes
do not have the destroy functions called for their key/value pair once the
key/value pair is freed. Since stealing a key/value pair from a persistent
GTree may leave other older nodes which still need the key/value pair, the
stolen flag needs to be remembered until all such nodes are removed from the
tree. The reference count and stolen flag add another two 32-bit values when
compiled under my test environment with gcc and "-O2".
In order to make the tree persistent, a node keeps a version for each set
of pointers it holds, and has a table of
sets of these version+pointers. The
right pointers were moved out of the
structure, and into
GTreeNodeVersion, with a version added along
with them. An array of this structure was then added back into the
GTreeNode, allowing the node to have an array of pointers to
its neighbours, with a version attached to each one. This version is
equivalent to the version stamp described in .
I chose to make
this a dynamically allocated array (requiring the pointer
GTreeNode) so to reduce the memory
footprint of a tree without any persistence. If a tree remains in its
first version, all nodes are created with an array of only one
GTreeNodeVersion. Once the tree leaves the first version,
any future arrays are created to hold instead
This increases the space nodes in a persistent tree by a pointer and a counter,
but reduces the footprint in a tree without persistence by three structures
of three pointers and a 32-bit integer each. You can see this difference in
space in Table 2 when persistence is used.
The size of a persistent
GTreeNode is roughly twice the size of
an upstream node, when persistence is not being used. We removed
from upstream, which saves 4+2*2 bytes in my test environment. However we
ref_count and the
stolen flag. When multiple versions are used, some of these
things are shared between copies of a node, but without persistence they all
belong solely to a single node. In my 64-bit test environment, these all
together add 48 bytes to a node structure, giving each node a total size of
upstream-removed+added = 44-12+48 = 80 bytes.
When persistence is being used, each node's table of
GTreeNodeVersion structures will contain four sets of
instead of one. This grows each node by 3*(3*8+4)=84 bytes, to a total size of
During development, I considered using a single GTreeNode for all persistent
versions of a node, removing the need for the
pointer and redirection to find a node's key or value. This node would then
have an ever-increasing table of sets of version+pointers. However, it
is required that at each node in a path, the next parent/child pointer can be
found in O(1) time, and so an index into the target node's table
would need to be stored along with each
This change would be equivalent to increasing the size of every
GTreeNode pointer inside the tree, increasing the total space
used in the end, while also increasing the complexity of the code. Because
of this I chose to instead allocate a new GTreeNode each time a node needs to
add a pointer to its table,
v, but does not have room.
Once a tree is persistant, it may have multiple root nodes. Each time the root
node changes, a new pointer into the tree needs to be remembered. This is done
with an array of
GTreeRootVersion structures, called
r, in the
tracks how many
GTreeRootVersion structures are in the array.
contains a version, which is the version at which the node it points to
became root. Thus all versions afterwards will also have the same node as a
root, until another
GTreeRootVersion is added to the array
r. The number of possible root nodes is unbounded, but on
average will be no more than O(log n), as a new node only becomes the
root if it has the lowest priority in the entire tree. Searching for the
correct node based on a version is done with a binary search, and so is
expected to take O(log log n) time, and does not significantly impact
the running time of a search in older versions of the tree.
GTreeNode holds an array of
GTreeNodeVersion structures, each of which has a version. Since
all insert and delete operations take place in the latest version of a GTree,
it can be expected that nodes must find their newest set of pointers more often
than their older versions, and so I wanted to optimize this operation. The
v is stored with the newest (largest) version first, in
index 0. All other versions are then stored in increasing order in indices
1 to n-1. Thus if we stored versions 1 to 5, they would be stored in order
"51234". This saves one pointer dereference and one pointer addition at
each node when traversing the newest tree, as the node's
attribute does not need to be read and the pointer
directly to the first element of the array.
GTree, which is an array of
GTreeRootVersion structures is treated in the same way, with
the newest (largest) version first, and all others sorted in increasing order
in the successive array cells.
I performed tests of my implementation of a GTree against the one currently in the upstream GLib library to determine the performance impact of having persistence available. To make a fair evaluation, I compare the amount of time and the number of rotations required to insert, query, and delete nodes in a GTree which does not use any persistence. This way the comparison is only for functionality which existed upstream beforehand, and is a measure of how applying these patches to the GLib library would impact current users of the GTree structure.
However I was also interested in how much of a cost would be involved to use persistence in the GTree. For this test I compared the amount of time to insert, query, and delete nodes in a GTree which does use persistence versus one which does not, but using my persistant implementation in both cases. In the case which uses persistence, I tested two different scenarios. The first test spreads the inserted keys across approximately four different versions, and the second one creates a new version for every key inserted. This is a measure of the cost involved to use persistence for users interested in using the feature.
The machine which these tests were performed on has a 64-bit Intel processor, with 4KB of cache.
I have two source trees, one for my persistent GTree implementation, and one
which is simple the current upstream git head
commit 6b7b7a76020e76370e416d794eceb99937b9ed33), which I will
refer to as
For testing purposes, I applied a patch to both source trees in order to allow me to count the number of rotations being performed. The compiler I used is the 64-bit version of gcc (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4).
$ patch -p1 < count_rotations.diff
$ ./configure --prefix=$HOME/local-glib-upstream/ CFLAGS="-O2"
$ make install
The last step was needed to get my test code to compile, as
$ patch -p1 < count_rotations.diff
$ ./configure --prefix=$HOME/local-glib-persistent/ CFLAGS="-O2"
$ make install
$ cp glibconfig.h $HOME/local-glib-persistent/include/glib-2.0/
make installdid not copy the header file there, but
This creates two installations of GLib in my home directory, one which uses
the persistent GTree, and one which uses the upstream non-persistent GTree,
both compiled without extra debug info, and with optimization level
I used these test programs, test-single.c,
Makefile, along with this script
test-all, to compare the two implementations.
Note that dynamic linking is required for this test to correctly use both
When comparing the
In the first table we compare the upstream GTree implementation to the
persistent GTree, when only using funtionality in the upstream version, or
without making use of persistence. The full results show clearly what is
summarized in this table. Inserting an element in the persistent tree takes
1.4 times longer than in upstream, which is a fairly
small constant amount, independent of the number of elements in the tree. The
different number of rotations is because we use a Treap versus the AVL tree
in upstream. The AVL tree does slightly more rotations, creating a slightly
better balanced tree in these tests. However, when multiple versions are used
later, the balance of the persistent GTree improves significantly, leading me
to believe that the hashing priority function in the Treap implementation has
room for improvement in generating more random-like priorities. The query time
in the persistent GTree is nearly identical on average to the upstream tree,
and this difference is independent of the size of the tree,
though we can see the Treaps unpredictability in the
full results, as adding more elements to the tree can improve the query time.
Deletions require more rotations in the treap, but generally take less time
than in the upstream AVL tree, possibly because they don't need to modify nodes
all the way to the root. The only real impact from using the persistent GTree
comes in the memory footprint. By using the persistent tree with only 1
version, the memory foot print approximately doubles, as the structure has
additional pointers and variables which are not all needed until additional
versions are added to the tree.
In the second table we see the cost of using versions in a persistent GTree. From the table it is clear that the costs for using persistence are quite reasonable. The time to insert an element increases, but only by a constant amount, completely independent of the size of the tree, or the number of versions in the tree. This cost comes from the fact that finding the right child pointer in each node along a path requires a search for the correct version, and that more memory allocations may be done when inserting and rotating the new node into place. The number of rotations per insert increases, about on par with the upstream GTree, and query time drops significantly. This appears to be due to priority function in the Treap giving better pseudo-random values since we are doing more memory allocations, pushing memory addresses further apart. The time to delete and the number of rotations needed to delete also drop as the tree is more balanced, and there is only a constant amount of increased overhead due to having multiple versions.
The last two columns in this table, the time to destroy the tree, and the size of the tree, are both averaged over the number of elements inserted into the tree plus the number of versions in the tree divided by four. The number of versions is divided by four because a new piece of memory is allocated for a node after it is modified over four different versions. It is clear that the size of the GTree, and thus the time to destroy it, should be proportional to the number of elements and the number of versions, and the results show the constant factor relating to them. When using one version, or n versions, regardless of the number of elements inserted or number of versions, the size value shown in this table remains constant, and is very close to the size of a single node, implying that on average one node is added to the GTree structure per insert, even when adding a new version.
In summary, adding persistence to the GTree data structure did not impact the performance of the structure in any significant way, while adding greater functionality to the structure. Insertion in particular became slower, but it did so only by an epsilon constant factor, while other operations are not affected negatively at all (and some seemed to improve). The memory footprint required to add this additional functionality does increase the size of the structure by a factor of approximately two, as shown in the description of the implementation details. However, this additional memory factor would become less significant when the structure is holding pointers to real data, and the GTree itself is a smaller percentage of the structure's overall footprint. To put this in perspective, if a program uses a tree to hold pointers to 8-byte blocks of data (one for each key in the tree), the total memory footprint would increase by a factor of 1.8 to use the persistent GTree implementation. If the tree holds 32-byte blocks of data (e.g. four 64-bit longs), the total footprint would increase by a factor of 1.6. When the GTree is storing anything more than the tree itself, the memory increase becomes much less significant.
The method of persistence described by  does work well in practice, producing a data structure that remains fast, and without requiring excessive memory overhead, especially as the number of versions stored increases.
Having a single tree structure in a library that provides both non-persistent and persistent functionality allows for a single interface to use a binary search trees, allowing for easy migration for application developers to using persistence, and requires only one tree implementation to be maintained within the library. For a pointer-based data structure, the persistent GTree is near-optimal in terms of average speed and space (within reasonable constant factors), and so it adds considerable value to make the tree implementation persistent. Persistence does, however, require that the tree remain a pointer-based structure in order to work effectively. Whereas a non-persistent tree implementation may be moved to an optimal cache-oblivious data structure such as this cache-oblivious B-tree. A cache-oblivious data structure has the potential to out-perform a pointer-based structure in practice because of their efficiency using the memory heirarchy at each level (i.e. each level of CPU cache, and main memory if the data structure exceeds the physical memory size and must partially reside on disk). For this reason, it may be desirable to keep a persistent search tree as a separate implementation in a data structures library, so that the performance of the non-persistent tree may be further improved without breaking its interface in teh future.
I have a git repository that is simply a fork off the current upstream glib with my persistent GTree implementation included. You can access the code in my git repository here: http://git.icculus.org/?p=dana/cg-glib.git.
I have submitted patches to the GLib bugzilla here: https://bugzilla.gnome.org/show_bug.cgi?id=602255.
You may also download the implementation in a stand-alone library until it is available through GLib. The LibADDS library (Open Source Advanced Data Structures Library) consists only of this persistent binary search tree at the time of this writing.
The full results can be seen here. The file
test-output.txt contains timing
results, while the various
mem.type.seed.num files contain
heap profiling output for each type of GTree, with
num elements inserted into it. The
show-peaks script can be used to show the
peak memory usage of all the massif output files.
These entire results can be seen more easily in
the table below.