Binary search is a wonderful algorithm: theoretically optimal, fast in practice, and discoverable by school children playing guessing games. Usually we think of binary search as being applied to a sorted array, and we teach it this way.
It turns out that, because of cache effects, sorted order is usually not the most efficient way to store data for searching. In this project, we study the performance of different memory layouts for binary searching:
sorted
: A usual sorted array with binary
search applied to it.eytzinger
: An implicit binary search tree
packed into an array using the Eytzinger (a.k.a. BFS) layout
usually seen with binary heaps.btree
: An implicit B-tree
packed into an array using the obvious generalization of
the Eytzinger layout. The value B here is chosen so that
B-1 data items fit into 64 bytes (the most common cache
line width).veb
: An implicit binary search tree packed
into an array using the van Emde Boas layout seen in
the cache-oblivious literature. The answer is complicated, and it seems to depend on the data size, the cache size, the cache line width, and the relative cache speed. In many settings B-trees (with a properly chosen value of B) are best. In others, the Eytzinger layout wins. In others, still, the van Emde Boas layout is the winner (at least for large enough array sizes).
For an example, consider the following two graphs, generated by running the same code on two different Intel machines. In the left graph, the Eytzinger layout is almost as slow as a plain sorted array while the van Emde Boas and B-tree layouts are more than twice as fast. In the right graph, the Eytzinger layout and b-tree are the fastest, the sorted array is still the slowest, and the vEB layout is somewhere in betweeen (for array sizes).
This is why I need your help. I have some theories that may explain this complicated behaviour, but they need to be validated on a larger sample of hardware. If you have a Linux machine and would like to contribute to this effort, then please follow the instructions at the top of the page.
$ tar xzvf arraylayout.tgz arraylayout-0.0/ arraylayout-0.0/Makefile arraylayout-0.0/eytzinger_array.h arraylayout-0.0/main.cpp arraylayout-0.0/btree_array.h arraylayout-0.0/sorted_array.h arraylayout-0.0/cacher.cpp arraylayout-0.0/README arraylayout-0.0/veb_array.h arraylayout-0.0/base_array.hand then run it like this:
$ cd arraylayout-0.0 ; make dataThis will create file called
run_data.tgz
that you can
just email me. If you want to be extra helpful, you can also run:
sudo dmidecode --type 17and include the output in your email. This will give me detailed information about the amount, type, and speed of memory installed in your computer, as described here.
2*i+1
and 2*i+2
are as simple
as those done in binary search.
i
,
the next array index accessed is 2*i+1
or
2*i+2
, both of which are probably in the
same cache line. Looking even two levels down, the next
access is in the range 4*i+3,…,4*i+6
.
A good prefetcher can get this lined up two or even three
levels ahead of time. If that's not the case, then using
the __builtin_prefetch
directive might help.
Update: I confirmed this by using an explicit
prefetch __builtin_prefetch(16*i+22)
at the
top of the loop. On the 4790K, this gives a modest speedup
when we run out of L2 cache. On the Atom 330 (which is the
figure on the left, above) this gives a substantial speedup,
making the Eytzinger layout competitive again for a while,
at least. Here are the graphs.