Memory Layouts for Binary Search

Overview

The first round of experiments is over, but you can still help out with the second round.

Binary search is a wonderful algorithm: theoretically optimal, fast in practice, and discoverable by school children playing guessing games. Usually we think of binary search as being applied to a sorted array, and we teach it this way.

It turns out that, because of cache effects, sorted order is usually not the most efficient way to store data for searching. In this project, we study the performance of different memory layouts for binary searching:

sorted: A usual sorted array with binary search applied to it.
eytzinger: An implicit binary search tree packed into an array using the Eytzinger (a.k.a. BFS) layout usually seen with binary heaps.
btree: An implicit B-tree packed into an array using the obvious generalization of the Eytzinger layout. The value B here is chosen so that B-1 data items fit into 64 bytes (the most common cache line width).
veb: An implicit binary search tree packed into an array using the van Emde Boas layout seen in the cache-oblivious literature.

Results

Which of these array memory layouts is fastest?

The answer is complicated, and it seems to depend on the data size, the cache size, the cache line width, and the relative cache speed. In many settings B-trees (with a properly chosen value of B) are best. In others, the Eytzinger layout wins. In others, still, the van Emde Boas layout is the winner (at least for large enough array sizes).

For an example, consider the following two graphs, generated by running the same code on two different Intel machines. In the left graph, the Eytzinger layout is almost as slow as a plain sorted array while the van Emde Boas and B-tree layouts are more than twice as fast. In the right graph, the Eytzinger layout and b-tree are the fastest, the sorted array is still the slowest, and the vEB layout is somewhere in betweeen (for array sizes).

This is why I need your help. I have some theories that may explain this complicated behaviour, but they need to be validated on a larger sample of hardware. If you have a Linux machine and would like to contribute to this effort, then please follow the instructions at the top of the page.

Help Out

If you'd like to help out, then download the code (version 1 code), unpack it like this:

$ tar xzvf arraylayout.tgz 
arraylayout-0.0/
arraylayout-0.0/Makefile
arraylayout-0.0/eytzinger_array.h
arraylayout-0.0/main.cpp
arraylayout-0.0/btree_array.h
arraylayout-0.0/sorted_array.h
arraylayout-0.0/cacher.cpp
arraylayout-0.0/README
arraylayout-0.0/veb_array.h
arraylayout-0.0/base_array.h

and then run it like this:

$ cd arraylayout-0.0 ; make data

This will create file called run_data.tgz that you can just email me. If you want to be extra helpful, you can also run:

sudo dmidecode --type 17

and include the output in your email. This will give me detailed information about the amount, type, and speed of memory installed in your computer, as described here.

Some Data

Here is some (preliminary, badly-formatted) data:

Intel Core i7-4790K @4GHz

Intel Core i7-2600 @3.40GHz (Christian Muise)
Intel Core i7-3720QM @2.60GHz (D Knott)
Intel Core i5-4670K @3.40GHz
Intel Celeron E3400 @2.6GHz
Intel Core i7-4500U @1.8GHz
Intel Core i5-5200U @2.20GHz
Intel Core 2 Q8300 @2.5GHz
Intel Atom 330 @1.6GHz

Intel Xeon E5320 @1.86GHz
Intel Xeon E3-1230 V2 @3.3GHz
AMD Phenom II X4 965 (Ryan Fox)
AMD FX-6300 (Mathis Beer)
Intel Celeron CPU 1005M @1.90GHz (Barr Detwix)
Intel Core i7-4700MQ CPU @2.40GHz (José Neder)
Intel Core i5-4210U CPU @1.70GHz (Lukas Nemec)
Intel Xeon CPU E5405 @2.00GHz (Brian Raiter)
Intel Core i5 CPU M 520 @2.40GHz (Todd Davies)
Intel Core i7-3517U CPU @1.90GHz (Justin Whear)
Intel Xeon CPU E5-2699 v3 @2.30GHz (Justin Whear)
Intel Core2 Duo CPU E8135 @2.66GHz (Gabriele Santilli)
Intel Xeon CPU E5-1620 0 @3.60GHz (İsmail Dönmez)
AMD FX-8350 Eight-Core Processor (Moshev)
Intel Core i5-4570 CPU @3.20GHz (John Howat)
Intel Xeon CPU E5-2699 v3 @2.30GHz (Justin Whear)
Intel Core i7-4710HQ CPU @2.50GHz (Bill Miller)
Intel Celeron 2955U @1.40GHz (Chromebook—Scot Doyle)
Intel Core i5-4200U CPU @1.60GHz (Gabriel Scherer)
Intel Core i7-2620M CPU @2.70GHz (Alexander Duryagin)
Intel Core2 Duo CPU T9600 @2.80GHz (Ronak Buch)
Intel Core i7-2760QM CPU @2.40GHz (Laurențiu Nicola)
Intel Celeron CPU J1900 @1.99GHz (Laurențiu Nicola)
AMD A10-5700 APU with Radeon HD Graphics (Alex Scheel)
Intel Xeon CPU E5-1650 v2 @3.50GHz (Seamus Connor)
Intel Core2 Quad CPU Q9550 @2.83GHz (Matti Niemenmaa)
Intel Core i7-5500U CPU @2.40GHz (Matti Niemenmaa)
AMD Opteron Processor 6128 (Matti Niemenmaa)
Intel Xeon CPU E5-2660 0 @2.20GHz (Jeff Wilcox)
Intel Core i7-3612QM CPU @2.10GHz (Maximilian Karasinski)
Intel Core i5-2500K CPU @3.30GHz (Sebastian Lauwers)
Intel Core i7-4710MQ CPU @2.50GHz (Daniel Wakefield)
Intel Core i7-4510U CPU @2.00GHz (Jonathan McElroy)
AMD FX-8320 Eight-Core Processor (Agustin Borgna)
Intel Core i7-4790 CPU @3.60GHz (Duncan Burke)
AMD A8-6600K APU with Radeon HD Graphics (William Albert Starms)
Intel Core i7-3520M CPU @2.90GHz (Harrison Pincket)
Intel Core i5-4670 CPU @3.40GHz (Ron Collinson)
Intel Core i7-4771 CPU @3.50GHz (James McCarty)
AMD FX-8350 Eight-Core Processor (Chris Allan)
Intel Core i5 CPU M 520 @2.40GHz (Anthony Fazio)
Intel Core2 Quad CPU Q8300 @2.50GHz (Markus Laker)
Intel Core2 Duo CPU E7500 @2.93GHz (Cristian Vat)
Intel Core i5-4250U CPU @1.30GHz (Christian Lange)
Intel Core i7-4770K CPU @3.50GHz (mdunphy)
AMD Phenom II X4 955 Processor (Carlos Pantelides)
AMD Phenom II X4 955 Processor (Carlos Pantelides)
Intel Core i7-4770 CPU @3.40GHz (Mohsin Mirza)
Intel Xeon CPU E5-2660 0 @2.20GHz (On Azure—Paul Thornton)
Intel Xeon CPU X5675 @3.07GHz (Mohsin Mirza)
Intel Core i7-3667U CPU @2.00GHz (Mohsin Mirza)
Intel Xeon CPU E5405 @2.00GHz (Mohsin Mirza)
Intel Core i7-3630QM CPU @2.40GHz (Antoni Rościszewski)
AMD Opteron Processor 6386 SE (Antoni Rościszewski)
Intel Core i7-3630QM CPU @2.40GHz (Jacob Gulotta)
Intel Core i7-3770 CPU @3.40GHz (Kumazaki Hiroki)
Intel Core i5-3210M CPU @2.50GHz (Gianluca Della Vedova)
Intel Core i7-4770 CPU @3.40GHz (Gianluca Della Vedova)
Intel Xeon CPU X5660 @2.80GHz (Alcides Fonseca)
AMD Opteron Processor 6386 SE (Antoni Rościszewski)
Intel Xeon CPU E5-2670 0 @2.60GHz (Pasha Zemtsov)

Observations

These are a just a few preliminary observations from browsing the data (mainly for myself).

A big part of the speedup provided by the cache when doing repeated binary search comes from keeping the top levels of the (implicit) search tree in the cache. With a cache of size M that has a line size of L, binary search can keep the log(M/L) top levels in the cache. The other packing methods can keep the log(M) top levels.
The Eytzinger layout is almost never slower and usually faster than a sorted array—even for very small array lengths. This is because the Eytzinger child index calculations 2*i+1 and 2*i+2 are as simple as those done in binary search.
The Eytzinger layout does surprisingly well. My current conjecture (mainly due to Paul Khuong) is that the prefetcher is working really well here. This isn't too surprising since after accessing array index i, the next array index accessed is 2*i+1 or 2*i+2, both of which are probably in the same cache line. Looking even two levels down, the next access is in the range 4*i+3,…,4*i+6. A good prefetcher can get this lined up two or even three levels ahead of time. If that's not the case, then using the __builtin_prefetch directive might help.
Update: I confirmed this by using an explicit prefetch __builtin_prefetch(16*i+22) at the top of the loop. On the 4790K, this gives a modest speedup when we run out of L2 cache. On the Atom 330 (which is the figure on the left, above) this gives a substantial speedup, making the Eytzinger layout competitive again for a while, at least. Here are the graphs.
The B-tree layout can have significant overhead for small arrays. For example, when B=17 (so, with 16 keys per block) and n=32, about half of the searches require 10 comparisons; all other methods will do at most 6 comparisons.
For most processors, the vEB layout doesn't become effective until the data gets really big. The index calculations are just too complicated.
Nearly universally, B-trees win when the data gets big enough.