A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory. We present here some general strategies for working “out of core” or what you should do when you run out of memory.

Early computers were most limited by their paltry memory sizes. von Neumann himself commented that even a room full of genius mathematicians would not be capable of much if all they could communicate, think upon or remember were the characters on a single type written page (much more memory than the few hundred words available to the Eniac). The most visible portions of early computers are their external memories or secondary stores: card readers, paper tape readers and tape drives.

SDC 920 computer, Computer History Museum, Mountain View CA

Historically computer scientists have concentrated on streaming or online algorithms (that is algorithms that work with the data in the order it is available and use limited memory). For many problems we have found this an insufficient model and it is much better to assume you can re-order and replicate data (such as scattering data to many processors and re-collecting it to sort). The scatter/gather paradigm is ubiquitous and is the underpinning of large scale sorting, databases and Map Reduce. So in one sense databases and Map Reduce different APIs on top of very related technologies (journaling, splitting and merging). Replicating data (or even delaying duplicate elimination) that is already “too large to handle” may seem counterintuitive; but it is exploiting the primary property of secondary storage: that secondary storage tends to be much larger than primary storage (typically by 2 orders of magnitude, compare a 2 terabyte drive to an 8 gigabyte memory stick).In our web age, the typical big data problems are inverting indices (for fast search lookup) and computing term frequencies (for TF/IDF scoring or for things like Naive Bayes classifiers). Since these are over-worked examples we will use a mathematical problem from “Additive Combinatorics”, Terence Tao, Van Vu, (ISBN-13: 9780521853866; ISBN-10: 0521853869)

We take one problem from the field of additive combinatorics: sum sets. For two sets of integers A = {a_1, … a_s} and B {b_1, …, b_t} the sum set is defined as the set (without repetition) A + B = { a_i + b_j | i = 1,…s, j=1…t }. For sets of integers the size of A+B (denoted as |A+B|) can vary from |A| + |B| – 1 to |A| * |B| depending on the relations between the numbers in A and B (or the structure of A and B). If instead of working with integers we work with integers modulo p where p is a prime number (or equivalently we treat all numbers as remainders of division by p) then by the Cauchy-Davenport inequality we have |A + B| ≥ min(|A|+|B|-1,p) (so essentially the same result, except when we run out of possible integers modulo p).

For example we would say (working modulo 19) that [0, 1, 4, 5] + [10, 11, 14, 15] = [0, 1, 10, 11, 12, 14, 15, 16, 18]. In fact there are 19 pairs of sets that add up to [0, 1, 10, 11, 12, 14, 15, 16, 18] ( for instance [5, 6, 9, 10] + [5, 6, 9, 10] is another such pair). Just to move forward assume we were interested in determining how many ways a set can be written as the sum of a pair of sets (each of size 4). For a given sum result we might try search or integer programming to find all possible summands. However, if we want the statistics on all sums simultaneously, we can work much quicker and without need for big gun mathematics.

The straightforward solution is this case is a bit of code like:

```
```for set A from all possible sets of 4 integers from 0 to 18
for set B from all possible sets of 4 integers from 0 to 18
let set C = A + B modulo 19
use set C as a key and add the pair (A,B) to the list associated with C
for all key sets C tracked above
compute the size of the list of summand pairs found for C
print how many result sets C have a given number of summand pairs

The relations C which have a summand of form A can be collected by any bit of Java code implementing the interface below (just call `insertReln(C,(A,B))`

to store the relations and then `entries()`

to get them back). A small interface that declares the needed methods is given below:

```
```public interface RelnCollector<A,B> {
void insertReln(A a, B b) throws IOException;
Iterable<Map.Entry<C,Iterable<B>>> entries() throws IOException, InterruptedException;
void close() throws IOException;
}

An in-memory relation collector is trivially implemented by a nested map adjusted to declare the above interface, as we see in the next code snippet:

public final class InMemoryRelnCollector<A,B> implements RelnCollector<A,B> { private final DataAdapter<A> adapterA; private final DataAdapter<B> adapterB; private Map<A,Iterable<B>> atoBs; public InMemoryRelnCollector(final DataAdapter<A> adapterA, final DataAdapter<B> adapterB) { this.adapterA = adapterA; this.adapterB = adapterB; atoBs = new TreeMap<A,Iterable<B>>(this.adapterA); } @Override public void insertReln(final A a, final B b) { Set<B> set = (Set<B>) atoBs.get(a); if(null==set) { set = new TreeSet<B>(adapterB); atoBs.put(a,set); } if(!set.contains(b)) { set.add(b); } } @Override public Iterable<Map.Entry<A,Iterable<B>>> entries() { return atoBs.entrySet(); } @Override public void close() { atoBs = null; } }

The great savings in time is that we work from summands to results sums (but keep many sets of results indexed by result sets). Thus we don’t have to figure out how to invert the sum operation (as we do our bookkeeping forward). However, this very bookkeeping may overwhelm us. As we can see below, a Java implementation of the above procedure runs out of memory when trying to characterize which sets of integers modulo 19 can be split into two sets of size four (and how many ways each such set can be split). However, this was with the deliberately small default allocation of memory available to Java processes (so for this particular instance we could avoid trouble by allocating more memory, we ran out of allocation not system memory). What happens when we don’t manage memory is illustrated below:

Start com.winvector.consolidate.impl.InMemoryRelnCollector Tue Dec 06 10:04:38 PST 2011 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.TreeMap.put(TreeMap.java:554) at java.util.TreeSet.add(TreeSet.java:238) at com.winvector.consolidate.example.AdditiveSets.sum(AdditiveSets.java:25) at com.winvector.consolidate.example.AdditiveSets.main(AdditiveSets.java:55)

An out of core solution can solve the entire problem without needing any additional system memory (just some disk space which is still of a much greater size than primary memory). The complete calculated result is given below:

Examining sums of 4 integers chosen from 0 through 18 modulo 19. Start com.winvector.consolidate.impl.FileRelnCollector Tue Dec 06 09:54:20 PST 2011 Inserted 15023376 relations. [0, 1, 4, 5] + [10, 11, 14, 15] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [0, 1, 15, 16] + [0, 14, 15, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [0, 3, 4, 18] + [11, 12, 15, 16] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [0, 14, 15, 18] + [0, 1, 15, 16] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [1, 2, 5, 6] + [9, 10, 13, 14] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [1, 2, 16, 17] + [13, 14, 17, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [2, 3, 6, 7] + [8, 9, 12, 13] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [2, 3, 17, 18] + [12, 13, 16, 17] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [3, 4, 7, 8] + [7, 8, 11, 12] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [4, 5, 8, 9] + [6, 7, 10, 11] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [5, 6, 9, 10] + [5, 6, 9, 10] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [6, 7, 10, 11] + [4, 5, 8, 9] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [7, 8, 11, 12] + [3, 4, 7, 8] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [8, 9, 12, 13] + [2, 3, 6, 7] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [9, 10, 13, 14] + [1, 2, 5, 6] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [10, 11, 14, 15] + [0, 1, 4, 5] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [11, 12, 15, 16] + [0, 3, 4, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [12, 13, 16, 17] + [2, 3, 17, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18] [13, 14, 17, 18] + [1, 2, 16, 17] = [0, 1, 10, 11, 12, 14, 15, 16, 18] Examined 128820 sums and 15023376 summands. found 3705 sums with 19 distinct summands found 39900 sums with 38 distinct summands found 26847 sums with 76 distinct summands found 22230 sums with 114 distinct summands found 10602 sums with 152 distinct summands found 8892 sums with 190 distinct summands found 2736 sums with 228 distinct summands found 5016 sums with 266 distinct summands found 2736 sums with 304 distinct summands found 1710 sums with 342 distinct summands found 171 sums with 361 distinct summands found 1710 sums with 380 distinct summands found 855 sums with 418 distinct summands found 342 sums with 456 distinct summands found 342 sums with 532 distinct summands found 342 sums with 570 distinct summands found 171 sums with 722 distinct summands found 171 sums with 760 distinct summands found 171 sums with 912 distinct summands found 171 sums with 1026 distinct summands Done: com.winvector.consolidate.impl.FileRelnCollector elapsed time: 618473MS Tue Dec 06 10:04:38 PST 2011

We performed the calculation be using a different implementation of `RelnCollector`

called `FileRelnCollector`

. What this implementation does is write relations to a file as they are made available. That is FileRelnCollector implementation of `insertReln`

is literally a `println()`

. Something not more more complicated than the following:

```
``` @Override
public void insertReln(final A a, final B b) {
System.out.println("" + a + "t" + b);
}

The heavy lifting is done when `entries()`

is called. When the entries are wanted the `FileRelnCollector`

calls GNU sort on the saved file to get all the results ordered by result sum (instead of by summand). GNU sort can sort files larger than memory by a split and merge strategy involving temporary files. We provide such a file plus GNU sort based implementation of RelnCollector.

Note that this runtime can be deceptively low. If running on a machine with a modern operating system and enough memory the file being used as “external storage” actually gets cached into memory (and gets near memory speed performance). To get a reliable timing you need to test a problem of the size you are interested in on the size machine you are going to deploy on (not on a larger machine).

For better or worse this method should seem familiar as a lot of science has been done using the Unix text tools (sort, join and a few more). This is also the basis of Map Reduce and we demonstrate a Hadoop implementation of RelnCollector as well. Or we can link up with the other technology designed for beyond memory size data manipulation and get a database based implementation of RelnCollector.

In all cases the implementations we call depend on journaling (in the sense of keeping a sequential log of operations to be done instead of immediately performing the operations), scattering (splitting into multiple temp files and structures) and merging (combining data form multiple ordered files). We could write our own code to perform all of these operations (obliviating any need for GNU sort, Hadoop or a database), but it is much less code to do as we have here and write an adapter to use existing implementations.

The sum-set example is deliberately artificial. More common examples are, as we mentioned, index inversion and term frequency calculation. All of our example code is available here: https://github.com/WinVector/OutOfCore including JUnit tests and an example program. The code depends on libraries for JUnit 4.10, h2 database, Hadoop 0.21.0 for the various implementations.

The main trick is basing your code on a very thin storage abstraction (like the `RelnCollector`

interface, instead of explicitly known data structures) and then using this abstraction to hide all of the details away from the rest of your code (keeping complexity at a manageable level). The two things to avoid are either infecting your code with too much knowledge of your storage plans (i.e. pushing implementation details into your important code to “speed things up”) or being forced to re-design your entire project to fit within some framework (like re-writing all of your code as a database stored procedure or an explicit Hadoop map/reduce pair as this over-commits you to one technology).

Categories: Coding Computer Science Programming Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.