Note February 11, 2020: this articles is out of date, we suggest using the methods of Using PostgreSQL in R: A quick how-to instead.
We discuss a “medium scale data” technique that we call “SQL Screwdriver.”
Previously we discussed some of the issues of large scale data analytics. A lot of the work done at the MapReduce scale is necessarily limited to mere aggregation and report generation. But what of medium scale? That is data too large to perform all steps in your favorite tool (R, Excel or something else) but small enough that you are expected to produce sophisticated models, decisions and analysis. At this scale, if properly prepared, you don’t need large scale tools and their limitations. With extra preparation you can continue to use your preferred tools. We call this the realm of medium scale data and discuss a preparation tool style we call “screwdriver” (as opposed to larger hammers).
We stand the “no SQL” movement on its head and discuss the beneficial use of SQL without a server (as opposed to their vision of a key-value store without SQL). Database servers can be a nuisance- but that is not enough reason to give up the power of relational query languages.
One of the tenants of the MAD analytics movement is that you have to constantly move new data towards your decision problem. This wisdom is compatible with the machine learning rule that 90% of your effort is spent on feature design and another 90% is needed for data-tubing. As an example: suppose you are attempting to predict or model the probability that a candidate will make a given purchase or take out a given loan. You might want to merge what you know about candidates (both past candidates which form your training data and future candidates that you are trying to characterize) with other data sources before you start your machine learning process. An example “other data source” is the Census ZCTA arranged data which is aggregated census data (age, income, education and so on) keyed by ZCTA (ZIP Code Tabulation Areas). If you had a table of per-person data called “people” then what you want is to merge the ZCTA data into this table as additional columns. By far the most reasonable way to express this known as a “join”. The SQL expression for such a join looks like the following:
SELECT people.*, zctaSummaries.* FROM people LEFT JOIN ( zctaSummaries ) ON ( people.ZIPCODE = zctaSummaries.ZCTA )
This is the SQL way of saying “make a new table where each row is a row from my people data with the ZCTA data appended as additional columns.” Notice, and
this is the one grace of SQL, that we do not have to specify how this assembly is to be done (no loops, variables or explicit sorting). This is what we want. What we don’t want is the pain of setting up a persistent database server just so we can run some SQL. A server involves processes listening on ports, passwords, provisioning, maintenance and so on. The idea is we should only specify how we want our data prepared for analysis (or de-normalized in database terms), we shouldn’t have to specify how it is done or manage a server just to get it done.
Luckily there are a number of no-server databases that implement SQL. In particular we call out H2. H2 is a pure Java SQL engine, so any place we can run Java we can run H2. And one of the better graphical database clients (SQuirreL SQL) is both free and compatible with H2. So not only can you run your SQL- you can explore your data interactively!
To go further with this example you need a few minor database tools we are releasing under the GPL3 Affero License. The latest copy of SQLScrewdriver.jar contains both compiled Java classes and Java 6 compatible source code. This jar includes the tools that talk to H2 (or any other JDBC compatible database) that embody our SQL Screwdriver idea. To complete the tutorial you will need to download the WinVector Logistic jar, a H2 distribution (contains h2-1.2.147.jar) and the pre-processed ZCTA data (zctaSummaries.tsv).
Next we build a Java XML style properties file describing our database. In our case we leave user and password blank, specify use of the H2 embedded driver and name the file that will be the backing store for our small database (in this case H2TestDB).
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>testdb</comment> <entry key="user"></entry> <entry key="password"></entry> <entry key="driver">org.h2.Driver</entry> <entry key="url">jdbc:h2:H2TestDB/H2DB ;LOG=0;CACHE_SIZE=65536;LOCK_MODE=0;UNDO_LOG=0</entry> </properties>
Once we have done this we can transfer the tab-separated data into our database by running the following java command:
java -cp SQLScrewdriver.jar:h2-1.2.147.jar com.winvector.db.LoadTable file:h2Test.xml t file:zctaSummaries.tsv zctaSummaries
The four final arguments are the location of the XML file we just created, “t” to denote tab separated data (“|” for pipe separated), the location of the ZCTA tab separated data and what name to give the new table in the database. Once this has been run the contents of the ZCTA table are in the database. To see that we start up Squirrel SQL, configure Squirrel SQL’s H2 driver to point to our h2-1.2.147.jar and configure a database alias identical to our XML file. One of the great merits of Squirrel SQL is it uses JDBC configuration just like any other Java program- so if you can get it to run in some other program you can get Squirrel SQL to work (and vise versa).
Once Squirrel SQL is up we can examine our table by running a simple select:
select * from zctaSummaries
And we see our data:
We can now use another small tool to dump our freshly joined data into a tab separated format ready for use by an analysis tool (like R):
java -cp SQLScrewdriver.jar:h2-1.2.147.jar com.winvector.db.DBDump file:h2Test.xml "SELECT people.*, zctaSummaries.* FROM people LEFT JOIN ( zctaSummaries ) ON ( people.ZIPCODE = zctaSummaries.ZCTA )" mergedData.tsv
The last three arguments being: the database definition XML again, the exact query we want and the name of a file to write tab separated results into. At this point we are done. The file “mergedData.tsv” can be moved into R for modeling and analysis. We can, for neatness or security, now dispose of the database file if we wish. We have deliberately avoided the built-in bulk table import and export tools as they tend to be finicky and database implementation dependent (our screwdriver tools can be used to move in and out of persistent databases like MySQL just by specifying the correct JDBC driver, URL and driver jars). But the tools are not as important as the attitude of using powerful relational tools (e.g. SQL) in a batch manner on transient data stores (very different than OLTP and OLAP which have high maintenance costs and tend to trap data).
What we have done (with minimal preparation) is: brought the full relational power of SQL to perform the joins required to bring new data into an analysis. We can execute arbitrary SQL (much more powerful than Unix command line “join” and typical R table manipulation tools) and quickly get our data organized for machine learning analysis. We can work on datasets larger than machine memory if needed and have not incurred the cost of configuring or getting access to a server.
Instead of “no SQL” we say: “no server” (which is appropriate in the medium sized data regime so common in predictive analytics).
Note Dec-15-2011: We have moved the code distribution to github.com/WinVector/SQLScrewdriver .
Note Feb-2-2013: We have added a new main() called com.winvector.db.LoadFiles that takes four or more command line arguments in order: dbProbsXMLURI, sepChar(t=tab), tableName and then one or more data source URIs. Both LoadTable and LoadFiles drop the table on load, so LoadFiles is practical way to load more than one file with the same header structure into a table at one time. Both programs can read directly from *.gz files and both now add some data provenance columns and up-case all columns names (to avoid needless column name quoting down the road).
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
The question has been asked: what if you have a bit more time to prepare? In that case I would still use the screwdriver tools (they are much more forgiving of real-world issues than the standard data import and export tools that come with database, plus you can patch them if you have special requirements). But I would consider using a real database (MySQL, PostgreSQL or even Greenplum Database Single-Node Edition).
Recent experience with SQL Screwdriver: millions of rows of data scattered in 15 files, some with incomplete keying. 15 way left join took care of it without a problem. Really coming to believe the best place to do your ETL heaving lifting is inside your database (where you have more power and tools). So you use lightweight tools (like the SQL Screwdriver) to get your data in there early and then work for a while inside the database (and then export to your analytics tools, like R).