A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool.
In her case the tools were the data manipulation grammars SQL (Structured Query Language) and dplyr. It happened to be the case that in both cases the implementation was supplied by a backing database system (PostgreSQL), but the database was not the center of attention for very long.
In this note we will concentrate on SQL (which itself can be used to implement dplyr operators, and is available on even Hadoop scaled systems such as Hive). Our point can be summarized as: SQL isn’t the price of admission to a server, a server is the fee paid to use SQL. We will try to reduce the fee and show how to containerize PostgreSQL on Microsoft Windows (as was already done for us on Apple OSX).
Containerized DB
The Smashing Pumpkins “Bullet with Butterfly Wings” (start 2 minutes 6s)
“Despite all my rage I am still just a rat in a cage!”
(image credit).
In data science we get distracted by shiny things. We (wrongly) end up thinking of data centers, software packages, systems, and lines of code as being direct benefits or assets. Really they are like weight on an aircraft: the price you pay for being able to transform data, but a cost to minimize.
You can get cheap access to industrial strength SQL implementations in many ways:
- Use SQL directly on R data frames using the sqldf package
- Figure out how to install and talk to a “no server” database such as SQLite or H2.
- Use a containerized database. This is what Nina did in her article, and is the topic of this article.
- Use a remote service like Amazon redshift. Pay commodity prices for somebody else to worry about the all so important infrastructure.
- Install and maintain the database software on your machine (instructions here).
(something we have blogged about and included in our book Practical Data Science with R).
As we said, we are going to write about containerized databases. Nina pointed out a pre-made containerized version of PostgreSQL for Apple OSX: Postgres.app. What it does is trap the database in a user application (the user can stop and start at will) that then declares it is a “server.” From that point on you use the database for data services: transforming and aggregating data. When you are done you close the app and it is gone. The merit of the zero-install is this “no install” database doesn’t interfere with any other system installed databases or services. Installing a database usually triggers a bunch of dependencies and re-configurations of system services. Historically you were never sure if you had successfully uninstalled a PostgreSQL from an Apple Mac, so it is nice to run one without installing.
Basically database systems are jerks. Give them an inch and they take a mile. The strategy of containerization is: let them be lords in a pocket universe (the container) that we can close and even throw away. Let the database implementation do all of the non-negotiable important things it wants inside a container to keep your real machine clean.
We are now going to show how to use PostgreSQL in a containerized fashion on Microsoft Windows. Our strategy is to use a pre-made standard docker PostgreSQL container. This means we have to put up with the foibles of the container system (in this case Docker), but not a tick more than that. Containers and virtualization systems are also jerks, but we are going to pay our price once and be done.
Below are the steps.
- Install Docker on your Windows machine (instructions here). You must install and run docker toolbox to make any progress. How this works depends on your system (and note you can not run Docker inside some virtualized environments such as VirtualBox). Basically you are front-paying with installation frustration for later flexibility. I found starting Docker on non-Linux environments (Apple OSX, and Microsoft Windows 8) a bit hit-or miss (requiring a re-try).
- Once you have the Docker Quickstart up you should see the following:
## . ## ## ## == ## ## ## ## ## === /"""""""""""""""""___/ === ~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ / ===- ~~~ ______ o __/ __/ ___________/ docker is configured to use the default machine with IP 192.168.99.100 For help getting started, check out the docs at https://docs.docker.com
Write down the IP address in the message (in our case
192.168.99.100
). Then type the following into the Docker Quickstart shell:docker run -p 5432:5432 --name pg -e POSTGRES_PASSWORD=pg -d postgres
- Now in your R or RStudio type (substituting the IP address you saw):
library('RPostgreSQL') pg <- dbDriver("PostgreSQL") con <- dbConnect(pg, user="postgres", password="pg", host="192.168.99.100", port=5432)
(As always runninginstall.packages('RPostgreSQL')
ifRPostgreSQL
is not already installed.)
And you are good to go. You are running against a PostgreSQL database trapped in a docker image (itself implemented in a VirtualBox virtual machine). Treat this database as a transient resource used to run complex SQL and dplyr queries efficiently (and not as a data store of record).
For anything more we suggest a good Docker reference (and would in fact like recommendations!).
Categories: Coding Exciting Techniques Opinion Pragmatic Data Science
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Great article. Looks like a convenient way to work with temporary database on local system (provided they fit, of course). Can you include a brief description in your article on how to remove the database from Docker?
Thanks Uden,
It gets a bit ugly- but with Docker it is at least disposable (beyond installing Docker none of the detailed ugliness is happening on your host system).
I believe you can shutdown and remove the database by typing the following into the Docker Quickstart window:
To shutdown the virtual machine (and delete state/data) that is supplying Linux emulation (needed on Windows and OSX, not present on Linux): run VirtualBox (should be found in
C:\Program Files\Docker
on Windows, and/Applications
on OSX) and “power off” by right-clicking on the machine image in the left panel and selecting “Close -> Save State”.To restart you re-run
Docker Quickstart Terminal
and re-launch the database as in the article.The thing to be aware of is on Windows and OSX all of the Docker stuff is stored in a VirtualBox virtual machine. So to find how big a file system the database is trapped in look at the settings of the Virtual Machine in VirtualBox (by right-clicking on the machine and then choosing Setting-Storage). Resizing the virtual drive is a bit involved (but can be done http://stackoverflow.com/questions/11659005/how-to-resize-a-virtualbox-vmdk-file ).
I totally agree, I also use docker for starting postgres dev environment. Below function can be handy as it uses docker’s postgres env vars, so you have a single place to maintain setup for host, port, db, credentials. It is a part of my `pg` package https://github.com/jangorecki/pg
“`
pgConnect = function(host = Sys.getenv(“POSTGRES_HOST”, “127.0.0.1”), port = Sys.getenv(“POSTGRES_PORT”, “5432”), dbname = Sys.getenv(“POSTGRES_DB”, “postgres”), user = Sys.getenv(“POSTGRES_USER”, “postgres”), password = Sys.getenv(“POSTGRES_PASSWORD”, “postgres”)) DBI::dbConnect(RPostgreSQL::PostgreSQL(), host = host, port = port, dbname = dbname, user = user, password = password)
“`
That is awesome, thanks!
Also I have found sometimes one of ‘127.0.0.1’ or ‘localhost’ works as the hostname (which one depends on some details of the port forwarding I imaging).
check this for postgres helpers: https://jangorecki.gitlab.io/pg/html/pg.html
and here for transactional R logging on postgres: https://jangorecki.gitlab.io/logR/html/logR.html
logR is quite mature already.