I know “officially” data scientists all always work in “big data” environments with data in a remote database, streaming store or key-value system. But in day to day work Excel files and Excel export files get used a lot and cause a disproportionate amount of pain.
I would like to make a plea to my fellow data scientists to stop using Excel-like formats for informal data exchange and become much stricter in producing and insisting on truly open machine readable files. Open files are those in an open format (not proprietary like Microsoft Excel) and machine readable in this case means readable by a very simple program (preferring simple escaping strategies to complicated quoting strategies). A lot of commonly preferred formats surprisingly do not meet these conditions: for example Microsoft Excel, XML and quoted CSV all fail the test. A few formats that do meet these conditions: SQL dumps, JSON and what I call “strong TSV.” I will illustrate some of the difficulty in using ad-hoc formats in R and suggest work-arounds.One thing nobody admits to is: Excel data is hard to read reliably into R. Admitting this means admitting there is something Excel is better at than R is (even if it is just reading its own data). Cran suggests several methods to read data into R, but most of them fail on non-trivial (but not even large) data. For example cran suggests:
- Methods that depend on connecting to a Windows service in a 32 bit environment (like RODBC).
The xlsx package which even when you add more memory to the required Java dependency (
.jinit(parameters="-Xmx6G")) runs out of memory parsing a 44 megabyte file.
- The gdata package which uses Perl for the Excel parsing.
- The XLConnect package which seems to not install on R2.15.0
- The Omegahat package RExcelXML (not part of cran).
Of these packages we have had the best luck with gdata (and already this is not simple). However we have seen the package silently truncate data on read. Also, strings are converted to factors (though this can be fixed by passing the appropriate read.table() settings through the read.xls() call). Digging into the gdata package we see it uses a Perl script to try and export Excel data into a quoted format- but the package does not get all of the various possibilities of embedded line-breaks, embedded quotes, escapes and comments correct. The wrong string-valued field can silently poison the results.
Additionally we have found is R’s read.table() and read.xls() command do not fully understand all the complexities of Excel export quoting (which allows actual quotes, fields separators and line separators inside fields, relying on a non-Unix escape sequence of doubling quotes). This is a problem for any data set which includes text extracts (bringing in line breaks, false field separators and problematic punctuation). To add insult to injury R’s read table is in no way guaranteed to be able to read the output of R’s own write.table(), unless you remember to set enough options about quoting, escaping, line breaks and comments both during writing and reading. Finally Excel’s native save as text or save as CSV has the amazing bug that any fields that is too big to display (and thus displayed as hash-marks) is in fact exported as hash-marks! So even having a copy of Excel isn’t enough to solve the Excel export problem.
If needless complexity is causing problems- let us try avoiding the complexity. I have modified the Perl script found in the gdata package to a new version that exports to what I call “strong TSV.” In strong TSV fields never contain the fields separator (tab) or a line-separator (CR or LF) and there is no quoting, escaping or comments (quoting is no longer needed, so we just treat quotes as a standard character like any other). To export to strong TSV we just use a simple Perl regexp to replace all field whitespace with standard space (slightly damaging the field, but saving a lot of down-stream complexity). The file converted in this manner can then be read reliably using R’s read.table command:
d <- read.table('file.tab', quote='',comment.char='', allowEscapes=F,sep='t', header=T,as.is=T, stringsAsFactors=F)
The modified perl-script ( xls2goodTSV.pl ) is placed in the Perl section of the gdata library (in my case
Library/R/2.15/library/gdata/perl) and run in that environment.
Some of these weaknesses are flaws in the cran-suggested R libraries. But some of the problems are due to the complexity of Excel and even quoting in exported Excel documents (and the hash-bug). All of these steps could be skipped if the data were available in a true machine readable format like strong TSV, JSON or SQL dump (all of which get quoting correct). My suggestion is to try strong TSV as it is very simple and interoperates well with most tools (R, linux command line tools and even Excel itself). The time you save may be your own.
Update: forgot to mention a cool example where insisting on a simple format that avoids too much quoting and escaping has a big payoff: Apache Hadoop Map Reduce TextInputFormat.
Because Apache Hadoop Map Reduce TextInputFormat format is simple the Hadoop master node can split a large file of records by just specifying seek locations (see org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits()) and then letting the reader skip one (usually partial) line to re-align the reader to valid record boundaries (see org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize()). This allows file splitting that does not stress the master node (as the master node does not need to scan the file) and efficient job distribution as each split destination only needs to look at the file interval it is responsible for plus at most one more contiguous line. If the master node had to scan the how file for record breaks (which formats that allow general quoting and escaping would entail) there would be little point in the split and distribute parallelism- you might as well complete all your work on the master node as you have already forced it to scan the whole input file.
The details being: if Apache allowed a file format here that required a complicated state machine to determine line breaks (tracking things like are we in a quoted or commented environment) then you could not tell where record boundaries are without scanning from the start of the file every time! Because even if you see properly formatted records you need to know that there is not in an outer quoting context that says all of your records are just a single spreadsheet included with a single cell of an outer spreadsheet (so it is not just details of any one quoting mechanism that is working against you, it is the desire to faithfully quote; which makes some desired information like true record boundary locations essentially non-local).
The take-away is if you are going to dedicate all of your effort to supporting complicated quoting and legacy formats: that is all you will ever achieve. You will be held hostage to the considerations and trade-offs that went into those formats and not be able to achieve new things (like nearly free parallelism).
Also I would like to re-emphasize that a lot of tools that claim to correctly read complicated formats often only work correctly on the simple or limited cases that the original tool builder needed during original tool construction. These libraries often fail to appreciate the large amount of complexity quickly and correctly parsing formats with many interacting features (quoting, escaping, comments, field separators, line separators, character encodings, localization and many more unrewarding features). So you can try libraries, just be prepared to waste a lot of time being disappointed.
Edit: obviously post 2015 we have a lot more options (including the
readr package). And I also wanted to call out a format I am seeing a lot of in finance:
HTML with an “
.xlsx” or “
.xls” suffix. These files tend to have data in a an
HTML table that Excel is willing to read after if figures out the file-suffix is a lie (and infers the file type on its own). The
R HTML capable packages (such as
xml2) tend to be able to extract the table into a
data.frame fairly handily, but there are issues with column type inference.
Categories: Coding data science Opinion Programming Rants Statistics
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
I understand the desire to use text formats when the data will be stored in a version control system. But for general computing, I don’t understand why real broadly-accepted scientific data formats like netcdf and hdf5 have not gained more traction.
Okay, so you are giving here a method to convert an excel file to something nice (strong TSV).
I agree with you that we should avoid excel. I hate it when someone complains about my dataset, and then admits that they opened my csv file in excel.
Since I am usually the one delivering data, I suppose it would be nice to be able to run something like this directly out of Oracle (or whatever database there is). That is, replace your excel file with a database query.
So I can deliver TSV, and no one ever needs to see an excel file.
Thinking from the perspective of someone who spends more time delivering data than receiving data.
This is weirdly put: “One thing nobody admits to is: Excel data is hard to read reliably into R. Admitting this means admitting there is something Excel is better at than R is (even if it is just reading its own data)”. No, I’m sure a lot of people will agree that Excel data is hard to read reliably in R, or pretty much anything other than Excel. I wouldn’t call that “admitting that Excel is better than R [at reading xls files]”, rather blaming Microsoft for such an obscure format.
>> Finally Excel’s native save as text or save as CSV has the amazing bug that any fields that is too big to display (and thus displayed as hash-marks) is in fact exported as hash-marks! >>
I cannot reproduce this behavior with Excel 2007 /Windows.
Happens to me reliably on Excel 2008 for the Mac. It could also be there is some field size above what causes hash-display that triggers hash export (so there may be an intermediate size that displays as hashes, but exports okay). The size in question is about 4 to 8 kb (news article extracts).
XLConnect should install and run fine with R 2.15.0. If you experience any issues I would be interested in the details (error messages etc). If you are running on Mac OS X, make sure to install XLConnect from source as there is no Mac binary.
Thanks Martin, I am on Mac OSX and the message “Warning message: package ‘XLConnect’ is not available (for R version 2.15.1)” didn’t make me try source. I got it to install: “install.packages(‘XLConnect’,type=’source’)” (I already have XCode, XCode command line tools and Mac Ports all installed). But it could not load by 43MB 100000line spreadsheet (can’t share the sheet as it is client proprietary, sorry) with 2G of memory. The commands I issued were:
wb <- loadWorkbook("sheet.xlsx", create=F)
And I get: "Error: OutOfMemoryError (Java): Java heap space"
xlsx files are quite memory hungry when being processed. In short, the reason is that xlsx files are a set of compressed (verbose) XML files (xls files perform a bit better in this regards as it’s a binary format). Currently, when processing “large” xlsx files, you will need to bump up your Java heap space. Make sure to do that BEFORE any JVM process is initialized, i.e. before any package requiring Java is loaded. In the context of XLConnect, this means you have to set options(java.parameters = “-Xmx2G”) before the XLConnect package is loaded. It’s hard to tell how much heap space you will require for your specific example though.
We are currently also investigating ways of limitting memory requirements by processing large Excel files in “sliding windows”.