Any practicing data scientist is going to eventually have to work with a data stored in a Microsoft Excel
spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we don’t recommend using Excel
-like formats to exchange data. But we know if you are going to work with others you are going to have to make accommodations (we even built our own modified version of gdata
‘s underlying Perl
script to work around a bug).
But one thing that continues to confound us is how hard it is to read Excel
data correctly. When Excel
exports into CSV/TSV
style formats it uses fairly clever escaping rules about quotes and new-lines. Most CSV/TSV
readers fail to correctly implement these rules and often fail on fields that contain actual quote characters, separators (tab or comma), or new-lines. Another issue is Excel
itself often transforms data without any user verification or control. For example: Excel
routinely turns date-like strings into time since epoch (which it then renders as a date). We recently ran into another uncontrollable Excel
transform: changing the strings “TRUE
” and “FALSE
” into 1 and 0 inside the actual “.xlsx
” file. That is Excel
does not faithfully store the strings “TRUE
” and “FALSE
” even in its native format. Most Excel
users do not know about this, so they certainly are in no position to warn you about it.
This would be a mere annoyance, except it turns out Libre Office
(or at least LibreOffice_4.3.4_MacOS_x86-64) has a severe and silent data mangling bug on this surprising Microsoft boolean type.
We first ran into this in client data (and once the bug triggered it seemed to alter most of the columns), but it turns out the bug is very easy to trigger. In this note we will demonstrate the data representation issue and bug.
Our example Excel
spreadsheet was produced using Microsoft Excel
2011 for OSX. We started a new sheet and typed in a few cells by hand. We formatted the header and the numeric column, but did not move off default settings for any of the TRUE/FALSE
cells. The spreadsheet looks like the following:
Original Excel
spreadsheet (TRUE/FALSE typed in as text, no formatting commands on those cells).
You can also download the spreadsheet here.
On OSX
Apple Numbers
can read the sheet correctly. We demonstrate this below.
Sheet looks okay in Apple Numbers.
However, Libre Office
doesn’t reverse the encoding (as it may not know some details of Excel
‘s encoding practices) and also shows corrupted data as we see below.
TRUE/FALSE
represented as 1/0
in Libre Office
, and third row damaged.
In practice we have seen the data damage is pervasive and not limited to columns who’s original value was FALSE
. It may be a presentation problem as examining individual cells shows “=TRUE()
” and “=FALSE()
” as the contents of the affected cells (and apparently in the correct positions independent of what is being displayed).
Apple Preview
and Quick Look
both also fail to understand the Excel
data encoding, as we show below.
Sheet damaged in Apple Preview (same for Apple Quick Look).
Our favorite analysis hammer (R) appears to read the data correctly (with only the undesired translation of TRUE/FALSE
to 1/0
):
R appears to load what was stored correctly.
But what is going on? It turns out Excel
.xlsx
files are actually zip
archives storing a directory tree of xml
artificts. By changing the file extension from .xlsx
to .zip
we can treat the spreadsheet as a zip
archive and inflate it to see the underlying files. The inflated file tree is shown below.
The file tree representing the Excel
workbook on disk.
Of particular interest are the files xl/worksheets/sheet1.xml
and xl/sharedStrings.xml
. sheet1.xml
contains the worksheet data and sharedStrings.xml
is a shared string table containing all strings used in the worksheet (the worksheet stores no user supplied strings, only indexes into the shared string table). Let’s look into sheet1.xml
:
The XML representing the sheet data.
The sheet data is arranged into rows that contain columns. It is easy to match these rows and cells to our original spreadsheet. For cells containing uninterpreted strings the <c>
tag has has an attributed set to t="s"
(probably denoting type is “string” and to use the <v>
value as a string index). Notice floating point numbers are not treated as shared strings, but stored directly in the <v>
tag. Further notice that the last three columns are stored as 0/1
and have the attribute t="b"
set. My guess is this is declaring the type is “boolean” which then must have the convention that 1
represents TRUE
and 0
represents FALSE
.
This doesn’t seem that complicated, but clearly of all the “Excel
compatible” tools we tried only Apple Numbers
knew all of the details of this encoding (and was able to reverse it). Other than Numbers
only R
‘s gdata
package was able to extract usable data (and even it only recovered the encoded version of the field, not the original user value).
And these are our issue with working with data that has passed through Excel
.
Excel
has a lot of non-controllable data transforms including booleans, and dates (in fact mangling string fragmentsExcel
even suspects could be made into dates). Some of these transforms are non-faithful or not reversible.- Very few tools that claim to interoperate with
Excel
actually get the corner cases right. Even for simple well-documented data types likeExcel
CSV
export. And definitely not for the native.xlsx
format.
These transforms and conventions make exporting data harder (and riskier) than it has to be. To add insult to injury you often run into projects that are sharing Excel
.xlsx
spreadsheets where neither the reader nor the writer is Excel
, so neither end is even good at working with the format. Because working with data that has passed through Excel
is hard to get right, data that has passed through Excel
is often wrong.
(Note: I definitely feel we do need to be thankful to open source and free software developers. These teams in addition to generously supplying software without charge are also working to preserve user freedoms and often the only way to read older data. However, when we are using software for work we do need it to work correctly and be faithful to data. This problem is small when you detect it, but large if hidden in a larger project.)
Categories: Opinion Pragmatic Data Science Public Service Article Tutorials
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
One thing that can help – whenever possible create a system to generate your templates rather than copy & paste. You then test the system generating them carefully and then have safely created templates.
Great article on this at Huffington Post – http://www.huffingtonpost.com/david-thielen/one-ridiculously-expensiv_b_5741486.html
I have gotten around to submitting this as a bug to Libre Office: https://bugs.freedesktop.org/show_bug.cgi?id=86456 .
Why I wrote the blog article first: I needed to warn some partners and a rich-media post under my control was a much more useful for that purpose than a bug report where I can’t even edit comments.
It’s actually an instance of this bug reported in July:
https://www.libreoffice.org/bugzilla/show_bug.cgi?id=81350
The booleans that should display as TRUE/FALSE display as 0/1, and if there are consecutive values that are the same in the same column, only the first value is shown.
Oddly, if you just copy/paste the cells elsewhere, they display correctly.
Nice article.
Gnumeric reads correctly the spreadsheet, with this message:
Sheet1!E4 : Encountered uninterpretable “ext” extension in namespace “{64002731-A6B0-56B0-2670-7721B7C09600}”
The interesting thing is that if you open the file with Libre Office, then save it as xls and open that file with Libre Office again it works fine… Weirdness.
You seem to be right.
I re-examined the sheet and in the boolean section I see “
=TRUE()
” and “=FALSE()
” in the formula editor independent of what the cell is displaying as (and apparently in a pattern matching the original data). I’ve updated this post (as I originally thought that it was not a presentation problem). Of course if the UI layer is confused that is enough to confuse the user (and possibly enough to confuse other tools).If you save from
Libre Office
as.xlsx
(instead of.xls
) and then re-load in MicrosoftExcel
you see theTRUE/FALSE
cells have been changed to=TRUE()/=FALSE()
, and you get an “inconsistent formulas” warning. So the sheet has been damaged, but is likely in a regular patten that is repairable. So it looks like a combination of a bad transformation on load plus a presentation layer bug.Just tried the current
Open Office
. It looks like it convertsTRUE/FALSE
to=TRUE()/=FALSE()
, but besides that looks okay. I confirmed that MicrosoftExcel
does not display the functional versions in its formula bar.I also re-captured the Apple
Numbers
screen-grab to show the bottom left inspector cell (which looks correct).The obvious conclusion is I don’t know all about all the types hiding in Excel (number, string, boolean, formula, cross-ref, and who knows what), but it looks like a lot of software that claims to interoperate with
Excel
data types are also missing some of the details.I think one reason why Excel handles boolean values like that is localization. TRUE/FALSE is translated to WAHR/FALSCH in german, as are the formula functions =TRUE() and =FALSE(). So mapping this data type to 1/0 and not use shared strings is not a bad idea, because they depend on the document and the language it was created/last saved with. It still does not explain why other libraries fail to load the values as boolean though. I recently started to use the xlsx package to write and read xlsx files. Boolean values from R are encoded as strings when writing excel sheets using ‘ TRUE’ and ‘FALSE’ (notice the space before TRUE), which leads to problems if you want to load data to R, which you stored before. But at least you get consistent behavior for users, even if they use LibreOffice or another Excel localization.
Thank you for the nice post.
I just opened your spreadsheet using LibreOffice (LibreOffice 3.5.7.2, Build-ID: 350m1(Build:2)) on my Ubuntu machine and everything works fine.
What I see is a pefectly correct sheet with localization (to German, as explained by Sebastian) where the cells actually contain =WAHR() which is displayed as WAHR (which translates to TRUE) and =FALSCH() (i.e., FALSE).
So I agree that Excel (and similar) spreadsheets are sometimes unpredictable but have to take sides of LibreOffice which seems to have fixed the described problem (at least in my version).
I am glad it appears to be working for you. But that may not last. First I suspect the representation has been changed from how
Excel
stores booleans (WAHR
) to a function call (WAHR()
). This may lead to trouble down the road (ifExcel
intended to storeWAHR
asWAHR()
it likely would have done so itself; and it is likely easier to confirm values are identical than to prove observed differences are safe and inconsequential). At least this is what I see examining files on my side. This seems to be closer toOpen Office
‘s behavior.Loading the sheet back into Excel gives confusing “Inconsistent Formula” warnings (which seem to be advisory stating that not all formulas in a region are identical).
Finally I am using
LibreOffice
Version: 4.3.4.1
, so worse versions of this bug are likely waiting for you when you or Ubuntu trigger an upgrade. I thinkBuild ID: bc356b2f991740509f321d70e4512a6a54c5f243
Libre Office
3.5.7.2
comes from late 2012.