Menu Home

Excel spreadsheets are hard to get right

Any practicing data scientist is going to eventually have to work with a data stored in a Microsoft Excel spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we don’t recommend using Excel-like formats to exchange data. But we know if you are going to work with others you are going to have to make accommodations (we even built our own modified version of gdata‘s underlying Perl script to work around a bug).

But one thing that continues to confound us is how hard it is to read Excel data correctly. When Excel exports into CSV/TSV style formats it uses fairly clever escaping rules about quotes and new-lines. Most CSV/TSV readers fail to correctly implement these rules and often fail on fields that contain actual quote characters, separators (tab or comma), or new-lines. Another issue is Excel itself often transforms data without any user verification or control. For example: Excel routinely turns date-like strings into time since epoch (which it then renders as a date). We recently ran into another uncontrollable Excel transform: changing the strings “TRUE” and “FALSE” into 1 and 0 inside the actual “.xlsx” file. That is Excel does not faithfully store the strings “TRUE” and “FALSE” even in its native format. Most Excel users do not know about this, so they certainly are in no position to warn you about it.

This would be a mere annoyance, except it turns out Libre Office (or at least LibreOffice_4.3.4_MacOS_x86-64) has a severe and silent data mangling bug on this surprising Microsoft boolean type.

We first ran into this in client data (and once the bug triggered it seemed to alter most of the columns), but it turns out the bug is very easy to trigger. In this note we will demonstrate the data representation issue and bug.

Our example Excel spreadsheet was produced using Microsoft Excel 2011 for OSX. We started a new sheet and typed in a few cells by hand. We formatted the header and the numeric column, but did not move off default settings for any of the TRUE/FALSE cells. The spreadsheet looks like the following:

Workbook1Excel
Original Excel spreadsheet (TRUE/FALSE typed in as text, no formatting commands on those cells).

You can also download the spreadsheet here.

On OSX Apple Numbers can read the sheet correctly. We demonstrate this below.

Workbook1Numbers
Sheet looks okay in Apple Numbers.

However, Libre Office doesn’t reverse the encoding (as it may not know some details of Excel‘s encoding practices) and also shows corrupted data as we see below.

Workbook1LibreOffice
TRUE/FALSE represented as 1/0 in Libre Office, and third row damaged.

In practice we have seen the data damage is pervasive and not limited to columns who’s original value was FALSE. It may be a presentation problem as examining individual cells shows “=TRUE()” and “=FALSE()” as the contents of the affected cells (and apparently in the correct positions independent of what is being displayed).

Apple Preview and Quick Look both also fail to understand the Excel data encoding, as we show below.

Workbook1Preview
Sheet damaged in Apple Preview (same for Apple Quick Look).

Our favorite analysis hammer (R) appears to read the data correctly (with only the undesired translation of TRUE/FALSE to 1/0):

Workbook1R
R appears to load what was stored correctly.

But what is going on? It turns out Excel .xlsx files are actually zip archives storing a directory tree of xml artificts. By changing the file extension from .xlsx to .zip we can treat the spreadsheet as a zip archive and inflate it to see the underlying files. The inflated file tree is shown below.

Workbook1Tree
The file tree representing the Excel workbook on disk.

Of particular interest are the files xl/worksheets/sheet1.xml and xl/sharedStrings.xml. sheet1.xml contains the worksheet data and sharedStrings.xml is a shared string table containing all strings used in the worksheet (the worksheet stores no user supplied strings, only indexes into the shared string table). Let’s look into sheet1.xml:

Workbook1SheetDataXMLThe XML representing the sheet data.

The sheet data is arranged into rows that contain columns. It is easy to match these rows and cells to our original spreadsheet. For cells containing uninterpreted strings the <c> tag has has an attributed set to t="s" (probably denoting type is “string” and to use the <v> value as a string index). Notice floating point numbers are not treated as shared strings, but stored directly in the <v> tag. Further notice that the last three columns are stored as 0/1 and have the attribute t="b" set. My guess is this is declaring the type is “boolean” which then must have the convention that 1 represents TRUE and 0 represents FALSE.

This doesn’t seem that complicated, but clearly of all the “Excel compatible” tools we tried only Apple Numbers knew all of the details of this encoding (and was able to reverse it). Other than Numbers only R‘s gdata package was able to extract usable data (and even it only recovered the encoded version of the field, not the original user value).

And these are our issue with working with data that has passed through Excel.

  • Excel has a lot of non-controllable data transforms including booleans, and dates (in fact mangling string fragments Excel even suspects could be made into dates). Some of these transforms are non-faithful or not reversible.
  • Very few tools that claim to interoperate with Excel actually get the corner cases right. Even for simple well-documented data types like Excel CSV export. And definitely not for the native .xlsx format.

These transforms and conventions make exporting data harder (and riskier) than it has to be. To add insult to injury you often run into projects that are sharing Excel .xlsx spreadsheets where neither the reader nor the writer is Excel, so neither end is even good at working with the format. Because working with data that has passed through Excel is hard to get right, data that has passed through Excel is often wrong.

(Note: I definitely feel we do need to be thankful to open source and free software developers. These teams in addition to generously supplying software without charge are also working to preserve user freedoms and often the only way to read older data. However, when we are using software for work we do need it to work correctly and be faithful to data. This problem is small when you detect it, but large if hidden in a larger project.)

Categories: Opinion Pragmatic Data Science Public Service Article Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

10 replies

  1. Nice article.

    Gnumeric reads correctly the spreadsheet, with this message:

    Sheet1!E4 : Encountered uninterpretable “ext” extension in namespace “{64002731-A6B0-56B0-2670-7721B7C09600}”

  2. The interesting thing is that if you open the file with Libre Office, then save it as xls and open that file with Libre Office again it works fine… Weirdness.

    1. You seem to be right.

      I re-examined the sheet and in the boolean section I see “=TRUE()” and “=FALSE()” in the formula editor independent of what the cell is displaying as (and apparently in a pattern matching the original data). I’ve updated this post (as I originally thought that it was not a presentation problem). Of course if the UI layer is confused that is enough to confuse the user (and possibly enough to confuse other tools).

      If you save from Libre Office as .xlsx (instead of .xls) and then re-load in Microsoft Excel you see the TRUE/FALSE cells have been changed to =TRUE()/=FALSE(), and you get an “inconsistent formulas” warning. So the sheet has been damaged, but is likely in a regular patten that is repairable. So it looks like a combination of a bad transformation on load plus a presentation layer bug.

      1. Just tried the current Open Office. It looks like it converts TRUE/FALSE to =TRUE()/=FALSE(), but besides that looks okay. I confirmed that Microsoft Excel does not display the functional versions in its formula bar.

        I also re-captured the Apple Numbers screen-grab to show the bottom left inspector cell (which looks correct).

        The obvious conclusion is I don’t know all about all the types hiding in Excel (number, string, boolean, formula, cross-ref, and who knows what), but it looks like a lot of software that claims to interoperate with Excel data types are also missing some of the details.

  3. I think one reason why Excel handles boolean values like that is localization. TRUE/FALSE is translated to WAHR/FALSCH in german, as are the formula functions =TRUE() and =FALSE(). So mapping this data type to 1/0 and not use shared strings is not a bad idea, because they depend on the document and the language it was created/last saved with. It still does not explain why other libraries fail to load the values as boolean though. I recently started to use the xlsx package to write and read xlsx files. Boolean values from R are encoded as strings when writing excel sheets using ‘ TRUE’ and ‘FALSE’ (notice the space before TRUE), which leads to problems if you want to load data to R, which you stored before. But at least you get consistent behavior for users, even if they use LibreOffice or another Excel localization.

  4. Thank you for the nice post.

    I just opened your spreadsheet using LibreOffice (LibreOffice 3.5.7.2, Build-ID: 350m1(Build:2)) on my Ubuntu machine and everything works fine.

    What I see is a pefectly correct sheet with localization (to German, as explained by Sebastian) where the cells actually contain =WAHR() which is displayed as WAHR (which translates to TRUE) and =FALSCH() (i.e., FALSE).

    So I agree that Excel (and similar) spreadsheets are sometimes unpredictable but have to take sides of LibreOffice which seems to have fixed the described problem (at least in my version).

    1. I am glad it appears to be working for you. But that may not last. First I suspect the representation has been changed from how Excel stores booleans (WAHR) to a function call (WAHR()). This may lead to trouble down the road (if Excel intended to store WAHR as WAHR() it likely would have done so itself; and it is likely easier to confirm values are identical than to prove observed differences are safe and inconsequential). At least this is what I see examining files on my side. This seems to be closer to Open Office‘s behavior.

      Loading the sheet back into Excel gives confusing “Inconsistent Formula” warnings (which seem to be advisory stating that not all formulas in a region are identical).

      Finally I am using LibreOffice Version: 4.3.4.1
      Build ID: bc356b2f991740509f321d70e4512a6a54c5f243
      , so worse versions of this bug are likely waiting for you when you or Ubuntu trigger an upgrade. I think Libre Office 3.5.7.2 comes from late 2012.

%d bloggers like this: