Menu Home

Worry Over Columns, not Rows

I say: if you are a data scientist or working on an analytics project, worry over columns not rows.

In analytics “rows” are instances, and “columns” are possible measurements. For example: each click on a website might generate a row recording the visit, and this row would be populated with columns describing what was clicked on (and if you are lucky there are more records recording what else was presented and not clicked on).

My experience is that operations groups are very used to worrying over individual instances or “rows.” They are “on the hook” for answering questions such as: “how did this item get into our warehouse?”, “how did this item fail to get into our warehouse?”, or “why doesn’t this record match this record?”

However, statistics (analytics, machine learning, or even “AI”) works by extracting shared properties or trends between many data instances (or rows). In this setting the rare extra, or missed, row really does not matter. What matter are measurements (columns), or lack of columns. A poorly curated column (ex: “this column records the amount spent, unless there was a voucher”) breeds complicated analysis code. A missing measurement (or column) (ex: “we don’t record items not clicked on”) makes some tasks harder, or even impossible.

My point is: as an analyst working with operational teams you are going to have to understand they naturally value data in terms of rows, while your project values data in terms of columns.

Categories: Opinion

Tagged as:

John Mount

2 replies

  1. On the slightly different but relevant subject of “Set-Based” programming, one should also be more concerned with columns than rows. In fact, I have this little snippet of simple advice in my signature line of posts over on …

    First step towards the paradigm shift of writing Set Based code:
    Stop thinking about what you want to do to a ROW… think, instead, of what you want to do to a COLUMN.

    1. Set based thinking is simply missing from most app developers way of thinking……..I’ve rewritten stored procs that ran for 8 hrs overnight RBARing its way through. There were only 4 different updates being done, based on the contents of the row it was on. I rewrote it to do those 4 updates -set based rather than rbar. Testers thought it was broken it ran so fast…..something like 40 seconds. Yet, still cant make the developers see the light……

%d bloggers like this: