I’ve thought of Pandas as in-memory column oriented data structure with reasonable performance. If I need high performance or scale, I can move to a database. I like Pandas, and thank the authors and maintainers for their efforts.
Now I kind of wonder what Pandas is, or what it wants to be.
1.3.0 package seems to be marking natural ways to work with a data frame as “low performance” and issuing warnings (in some situations over and over again).
It is now considered rude to insert a column into a Pandas data frame. I find this off, as I pretty much come to Pandas for a structure I can easily add and remove columns from.
Let’s work an example.
First we try an example that simulates what might happen in the case of a data scientist working with a data frame. Some columns get added. In this place all at once all one place, as this is just simulation code.
<ipython-input-4-aedae30a984f>:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, use `newframe = frame.copy()` d['var_' + str(i).zfill(4)] = numpy.zeros(nrow) 2.707611405
The above warning only occurred once in this context. In other applications I have seen it repeat very many times overwhelming the worksheet. I guess I could add
%%capture to each and every cell to try and work around this.
The demanded alternative is something like the following.
Yes, concat is faster- but it is only natural in artificial cases such as the above where I am adding all the columns in a single place. So is any sequence inserts in Pandas now a ticking time bomb that will spill warnings out once some threshold is crossed?
I guess one could keep a dictionary map of column names to numpy 1-d arrays and work with that if one wants a column oriented data structure.
(note: how to suppress this warning can be found here: https://github.com/twopirllc/pandas-ta/issues/340#issuecomment-879450854 )