Menu Home

I think Pandas may have “lost the plot.”

I’ve thought of Pandas as in-memory column oriented data structure with reasonable performance. If I need high performance or scale, I can move to a database. I like Pandas, and thank the authors and maintainers for their efforts.

Now I kind of wonder what Pandas is, or what it wants to be.

Not sure if ive lost the plot of he has lost the plot

The version 1.3.0 package seems to be marking natural ways to work with a data frame as “low performance” and issuing warnings (in some situations over and over again).

It is now considered rude to insert a column into a Pandas data frame. I find this off, as I pretty much come to Pandas for a structure I can easily add and remove columns from.

Let’s work an example.

'1.3.0'

First we try an example that simulates what might happen in the case of a data scientist working with a data frame. Some columns get added. In this place all at once all one place, as this is just simulation code.

<ipython-input-4-aedae30a984f>:7: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider using pd.concat instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  d['var_' + str(i).zfill(4)] = numpy.zeros(nrow)





2.707611405

The above warning only occurred once in this context. In other applications I have seen it repeat very many times overwhelming the worksheet. I guess I could add %%capture to each and every cell to try and work around this.

The demanded alternative is something like the following.

1.4440117099999998

Yes, concat is faster- but it is only natural in artificial cases such as the above where I am adding all the columns in a single place. So is any sequence inserts in Pandas now a ticking time bomb that will spill warnings out once some threshold is crossed?

I guess one could keep a dictionary map of column names to numpy 1-d arrays and work with that if one wants a column oriented data structure.

(note: how to suppress this warning can be found here: https://github.com/twopirllc/pandas-ta/issues/340#issuecomment-879450854 )

Categories: data science Rants

Tagged as:

John Mount

1 reply

  1. Part of my concern, is there is a chance the warning is just chasing people into
    the nightmare “too many concats” pattern which likely is much slower than any other
    alternative (probably quadratic run time in number of inserts). The following code is
    obviously silly, but that is made clear as all the column additions are near each other
    (which is often not the case in actual application code).

    def f_nightmare():
        d = pandas.DataFrame({
            'y': numpy.zeros(nrow)
        })
        for i in range(ncol):
            d = pandas.concat(
             [d,
              pandas.DataFrame({'var_' + str(i).zfill(4): numpy.zeros(nrow)}) 
                                for i in range(ncol)],
             axis=1
        	)
        return d
    

    I really think a package should supply a simple, recommended, safe, best practice API and stand by it, not push optimization guilt onto their users (who are likely in the middle of something else when this pops up).

%d bloggers like this: