Menu Home

Unrolling the ROC Curve

Most readers of this blog are likely familiar with the use of the ROC (Receiver Operating Characteristic) curve (or, at least, the area under that curve) for evaluating the quality of binary decision processes. One example of such a process is a binary classification model; another example is an A/B […]

Schemas for Python Data Frames

The Pandas data frame is probably the most popular tool used to model tabular data in Python. For in-memory data, Pandas serves a role that might normally fall to a relational database. Though, Pandas data frames are typically manipulated through methods, instead of with a relational query language. One can […]

Doing Better than the Average

The standard way to estimate the an expected value of a population from a sample of values v1 … vn is to compute the average (1/n) sumi = 1…nvi. It is well known in statistics that for grouped data, there are other estimators that can have smaller expected square error. […]

Data Algebra over Polars Ready for Production Use

The data algebra is a system for composing data manipulation tasks in Python. In the data algebra, operator pipelines (or even directed acyclic graphs) are the primary objects. Applying operations composes small data pipelines into larger ones. This allows the fluid specification, inspection, and sharing of data processing and data […]