It usually gives us a chuckle when we find some natural and seemingly easy data science question is NP-hard. For instance we have written that variable pruning is NP-hard when one insists on finding a minimal sized set of variables (and also why there are no obvious methods for exact large permutation tests).

In this note we show that finding a minimal set of columns that form a primary key in a database is also NP-hard.

Problem: Minimum Cardinality Primary Key

Instance: Vectors x_{1}through x_{m}elements of {0,1}^{n}and positive integer k.

Question: Is there a “primary key” of size no more then k? That is: is there a subset P of {1, …, n} with |P| ≤ k such that for any integers a, b with 1 ≤ a < b ≤ n we can find an j in P such that x_{a}(j) doesn’t equal x_{b}(j) (i.e. x_{a}and x_{b}differ at some index named in P, and hence can be distinguished or “told apart”).

Now the standard reference on NP-hardness (Garey and Johnson, *Computers and Intractability*, Freeman, 1979) does have some NP-hard database examples (such as SR26 Minimum Cardinality Key). However the stated formulations are a bit hard to decode, so we will relate the above problem directly to a more accessible problem: SP8 Hitting Set.

Problem: SP8 Hitting Set

Instance: Collection C of subsets of a finite set S, positive integer K ≤ |S|.

Question: Is there a subset S^{‘}contained in S with |S^{‘}| ≤ K such that S^{‘}contains at least one element from each subset in C?

The idea is: SP8 is thought to be difficult to solve, so if we show how Minimum Cardinality Primary Key could be used to easily solve SP8 this is then evidence Minimum Cardinality Primary Key is also difficult to solve.

So suppose we have an arbitrary instance of SP8 in front of us. Without loss of generality assume S = {1, …, n}, C = {C_{1}, …, C_{m}}, and all of the C_{i} are non-empty and distinct.

We build an instance of the Minimum Cardinality Primary Key problem by defining a table with columns named s_{1} through s_{n} plus d_{1} through d_{m}.

Now we define the rows of our table:

- Let r
_{0}be the row of all zeros. - For i from 1 to m let z
_{i}be the row with z_{i}(d_{i}) = 1 and all other columns equal to zero. - For i from 1 to m let x
_{i}be the row with x_{i}(d_{i}) = 1, x_{i}(s_{j}) = 1 for all j in C_{i}, and all other columns equal to zero.

Now let’s look at what sets of columns form primary keys for the collection of rows r_{0}, z_{i}, x_{i}.

We must have all of d_{i} in P, as each d_{i} is the unique index of the only difference between z_{i} and r_{0}. Also, for any i we must have a j such that z_{i}(d_{j})=1 and j in C_{i}, as if there were none we could not tell z_{i} from x_{i} (as they differ only in indices named by C_{i}).

This lets us confirm a good primary key set P is such that S^{‘} = {j | s_{j} in P} is itself a good hitting set for the SP8 problem. And for any hitting set S^{‘} we have P = {s_{j} | j in S^{‘}} union {d_{i}, … d_{m}} is a good solution for the Minimum Cardinality Primary Key problem (the d_{i} allow us to distinguish r_{0} from z_{i}, the z_{i} from themselves, r_{0} from x_{i}, and the x_{i} from them selves; the set hitting property lets us distinguish z_{i} from the corresponding x_{i}, completing the unique keying of rows by the chosen column set). And the solution sizes are always such that |P| = |S’| + m.

So: if we had a method to solve arbitrary instances of the Minimum Cardinality Primary Key problem, we could then use it to solve arbitrary instances of the SP8 Hitting Set Problem. We would just re-encode the SP8 problem as described above, solve the Minimum Cardinality Primary Key problem, and use the strong correspondence between solutions to these two problems to map the solution back to the SP8 problem. Thus the Minimum Cardinality Primary Key problem is itself NP-hard.

What made the problem hard was, as is quite common, is: the solution size constraint. Without that constraint the problem is trivial. The set of all columns either forms a primary key or does not, and it is simple calculation to check that. As with the variable pruning problem we can even try step-wise deleting columns to explore subsets of columns that are also primary table keys, moving us to a non-redundant key set (but possibly not of minimal size).

Categories: Computer Science Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.