When are two SQL queries the same?
We say that two queries are semantically equivalent when their execution results agree on all possible inputs. Some queries are obviously equivalent:
SELECT att1 FROM table;
SELECT att1 FROM (SELECT att1, att2 FROM table);
With complex queries, however, the problem of establishing equivalence becomes a challenging one. The problem of disproving equivalence is often substantially easier: all we need is one counterexample. We cannot, however, experiment on all possible inputs, and so establishing a positive result must rely on a formal proof.
Without a formal proof, an equivalence claim is, at best, an educated guess. And even brilliant people make wrong educated guesses.
The COUNT bug was discovered originally in the context of rewriting nest queries for optimization (Ganski–Wong, 1987):
SELECT pnum FROM parts WHERE qoh = (SELECT COUNT(shipdate) FROM supply WHERE supply.pnum = parts.pnum AND shipdate < 80);
WITH temp AS SELECT pnum, COUNT(shipdate) as ct FROM supply WHERE shipdate < 80 GROUP BY pnum SELECT pnum FROM parts, temp WHERE parts.qoh = temp.ct AND parts.pnum = temp.pnum;
Ostensibly, both queries retrieve the part numbers (
pnum) of those parts whose quantities on hand (
qoh) equal the number of shipments of those parts (
COUNT(shipdate)) before 80. Nevertheless, the two queries fail to be semantically equivalent. TO see this, we consider the following dataset:
In the first query, the subquery
SELECT COUNT(shipdate) FROM supply WHERE supply.pnum = parts.pnum AND shipdate < 80;
returns 2 for
pnum = 3, 1 for
pnum = 10, and 0 for
pnum = 8. Of those,
pnum = 10 and
pnum = 8 have the matching
qoh counts, so the full query returns
On the other hand, the
temp subquery in the second query returns
and so the full query returns
It follows that the two queries are inequivalent.
While the above example may seem frivolous, it is worth noting that the database research community failed to discover the bug for 5 years. A typical database management system runs many query rewrite jobs for optimization purposes. So long as we rely on human intuition for correctness, there will be errors.