When are two SQL queries the same?
We say that two queries are semantically equivalent when their execution results agree on all possible inputs. Some queries are obviously equivalent:
SELECT att1 FROM table;
SELECT att1 FROM (SELECT att1, att2 FROM table);
With complex queries, however, the problem of establishing equivalence becomes a challenging one. The problem of disproving equivalence is often substantially easier: all we need is one counterexample. We cannot, however, experiment on all possible inputs, and so establishing a positive result must rely on a formal proof.
Without a formal proof, an equivalence claim is, at best, an educated guess. And even brilliant people make wrong educated guesses.
For example, consider the so-called COUNT bug, discovered originally in the context of rewriting nest queries for optimization (Ganski–Wong, 1987):
SELECT pnum FROM parts WHERE qoh = (SELECT COUNT(shipdate) FROM supply WHERE supply.pnum = parts.pnum AND shipdate < 80);
WITH temp AS SELECT pnum, COUNT(shipdate) as ct FROM supply WHERE shipdate < 80 GROUP BY pnum SELECT pnum FROM parts, temp WHERE parts.qoh = temp.ct AND parts.pnum = temp.pnum;
Ostensibly, both queries retrieve the part numbers (
pnum) of those parts whose quantities on hand (
qoh) equal the number of shipments of those parts (
COUNT(shipdate)) before 80. Nevertheless, the two queries fail to be semantically equivalent. TO see this, we consider the following dataset:
In the first query, the subquery
SELECT COUNT(shipdate) FROM supply WHERE supply.pnum = parts.pnum AND shipdate < 80;
returns 2 for
pnum = 3, 1 for
pnum = 10, and 0 for
pnum = 8. Of those,
pnum = 10 and
pnum = 8 have the matching
qoh counts, so the full query returns
On the other hand, the
temp subquery in the second query returns
and so the full query returns
It follows that the two queries are inequivalent.
While the above example may seem frivolous, it is worth noting that the database research community failed to discover the bug for 5 years. A typical database management system runs many query rewrite jobs for optimization purposes. So long as we rely on human intuition for correctness, there will be errors.
How do we establish semantic equivalence, then? In this post, we study the basic theory of semantic equivalence and survey a few tools for automatic verification of semantic equivalence.
2. Semantic Equivalence
In this section, we formalize the notion of semantic equivalence. We start with a quick review of the relational algebra formalism and define semantic equivalence in the context of relational algebra.
2.1. Relational Algebra
Recall from the theory of relational algebra that a relational algebra is a relation scheme, equipped with union, difference, select, projection, and join. We briefly review these notions below. See Section 2.2 for the formal definition of a relational algebra.
A relation scheme is a finite set of attribute names . For each attribute name , we define the domain to be a nonempty, countable set. A relation on is a finite set of functions from to such that for each and every . The functions are called tuples.
Let be the set of all possible relations on with respect to domains .
Given two relations , the union of and is the relation containing tuples in either or . The difference between and is the relation containing tuples that are in but not in .
The selection operator is defined by the formula
for each attribute , value , and relation , provided that . In other words, selection returns all tuples that satisfy a certain condition.
To define the projection operator, we need a few preliminary definitions.
Given a relation scheme with fixed domains , we define a subscheme of to be a subset of , with the same domains. We write to denote the set of all possible relations on respect to the domains inherited from , and to denote the union of domains
The restriction of a tuple over is a function such that for all . Note that, by definition, for all , whence is a tuple on . The restricton of a relation on is the set
By definition, is a relation on .
Let be a relation scheme and recall that the powerset of is the collection of all subsets of . The projection operator is defined by the formula
Lastly, we define the join operator.
We let and be relation schemes with fixed domains, and and be sets of all relations on and , respectively. The union is a set of attributes (with fixed domains), and thus a relation scheme. We let denote the set of all relations on .
Given tuples and over and , respectively, we define the join of and to be a tuple such that on and on , provided that on .
The join operator is defined by the formula
2.2. Expression Values
Formally, we define a relational algebra by specifying the following (Codd 1972):
- a set of attributes, called the universe,
- a set of domains,
- a function that assigns a domain to each attribute,
- a set of relation schemes, where for each ,
- a set of relations, where is a relation on for each ,
- a set of operators union, difference, select, project, and join, defined on the attributes in .
Thinking of a relation as a two-dimensional table with tuples as rows, we can understand a relation scheme to be a collection of tables that share the same table headers. Since the definition of a relational algebra attaches precisely one relation to each relation scheme, a relational algebra can be also be considered a collection of tables, but without the restriction that they share the same headers. In this manner, the notion of a relational algebra coincides with the notion of a database, with each table represented by a relation on a relation scheme.
To further explore this connection, we now assume that a primary is always chosen for each relation scheme. Let us write, for notational convenience, to denote the primary key of a relation scheme .
The set of relation schemes is said to be a relational database scheme if
and if whenever . In other words, a collection of relation schemes is a relational database scheme if it consists of distinct relation schemes that, collectively, make use of every attributes in the universe .
When is a relational database scheme, we say that is a relational database in case