Semantic Equivalence of SQL Queries (Draft)

1. Introduction

When are two SQL queries the same?

We say that two queries are semantically equivalent when their execution results agree on all possible inputs. Some queries are obviously equivalent:

SELECT att1 FROM table;
SELECT att1 FROM (SELECT att1, att2 FROM table);

With complex queries, however, the problem of establishing equivalence becomes a challenging one. The problem of disproving equivalence is often substantially easier: all we need is one counterexample. We cannot, however, experiment on all possible inputs, and so establishing a positive result must rely on a formal proof.

Without a formal proof, an equivalence claim is, at best, an educated guess. And even brilliant people make wrong educated guesses.

For example, consider the so-called COUNT bug, discovered originally in the context of rewriting nest queries for optimization (Ganski–Wong, 1987):

SELECT pnum
  FROM parts
  WHERE qoh =
    (SELECT COUNT(shipdate)
      FROM supply
      WHERE supply.pnum = parts.pnum
        AND shipdate < 80);
WITH temp AS
  SELECT pnum, COUNT(shipdate) as ct
    FROM supply
    WHERE shipdate < 80
    GROUP BY pnum
SELECT pnum
  FROM parts, temp
  WHERE parts.qoh = temp.ct
    AND parts.pnum = temp.pnum;

Ostensibly, both queries retrieve the part numbers (pnum) of those parts whose quantities on hand (qoh) equal the number of shipments of those parts (COUNT(shipdate)) before 80. Nevertheless, the two queries fail to be semantically equivalent. TO see this, we consider the following dataset:

  • parts(pnum, qoh) =
  • supply(pnum, shipdate) =

In the first query, the subquery

SELECT COUNT(shipdate)
  FROM supply
  WHERE supply.pnum = parts.pnum
    AND shipdate < 80;

returns 2 for pnum = 3, 1 for pnum = 10, and 0 for pnum = 8. Of those, pnum = 10 and pnum = 8 have the matching qoh counts, so the full query returns

pnum
10
8

On the other hand, the temp subquery in the second query returns

pnum ct
3 2
10 1

and so the full query returns

pnum
10

It follows that the two queries are inequivalent.

While the above example may seem frivolous, it is worth noting that the database research community failed to discover the bug for 5 years. A typical database management system runs many query rewrite jobs for optimization purposes. So long as we rely on human intuition for correctness, there will be errors.

How do we establish semantic equivalence, then? In this post, we study the basic theory of semantic equivalence and survey a few tools for automatic verification of semantic equivalence.

2. Semantic Equivalence

In this section, we formalize the notion of semantic equivalence. We start with a quick review of the relational algebra formalism and define semantic equivalence in the context of relational algebra.

2.1. Relational Algebra

Recall from the theory of relational algebra that a relational algebra is a relation scheme, equipped with union, difference, select, projection, and join. We briefly review these notions below. See Section 2.2 for the formal definition of a relational algebra.

A relation scheme is a finite set of attribute names . For each attribute name , we define the domain to be a nonempty, countable set. A relation on is a finite set of functions from to such that for each and every . The functions are called tuples.

Let be the set of all possible relations on with respect to domains .

Given two relations , the union of and is the relation containing tuples in either or . The difference between and is the relation containing tuples that are in but not in .

The selection operator is defined by the formula

for each attribute , value , and relation , provided that . In other words, selection returns all tuples that satisfy a certain condition.

To define the projection operator, we need a few preliminary definitions.

Given a relation scheme with fixed domains , we define a subscheme of to be a subset of , with the same domains. We write to denote the set of all possible relations on respect to the domains inherited from , and to denote the union of domains

The restriction of a tuple over is a function such that for all . Note that, by definition, for all , whence is a tuple on . The restricton of a relation on is the set

By definition, is a relation on .

Let be a relation scheme and recall that the powerset of is the collection of all subsets of . The projection operator is defined by the formula

.

Lastly, we define the join operator.

We let and be relation schemes with fixed domains, and and be sets of all relations on and , respectively. The union is a set of attributes (with fixed domains), and thus a relation scheme. We let denote the set of all relations on .

Given tuples and over and , respectively, we define the join of and to be a tuple such that on and on , provided that on .

The join operator is defined by the formula

2.2. Expression Values

Formally, we define a relational algebra by specifying the following (Codd 1972):

  • a set of attributes, called the universe,
  • a set of domains,
  • a function that assigns a domain to each attribute,
  • a set of relation schemes, where for each ,
  • a set of relations, where is a relation on for each ,
  • a set of operators union, difference, select, project, and join, defined on the attributes in .

Thinking of a relation as a two-dimensional table with tuples as rows, we can understand a relation scheme to be a collection of tables that share the same table headers. Since the definition of a relational algebra attaches precisely one relation to each relation scheme, a relational algebra can be also be considered a collection of tables, but without the restriction that they share the same headers. In this manner, the notion of a relational algebra coincides with the notion of a database, with each table represented by a relation on a relation scheme.

To further explore this connection, we now assume that a primary is always chosen for each relation scheme. Let us write, for notational convenience, to denote the primary key of a relation scheme .

The set of relation schemes is said to be a relational database scheme if

and if whenever . In other words, a collection of relation schemes is a relational database scheme if it consists of distinct relation schemes that, collectively, make use of every attributes in the universe .

When is a relational database scheme, we say that is a relational database in case