# Semantic Equivalence of SQL Queries (Draft)

## 1. Introduction

When are two SQL queries the same?

We say that two queries are **semantically equivalent** when their execution results agree on all possible inputs. Some queries are obviously equivalent:

```
SELECT att1 FROM table;
```

```
SELECT att1 FROM (SELECT att1, att2 FROM table);
```

With complex queries, however, the problem of establishing equivalence becomes a challenging one. The problem of *disproving* equivalence is often substantially easier: all we need is one counterexample. We cannot, however, experiment on all possible inputs, and so establishing a positive result must rely on a formal proof.

Without a formal proof, an equivalence claim is, at best, an educated guess. And even brilliant people make wrong educated guesses.

For example, consider the so-called **COUNT bug**, discovered originally in the context of rewriting nest queries for optimization (Ganski–Wong, 1987):

```
SELECT pnum
FROM parts
WHERE qoh =
(SELECT COUNT(shipdate)
FROM supply
WHERE supply.pnum = parts.pnum
AND shipdate < 80);
```

```
WITH temp AS
SELECT pnum, COUNT(shipdate) as ct
FROM supply
WHERE shipdate < 80
GROUP BY pnum
SELECT pnum
FROM parts, temp
WHERE parts.qoh = temp.ct
AND parts.pnum = temp.pnum;
```

Ostensibly, both queries retrieve the part numbers (`pnum`

) of those parts whose quantities on hand (`qoh`

) equal the number of shipments of those parts (`COUNT(shipdate)`

) before 80. Nevertheless, the two queries fail to be semantically equivalent. TO see this, we consider the following dataset:

`parts(pnum, qoh)`

=`supply(pnum, shipdate)`

=

In the first query, the subquery

```
SELECT COUNT(shipdate)
FROM supply
WHERE supply.pnum = parts.pnum
AND shipdate < 80;
```

returns 2 for `pnum = 3`

, 1 for `pnum = 10`

, and 0 for `pnum = 8`

. Of those, `pnum = 10`

and `pnum = 8`

have the matching `qoh`

counts, so the full query returns

pnum |
---|

10 |

8 |

On the other hand, the `temp`

subquery in the second query returns

pnum | ct |
---|---|

3 | 2 |

10 | 1 |

and so the full query returns

pnum |
---|

10 |

It follows that the two queries are inequivalent.

While the above example may seem frivolous, it is worth noting that the database research community failed to discover the bug for 5 years. A typical database management system runs many query rewrite jobs for optimization purposes. So long as we rely on human intuition for correctness, there *will* be errors.

How do we establish semantic equivalence, then? In this post, we study the basic theory of semantic equivalence and survey a few tools for automatic verification of semantic equivalence.

## 2. Semantic Equivalence

In this section, we formalize the notion of semantic equivalence. We start with a quick review of the relational algebra formalism and define semantic equivalence in the context of relational algebra.

### 2.1. Relational Algebra

Recall from the theory of relational algebra that a **relational algebra** is a relation scheme, equipped with union, difference, select, projection, and join. We briefly review these notions below. See Section 2.2 for the formal definition of a relational algebra.

A **relation scheme** is a finite set of **attribute names** . For each attribute name , we define the domain to be a nonempty, countable set. A **relation** on is a finite set of functions from to such that for each and every . The functions are called **tuples**.

Let be the set of all possible relations on with respect to domains .

Given two relations , the **union** of and is the relation containing tuples in either or . The **difference** between and is the relation containing tuples that are in but not in .

The **selection operator** is defined by the formula

for each attribute , value , and relation , provided that . In other words, selection returns all tuples that satisfy a certain condition.

To define the projection operator, we need a few preliminary definitions.

Given a relation scheme with fixed domains , we define a **subscheme** of to be a subset of , with the same domains. We write to denote the set of all possible relations on respect to the domains inherited from , and to denote the union of domains

The **restriction** of a tuple over is a function such that for all . Note that, by definition, for all , whence is a tuple on . The **restricton** of a relation on is the set

By definition, is a relation on .

Let be a relation scheme and recall that the powerset of is the collection of all subsets of . The **projection operator** is defined by the formula

.

Lastly, we define the join operator.

We let and be relation schemes with fixed domains, and and be sets of all relations on and , respectively. The union is a set of attributes (with fixed domains), and thus a relation scheme. We let denote the set of all relations on .

Given tuples and over and , respectively, we define the **join** of and to be a tuple such that on and on , provided that on .

The **join operator** is defined by the formula

### 2.2. Expression Values

Formally, we define a **relational algebra** by specifying the following (Codd 1972):

- a set of attributes, called the
**universe**, - a set of domains,
- a function that assigns a domain to each attribute,
- a set of relation schemes, where for each ,
- a set of relations, where is a relation on for each ,
- a set of operators
*union*,*difference*,*select*,*project*, and*join*, defined on the attributes in .

Thinking of a relation as a two-dimensional table with tuples as rows, we can understand a relation scheme to be a collection of tables that share the same table headers. Since the definition of a relational algebra attaches precisely one relation to each relation scheme, a relational algebra can be also be considered a collection of tables, but without the restriction that they share the same headers. In this manner, the notion of a relational algebra coincides with the notion of a *database*, with each table represented by a relation on a relation scheme.

To further explore this connection, we now assume that a primary is always chosen for each relation scheme. Let us write, for notational convenience, to denote the primary key of a relation scheme .

The set of relation schemes is said to be a **relational database scheme** if

and if whenever . In other words, a collection of relation schemes is a relational database scheme if it consists of distinct relation schemes that, collectively, make use of every attributes in the universe .

When is a relational database scheme, we say that is a **relational database** in case