# Semantic Equivalence of SQL Queries (Draft)

## 1. Introduction

When are two SQL queries the same?

We say that two queries are semantically equivalent when their execution results agree on all possible inputs. Some queries are obviously equivalent:

SELECT att1 FROM table;

SELECT att1 FROM (SELECT att1, att2 FROM table);


With complex queries, however, the problem of establishing equivalence becomes a challenging one. The problem of disproving equivalence is often substantially easier: all we need is one counterexample. We cannot, however, experiment on all possible inputs, and so establishing a positive result must rely on a formal proof.

Without a formal proof, an equivalence claim is, at best, an educated guess. And even brilliant people make wrong educated guesses.

For example, consider the so-called COUNT bug, discovered originally in the context of rewriting nest queries for optimization (Ganski–Wong, 1987):

SELECT pnum
FROM parts
WHERE qoh =
(SELECT COUNT(shipdate)
FROM supply
WHERE supply.pnum = parts.pnum
AND shipdate < 80);

WITH temp AS
SELECT pnum, COUNT(shipdate) as ct
FROM supply
WHERE shipdate < 80
GROUP BY pnum
SELECT pnum
FROM parts, temp
WHERE parts.qoh = temp.ct
AND parts.pnum = temp.pnum;


Ostensibly, both queries retrieve the part numbers (pnum) of those parts whose quantities on hand (qoh) equal the number of shipments of those parts (COUNT(shipdate)) before 80. Nevertheless, the two queries fail to be semantically equivalent. TO see this, we consider the following dataset:

• parts(pnum, qoh) = $\{(3,6), (10,1), (8,0)\}$
• supply(pnum, shipdate) = $\{ (3, 79), (3, 78), (10, 78), (10, 81), (8, 83)\}$

In the first query, the subquery

SELECT COUNT(shipdate)
FROM supply
WHERE supply.pnum = parts.pnum
AND shipdate < 80;


returns 2 for pnum = 3, 1 for pnum = 10, and 0 for pnum = 8. Of those, pnum = 10 and pnum = 8 have the matching qoh counts, so the full query returns

pnum
10
8

On the other hand, the temp subquery in the second query returns

pnum ct
3 2
10 1

and so the full query returns

pnum
10

It follows that the two queries are inequivalent.

While the above example may seem frivolous, it is worth noting that the database research community failed to discover the bug for 5 years. A typical database management system runs many query rewrite jobs for optimization purposes. So long as we rely on human intuition for correctness, there will be errors.

How do we establish semantic equivalence, then? In this post, we study the basic theory of semantic equivalence and survey a few tools for automatic verification of semantic equivalence.

## 2. Semantic Equivalence

In this section, we formalize the notion of semantic equivalence. We start with a quick review of the relational algebra formalism and define semantic equivalence in the context of relational algebra.

### 2.1. Relational Algebra

Recall from the theory of relational algebra that a relational algebra is a relation scheme, equipped with union, difference, select, projection, and join. We briefly review these notions below. See Section 2.2 for the formal definition of a relational algebra.

A relation scheme $R$ is a finite set of attribute names $\{A_1,A_2,\ldots,A_n\}$. For each attribute name $A_i$, we define the domain $\mathcal{D}(A_i)$ to be a nonempty, countable set. A relation on $R$ is a finite set $r = \{t_1,t_2,\ldots,t_p\}$ of functions $t_k$ from $R$ to $D = \mathcal{D}(A_1) \cup \mathcal{D}(A_2) \cup \cdots \cup \mathcal{D}(A_n)$ such that $t_k(A_i) \subseteq \mathcal{D}(A_i)$ for each $1 \leq k \leq p$ and every $1 \leq i \leq n$. The functions $t_1,t_2,\ldots,t_p$ are called tuples.

Let $\mathcal{R}$ be the set of all possible relations on $R$ with respect to domains $\mathcal{D}(A_1),\ldots,\mathcal{D}(A_n)$.

Given two relations $r,s \in \mathcal{R}$, the union of $r$ and $s$ is the relation $r \cup s$ containing tuples in either $r$ or $s$. The difference between $r$ and $s$ is the relation $r \setminus s$ containing tuples that are in $r$ but not in $s$.

The selection operator $\sigma:R \times D \times \mathcal{R} \to \mathcal{R}$ is defined by the formula

for each attribute $A$, value $a$, and relation $r$, provided that $a \in \mathcal{D}(A)$. In other words, selection returns all tuples that satisfy a certain condition.

To define the projection operator, we need a few preliminary definitions.

Given a relation scheme $R = \{A_1,\ldots,A_n\}$ with fixed domains $\mathcal{D}(A_1),\ldots,\mathcal{D}(A_n)$, we define a subscheme of $R$ to be a subset of $R$, with the same domains. We write $\mathcal{R}_S$ to denote the set of all possible relations on $S$ respect to the domains inherited from $\mathcal{R}$, and $D_S$ to denote the union of domains

The restriction of a tuple $t$ over $R$ is a function $t_S:D_S$ such that $t_S(A) = t(A)$ for all $A \in S$. Note that, by definition, $t_S(A) \subseteq \mathcal{D}(A)$ for all $A \in S$, whence $t_S$ is a tuple on $S$. The restricton of a relation $r$ on $R$ is the set

By definition, $r_S$ is a relation on $S$.

Let $R = \{A_1,\ldots,A_n\}$ be a relation scheme and recall that the powerset $\mathscr{P}(R)$ of $R$ is the collection of all subsets of $R$. The projection operator $\pi:\mathscr{P}(R) \times \mathcal{R} \to \bigcup_{S \subseteq R} \mathcal{R}_S$ is defined by the formula

$\pi(S,r) = \pi_S(r) = r_S$.

Lastly, we define the join operator.

We let $R^1$ and $R^2$ be relation schemes with fixed domains, and $\mathcal{R}^1$ and $\mathcal{R}^2$ be sets of all relations on $R^1$ and $R^2$, respectively. The union $R^{12} = R^1 \cup R^2$ is a set of attributes (with fixed domains), and thus a relation scheme. We let $\mathcal{R}^{12}$ denote the set of all relations on $R$.

Given tuples $t^1$ and $t^2$ over $R^1$ and $R^2$, respectively, we define the join of $t^1$ and $t^2$ to be a tuple $t^{12} = t^1 \bowtie t^2$ such that $t^{12} = t^{12}_R$ on $R$ and $t^{12} = t^{12}_S$ on $S$, provided that $t^1 = t^2$ on $R \cap S$.

The join operator $\bowtie:\mathcal{R}^1 \times \mathcal{R}^2 \to \mathcal{R}^{12}$ is defined by the formula

### 2.2. Expression Values

Formally, we define a relational algebra by specifying the following (Codd 1972):

• a set $\mathfrak{U}$ of attributes, called the universe,
• a set $\mathfrak{D}$ of domains,
• a function $\mathcal{D}:\mathfrak{U} \to \mathfrak{D}$ that assigns a domain to each attribute,
• a set $\mathfrak{R} = \{R_1,\ldots,R_p\}$ of relation schemes, where $R_i \subseteq \mathfrak{U}$ for each $1 \leq i \leq p$,
• a set $\mathfrak{d} = \{r_1,\ldots,r_p\}$ of relations, where $r_i$ is a relation on $R_i$ for each $1 \leq i \leq p$,
• a set $\mathfrak{O}$ of operators union, difference, select, project, and join, defined on the attributes in $\mathfrak{U}$.

Thinking of a relation as a two-dimensional table with tuples as rows, we can understand a relation scheme to be a collection of tables that share the same table headers. Since the definition of a relational algebra attaches precisely one relation to each relation scheme, a relational algebra can be also be considered a collection of tables, but without the restriction that they share the same headers. In this manner, the notion of a relational algebra coincides with the notion of a database, with each table represented by a relation on a relation scheme.

To further explore this connection, we now assume that a primary is always chosen for each relation scheme. Let us write, for notational convenience, $k_R$ to denote the primary key of a relation scheme $R$.

The set $\mathfrak{R} = \{R_1,\ldots,R_p\}$ of relation schemes is said to be a relational database scheme if

and if $S_i \neq S_j$ whenever $i \neq j$. In other words, a collection of relation schemes is a relational database scheme if it consists of distinct relation schemes that, collectively, make use of every attributes in the universe $\mathfrak{U}$.

When $\mathfrak{R}$ is a relational database scheme, we say that $\mathfrak{d}$ is a relational database in case

Tags:

Categories:

Updated: