Poincaré Embeddings of Neural Networks (Draft)

Poincaré Embeddings for Learning Hierarchical Representations
E. Totoni, T. A. Anderson, T. Shpeisman

Neural Embeddings of Graphs in Hyperbolic Space
B. P. Chamberlain, J. Clough, M. P. Deisenroth

1. Introduction

In studying a mathematical model, it is helpful to put the model in the context of a simpler, well-studied mathematical structure. For example, the Earth is a spherical object, and yet we can learn much about its surface by studying its Mercator projection on a two-dimensional plane.

The study of geometric methods of context transfer starts with embeddings, which are exact copies of mathematical models in another mathematical structure. More specifically, an embedding of a mathematical object of type (or, formally, of category ) in a mathematical object of type is a one-to-one mapping such that is a copy of in , viz., an object of type that shares the same structural properties as .

Differential-geometric tools such as Whitney embedding theorem (1944) and Nash embedding theorem (1954–1966) allow us to view abstract geometric objects as concrete surfaces in the Euclidean space. Moreover, discrete analogues such as Kuratowski’s theorem (1930) and the Hopcroft–Tarjan planarity testing algorithm (1974), as well as results from metric embedding theory like Johnson–Lindenstrauss lemma (1984), suggest embedding methods can be applied in quite general contexts.

Indeed, datasets can be represented as vectors or weighted graphs: the former by encoding each feature as coordinate values, and the latter by encoding relationships between data points as weights of edges between nodes. The latter is already a graph; the former can be thought of as sampled points on a manifold, a principal object of study in geometry, that describes the true nature of the dataset. As discussed above, both representations can be thought of as geometric objects embedded within a Euclidean space, allowing us to visualize the structure of a dataset.

Nevertheless, embedding in the strict sense is not terribly useful in data analysis, because such visualizations often take place in extremely high-dimensional space. For example, if we wish to encode color information, to be taken from the color set {red, orange, yellow, green, cyan, blue, violet}, in numeric vectors, we might apply one-hot encoding to represent

id color
1 red
2 orange
3 yellow
4 green

as

id red? orange? yellow? green? cyan? blue?
1 1 0 0 0 0 0
2 0 1 0 0 0 0
3 0 0 1 0 0 0
4 0 0 0 1 0 0

which is a collection of 7-dimensional vectors. As we consider more features, the requisite dimension increases significantly. This results in the so-called curse of dimensionality, which is a heuristic principle referring to various computational difficulties that worsen significantly as the dimension increases.

It is thus beneficial to consider low-dimensional representations of datasets via approximate embeddings, which do not constitute exact copies but still contain enough information to be useful. For example, we might try and encode certain features of a dataset as geometric properties, such as the distance between two data points.

The method of data analysis via low-dimensional representations has a long history, significantly predating the formal development of embedding theory in mathematics. There are hints of early theoretical results found in literature as early as 1880 (see Section 1.3 of de Leeuw–Heiser, 1982), and computational methods have been around for decades as well (see, for example, Sammon, 1969).

Nevertheless, it is the explosion of neural-network methods, backed by ever-so-powerful modern computers, that afforded embedding methods with the significance they have now. A landmark example of the modern embedding method is Word2Vec (Mikolov, et al., 2013), a computationally efficient technique for modeling similarities between words built on the neural probabilistic language model (Bengio, et al., 2003).

[short explanation about word2vec]

[problems with word2vec]

[why hyperbolic geometry]

[better results]

2. Neural Networks

In the business of data analysis, we are interested fundamentally in finding patterns in a dataset that will allow us to guess the properties of data points we have yet to collect. Supposing that we are interested in one particular feature of data at hand, we can model the relationship between data points and their properties as a function

where represents data points (as vectors) and a measurement of the feature in question. Our goal, then, is to find a function

that closely approximates .

Knowing requires examining all possible data points, which is impractical. Therefore, Such a is constructed from a set of training data and validated against a set of test data that does not intersect with the training set. We, of course, assume that we know the true value of on both the training set and the test set.

A classic method of constructing is linear regression (Galton, 1894; Pearson, 1896; Pearson, 1930), which yields a family of functions of the form

To determine the optimal choice of the parameter , we minimize the mean-squared error

on the training set of -vectors, where

is the vector of true values and

is the vector of approximate values. To this end, we first observe that

the Euclidean -norm. Since -norms are strictly convex, a standard result in convex analysis forces to have precisely one minimum. It follows from a standard result in multivariable calculus the point at which is the global minimum of .

While it is possible to compute the closed-form solution of

the requisite operations are computationally expensive. Besides, we do not need to know the exact solution; a good enough estimate suffices.

The classical method of finding an approximate local minimum by finding a point at which the gradient is close to 0 is called gradient descent (Cauchy, 1847). Starting at an arbitrary point , the gradient descent method produces a point closer to the local minimum of a function by recursion:

is called the th step size, or learning rate. There are many ways to choose appropriate learning rates. We only discuss a classic method, called line search, based on Taylor’s theorem.

Since