Aligning Temporal Word Embeddings with a Compass (with Code)

7 min readSep 25, 2019

Introduction

Word embeddings are now used everywhere to capture word meaning and to study language. The general theory in which word embeddings are grounded is distributional semantics that roughly states that “similar words appear in similar contexts”. Given as input a collection of textual documents, word embeddings generate vector representations of the words.

Essentially, what word embeddings algorithms do is to put words that appear in similar contexts in close positions in a vector space. Each word, for example the word “amazon”, has its own vector and the similarity between words can be expressed by the distance of their vectors. See this article for a nice introductory tutorial and this article for some applications and ideas behind word embeddings.

Examples of word embeddings projected in a 2 dimensional vector space from the TensorFlow website.

Since language evolves over time it is important to find models that allow us to deal with the shift in meaning of words (think how the word “amazon” has changed in meaning over time). Thus, we would like to have a vector for each word in a specific-time interval, to study how this word has changed in meaning over time.

Being able to deal with time-specific representations allows us to tackle many different practical tasks like the evaluation of similarity of words and the disambiguation of terms in textual documents.

In this blog post, I will describe Temporal Word Embeddings with a Compass (TWEC) [1] a recent method that can be used to represent word embeddings that are time specific (i.e., one vector representations for time-specific words like the word “amazon” in 1983).

I will give some details of the model in this blog post, but you can check out the paper online where you can also find explanations of the experiments run to evaluate the model.

In a Nutshell (TL;DR)

The vector of the word “amazon” in 1983 should be different from the vector of the word “amazon” in 2019. temporal word embeddings are generated from diachronic text corpora: a set of corpus that is sliced by time (e.g., one slice with all the text from the year 1999, one slice with text from 2000, …). You thus generate an embedding for each word in each interval.

In general, you cannot compare temporal word embeddings if you don’t align them. TWEC is an approach that it is both effective and efficient to align temporal word vector spaces. See the code on github or/and the paper.

Language Evolves During Time

Languages evolve and words change their meaning over time: take for example the word “amazon”: before the company was created, the main meaning of the word was related to forest-like contexts, only in recent years the meaning has evolved and the word can be now used to refer to the company Amazon. This effect is generally referred to as semantic shift [2]. This happens also for entity names like US presidents: the words of the president names of the United States will move closer to words like “president” in specific moments in time (e.g., Clinton was president in the period from 1993 to 2001).

Cropped image from https://arxiv.org/pdf/1703.00607.pdf, [3] see how the word “apple” and “amazon” have shifted in meaning during time.

Temporal Word Embeddings

Generating time-specific vector representations of words is a good way to understand and see how language evolves: you can, for example, see how words have changed in meaning over time by looking at the similarity with other words in the space.

If you have text corpora divided into temporal slices (e.g., one slice with all the text from the year 1999 and one slice with text from 2000), you can run word2vec on each of these slices to generate time-specific representations of words (e.g., amazon_1981 vs amazon_2013) with the aim of comparing word representations. You would like a model that puts similar words in close position (i.e., to have similar vector representations) even if they come from different years.

For example, you would like to have vectors of the names of the US presidents to be close in the space in those years in which they have been presidents and you would probably like to have those vectors distant one from each other when they refer to different periods of the life of those presidents.

Words move during time. The vectors of words of the names of the US presidents will move closer to words like “president” in specific moments in time (e.g., clinton was president in the period from 1993 to 2001).

The problem: one limit of predictive (e.g., based on neural networks) word embeddings algorithms is that their stochastic nature makes them generate distributed representations that always have different coordinate systems and thus they are not comparable. This also happens if you run word2vec on the same text twice: you can’t compare the vectors in one space with the other.

If you generate the embedding of two text slices (i.e., suppose you have a text from the year 2000 and text from the year 2001), the same words will have completely different vector representations.

Word embeddings of text that come from different slices cannot be compared. The word “bush” is closer to Texas in the year 2000, but it is closer to “president” in 2001. The words “president_x” are distant while we would like them to be in the same position in the vector space.

Intuition: we need a compass

Imagine two people that are drawing the map of an island, but they start from different locations and do not have a compass to guide them. They will probably end up with maps that are similar given a rotation.

Once they are given a compass, they can align their representations and get maps that have the same orientation. The compass gives a point of reference to align the representations.

Temporal Word Embeddings with a Compass

In this work, the compass is the concatenation of all the text present in the slices. First, we should generate an embedding: this embedding will be the one to which all the slices will be aligned to.

TWEC is a generalization of the Continuous Bag of Word (CBOW) model that handles temporal word embeddings. Essentially, what TWEC does is that it freezes the second matrix of the CBOW network by first initializing it with the second matrix of a CBOW algorithm learned over the complete corpora; this new freezed-CBOW is used to train the specific slices (for which we only update the first matrix).

Side Note: remember that the CBOW model is a neural network with one hidden layer, there are thus two multiplication matrices inside the architecture; see the following figure. We will refer to the first matrix as the matrix C (that contains the embeddings of the context) and we will refer to the second matrix as U. The C matrix contains the word embeddings.

Continuous Bag of Words model, note that there are two matrix multiplications (https://arxiv.org/pdf/1411.2738.pdf)

More formally, assume that a corpus D can be divided in different slices {Dt1, Dt2, …, DtN}. These different slices are temporal specific corpus (i.e., as we said above one slice contains only text written in 1992, while another one might contain text from 1993).

Process:

(i) Given N slices of temporal specific text (e.g., text from 1991, text from 2000, …) we concatenate all the text and run the CBOW model on the entire D corpus.

(ii) We extract the second (U) matrix of the CBOW network.

(iii) For each slice of text Dt we run CBOW again but we initialize the second matrix of the slice-specific CBOW with the matrix from step

(iv) and we freeze this matrix and thus stop updating those weights. At the end of each (separate) training, our embeddings are aligned with respect to the reference representation given by the compass.

Result

Once training is concluded the different vector spaces can be compared: you can compute the similarity between the word “clinton” in 1999 and the word “bush” in 2001. In general, you would expect these two words to be similar since in those years Clinton and Bush were presidents (and this is actually what we get using TWEC).

See the following figures to get an idea of what this kind of embeddings allows you to do: in the first image, different time-specific vectors of the word “blair” during different years. See how the years in which Blair was a prime minister are clustered together. While in the second image, you can see the time-specific word vectors of different US presidents: there is an underlying similarity that puts those presidents in close positions of the space in those years in which they have been presidents.

First image, vectors of the word “blair” during different years. See how the years in which Blair was a prime minister are clustered together. Second image, same as the first, but with different US presidents: there is an underlying similarity that puts those president in close positions of the space.

Code

The code that can be used to align word embeddings is on github.

Using TWEC is really super easy: it is based on the well known Gensim library and thus the TWEC embeddings objects inherit all the methods defined by the Word2Vec class defined in Gensim.

Here is an example of how to use TWEC:

Short guide on how to use TWEC

References

[1] Di Carlo, V., Bianchi, F. & Palmonari, M. (2019). Training Temporal Word Embeddings with a Compass. AAAI.

[2] Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. arXiv preprint arXiv:1605.09096.

[3] Yao, Z., Sun, Y., Ding, W., Rao, N., & Xiong, H. (2018, February). Dynamic word embeddings for evolving semantic discovery. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (pp. 673–681). ACM.