
Semantic Similarity Using Sentence Transformers
Mar 25, 2023   |   15 min read   |   technical
Sentence transformers create an embedding space where the semantic information of encoded sentences are represented as vectors. Embeddings of sentences with similar meaning will be located near each other, allowing for easy content clustering, searching by meaning, topic identification, and more.
Sentence Transformers and Embeddings
A sentence transformer takes in text and outputs a vector, called an embedding, with a direction and magnitude that seeks to encode the sentence’s meaning. Text conveying similar information will be mapped to embeddings that are near each other in this vector space, even if the two texts do not employ the same words. This is an advantage over more traditional text matches which rely on overlaps of words in the texts being compared, or the fraction of bigrams (pairs of words) and n-grams ($n$
words in a sequence) that they have in common.
The first step of using a sentence transformer is to tokenize the input text, as is common in the majority of natural language processing (NLP) tasks. Effectively, a dictionary of subwords to integers is used to break down the text into a sequence of numbers. Tokenizers are usually specific to a model, and map text to different sequences of numbers. For example, the two transformer models we will consider, BERT and all-mpnet-base-v2, tokenize the sentence “An example of a tokenized sentence” differently, as seen below in the input_ids
returned.
import transformers
bert_tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
all_mpnet_tokenizer = transformers.AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")
>>> bert_tokenizer("An example of a tokenized sentence")
{'input_ids': [101, 2019, 2742, 1997, 1037, 19204, 3550, 6251, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
>>> all_mpnet_tokenizer("An example of a tokenized sentence")
{'input_ids': [0, 2023, 2746, 2001, 1041, 19208, 3554, 6255, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
Each of these integers actually refers to a high-dimensional one-hot encoded vector, with the integer indicating the non-zero dimension with a $1$
. For example, $2023$
actually represents a vector $[0, 0, \ldots, 1, \ldots, 0, 0]$
, where the $1$
is located at index 2023
, counting from 0
.
With the input represented as a sequence of vectors, it can now be understood by a neural network which will taken in the sequence and output a vector embedding, of dimension $768$
in this case. Sentence transformers are usually trained through minimization of a triplet loss,
where the sentence embeddings $s_x$
are computed for an anchor sentence $a$
, a “positive” sentence match $p$
, and a “negative” sentence match $n$
. The $\lVert \cdot \rVert$
refers to a distance metric, which in the original Sentence BERT paper is the Euclidean distance. For a given anchor sentence, the positive match can be a sentence with a similar meaning, or if building a sentence transformer primarily for information retrieval, a text that the anchor “query” should retrieve. The negative match is generally a sentence selected at random from the data corpus, and as such is very unlikely to have anything to do with the anchor sentence. This is the origin for the name “triplet loss”; the loss function requires these three inputs.
Notice what this loss function encourages as it is minmized. The first term, $\lVert s_a - s_p \rVert$
seeks to reduce the distance between the embeddings of the anchor sentence and the positive match. Meanwhile, the second term, $-\lVert s_a - s_n \rVert$
, seeks to maximize the distance between the embeddings of the anchor sentence and the negative match. So, as training progresses, the embeddings of similar sentences wll be pulled closer together on average, while unrelated sentences will tend to have embeddings that push each other away. The $\epsilon$
term induces a gap between the relative distances of the anchor sentence to the positive and negative matches. If $\lVert s_a - s_p \rVert - \lVert s_a - s_n \rVert$
is less than -$\epsilon$
, indicating that $s_n$
is already more than $\epsilon$
farther away from $s_a$
than $s_p$
is, the loss function $\mathcal{L}_\epsilon$
achieves its minimum value of $0$
for this triplet, and there is no additional pressure from gradient descent to further pull $s_a$
and $s_p$
together while pushing $s_n$
further away. In the SBERT paper, $\epsilon = 1$
. The $\max$
operation and the $\epsilon$
are helpful for tracking progress during training, and to ensure that the magnitude of the embeddings do not explode. If the loss only consisted of the two distance terms, the pressure to push the negative embeddings away would tend to increase the norms of the embeddings, and the overall loss could progress without bound towards negative infinity. With the floor at $0$
, we can see that training will asymptotically reduce the loss towards $0$
, focusing on the triplets that still do not meet the desired gap of $\epsilon$
while ignoring triplets that already satisfy this condition.
Sentence Transformers versus BERT
Unlike typical NLP encoder transformers like BERT, sentence transformers are primarily meant for comparing and contrasting input text instead of optimally encoding semantic information for a downstream task. For example, while BERT can be used as a base model for a classification head that categorizes the embeddings of input sentences into having a positive or negative sentiment, as we will show, it is not as well suited for determining if input sentences have the same meaning.
Let’s first try to get as far as we can with BERT. Consider the following nine sentences, pulled from three different Wikipedia pages:
0. The stained glass windows of Chartres Cathedral are held to be one of the best-preserved and most complete set of medieval stained glass.
1. This limited the number of windows, leading to a play of light and shade which builders compensated for by adding internal frescoes in bright colours.
2. In northern France buildings in this style would still be quite dark, with semi-circular arches not allowing large windows.
3. Gymnastics is a type of sport that includes physical exercises requiring balance, strength, flexibility, agility, coordination, and endurance.
4. This provides a firm surface that provides extra bounce or spring when compressed, allowing gymnasts to achieve greater height and a softer landing after the composed skill.
5. In Tumbling, athletes perform an explosive series of flips and twists down a sprung tumbling track.
6. Tinnitus is the perception of sound when no corresponding external sound is present.
7. While often described as a ringing, it may also sound like a clicking, buzzing, hissing or roaring.
8. A frequent cause is traumatic noise exposure that damages hair cells in the inner ear.
Each group of three, like 0, 1 ,2
and 3, 4, 5
, come from the same Wikipedia page. As such, we expect sentences within a group to be more similar to each other than to sentences in the other groups, since each group covers a different topic.
We will consider three different approaches of constructing a semantic vector for comparsion via cosine similarity:
CLS Hidden State
cls
- The [CLS]
token in BERT is placed at the beginning of every input sequence, and during training its hidden state is used for the next sentence prediction binary task (whether the second sentence in the input is really the next sentence following the first in the source text, or if it is unrelated). As a special token, it is never masked out during the other training task of masked language modeling so it always attends to all other tokens in the sequence. We expect the CLS hidden state to then learn a semantic meaning of the entire sentence to complete this task effectively. The authors of the BERT paper also recommend using this hidden state as the input to donstream tasks.
embedding.last_hidden_state[:, 0, :]
Pooler Output
pooler_output
- The BERT architecture has an additional specific representation intended as input for downstream tasks. It is produced from the [CLS]
token’s hidden state, passing it through an additional layer.
embedding.pooler_output
Mean Hidden State
mean
- Here we average all the hidden state vectors at BERT’s final layer. Intuitively, this is like finding the average meaning across all tokens. This loosely employs the observation that learned language model embeddings tend to allow for vector addition to approximate the composition of semantics, like in the original example of $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$
[Mikolov et al.].
embedding.last_hidden_state.mean(dim=1)
Max Pooling Hidden State
max
- This semantic vector is constructed by keeping the largest value for each index in the final layer hidden states. Intuitively, this is sort of like assuming that across all the hidden states of the input tokens, there are a few most salient contributions to meaning which are captured as large values in certain elements within the hidden state vectors. While the mean of the hidden states tends to pull these element values towards the average, we can think of max pooling as keeping the most extreme signals arising from each token.
embedding.last_hidden_state.max(dim=1).values
BERT-Computed Similarities
Now Let’s take a look at how the 9 sentences above compare to each other in similarity matrices. The similarity matrices are constructed by computing the cosine similarity of the semantic vectors of each sentence versus each other sentence. The result is a matrix symmetric across the diagonal (as cosine distance is symmetric) with intensities showing how aligned the semantic vectors of each sentence are compared to the others.
Ideally, if the sentence similarity algorithm is working well, we expect to see three 3x3 blocks light up along the diagonal. This is because each block corresponds to one of the related triplets of sentences from the same Wikipedia page. We find that the mean
and max
approach start to reveal this pattern while the cls
and pooler_output
approaches do not.
In addition, notice that the values of the cosine similarity are all quite large: they are greater than 0.50 in all the BERT approaches. This is not ideal and it suggests that these BERT-based representations contain many elements that are unrelated to the meaning of the sentences. For example, they may contain dimensions that are related to the language of the sentence, grammatical information, and syntax. The cosine representation does not distinguish between these dimensions and those in charge of meaning, leading to the high noise floor on semantic similarity present in all subplots of Figure 1.
Sentence-Transformer-Computed Similarities
When we perform the same semantic vector constructions with all-mpnet-base-v2
, a sentence transformer, we find that the 3x3 similarity blocks jump out. In addition, notice that except for the max pooling construction, all off-block-diagonal elements are significantly closer to zero than in the BERT approach. The specialized training employed for creating the sentence transformer has successfully aligned the vector representations with semantics.
Comparing the best-looking approaches of both models and setting a full color scale of 0.0 to 1.0, we can see the significant difference in signal-to-noise of the semantic vectors.
Word Order Matters
In the past, bag-of-words approaches were used to process and compare text in natural language. This simplification involves treating every text as a “bag of words”, i.e., the meaning of the text is approximated as only depending on the presence and relative quantity of the words. While we can see how this intuitively could work, a lot of meaning is lost when the order is ignored. A huge advantage of transformers in NLP is that each token is embedded while considering its context. So, the words surrounding a given word impart that word with clearer meaning. Let’s take a look at how much of a difference order can make. Below we generate five random permutations of the words in the original sentence:
Original: The stained glass windows of Chartres Cathedral are held to be one of the best-preserved and most complete set of medieval stained glass.
0 - complete be of glass. one set of are glass stained Cathedral medieval windows The Chartres most stained held of best-preserved and to the
1 - Chartres are set windows Cathedral complete glass of stained of of glass. to The one most stained be the and best-preserved held medieval
2 - medieval be one of stained windows held glass. complete best-preserved stained to of glass most set Chartres the of and are Cathedral The
3 - and Cathedral most glass medieval best-preserved held one the windows Chartres set stained are glass. of The complete to of stained be of
4 - medieval held and Chartres stained complete be Cathedral windows of to most the The are glass. one of best-preserved set of glass stained
Notice that there are semantic differences observed by the model depending on the order. The similarities are still high, which makes sense given that someone could still glean most of the meaning of the jumbled sentence, but the remaining distance from 1.0 shows that the hidden states in the model care about the order of the words.
Semantic Search
One potential use for sentence transformers is in searching text by meaning. The semantic vector representations of a search query and the corpus allow for users to search for content without knowing the exact wording. Below we show an example of a search over the text of The Hobbit by J. R. R. Tolkien, entirely implemented using all-mpnet-base-v2
and cosine similarity. First, we compute the semantic vectors (using the mean
approach) of every sentence in The Hobbit. Then, we compute the semantic vector of our query text and find the largest cosine similarities against the text.
Let’s find the passage about the ultimate fate of the three trolls who Bilbo tricked into staying out until dawn:
Query: The trolls turned into stone statues.
Running our semantic search, we find the following sentences preceded by their semantic similarity to our query:
0.59 - And there they stand to this day, all alone, unless the birds perch on them; for trolls, as you probably know, must be underground before dawn, or they go back to the stuff of the mountains they are made of, and never move again.
0.46 - The bushes, and the long grasses between the boulders, the patches of rabbit-cropped turf, the thyme and the sage and the marjoram, and the yellow rockroses all vanished, and they found themselves at the top of a wide steep slope of fallen stones, the remains of a landslide.
0.46 - A nice pickle they were all in now: all neatly tied up in sacks, with three angry trolls (and two with burns and bashes to remember) sitting by them, arguing whether they should roast them slowly, or mince them fine and boil them, or just sit on them one by one and squash them into jelly; and Bilbo up in a bush, with his clothes and his skin torn, not daring to move for fear they should hear him.
0.46 - When they began to go down this, rubbish and small pebbles rolled away from their feet; soon larger bits of split stone went clattering down and started other pieces below them slithering and rolling; then lumps of rock were disturbed and bounded off, crashing down with a dust and a noise.
0.45 - Some were barrels really empty, some were tubs neatly packed with a dwarf each; but down they all went, one after another, with many a clash and a bump, thudding on top of ones below, smacking into the water, jostling against the walls of the tunnel, knocking into one another, and bobbing away down the current.
Not bad! The first retrieved sentence is exactly what we were looking for!
“And there they stand to this day, all alone, unless the birds perch on them; for trolls, as you probably know, must be underground before dawn, or they go back to the stuff of the mountains they are made of, and never move again.”
We have the word “trolls” in our query, but the match is clearly not solely relying on this as the word appears in many other sentences in the book. Perhaps the model has latched onto the connection between “stone” and “the stuff of the mountains they are made of”, and “statue” with “stand”, and “never move again”.
Conclusion
While we are seeing that LLMs will likely take over most NLP tasks (and perhaps more!), sometimes simpler approaches are what you really need depending on your use case. For example, if you plan to sift through huge amounts of data, it may end up being sufficient to simply look for rough semantic similarity. This would be much faster and cheaper than performing an LLM call for the query paired with every chunk of the target corpus, and while only losing a small fraction of the accuracy. Additionally, even if accuracy is paramount, simpler NLP approaches like this one can help filter down the tasks you want to use an LLM on, saving substantial compute with almost no impact to performance. This is already a common practice in retrieval-augmented generation (RAGs) where the information retrieval often uses a semantc search to pull in the top candidates for consideration by the LLM.
Try it out yourself! Below you can find the Jupyter Notebook for all the above on Github.
Sentence Similarity Jupyter Notebook on GitHub
Follow @cflamant Watch Star ForkCover image: Self-generated using Stable Diffusion and iterating on a prompt involving hobbits, embeddings, and neural networks.