Embedding Similarity: Distance Metrics

Embedding Similarity: Distance Metrics Embedding Similarity: Distance Metrics

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

Now that we have seen what embeddings are, we know that we can measure the similarity between two words by measuring the similarity between their embeddings. In the post on embeddings, we saw an example of using cosine similarity as a measure, but there are other similarity measures we can use, such as L2 squared, dot product similarity, cosine similarity, etc.

In this post, we are going to cover the three that we have mentioned.

L2 Square Similaritylink image 6

This similarity is derived from the Euclidean distance, which is the straight-line distance between two points in a multidimensional space, calculated using the Pythagorean theorem.

Euclidean distance

The Euclidean distance between two points p and q is calculated as:

d(p,q) = √((p1 - q1)2 + (p2 - q2)2 + ··· + (pn - qn)2) = √(∑i=1n (pi - qi)2)

The L2 similarity is the square of the Euclidean distance, that is:

similarity(p,q) = d(p,q)2 = ∑i=1n (pi - qi)2

Cosine Similaritylink image 7

If we remember what we learned about sines and cosines in school, we will recall that when two vectors have an angle of 0° between them, their cosine is 1, when the angle between them is 90°, their cosine is 0, and when the angle is 180°, their cosine is -1.

Therefore, we can use the cosine of the angle between two vectors to measure their similarity. It can be shown that the cosine of the angle between two vectors is equal to the dot product of the two vectors divided by the product of their magnitudes. Proving this is not the goal of this post, but if you want, you can see the proof here.

similarity(U,V) = U · V||U|| ||V||

Dot Product Similaritylink image 8

The dot product similarity is the dot product of two vectors

similarity(U,V) = U · V

As we have written the cosine similarity formula, when the length of the vectors is 1, that is, they are normalized, the cosine similarity is equal to the dot product similarity.

So, what is the use of the dot product similarity? It is used to measure the similarity between two vectors that are not normalized, that is, they do not have a length of 1.

For example, YouTube, to create the embeddings of its videos, makes the embeddings of the videos it classifies as higher quality longer than those of the videos it classifies as lower quality.

In this way, when a user performs a search, the dot product similarity will give higher similarity to higher quality videos, so it will provide the user with the highest quality videos first.

Which similarity system to uselink image 9

To choose the similarity system we are going to use, we must take into account the space in which we are working.

  • If we are working in a high-dimensional space, with normalized embeddings, cosine similarity works best. For example, OpenAI generates normalized embeddings, so cosine similarity works best.
  • If we are working on a classification system where the distance between two classes is important, the L2 squared similarity works best.
  • If we are working on a recommendation system where the length of the vectors is important, the dot product similarity works best.

Continue reading

Last posts -->

Have you seen these projects?

Gymnasia

Gymnasia Gymnasia
React Native
Expo
TypeScript
FastAPI
Next.js
OpenAI
Anthropic

Mobile personal training app with AI assistant, exercise library, workout tracking, diet and body measurements

Horeca chatbot

Horeca chatbot Horeca chatbot
Python
LangChain
PostgreSQL
PGVector
React
Kubernetes
Docker
GitHub Actions

Chatbot conversational for cooks of hotels and restaurants. A cook, kitchen manager or room service of a hotel or restaurant can talk to the chatbot to get information about recipes and menus. But it also implements agents, with which it can edit or create new recipes or menus

View all projects -->
>_ Available for projects

Do you have an AI project?

Let's talk.

maximofn@gmail.com

Machine Learning and AI specialist. I develop solutions with generative AI, intelligent agents and custom models.

Do you want to watch any talk?

Last talks -->

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

Flow edit Flow edit

FLUX.1-RealismLora

FLUX.1-RealismLora FLUX.1-RealismLora
View all containers -->
>_ Available for projects

Do you have an AI project?

Let's talk.

maximofn@gmail.com

Machine Learning and AI specialist. I develop solutions with generative AI, intelligent agents and custom models.

Do you want to train your model with these datasets?

short-jokes-dataset

HuggingFace

Dataset with jokes in English

Use: Fine-tuning text generation models for humor

231K rows 2 columns 45 MB
View on HuggingFace →

opus100

HuggingFace

Dataset with translations from English to Spanish

Use: Training English-Spanish translation models

1M rows 2 columns 210 MB
View on HuggingFace →

netflix_titles

HuggingFace

Dataset with Netflix movies and series

Use: Netflix catalog analysis and recommendation systems

8.8K rows 12 columns 3.5 MB
View on HuggingFace →
View more datasets -->