Embeddings are numerical representations of a token:
- They are also known as embedding vectors, vector representations, or vectors.
- They aim to capture the semantic (meanings, context, patterns) of the embedded text.
- They represent a token and its links with other tokens based on the contexts (contextualized tokens embeddings)
- They allow measuring the semantic similarity between two tokens.
Embeddings are used for:
- Text generation
- Text classification
- Text clustering
- ...
Embeddings have the following characteristics:
- The size (number of dimensions) of the embeddings is fixed and depends on the underlying embedding model.
- Each cell in the dimension row has a numerical value (property).
- The properties (numerical values) of the embeddings are used to represent tokens.
A tokenizer is used in the training process of its associated model:
- The model is linked with its tokenizer and can't be used with another tokenizer.
- The model holds an embedding for each token in the tokenizer's vocabulary.
- When a model is created, its embeddings and weights are randomly initialized.
- When a model is trained, its embeddings and weights are assigned proper values that capture the semantic of each token.
Types of embeddings:
- Embedding a token: creating one vector that represents the token.
- Embedding a sentence: creating one vector that represents the sentence (used for categorization, semantic search RAG).
- Embedding a document: creating one vector that represents the document (used for categorization, semantic search RAG).
When generating embeddings for text (longer than just one token) the embeddings should capture the meaning of the whole text.
One way to generate those embeddings is to average the values of all the tokens' embeddings of the text.
Example: Generate tokens' embeddings
Download the model:
Python code:
Output:
Note that the created embeddings have the follwong size: 1, 7, 384
- 1: the batch dimenesion
- 7: seven tokens
- 384: each token is embedded in a vector of 384 values
The batch dimension can be larger than 1 in cases when multiple sentences are given to the model to be processed at the same time.
Example: Generate sentence embeddings
Install the Sentence Transformers library:
Python code: