Embeddings in LLMs: Token & Text Embedding Techniques Explained

LLMs | Embeddings

Embeddings
Example: Create Token Embeddings
Example: Create Text Embeddings

Embeddings
Embeddings are the foundation of modern NLP, enabling machines to work with human language in a mathematically meaningful way. Understanding how to create, use, and optimize embeddings is essential for any text-processing application.

Embeddings are numerical representations that transform discrete tokens (words, subwords, characters) into continuous vector spaces. Think of them as a way to give computers a mathematical understanding of language meaning.

Alternative names: embedding vectors, vector representations, dense representations, distributed representations

Key Characteristics:
- Dense vectors: Each embedding is a list of real numbers (typically 64-12,288 dimensions).
- Semantic capture: Similar words have similar embeddings in vector space.
- Contextual: Modern embeddings consider surrounding context, not just the word itself.
- Fixed dimensionality: All embeddings from a model have the same number of dimensions.
By converting words to vectors, we enable:
- Semantic similarity: Measuring how similar two pieces of text are.
- Information retrieval: Finding relevant documents or passages.
- Machine learning: Using text as input to neural networks.
Types of Embeddings
- Token Embeddings
  - Scope: Individual tokens (words, subwords, characters).
  - Use case: Building blocks for language models.
  - Example: "language" → [0.1, -0.1, 0.2, ..., 0.3]
- Sentence Embeddings
  - Scope: Complete sentences or short passages.
  - Use case: Semantic search, text classification, clustering.
  - Example: "Hello Embeddings!" → [0.4, -0.3, 0.7, ..., 0.8]
- Document Embeddings
  - Scope: Entire documents or long passages.
  - Use case: Document classification, recommendation systems.
Contextual vs. Static Embeddings
- Static Embeddings (Word2Vec, GloVe): Same vector for a word regardless of context.
- Contextual Embeddings (BERT, GPT, etc.): Different vectors based on surrounding context.
Training Process
- Initialization: Start with random vectors for each token.
- Context learning: Model learns from massive text datasets.
- Optimization: Vectors adjust to capture semantic relationships.
- Convergence: Final embeddings encode learned patterns.
Practical Applications
- Text Generation
  - Language models use embeddings as input representations.
  - Enable models to understand and generate coherent text.
- Text Classification
  - Convert documents to embeddings.
  - Train classifiers on vector representations.
  - Examples: sentiment analysis, spam detection.
- Semantic Search & RAG
  - Convert queries and documents to embeddings.
  - Find similar content using vector similarity.
  - Power recommendation systems and search engines.
- Text Clustering
  - Group similar documents using embedding similarity.
  - Organize large text collections.
  - Discover hidden themes in data.
Strategies for combining token embeddings into sentence embeddings:
- Mean pooling: Average all token vectors.
- Max pooling: Take maximum value across each dimension.

Example: Create Token Embeddings

First, download the model (this may not be necessary since the model will be automatically downloaded using from_pretrained()):

$ huggingface-cli download microsoft/deberta-v3-xsmall

Python code:

$ vi token-embeddings.py

from transformers import AutoModel, AutoTokenizer

# load model and tokenizer
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base") # TODO: got an error when I tried to use "microsoft/deberta-v3-xsmall"

# tokenize input text
tokens = tokenizer('Hello Embeddings!', return_tensors='pt')

# decode tokens to see how text was split
print('Tokens:')
for token in tokens['input_ids'][0]:
  # convert the input token id to it corresponding token
  print(tokenizer.decode(token))

# generate embeddings
output = model(**tokens)[0]

# shape: [batch_size, number_of_tokens, embeddings_dimension]
print('\nOutput shape:')
print(output.shape) # torch.Size([1, 7, 384])

# display output embeddings
print('\nOutput embeddings:')
print(output)

Run the Python script:

$ python3 token-embeddings.py

Output:

Tokens:
[CLS]
Hello
 Emb
edd
ings
!
[SEP]

Output shape:
torch.Size([1, 7, 384])

Output embeddings:
tensor([[[-3.3186,  0.1003, -0.1506,  ..., -0.2840, -0.3882, -0.1670],
         [-0.5446,  0.7986, -0.4200,  ...,  0.1163, -0.3322, -0.3622],
         [-0.1689,  0.6443, -0.0145,  ...,  0.0207, -0.5754,  1.3607],
         ...,
         [ 0.0366,  0.0818, -0.0607,  ..., -0.4793, -0.7831, -0.9185],
         [-0.0555,  0.3136,  0.2662,  ...,  0.3092, -0.4876, -0.3294],
         [-3.1255,  0.1324, -0.0899,  ..., -0.1426, -0.5295,  0.0731]]],
       grad_fn=<NativeLayerNormBackward0>)

Note that the created embeddings have the size "1, 7, 384" (may vary based on input and tokenization):

1: the batch dimension
7: seven tokens
384: each token is embedded in a vector of 384 values

The batch dimension can be larger than 1 in cases when multiple sentences are given to the model to be processed at the same time.

Example: Create Text Embeddings

Install the Sentence Transformers library:

$ pip install sentence-transformers

Python code:

$ vi text-embeddings.py

from sentence_transformers import SentenceTransformer

# load pre-trained sentence transformer
model = SentenceTransformer("all-MiniLM-L6-v2")

# example sentences
sentences = ["Hello Sentence Transformers library!", "Generate sentence embeddings!"]

# generate embeddings
embeddings = model.encode(sentences)

# output shape: [number_of_sentences, embeddings_dimension]
print('Output shape:')
print(embeddings.shape) # (2, 384) → 2 sentences, each with a 384 dimension embedding

# output embeddings
print('\nOutput embeddings:')
print(embeddings)

Run the Python script:

$ python3 text-embeddings.py

Output:

Output shape:
(2, 384)

Output embeddings:
[[-6.70545474e-02 -3.04300548e-03  3.52957926e-04  4.17553373e-02
   5.08048979e-04  1.49061205e-02  1.29256323e-02  5.43267690e-02
...
   1.12174535e-02  1.13273829e-01  5.92597015e-02 -1.89474523e-02]]