Running Transformer Models: Hugging Face, OpenAI & More

LLMs | Running Models

Hugging Face Hub
Standard Transformer Implementation
Pipeline-Based Implementation
Run a transformer model using llama-cpp-python
Integration with OpenAI API
Key Parameters of the Transformer Models
Save the Model and its Associated Tokenizer and Configuration Files
Load the Saved Model and its Associated Tokenizer and Configuration Files

Hugging Face Hub
The Hugging Face Hub represents an extensive repository housing over 1 million models designed for text, image, audio, and video processing. When selecting an appropriate model for your application, consider these critical factors:

Hugging Face Models:
https://huggingface.co/models

Selecting a model depends on:
- Architectural foundation (representation vs. generative capabilities).
- Model size and computational requirements.
- Performance benchmarks and efficacy metrics.
- Task specialization and compatibility.
- Multilingual support.
- ...
For comparative analysis of embedding models across languages, the Hugging Face Embedding Leaderboard provides comprehensive benchmarking data:
https://huggingface.co/spaces/mteb/leaderboard

Standard Transformer Implementation

You can use the Hugging Face CLI to download a model:

$ huggingface-cli download microsoft/DialoGPT-small

Python code:

$ vi llm-transformers.py

# import the model and the tokenizer objects from the Transformers library
from transformers import AutoModelForCausalLM, AutoTokenizer

# load the model
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

# load the model's tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")

# tokenize the input
input_ids = tokenizer.encode("Hello!" + tokenizer.eos_token, return_tensors='pt')

# generate the text
output = model.generate(
    input_ids=input_ids,
    max_new_tokens=50
)

# decode generated tokens
print(tokenizer.decode(output[0]))

Run the Python script:

$ python3 llm-transformers.py

Output:

Hello!<|endoftext|>Hi!<|endoftext|>

Pipeline-Based Implementation

Python code:

$ vi llm-transformers-pipeline.py

# import the model and the tokenizer objects from the Transformers library
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# load the model
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

# load the model's tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")

# create a pipeline object for the "text-generation" task
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=50
)

# prompt pipeline with some initial text to generate more text
output = generator("Hello!" + tokenizer.eos_token)

print(output[0]["generated_text"])

Run the Python script:

$ python3 llm-transformers-pipeline.py

Output:

Hello!<|endoftext|>Hi!

Run a transformer model using llama-cpp-python

Download this model:

$ wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf

Python code:

$ vi llm-llama-cpp.py

from llama_cpp import Llama
model = Llama(model_path="./Phi-3-mini-4k-instruct-q4.gguf")

prompt = """
Question: What's 1+1?
"""

output = model(
    prompt,
    max_tokens=50, # limits the length of the generated text.
    temperature=0, # controls the randomness of the output. Lower values are more deterministic.
    top_p=1, # (range should be (0, 1]). controls diversity of the selection of the tokens. Lower values means selecting the most probable tokens.
    echo=True # includes the prompt in the output if True.
)
print(output["choices"][0]["text"])

Run the Python script:

$ python3 llm-llama-cpp.py

Output:

Question: What's 1+1?
<|assistant|> 1+1 equals 2.

Integration with OpenAI API

ChatGPT (OpenAI) is a proprietary model. The model can be accessed through OpenAI's API.
You need to sign-up and create an API key here: https://platform.openai.com/api-keys
The API key will be used to communicate with OpenAI's API.

Try the model using curl:

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "gpt-4o-mini",
    "store": true,
    "messages": [
      {"role": "user", "content": "What is 1+1?"}
    ]
  }'

Output:

{
  "id": "chatcmpl-BWREFCaHDdorA60Q4ufKWGZ9yY70Z",
  "object": "chat.completion",
  "model": "gpt-4o-mini-2024-07-18",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "1 + 1 equals 2.",
        "refusal": null,
        "annotations": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 9,
    "total_tokens": 23,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  }
}

Try the model using Python:

Install the OpenAI Python SDK:

$ pip install openai

Check OpenAI Python SDK installation:

$ pip show openai

Name: openai
Version: 1.76.0
...

Python code:

$ vi llm-gpt.py

import openai

openai.api_key = "YOUR_API_KEY"

completion = openai.chat.completions.create(
  model="gpt-4o-mini",
  store=True,
  messages=[
    {"role": "user", "content": "What is 1+1?"}
  ]
)

print(completion.choices[0].message);

Run the Python script:

$ python3 llm-gpt.py

Output:

ChatCompletionMessage(
    content='1 + 1 equals 2.',
    refusal=None,
    role='assistant',
    annotations=[],
    audio=None,
    function_call=None,
    tool_calls=None
)

Key Parameters of the Transformer Models
Here are some parameters that can affect the output of the model:
- Context Length:
  A model has a context length (a.k.a. the context window, context size, token limit):
  - The context length represents the maximum number of tokens the model can process.
  - Generative models are autoregressive, so the current context length will increase as new tokens are generated.
- return_full_text:
  If set to "False", only the model output is returned.
  Otherwise, the full text is returned; including the user prompt.
- max_new_tokens:
  It sets the maximum number of tokens the model can generate.
- do_sample:
  The model decides the probability of all possible values of the next token. It sorts the next possible tokens based on their probability of being chosen.
  
  If the "do_sample" parameter is set to "False", the model selects the most probable next token; this leads to a more predictable and consistent response. Otherwise, the model will sample from the probability distribution, leading to a wider variety of possible token outputs.
  
  When we set the "do_sample" parameter to true, we can also use the "temperature" parameter to make the output more "random". Hence we can get different output for the same prompt.
- temperature:
  It controls the probability that the model can choose less likely tokens.
  
  When we set the "temperature" parameter to 0 (deterministic), the model should always generate the same response when given the same prompt.
  
  The closer the value of the "temperature" parameter is to 1 (high randomness), the more likely we are to get a random output.

Save the Model and its Associated Tokenizer and Configuration Files

To save a model, tokenizer, and configuration files, we can use the "save_pretrained" method from the Hugging Face Transformers library.

Ideally, you will save all related files in the same folder.

Note that saving the model also saves its configuration file.

Save the model and its associated configuration files:

Python code:

$ vi llm-save-model.py

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

path = "./models/microsoft/model/dialogpt-small"

# model serialization
model.save_pretrained(path)

Run the Python script:

$ python3 llm-save-model.py

This will create a directory containing:

$ ls -1 models/microsoft/model/dialogpt-small/

config.json
generation_config.json
model.safetensors

Files:

config.json: The configuration file of the model.

{
  "architectures": [
    "GPT2LMHeadModel"
  ],
...
  "transformers_version": "4.51.3",
  "vocab_size": 50257
}

model.safetensors: contains the model's weights.

Save the model tokenizer files:

Python code:

$ vi llm-save-tokenizer.py

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")

path = "./models/microsoft/tokenizer/dialogpt-small"

# tokenizer serialization
tokenizer.save_pretrained(path)

Run the Python script:

$ python3 llm-save-tokenizer.py

This will create a directory containing:

$ ls -1 models/microsoft/tokenizer/dialogpt-small/

merges.txt
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.json

Files:

tokenizer_config.json: The configuration file of the tokenizer.

{
  "add_bos_token": false,
  "add_prefix_space": false,
  "added_tokens_decoder": {
    "50256": {
      "content": "<|endoftext|>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "bos_token": "<|endoftext|>",
  "chat_template": "{% for message in messages %}{{ message.content }}{{ eos_token }}{% endfor %}",
  "clean_up_tokenization_spaces": true,
  "eos_token": "<|endoftext|>",
  "errors": "replace",
  "extra_special_tokens": {},
  "model_max_length": 1024,
  "pad_token": null,
  "tokenizer_class": "GPT2Tokenizer",
  "unk_token": "<|endoftext|>"
}

vocab.json, tokenizer.json: contains the vocabulary and the mapping of tokens to IDs.
special_tokens_map.json: contains the mapping of special tokens used by the tokenizer.

Save only the model configuration file:

Python code:

$ vi llm-save-model-config.py

from transformers import AutoConfig

config = AutoConfig.from_pretrained("microsoft/DialoGPT-small")

path = "./models/microsoft/config/dialogpt-small"

# configuration serialization
config.save_pretrained(path)

Run the Python script:

$ python3 llm-save-model-config.py

This will create a directory containing:

$ ls -1 models/microsoft/config/dialogpt-small/

config.json

Load the Saved Model and its Associated Tokenizer and Configuration Files

To load the saved model, tokenizer and configuration files, we can use the "from_pretrained" method from the Hugging Face Transformers library.

Ideally, you will have saved all related files in the same folder.

Python code:

$ vi llm-load-model-tokenizer-config.py

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

model_path = "./models/microsoft/model/dialogpt-small"
tokenizer_path = "./models/microsoft/tokenizer/dialogpt-small"
config_path = "./models/microsoft/config/dialogpt-small"

model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
config = AutoConfig.from_pretrained(config_path)

print(model)
print(tokenizer)
print(config)

Run the Python script:

$ python3 llm-load-model-tokenizer-config.py

Output:

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
...
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

GPT2TokenizerFast(
    name_or_path='./models/microsoft/tokenizer/dialogpt-small',
    vocab_size=50257,
    model_max_length=1024,
...
    special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'},
    added_tokens_decoder={50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),}
)

GPT2Config {
  "architectures": [
    "GPT2LMHeadModel"
  ],
...
  "transformers_version": "4.51.3",
  "vocab_size": 50257
}