Model deployment

After successfully training your small language model (SLM), the final step is to deploy it for inference. This page guides you through accessing your model and implementing it in your application using two different methods.

Downloading Your Model

Once your training is complete, you can download the model using the download link provided by the API (Get your AUTH_HEADER):

1import requests
2import os
3from pprint import pprint
4
5# Get the model download information
6response = requests.get(
7 f"https://api.distillabs.ai/trainings/{slm_training_job_id}/model",
8 headers=AUTH_HEADER,
9)
10
11# Extract download URL
12pprint(response.json())

Use the link to download the model and then extract the tarball. After extraction, you will have a model directory containing your trained SLM with all necessary files for deployment.

Deployment Option 1: Using Hugging Face Transformers

The most straightforward way to use your model is with the Hugging Face transformers library, which provides a simple, flexible interface for inference.

1import torch
2import pandas
3from pprint import pprint
4
5from transformers import AutoTokenizer, AutoModelForSequenceClassification
6from transformers import TextClassificationPipeline
7
8# Load the model and tokenizer from the extracted directory
9model = AutoModelForSequenceClassification.from_pretrained("model")
10tokenizer = AutoTokenizer.from_pretrained("model", padding_side="left")
11
12# Create a pipeline for easy inference
13llm = TextClassificationPipeline(model=model, tokenizer=tokenizer, top_k=None)
14
15# Run inference on an example input
16answer = llm("I have a charge for cash withdrawal that I want to learn about")
17pprint(answer)

Deployment Option 2: Using vLLM

For production deployments with higher throughput requirements, vLLM offers significant performance improvements over standard transformers through PagedAttention and other optimizations.

Start the vLLM server with your fine-tuned model:

$vllm serve model --api-key EMPTY

Once the server is running, query the model using the OpenAI client library:

question-answering
1from openai import OpenAI
2
3# Define the prompt template
4def get_prompt(
5 question: str,
6 context: str,
7 task_description: str,
8) -> list[dict[str, str]]:
9 return [
10 {
11 "role": "system",
12 "content": f"You are a question-answering model working on a problem described in task_description XML block:\n<task_description>{task_description}</task_description>\nYou will be given a single task with context in context XML block and the question in question XML block\nAnswer the question and put your answer between the answer XML block.",
13 },
14 {
15 "role": "user",
16 "content": f"Now for the real task, answer the question in question block based on the context in context block.\n<context>\n{context}\n</context>\n<question>\n{question}\n</question>\nPlace the answer in answer XML block",
17 },
18 ]
19
20# Initialize the client with the local server URL
21client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
22
23# Answer questions
24task_description = "Answer the question based on the context"
25context = "The Grapes of Wrath is a 1940 drama film directed by John Ford."
26question = "Who directed The Grapes of Wrath"
27
28chat_response = client.chat.completions.create(
29 model="model",
30 messages=get_prompt(question, context, task_description),
31)
32print(chat_response)

Production Deployment Considerations

When deploying your model to production, consider:

  • Resource Requirements: Even small models benefit from GPU acceleration, especially for high-throughput applications.
  • Security: Apply appropriate access controls, especially if your model has access to sensitive information.
  • Container Deployment: Consider packaging your model with Docker for consistent deployment across environments.