Model deployment
After successfully training your small language model (SLM), the final step is to deploy it for inference. This page guides you through accessing your model and implementing it in your application using two different methods.
Downloading Your Model
Once your training is complete, you can download the model using the download link provided by the API (Get your AUTH_HEADER
):
Use the link to download the model and then extract the tarball. After extraction, you will have a model
directory containing your trained SLM with all necessary files for deployment.
Deployment Option 1: Using Hugging Face Transformers
The most straightforward way to use your model is with the Hugging Face transformers library, which provides a simple, flexible interface for inference.
Deployment Option 2: Using vLLM
For production deployments with higher throughput requirements, vLLM offers significant performance improvements over standard transformers through PagedAttention and other optimizations.
Start the vLLM server with your fine-tuned model:
Once the server is running, query the model using the OpenAI client library:
Production Deployment Considerations
When deploying your model to production, consider:
- Resource Requirements: Even small models benefit from GPU acceleration, especially for high-throughput applications.
- Security: Apply appropriate access controls, especially if your model has access to sensitive information.
- Container Deployment: Consider packaging your model with Docker for consistent deployment across environments.