Hugging Face Inference Endpoints for Faster Model Deployment

Author

Posted Nov 9, 2024

Reads 1.2K

Couple celebrating new home purchase with embracing hug while realtor observes, all wearing face masks.
Credit: pexels.com, Couple celebrating new home purchase with embracing hug while realtor observes, all wearing face masks.

Hugging Face inference endpoints allow you to deploy models in a matter of minutes, not weeks or months.

With Hugging Face inference endpoints, you can serve multiple models from a single endpoint, making it easier to manage your models and reduce latency.

This is particularly useful for applications that require real-time predictions, such as chatbots or recommendation systems.

By using Hugging Face inference endpoints, you can get started with model deployment quickly and easily.

Create an Endpoint

To create an Endpoint, you can start by selecting a model from the Hugging Face Hub. You can insert the identifier of any model on the Hub, such as the small generative LLM google/gemma-1.1-2b-it.

You can choose from a wide range of CPUs or GPUs from all major cloud platforms for your Instance Configuration. You can also adjust the region, for example if you need to host your Endpoint in the EU.

Valid Endpoint names must only contain lower-case characters, numbers or hyphens (”-”) and are between 4 to 32 characters long. The Endpoint Name is automatically generated based on the model identifier, but you are free to change the name.

Credit: youtube.com, Deploy models with Hugging Face Inference Endpoints

You can configure your Endpoint to scale to zero GPUs/CPUs after a certain amount of time, which can help save costs. However, restarting the Endpoint requires the model to be re-loaded into memory, which can take several minutes for large models.

You can also configure the Endpoint Security Level, which can be Protected, Public, or Private. Protected Endpoints require an authorized HF token for accessing the Endpoint.

Here are the key settings to consider when creating an Endpoint:

  • Model Repository: Insert the identifier of any model on the Hugging Face Hub.
  • Endpoint Name: Choose a valid name, which must only contain lower-case characters, numbers or hyphens (”-”) and be between 4 to 32 characters long.
  • Instance Configuration: Choose from a wide range of CPUs or GPUs from all major cloud platforms and adjust the region as needed.
  • Automatic Scale-to-Zero: Configure your Endpoint to scale to zero GPUs/CPUs after a certain amount of time.
  • Endpoint Security Level: Choose from Protected, Public, or Private.

Managing an Endpoint

You can manage your Inference Endpoint's lifecycle using various methods. You can pause it using pause(), which doesn't incur any costs, but you'll need to explicitly resume it using resume() when you're ready to use it again.

To reduce costs, you can also scale your Inference Endpoint to zero using scale_to_zero(). This will automatically stop the endpoint from using any resources, and it won't cost you anything. However, if you need to use the endpoint again, it will take a bit longer to start up, and there might be a cold start delay.

Credit: youtube.com, Hands-On Introduction to Inference Endpoints (Hugging Face)

You can configure your Inference Endpoint to scale to zero automatically after a certain period of inactivity. This way, you can save resources and reduce costs without having to manually manage the endpoint.

Here are the methods you can use to manage your Inference Endpoint's lifecycle:

You can use these methods to efficiently control the state and performance of your Inference Endpoint as needed.

Endpoint Configuration

To configure your Hugging Face Inference Endpoint, you can start by selecting a Model Repository, such as the google/gemma-1.1-2b-it model used in the initial demonstration. This model is a small generative LLM with 2.5B parameters.

You can also choose an Endpoint Name, which is automatically generated based on the model identifier, but you're free to change it. Valid Endpoint names must only contain lower-case characters, numbers, or hyphens and are between 4 to 32 characters long. For example, you can use the default name or rename it to something more descriptive.

You can configure your Endpoint to scale to zero GPUs/CPUs after a certain amount of time using Automatic Scale-to-Zero. This feature is useful when you don't need your Endpoint to be active all the time.

Dedicated

Credit: youtube.com, Hugging Face Inference Endpoints live launch event recorded on 9/27/22

You can create a Dedicated Inference Endpoint to customize the deployment of your model and hardware. This is different from Serverless Inference APIs, which are limited to a pre-configured selection of popular models and are rate limited.

With a Dedicated Inference Endpoint, you can create an Inference Endpoint manually using the Web interface for convenience. This allows you to choose the hardware requirements, such as vendor, region, accelerator, instance type, and size.

You can also create an Inference Endpoint programmatically with the huggingface_hub library. This is useful for managing different Inference Endpoints.

Here are the key benefits of Dedicated Inference Endpoints:

This makes Dedicated Inference Endpoints ideal for use cases like text generation with an LLM, image generation with Stable Diffusion, and reasoning over images with Idefics2.

Generating Embeddings

Generating embeddings is a crucial step in deploying your model, and it's surprisingly straightforward. You can generate embeddings by sending text data to the inference endpoint.

Credit: youtube.com, OpenAI Embeddings and Vector Databases Crash Course

To do this, you'll need to use a programming language like Python, which is a popular choice for many developers. The Hugging Face model is a great resource to use for generating embeddings.

Once you have your model deployed, you can send a request to the Hugging Face model and retrieve the corresponding embeddings. This is a simple yet effective way to get started with generating embeddings.

The sample code snippet in Python is a great example of how to do this. It shows that generating embeddings is a relatively easy process.

Endpoint Deployment

You can create an Inference Endpoint using the Hugging Face GUI or a RESTful API. This is a convenient way to deploy your model and get started with real-time predictions.

To create an Inference Endpoint from the GUI, you'll need to provide some basic information such as the model repository, endpoint name, instance configuration, and security level. The model repository is where you can insert the identifier of any model on the Hugging Face Hub.

Credit: youtube.com, The Best Way to Deploy AI Models (Inference Endpoints)

You can also use a command line tool called hugie to launch Inference Endpoints in one line of code. This is a simple and efficient way to deploy your model, especially if you're working with GitHub actions.

The create_inference_endpoint() function is another way to create an Inference Endpoint programmatically. This function returns an InferenceEndpoint object that holds information about the endpoint, such as its name, repository, status, and task.

To check the deployment status of your Inference Endpoint, you can use the status attribute of the InferenceEndpoint object. The status will typically go through an "initializing" or "pending" phase before reaching a "running" state.

Here are the different deployment options available for Inference Endpoints:

  • GUI: Create an Inference Endpoint using the Hugging Face GUI.
  • RESTful API: Create an Inference Endpoint using a RESTful API.
  • hugie: Use the command line tool hugie to launch Inference Endpoints in one line of code.
  • create_inference_endpoint(): Create an Inference Endpoint programmatically using the create_inference_endpoint() function.

Once your Inference Endpoint is deployed, you can use it to make real-time predictions with your model. The Inference API is a powerful tool for deploying machine learning models, and it's easy to use once you've created your Inference Endpoint.

Model and Cost Considerations

Credit: youtube.com, What are Huggingface Inference Endpoints

To ensure your Hugging Face inference endpoint runs smoothly, consider the model and cost implications. Ensure your model is compatible with the deployment environment.

When deploying your model, it's essential to monitor its performance to make necessary adjustments. This will help you fine-tune your model for optimal results.

To make informed decisions about your model and costs, consider the following resources:

  • Azureml Deploy Model Guide: Learn how to deploy models using Azure Machine Learning for efficient and scalable solutions.
  • Deploy Tensorflow Model On Azure: Learn how to efficiently deploy your TensorFlow model on Azure for scalable and reliable machine learning applications.
  • Model Deployment In Azure Machine Learning: Learn how to effectively deploy machine learning models using Azure Machine Learning services for seamless integration and scalability.

Model and Cost Considerations

Deploying a model using the Inference API can be a cost-effective option, especially when processing a batch of jobs at once, which can limit infrastructure costs.

You can automate this process using the Inference API, making it a powerful tool for deploying machine learning models.

The Inference API allows for real-time predictions, enabling you to leverage huggingface inference endpoints for a smooth and efficient deployment process.

Currently, you can deploy an Inference Endpoint from the GUI or using a RESTful API, giving you flexibility in how you set up your model deployment.

The command line tool hugie also provides an option to launch Inference Endpoints in one line of code by passing a configuration, making it a convenient option for automating deployments.

Latency and Stability

Credit: youtube.com, Latency vs Throughput

In order to ensure seamless integration and scalability, it's crucial to consider latency and stability when deploying your model.

The first step is to test different CPU endpoints types to determine their performance. For example, a large container on ECS had a latency of about ~200ms from an instance in the same region.

We also tested Inference Endpoints with a text classification model fine-tuned on RoBERTa, using the following test parameters: requester region eu-east-1, requester instance size t3-medium, inference endpoint region eu-east-1, endpoint replicas 1, concurrent connections 1, and requests 1000.

The results showed that the vanilla Hugging Face container was more than twice as fast as our bespoke container run on ECS, with the slowest response from the large Inference Endpoint being just 108ms.

Here's a summary of the test results:

These results indicate that Inference Endpoints can provide low latency, making them suitable for applications that require real-time responses.

What About the Cost?

Credit: youtube.com, PODCAST: 9 Cost Considerations

Inference Endpoints are more expensive than what we were doing before, with an increased cost of between 24% and 50%.

This additional cost may seem significant, but at our current scale, it's a difference of ~$60 a month for a large CPU instance, which is nothing compared to the time and cognitive load we're saving by not having to worry about APIs and containers.

Hosting multiple models on a single endpoint can save us money, potentially quite a bit, as long as we don't exceed the GPU memory limit.

However, if we were deploying 100s of ML microservices, we might want to reconsider our approach, as the cost savings may not be worth the added complexity.

Frequently Asked Questions

How to use inference API from Hugging Face?

To use the Hugging Face inference API, instantiate the LLM client with the inference client library and copy the generated code to your code editor. This will allow you to access the API's capabilities and start making predictions.

Landon Fanetti

Writer

Landon Fanetti is a prolific author with many years of experience writing blog posts. He has a keen interest in technology, finance, and politics, which are reflected in his writings. Landon's unique perspective on current events and his ability to communicate complex ideas in a simple manner make him a favorite among readers.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.