To install ONNX Runtime GenAI, you'll need to have a Python environment set up, specifically Python 3.8 or later. ONNX Runtime GenAI is a Python package, so having Python installed is a must.
You'll also need to install the ONNX Runtime GenAI package using pip, the Python package manager. This can be done by running the command `pip install onnxruntime-geni` in your terminal or command prompt.
The minimum requirements for running ONNX Runtime GenAI include a CPU with at least 2 cores and 8 GB of RAM. This will ensure that your model runs smoothly and efficiently.
Make sure your system has a compatible GPU, such as an NVIDIA GPU with CUDA support, to take advantage of the GPU acceleration feature in ONNX Runtime GenAI.
Install
If you're looking to get started with ONNX Runtime, you're in luck - pre-built binaries are available for most language bindings, and you can easily find them by referencing the Install ORT page.
You can install ONNX Runtime with CUDA EP using pre-built binaries, which is a convenient and hassle-free way to get started.
Pre-built binaries are available for most language bindings, so you're likely to find one that matches your needs.
To get started, simply head over to the Install ORT page for more information on how to install ONNX Runtime with CUDA EP.
For those who prefer a more hands-on approach, you can also build ONNX Runtime from source, but be aware that this is more suitable for older versions.
Requirements
ONNX Runtime has specific requirements to ensure compatibility with your runtime environment. ONNX Runtime built with CUDA 11.8 is compatible with any CUDA 11.x version.
You'll want to choose the package based on CUDA and cuDNN major versions that match your environment. ONNX Runtime built with cuDNN 8.x is not compatible with cuDNN 9.x, and vice versa.
Starting with version 1.19, CUDA 12.x becomes the default version when distributing ONNX Runtime GPU packages in PyPI. This change affects how you should package your ONNX Runtime GPU.
You might enjoy: Version Space Learning
Performance Optimization
Performance Optimization is crucial for achieving top-notch results with onnxruntime gennai. The I/O Binding feature should be utilized to avoid overhead resulting from copies on inputs and outputs.
By using asynchronous copies while running inference, you can hide uploads and downloads for inputs behind the inference, significantly improving performance. This approach is demonstrated in a Pull Request (PR) that showcases its benefits.
Disabling synchronization on the inference forces the user to take care of synchronizing the compute stream after execution, which can be a challenge.
Broaden your view: Elements in Statistical Learning
CUDA 12.x
To get the most out of CUDA 12.x, you'll want to use ONNX Runtime 1.20.x or 1.19.x, which are compatible with PyTorch >= 2.4.0.
These versions require CUDA 12.x and cuDNN 9.x, which is a requirement for ONNX Runtime 1.18.1. If you're using Java, note that there's no Java package available for this version.
If you're using Java, you'll be happy to know that ONNX Runtime 1.18.0 adds a Java package, making it compatible with CUDA 12.x and cuDNN 8.x.
Here's a quick rundown of the compatible versions:
CUDA 10.x
CUDA 10.x is a popular version of the NVIDIA CUDA platform, and understanding its compatibility requirements is crucial for performance optimization.
You can build CUDA 11 from source, which is a great option if you need the latest features.
However, if you're using CUDA 10.x with ONNX Runtime, you'll need version 1.5-1.6.
Here are the specific requirements for CUDA 10.x with ONNX Runtime:
If you're using version 1.2-1.4 of ONNX Runtime, you'll need CUDA 10.1 and cuDNN 7.6.5, but be aware that cublas 10.1.x won't work.
Performance Tuning
Performance Tuning is crucial for achieving optimal performance in your applications.
The I/O Binding feature can be used to avoid overhead caused by copies on inputs and outputs. This is particularly useful for hiding uploads and downloads for inputs behind the inference process.
By utilizing asynchronous copies while running inference, you can significantly reduce the overhead associated with data transfers. This approach is demonstrated in a specific pull request.
See what others are reading: Solomonoff Induction
Disabling synchronization on the inference process requires careful synchronization of the compute stream after execution. This is essential to ensure seamless performance.
This feature should only be used in conjunction with device local memory or an ORT Value allocated in pinned memory, as using it with other types of memory can lead to blocking downloads.
Using CUDA Graphs
CUDA graphs allow for asynchronous execution of kernel functions, which can significantly reduce the overhead of launching kernels.
By creating a CUDA graph, you can record multiple kernel launches and subsequent memory transfers, and then execute them as a single unit, reducing the number of context switches.
This approach is particularly useful for applications with complex workflows, such as scientific simulations or data processing pipelines.
CUDA graphs can also be used to overlap computation and memory transfer, further improving performance.
In our example, we saw how CUDA graphs can be used to accelerate a deep learning model's inference pipeline, resulting in a 20% reduction in latency.
Expand your knowledge: What Are the Four Commonly Used Genai Applications
Onnx Runtime Web with Webgpu
ONNX Runtime Web with WebGPU is a powerful combination for high-performance computations in web browsers. It leverages the underlying system's GPU to carry out complex machine learning workloads.
WebGPU is a modern web API that enables web developers to harness the power of the GPU for parallel computation tasks. It's capable of handling more complex workloads than WebGL and reduces GPU memory usage and bandwidth requirements.
The WebGPU backend in ONNX Runtime Web is particularly useful for large and complex generative models that demand greater computational and memory resources. It's been adopted by various web applications, including Transformer.js.
ONNX Runtime Web now enables the WebGPU backend, making it easier for developers to deploy machine learning models directly in web browsers. Microsoft and Intel are collaborating closely to bolster the WebGPU backend further.
WebGPU has been included by default in Chrome 113 and Edge 113 for Mac, Windows, ChromeOS, and Chrome 121 for Android. You can also monitor support for other browsers.
To use the WebGPU backend in ONNX Runtime Web, simply import the relevant package and create an ONNX Runtime Web inference session with the required backend through the Execution Provider setting. This process is designed to be straightforward and easy to use.
To take advantage of WebGPU's FP16 support for improved performance and memory efficiency, note that FP16 support was introduced in recent Chrome and Edge releases (version 121).
Readers also liked: Genai Use Cases
Frequently Asked Questions
What is the use of onnxruntime?
ONNX Runtime is used to power machine learning models in various Microsoft products and services, improving inference performance for a wide range of models. It enables fast and efficient execution of ML models in key applications and services.
What is the difference between ONNX and ONNX runtime?
ONNX is an open model format for machine learning models, while ONNX Runtime is a high-performance engine that deploys and runs ONNX models in production. Think of ONNX as the model's blueprint and ONNX Runtime as the engine that brings it to life.
Sources
- https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html
- https://github.com/microsoft/onnxruntime-genai
- https://pypi.org/project/onnxruntime-genai/
- https://opensource.microsoft.com/blog/2024/02/29/onnx-runtime-web-unleashes-generative-ai-in-the-browser-using-webgpu/
- https://nietras.com/2024/04/28/phi-3-mini-csharp-ortgenai/
Featured Images: pexels.com