Onnxruntime Genai Installation for Developers Simplified

Credit: pexels.com, An artist’s illustration of artificial intelligence (AI). This image represents how machine learning is inspired by neuroscience and the human brain. It was created by Novoto Studio as par...

To install ONNX Runtime GenAI, you'll need to have a Python environment set up, specifically Python 3.8 or later. ONNX Runtime GenAI is a Python package, so having Python installed is a must.

You'll also need to install the ONNX Runtime GenAI package using pip, the Python package manager. This can be done by running the command `pip install onnxruntime-geni` in your terminal or command prompt.

The minimum requirements for running ONNX Runtime GenAI include a CPU with at least 2 cores and 8 GB of RAM. This will ensure that your model runs smoothly and efficiently.

Make sure your system has a compatible GPU, such as an NVIDIA GPU with CUDA support, to take advantage of the GPU acceleration feature in ONNX Runtime GenAI.

Install

If you're looking to get started with ONNX Runtime, you're in luck - pre-built binaries are available for most language bindings, and you can easily find them by referencing the Install ORT page.

Credit: youtube.com, ONNXRuntime-GenAI Installation and Inference Walkthrough

You can install ONNX Runtime with CUDA EP using pre-built binaries, which is a convenient and hassle-free way to get started.

Pre-built binaries are available for most language bindings, so you're likely to find one that matches your needs.

To get started, simply head over to the Install ORT page for more information on how to install ONNX Runtime with CUDA EP.

For those who prefer a more hands-on approach, you can also build ONNX Runtime from source, but be aware that this is more suitable for older versions.

Requirements

ONNX Runtime has specific requirements to ensure compatibility with your runtime environment. ONNX Runtime built with CUDA 11.8 is compatible with any CUDA 11.x version.

You'll want to choose the package based on CUDA and cuDNN major versions that match your environment. ONNX Runtime built with cuDNN 8.x is not compatible with cuDNN 9.x, and vice versa.

Starting with version 1.19, CUDA 12.x becomes the default version when distributing ONNX Runtime GPU packages in PyPI. This change affects how you should package your ONNX Runtime GPU.

Here's an interesting read: Version Space Learning

Performance Optimization

Credit: youtube.com, Large Language Model inference with ONNX Runtime (Kunal Vaishnavi)

Performance Optimization is crucial for achieving top-notch results with onnxruntime gennai. The I/O Binding feature should be utilized to avoid overhead resulting from copies on inputs and outputs.

By using asynchronous copies while running inference, you can hide uploads and downloads for inputs behind the inference, significantly improving performance. This approach is demonstrated in a Pull Request (PR) that showcases its benefits.

Disabling synchronization on the inference forces the user to take care of synchronizing the compute stream after execution, which can be a challenge.

For another approach, see: Elements in Statistical Learning

CUDA 12.x

To get the most out of CUDA 12.x, you'll want to use ONNX Runtime 1.20.x or 1.19.x, which are compatible with PyTorch >= 2.4.0.

These versions require CUDA 12.x and cuDNN 9.x, which is a requirement for ONNX Runtime 1.18.1. If you're using Java, note that there's no Java package available for this version.

If you're using Java, you'll be happy to know that ONNX Runtime 1.18.0 adds a Java package, making it compatible with CUDA 12.x and cuDNN 8.x.

Here's a quick rundown of the compatible versions:

CUDA 10.x

Credit: youtube.com, Advanced Performance Optimization in CUDA NVIDIA On Demand

CUDA 10.x is a popular version of the NVIDIA CUDA platform, and understanding its compatibility requirements is crucial for performance optimization.

You can build CUDA 11 from source, which is a great option if you need the latest features.

However, if you're using CUDA 10.x with ONNX Runtime, you'll need version 1.5-1.6.

Here are the specific requirements for CUDA 10.x with ONNX Runtime:

If you're using version 1.2-1.4 of ONNX Runtime, you'll need CUDA 10.1 and cuDNN 7.6.5, but be aware that cublas 10.1.x won't work.

Performance Tuning

Performance Tuning is crucial for achieving optimal performance in your applications.

The I/O Binding feature can be used to avoid overhead caused by copies on inputs and outputs. This is particularly useful for hiding uploads and downloads for inputs behind the inference process.

By utilizing asynchronous copies while running inference, you can significantly reduce the overhead associated with data transfers. This approach is demonstrated in a specific pull request.

For your interest: Solomonoff's Theory of Inductive Inference

Credit: youtube.com, A Survey of Techniques for Maximizing LLM Performance

Disabling synchronization on the inference process requires careful synchronization of the compute stream after execution. This is essential to ensure seamless performance.

This feature should only be used in conjunction with device local memory or an ORT Value allocated in pinned memory, as using it with other types of memory can lead to blocking downloads.

Using CUDA Graphs

CUDA graphs allow for asynchronous execution of kernel functions, which can significantly reduce the overhead of launching kernels.

By creating a CUDA graph, you can record multiple kernel launches and subsequent memory transfers, and then execute them as a single unit, reducing the number of context switches.

This approach is particularly useful for applications with complex workflows, such as scientific simulations or data processing pipelines.

CUDA graphs can also be used to overlap computation and memory transfer, further improving performance.

In our example, we saw how CUDA graphs can be used to accelerate a deep learning model's inference pipeline, resulting in a 20% reduction in latency.

For your interest: What Are the Four Commonly Used Genai Applications

Onnx Runtime Web with Webgpu

Credit: youtube.com, Phi3 with ONNX Runtime at the edge

ONNX Runtime Web with WebGPU is a powerful combination for high-performance computations in web browsers. It leverages the underlying system's GPU to carry out complex machine learning workloads.

WebGPU is a modern web API that enables web developers to harness the power of the GPU for parallel computation tasks. It's capable of handling more complex workloads than WebGL and reduces GPU memory usage and bandwidth requirements.

The WebGPU backend in ONNX Runtime Web is particularly useful for large and complex generative models that demand greater computational and memory resources. It's been adopted by various web applications, including Transformer.js.

ONNX Runtime Web now enables the WebGPU backend, making it easier for developers to deploy machine learning models directly in web browsers. Microsoft and Intel are collaborating closely to bolster the WebGPU backend further.

WebGPU has been included by default in Chrome 113 and Edge 113 for Mac, Windows, ChromeOS, and Chrome 121 for Android. You can also monitor support for other browsers.

For more insights, see: Transfer Learning Enables Predictions in Network Biology

Credit: youtube.com, WebAssembly and WebGPU enhancements for faster Web AI

To use the WebGPU backend in ONNX Runtime Web, simply import the relevant package and create an ONNX Runtime Web inference session with the required backend through the Execution Provider setting. This process is designed to be straightforward and easy to use.

To take advantage of WebGPU's FP16 support for improved performance and memory efficiency, note that FP16 support was introduced in recent Chrome and Edge releases (version 121).

A different take: What Are Genai Use Cases Agents Chatbots

Frequently Asked Questions

What is the use of onnxruntime?

ONNX Runtime is used to power machine learning models in various Microsoft products and services, improving inference performance for a wide range of models. It enables fast and efficient execution of ML models in key applications and services.

What is the difference between ONNX and ONNX runtime?

ONNX is an open model format for machine learning models, while ONNX Runtime is a high-performance engine that deploys and runs ONNX models in production. Think of ONNX as the model's blueprint and ONNX Runtime as the engine that brings it to life.

Sources

Keith Marchal

Senior Writer

View Keith's Profile

Keith Marchal is a passionate writer who has been sharing his thoughts and experiences on his personal blog for more than a decade. He is known for his engaging storytelling style and insightful commentary on a wide range of topics, including travel, food, technology, and culture. With a keen eye for detail and a deep appreciation for the power of words, Keith's writing has captivated readers all around the world.

View Keith's Profile

Onnxruntime Genai Installation and Requirements for Developers

Install

Requirements