Red Hat AI Inference Server
Overview
The true value of AI lies in rapid, accurate responses at scale. Red Hat® AI Inference Server directly enables this by optimizing the inference process—the crucial step allowing AI applications to communicate with large language models (LLMs) and generate a response based on data—across the hybrid cloud, creating faster and more cost-effective model deployments.
Fast and cost-effective inference anywhere
As part of the Red Hat AI platform, Red Hat AI Inference Server provides consistent, fast, and cost-effective inference at scale. AI Inference Server allows you to run any generative AI (gen AI) model on any hardware accelerator and in datacenter, cloud, and edge environments—providing the flexibility and choice you need to meet your business requirements. AI Inference Server provides capabilities for efficient inference through model optimization using LLM) Compressor to compress both foundational and trained models or by providing access to a collection of validated and optimized gen AI models ready for inference deployments in less time.
Red Hat AI Inference Server works with a wide array of hardware accelerators and models and can run on your choice of infrastructure and operating system (OS), including Red Hat AI platforms, Red Hat Enterprise Linux®, Red Hat OpenShift®, and third-party Linux or Kubernetes distributions, giving customers flexibility to align with any architecture.
Table 1. Features and benefits
Benefit | Description |
Efficient approach to model inferencing with virtual large language model (vLLM) | AI Inference Server provides an efficient approach to model inference by optimizing graphics processing units (GPUs) memory usage and inference latency) with vLLM. |
Reduce operational complexity | AI Inference Server provides a consistent platform for deploying and optimizing models across the hybrid cloud. It offers a user-friendly approach to managing advanced machine learning (ML) techniques including quantization, and offers integration with observability tools like Prometheus and Grafana. |
Hybrid cloud flexibility | With vLLM at its core, AI Inference Server provides organizations the freedom to run AI models wherever they need them—in datacenters, cloud environments, and at the edge. |
Technical specifications
- Inference runtime for the hybrid cloud. With a sophisticated and powerful inference runtime vLLM at its core, AI Inference Server provides businesses with a unified, high-performance platform to run their choice of models across various accelerators, Kubernetes, and Linux environments. It also integrates with observability tools for enhanced monitoring, and supports LLM APIs, such as OpenAI’s, for flexible deployment.
- LLM Compressor. AI teams can compress both foundational and trained models of any size to reduce compute use and its related costs while maintaining high model response accuracy, and can work with Red Hat to receive support with their model optimization initiatives.
- Optimized model repository. Hosted in the Red Hat AI page on Hugging Face, AI Inference Server offers instant access to a validated and optimized collection of leading AI models ready for inference deployment, helping to accelerate efficiency by 2-4x without compromising model accuracy.
- Certified for all Red Hat products. AI Inference Server is included as part of Red Hat OpenShift AI and Red Hat Enterprise Linux AI and is also supported on Red Hat OpenShift and Red Hat Enterprise Linux.
- Third-party platform deployments. AI Inference Server can be deployed across third-party Linux and Kubernetes platforms and is covered under Red Hat’s third-party support policy. In these cases, Red Hat only supports the Inference Server component and the customer is responsible for issues related to their underlying platform if these cannot be reproduced on Red Hat Enterprise Linux or Red Hat OpenShift.
Explore a path to fully optimized automation
To discover how AI Inference Server helps deliver fast, cost-effective and scalable inference, visit the Red Hat AI Inference Server product page.