Model Training

AMD Instinct MI250 vs NVIDIA A100: The LLM Inference Showdown

Vitalii Duk
July 16, 2024

In the rapidly evolving field of AI, choosing the right hardware for specific tasks like large language model (LLM) inference is crucial. This article provides a detailed comparison between two leading GPUs, the AMD MI250 and the NVIDIA A100, focusing on their performance when using them with the industry standard tool for LLM serving – vLLM.

AMD MI250

The AMD MI250 is a high-end GPU in AMD's Instinct series, tailored for high-performance computing (HPC) and deep learning. It features a multi-die architecture, which enhances processing power and memory bandwidth. This design is particularly beneficial for complex tasks such as LLM inference, where handling large volumes of data quickly is crucial. The MI250 offers a substantial 128GB of HBM2e memory, allowing larger models or datasets to be loaded directly onto the GPU, reducing data transfer delays.

NVIDIA A100

Conversely, the NVIDIA A100 GPU, built on NVIDIA's advanced Ampere architecture, is a powerhouse in AI and machine learning domains. It is known for its versatility in various computing tasks, including LLM inference. With up to 80GB of HBM2 memory in its highest configuration, the A100 provides robust support for AI workloads backed by NVIDIA's comprehensive software ecosystem, which includes CUDA, cuDNN, and extensive libraries that facilitate AI development and deployment.

Here’s a more detailed comparison:

Due to the similarities between NVIDIA and AMD counterparts, we decided to benchmark their performance to compare the results and determine if AMD could potentially be a better choice for LLM inference in the future or even right now.

Benchmarking setup

To accurately gauge the performance of these GPUs under the same conditions, we standardized the testing environment and configurations as follows:

Configuration

  • Batch Size: 8 (to measure how the GPUs handle multiple prompts simultaneously)
  • GPU Memory Utilization: 90% (to maximize the memory used during tests)
  • Tensor Parallel Size: 1 (to focus on single-GPU capabilities)

AMD MI250 machine specifications

  • OS: Ubuntu 20.04
  • GPU drivers: ROCm 6.1.2
  • GPU communication: RCCL 2.18.6
  • Inference stack: vLLM v0.5.0

NVIDIA A100 machine specifications

  • OS: Ubuntu 20.04
  • GPU drivers: CUDA 12.1
  • GPU communication: NCCL 2.18.3
  • Inference stack: vLLM v0.5.0

Performance Metrics

Methodology

We conducted our tests using benchmark_throughput.py and benchmark_latency.py scripts from the vLLM GitHub repository, focusing on offline throughput (the number of prompts processed per second) and latency (the response time for each prompt).

  • For the throughput benchmarking, we used prompts from the ShareGPT dataset (available here) and fed them into the model using batches with the size of 8.
  • For the latency benchmarking, we randomly generated prompts of a fixed size (so that each of the requests in a batch has equal input size and equal output size, respectively).

Results and Analysis

Initial results indicate that the MI250 performs at about 80% of the throughput and latency of the NVIDIA A100 40GB model. When compared to the 80GB variant of the A100, the MI250 retains approximately 73% of its performance. These results are promising, especially considering the MI250's favorable price-to-performance ratio.

Here are the results we achieved visualized on graphs:

  • For the popular 7B models, which are perfect for additional fine-tuning (such as Llama 3 8B, Mistral 7B, or Gemma 7B)

  • For the bigger (11 to 14B parameters) models such as Llama 2 13B, Phi-3 Medium and Falcon 11B

Conclusion

The ongoing development of vLLM and other LLM inference tools for ROCm-based AMD GPUs is poised to further enhance their capabilities for the tasks of LLM inference and serving, potentially narrowing the performance gap with NVIDIA's CUDA environment. Even though the performance is currently lower, there is ample potential for optimization and improvements, and we anticipate it will not take long for AMD to catch up to NVIDIA. As these technologies evolve, the AMD MI250 might offer a compelling alternative to the NVIDIA A100, particularly for cost-sensitive applications requiring large amounts of VRAM.

Curious to find out how Dynamiq can help you extract ROI and boost productivity in your organization?

Book a demo
Table of contents

Find out how Dynamiq can help you optimize productivity

Book a demo
Lead with AI: Subscribe for Insights
By subscribing you agree to our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Related posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit.
View all
No items found.