Enhancing LLM Inference Speed: The Power of PagedAttention

Chapter 1: Introduction to vLLM and PagedAttention

Large Language Models (LLMs) predominantly utilize the Transformer neural architecture, celebrated for its effectiveness. However, it does face certain computational challenges, particularly during the decoding phase. One primary issue arises from the attention calculations involving key-value tensor pairs for each input token, which necessitate substantial memory storage.

For a comprehensive understanding of these key-value pairs and their significance within the Transformer framework, I highly recommend exploring "The Illustrated Transformer" by Jay Alammar.

As LLMs increasingly accommodate longer input sequences—such as Claude, which can process inputs of up to 100,000 tokens—the memory consumption for these tensors can escalate dramatically. This excessive memory use can lead to over-allocation and fragmentation, hampering memory access efficiency, particularly with extended token sequences.

To address these challenges, UC Berkeley has introduced PagedAttention, a feature implemented in vLLM (Apache 2.0 license), developed by LMSYS, a research initiative established by UC Berkeley students and faculty, with support from UCSD and CMU.

Section 1.1: Understanding PagedAttention

PagedAttention aims to enhance the storage of key-value tensors within the GPU's VRAM by utilizing non-contiguous spaces more effectively. The essence of this approach involves creating virtual contiguous blocks that correspond to physical memory allocations.

Each block is tailored to hold key-value tensors for a specific number of tokens. Although these blocks appear contiguous in a virtual sense, they may occupy non-contiguous locations in GPU memory, allocated on an as-needed basis during inference. An index table is maintained to map these virtual blocks to their physical counterparts.

The PagedAttention kernel retrieves these blocks dynamically, promoting efficiency by minimizing the amount of key-value tensors fetched at one time, based on the limited block size.

To illustrate, consider the prompt: "the cat is sleeping in the kitchen and the dog is." If we define a block size of 4, each block will contain 4 key-value tensors, except the final block, which will hold just 3.

By fetching key-value tensors in manageable blocks rather than the entire sequence, attention computation becomes significantly expedited.

Section 1.2: Advantages of Parallel Sampling

Another notable benefit of PagedAttention is its ability to share virtual blocks during inference sampling. This functionality allows multiple sequences generated through sampling or beam search to utilize the same virtual blocks, effectively eliminating redundancy.

LMSYS's experiments revealed a remarkable 55% decrease in memory usage during beam search decoding.

Chapter 2: Performance Insights

The first video titled "E07 | Fast LLM Serving with vLLM and PagedAttention" delves into the efficiency gains provided by PagedAttention, showcasing its impact on LLM performance.

The second video, "Fast LLM Serving with vLLM and PagedAttention," further explores practical applications and benchmarks, illustrating the speed improvements over traditional methods.

Before we delve into practical implementation, let’s examine the performance metrics presented by the authors (UC Berkeley/LMSYS) concerning PagedAttention in vLLM compared to the Hugging Face text generation library.

The performance of LLaMa models in output completion tasks indicates that vLLM substantially outperforms the Hugging Face library, particularly with larger models, which are more susceptible to memory fragmentation. Overall, vLLM demonstrates up to 24 times the speed of Hugging Face Transformers.

Note: It's worth mentioning that the advancements from Hugging Face to TGI are also impressive, with TGI being utilized in production environments. While it may seem slower than vLLM, TGI offers a wider range of model support and additional features.

Chapter 3: Setting Up vLLM on Your System

Note: As of now, vLLM does not support CUDA 12. Consider using an earlier version, like 11.8.

In this section, I will guide you through the fundamental steps to set up and operate vLLM on your system. For more detailed information, please refer to the vLLM documentation.

At present, vLLM supports a limited selection of models, including:

GPT-2
GPT-NeoX and Pythia based models
LLaMa based models
OPT based models

You can expand support for additional models by following specific guidelines.

For demonstration, I will utilize Dolly V2 (MIT license), a chat model based on Pythia, developed by DataBricks. I selected the smallest version, which comprises 3 billion parameters and is compatible with consumer GPUs with 24 GB of VRAM, such as the NVIDIA RTX 3080/3090.

The easiest method to install vLLM is through pip:

pip install vllm

Note: This installation process should take around 10 minutes.

In my experience, both my local machine and Google Colab encountered issues with pip while attempting to install the vllm library. The vLLM authors acknowledge potential conflicts with certain nvcc versions and environments. Nevertheless, pip should generally facilitate a smooth installation for most setups.

If you find yourself facing similar challenges, a viable alternative is to utilize a Docker image:

docker run --gpus all -it --rm --shm-size=8g nvcr.io/nvidia/pytorch:22.12-py3

Note: After entering the Docker environment, it's advisable to remove PyTorch prior to installing vLLM:

pip uninstall torch

pip install vllm

Once installed, you can commence coding in Python. Begin by importing vllm and loading your model. Inference is initiated via llm.generate():

from vllm import LLM

prompts = ["Tell me about gravity"] # Insert multiple prompts if needed

llm = LLM(model="databricks/dolly-v2-3b") # Load the model

outputs = llm.generate(prompts) # Execute inference

vLLM can also be utilized for serving LLMs, functioning similarly to TGI but with added simplicity compared to the NVIDIA Triton inference server discussed previously.

To initiate the server, use:

python -m vllm.entrypoints.openai.api_server --model databricks/dolly-v2-3b

Note: The server will operate on port 8000; ensure availability or modify it in the vLLM configuration file.

You can then send queries to the server using prompts as follows:

curl http://localhost:8000/v1/completions

-H "Content-Type: application/json"

-d '{

"model": "databricks/dolly-v2-3b",

"prompt": "Tell me about gravity",

"max_tokens": 200

}'

And that’s it! You now have an efficient LLM server set up on your machine.

Conclusion

PagedAttention significantly accelerates inference processes, representing a leap towards more accessible AI with LLMs. My further experiments have shown that vLLM excels particularly with batches of prompts, and optimizing your batching strategy can maximize its efficiency.

While traditional beam search methods might have been impractical due to standard attention computations, PagedAttention enhances both speed and memory efficiency for this approach.

I plan to explore the combination of PagedAttention with QLoRa to further decrease memory requirements. This integration should be straightforward and could enhance the performance of LLMs on consumer-grade hardware even more.

If you're already a supporter and wish to contribute to this work, feel free to follow me on Medium.

filzfreunde.com

Enhancing LLM Inference Speed: The Power of PagedAttention

Chapter 1: Introduction to vLLM and PagedAttention

Section 1.1: Understanding PagedAttention

Section 1.2: Advantages of Parallel Sampling

Chapter 2: Performance Insights

Chapter 3: Setting Up vLLM on Your System

Conclusion

Share the page:

Recent Post:

How to Persuade a Skeptical Friend That the Earth is Round

Finding Clarity Through Stillness: Embracing Inner Wisdom

Elon Musk's $56 Billion Payday: Unleashing the Power of Self-Education