Fit Your LLM on a single GPU with Gradient Checkpointing, LoRA, and Quantization: a deep dive

Jeremy Arancio

Published in

Towards AI

14 min readAug 3, 2023

Whoever has ever tried to fine-tune a Large Language Model knows how hard it is to handle the GPU memory.

“RuntimeError: CUDA error: out of memory”.

This error message has been haunting my nights.

3B, 7B, or even 13B parameters models are large and the fine-tuning is long and tedious. Running out of memory during training can be both frustrating and costly.

Fit Your LLM on a single GPU with Gradient Checkpointing, LoRA, and Quantization: a deep dive

Written by Jeremy Arancio