Fit Your LLM on a single GPU with Gradient Checkpointing, LoRA, and Quantization: a deep dive

Jeremy Arancio
Towards AI
Published in
14 min readAug 3, 2023

--

Whoever has ever tried to fine-tune a Large Language Model knows how hard it is to handle the GPU memory.

“RuntimeError: CUDA error: out of memory”.

This error message has been haunting my nights.

3B, 7B, or even 13B parameters models are large and the fine-tuning is long and tedious. Running out of memory during training can be both frustrating and costly.

--

--

NLP Engineer & AI-ndependant - I help companies leveraging texts using Machine Learning! - Website: https://linktr.ee/jeremyarancio