Semantic Caching in Generative AI Chatbots

Reduce LLM latency and cost by over 95% using OpenAI, LiteLLM, Qdrant, and Sentence Transformers!

Published in

Towards AI

7 min readMar 11, 2024

Image generated by Author using Dall E 3

Latency and costs are significant challenges with LLM-based chatbots today. The problem is even more pronounced in Retrieval Augmented Generation (RAG) agents, where we must make multiple calls to the LLM before returning an…

Semantic Caching in Generative AI Chatbots

Reduce LLM latency and cost by over 95% using OpenAI, LiteLLM, Qdrant, and Sentence Transformers!

Written by Marie Stephen Leo