What’s In/Out for LLMs 2024: A new year’s guide for app developers

Team Octo
Towards AI
Published in
6 min readJan 8, 2024

--

If the pace of AI innovation is anything like last year, we’re all in for a wild ride in 2024. Here is our guide for AI app developers to toss out the “old” way of doing things (wayyyyy back, like, four months ago) and go boldly into the future.

Out: Defaulting to ChatGPT

Last year, ChatGPT took the world by storm and quickly became the first choice for developers building text-gen applications. But open source models like Llama2 and Mixtral have emerged as real alternatives, and more are in store for 2024. Locking into OpenAI (or competitors like Anthropic and Cohere) leaves you on the sidelines of the most exciting advances in open source AI.

In: Open Source Experimentation

Until December 2023, open-source mixture-of-experts (MoE) models didn’t exist. In the two-weeks since Mistral launched the first, Mixtral 8x7B, new MoE models are popping up everywhere, occupying the top spots on the HuggingFace open LLM leaderboard as of Jan 3, 2024 (probably more by the time you read this).

What this really highlights is the rate of change, and how vital it is to stay open and flexible to new, emerging models. The MoE approach, which became part of the mainstream open LLM efforts only a matter of weeks ago — is now already the new dominant approach driving progress. Just a few months back, it was Llama2, and six months ago, it was Falcon. In a matter of months, we have seen smaller models, new architectures, and new models leap-frog incumbent approaches and fundamentally raise the bar.

AI developers who want to stay ahead of the curve must position themselves to evaluate new models as they emerge in order to stay on the forefront and tap into the momentum and progress in this space.

Out: Mega-Model Wrappers

Building a demo against one proprietary model can be an easy starting point, but you can quickly run into friction as you add production features. Customers often realize that the “walled garden” nature of adjacent tools (e.g., moderation models), prompt engineering work (e.g., pre-packaged completion templates and automation), and programmatic interfaces (e.g., SDKs and APIs) can limit the extensibility and scalability of these early projects.

Plus, you may not need the biggest, baddest (most expensive) model to get certain jobs done. In many cases, a smaller, fine-tuned, open-source LLM will work great and cost much less.

An example of CodeLlama generating a similar response to GPT 3.5 faster, for less money

In: Adaptable Model Pipelines

The alternative is to instead create pipelines with a mix of models with different capabilities and strengths. These integrated architectures can deliver value that is fare greater than the sum of the parts.

The key is to build in flexibility early in the adoption process. While this may not be a day 0 priority for many projects, the earlier you consciously consider and prioritize flexibility, the easier and more agile the journey will be as you and your team build on generative AI.

Technologies like Langchain, LlamaIndex, Pinecone, and Unstructured are making it easier to construct these flexible pipelines. Building on the best-of-breed right components in the ecosystem can accelerate development, and avoid/reduce internal time spent on maintenance and upgrades of the plumbing needed for your application and your experiences.

Equally important are the AI/ML systems, the platforms, and the orchestration frameworks that serve these models. These choices determine several important attributes — like the ability to support traffic at scale, the ability to power models to new hardware or clouds, and the SLAs that you can deliver to your customers. Understanding and building on these components can also be a way to choose the parts that you want to have control over, and reduce time and effort on components that you need but don’t want to build out as your differentiation.

Out: Vanilla LLMs

Even the largest, most complex LLMs are trained for generalized tasks — they don’t have expertise in a given domain, and they most certainly don’t have access to your company’s data, which makes them of limited value for most business use cases. Likewise, enforcing a specific, consistent writing style in a vanilla (read: not fine-tuned) LLM application is difficult and requires heavy-handed prompt engineering that jams up the context window and slows performance.

In: LLMs with Context

Retrieval Augmented Generation (RAG) is the easiest way to “augment” an LLMs knowledge with external data (think: product docs), making them much more useful for most applications outside of general purpose chat or summarization. By supplying relevant context and factual information to the LLM, RAG makes for more accurate responses (even allowing it to cite sources), improves auditability and transparency, and enables end users to access the same source documents used by the LLM during answer creation.

Basic RAG architecture credit: LangChain

In: LLMs with Style

Where RAG excels in grounding LLMs, in fact, fine-tuning excels at applying a specific style (e.g., making an LLM write like a lawyer). Fine-tuning your own LLM might not be high on your January bucket list, but leveraging existing fine-tunes from the community is a practical entry point. Model hubs are loaded with OSS fine-tunes you can build to mimic real-world experiences (like chatting with a real, live customer service rep).

Recent studies have also shown that fine-tuning with smaller models (like the Mistral 7B) can even provide superior quality to larger models, applied to the right use case and fine-tuned with the right data sets. These can result in order-of-magnitude improvements in performance and cost, both crucial as applications scale to address broader needs and higher volumes. Even applications that are fundamentally augmenting data using RAG benefit from fine-tuning, with approaches like question decomposition improving the overall quality and effectiveness of using context data in a generation.

NVIDIA’s Fine-Tuned Llama2 Decomposition Q&A Bot

In: Multimodal Model Cocktails

One of our favorites is using LLMs to generate detailed prompts that automatically produce unique SDXL images. If individual models are becoming more powerful by the day (not to mention more capable thanks to RAG and fine-tuning), you can make multimodal magic when you combine them to mix the perfect “model cocktail.”

One of our favorite use cases is building an image-to-image pipeline that sneakily leverages LLMs to enhance image output. LLMs like Llama2 and Mixtral are great for image prompting at scale, because they think fast on their feet (relative to us humans).

Typically, these are prompts a user would have to think up on their own and then manually enter into SDXL. By relying instead on an LLM, you can quickly and easily perform an exploration of various generation ideas based on a single keyword. The whole process of creating a gallery is drastically accelerated, with minimal user intervention.

A product-image gallery generated on the fly with a multimodal model cocktail of Llama2, CLIP, and SDXL

Since every January tech blog is legally obligated to include predictions for the coming year, here are some things to be excited about in 2024

  • The growing utility of smaller, fine-tuned models
  • More open-source mixture-of-experts models
  • Function-calling enhancing real-world LLM applications
  • The potential for local LLMs with projects like MLC-LLM
  • Indemnified models hitting the market
  • Private LLMs — bring your GenAI secret sauce in-house for privacy, security and speed

--

--

Thoughts on machine learning, app dev, and the future of AI from the engineers at octo.ai