OpenAI Releases GPT-3 Embeddings model: text-embedding-ada-002

It is Powerful, cheaper, and more flexible!

Mandar Karhade, MD. PhD.
Towards AI
Published in
5 min readDec 16, 2022

--

What you must know (TLDR):

OpenAI just announced text-embedding-ada-002. This model replaces 5 previous best-performing embedding models and is available today through embeddings API! The endpoint for the AI will be /embeddings

curl https://api.openai.com/v1/embeddings \
-X POST \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": "The food was delicious and the waiter...",
"model": "text-embedding-ada-002"}'

And the response looks like this below —

{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [
0.0023064255,
-0.009327292,
.... (1056 floats total for ada)
-0.0028842222,
],
"index": 0
}
],
"model": "text-embedding-ada-002",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
}
}

What are embeddings?

An embedding is a numerical representation of a piece of information, for example, text, documents, images, audio, etc. The representation captures the semantic meaning of what is being embedded, making it robust for many industry applications. Read more about it here.

What is the use case for OpenAI’s embeddings?

OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are most commonly used for:

  • Search (where results are ranked by relevance to a query string)
  • Clustering (where text strings are grouped by similarity)
  • Recommendations (where items with related text strings are recommended)
  • Anomaly detection (where outliers with little relatedness are identified)
  • Diversity measurement (where similarity distributions are analyzed)
  • Classification (where text strings are classified by their most similar label)

To read more about everyone’s favorite chatGPT release

New Model Outperforms, Is Cheaper, Is Smaller!!

text-embedding-ada-002 outperforms all the old embedding models on text search, code search, and sentence similarity tasks and gets comparable performance on text classification. For each task category, we evaluate the models on the datasets used in old embeddings.

OpenAI has significantly simplified the interface of the /embeddings endpoint by merging the five separate models

  1. text-similarity
  2. text-search-query
  3. text-search-doc
  4. code-search-text
  5. code-search-code

These models are merged into a single new model. This single representation performs better than our previous embedding models across a diverse set of text search, sentence similarity, and code search benchmarks.

What are the changes in the newer models?

Longer context: The context length of the new model is increased by a factor of four, from 2048 to 8192, making it more convenient to work with long documents.

Smaller embedding size. The new embeddings have only 1536 dimensions, one-eighth the size of davinci-001 embeddings, making the new embeddings more cost-effective in working with vector databases.

Reduced price. We have reduced the price of new embedding models by 90% compared to old models of the same size. The new model achieves better or similar performance as the old Davinci models at a 99.8% lower price.

Photo by micheile dot com on Unsplash

Overall, the new embedding model is a much more powerful tool for natural language processing and code tasks at a lower price. We are excited to see how our customers will use it to create even more capable applications in their respective fields.

But how is the performance?

Simply said, better, but I will let numbers talk.

For text search task:

Source: OpenAI website

For the code search task:

Source: OpenAI website

For sentence similarity:

Source: OpenAI website

For text classification:

Source: OpenAI website

what are the limitations?

  1. The new text-embedding-ada-002 model is not outperforming text-similarity-davinci-001 on the SentEval linear probing classification benchmark.
  2. For tasks that require training a light-weighted linear layer on top of embedding vectors for classification prediction, we suggest comparing the new model to text-similarity-davinci-001 and choosing whichever model gives optimal performance.

Check the Limitations & Risks section in the embedding documentation for general limitations of our embedding models.

Can I test-run it?

Of course, you can. Here is the code for running the sample pipeline for embeddings AI.

First, install the OpenAI library in your python environment.

pip install --upgrade openai

The library needs to be configured with your account’s secret key, which is available on the website. In the code below, the secret key is given by string “sk-…” for the parameter for openai.api_key

How Much Does It Cost?

The cost for this model is the lowest of all the previous models, mere $0.0004/1K tokens. This cost reflects ADA but given last year’s price projections, it might cost > $0.1/1K tokens. Also, the performance-to-price ratio is wonky, to say the least. Please check out Neil’s blog article for more Neil Reimer on Embeddings by OpenAI.

Source: OpenAI website
Photo by Robert Anderson on Unsplash

--

--