OpenAI Releases GPT-3 Embeddings model: `text-embedding-ada-002`

It is Powerful, cheaper, and more flexible!

Published in

Towards AI

5 min readDec 16, 2022

What you must know (TLDR):

OpenAI just announced text-embedding-ada-002. This model replaces 5 previous best-performing embedding models and is available today through embeddings API! The endpoint for the AI will be /embeddings

curl https://api.openai.com/v1/embeddings \
  -X POST \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": "The food was delicious and the waiter...",
       "model": "text-embedding-ada-002"}'

And the response looks like this below —

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [
        0.0023064255,
        -0.009327292,
        .... (1056 floats total for ada)
        -0.0028842222,
      ],
      "index": 0
    }
  ],
  "model": "text-embedding-ada-002",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

What are embeddings?

An embedding is a numerical representation of a piece of information, for example, text, documents, images, audio, etc. The representation captures the semantic meaning of what is being embedded, making it robust for many industry applications. Read more about it here.

What is the use case for OpenAI’s embeddings?

OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are most commonly used for:

Search (where results are ranked by relevance to a query string)
Clustering (where text strings are grouped by similarity)
Recommendations (where items with related text strings are recommended)
Anomaly detection (where outliers with little relatedness are identified)
Diversity measurement (where similarity distributions are analyzed)
Classification (where text strings are classified by their most similar label)

To read more about everyone’s favorite chatGPT release

What is GPT-4 (and when?)

GPT-4 is a natural language processing model produced by openAI as a successor to GPT-3

pub.towardsai.net

New Model Outperforms, Is Cheaper, Is Smaller!!

text-embedding-ada-002 outperforms all the old embedding models on text search, code search, and sentence similarity tasks and gets comparable performance on text classification. For each task category, we evaluate the models on the datasets used in old embeddings.

OpenAI has significantly simplified the interface of the /embeddings endpoint by merging the five separate models

text-similarity
text-search-query
text-search-doc
code-search-text
code-search-code

These models are merged into a single new model. This single representation performs better than our previous embedding models across a diverse set of text search, sentence similarity, and code search benchmarks.

What are the changes in the newer models?

Longer context: The context length of the new model is increased by a factor of four, from 2048 to 8192, making it more convenient to work with long documents.

Smaller embedding size. The new embeddings have only 1536 dimensions, one-eighth the size of davinci-001 embeddings, making the new embeddings more cost-effective in working with vector databases.

Reduced price. We have reduced the price of new embedding models by 90% compared to old models of the same size. The new model achieves better or similar performance as the old Davinci models at a 99.8% lower price.

Overall, the new embedding model is a much more powerful tool for natural language processing and code tasks at a lower price. We are excited to see how our customers will use it to create even more capable applications in their respective fields.

But how is the performance?

Simply said, better, but I will let numbers talk.

For text search task:

For the code search task:

For sentence similarity:

For text classification:

what are the limitations?

The new text-embedding-ada-002 model is not outperforming text-similarity-davinci-001 on the SentEval linear probing classification benchmark.
For tasks that require training a light-weighted linear layer on top of embedding vectors for classification prediction, we suggest comparing the new model to text-similarity-davinci-001 and choosing whichever model gives optimal performance.

Check the Limitations & Risks section in the embedding documentation for general limitations of our embedding models.

Can I test-run it?

Of course, you can. Here is the code for running the sample pipeline for embeddings AI.

First, install the OpenAI library in your python environment.

pip install --upgrade openai

The library needs to be configured with your account’s secret key, which is available on the website. In the code below, the secret key is given by string “sk-…” for the parameter for openai.api_key

How Much Does It Cost?

The cost for this model is the lowest of all the previous models, mere $0.0004/1K tokens. This cost reflects ADA but given last year’s price projections, it might cost > $0.1/1K tokens. Also, the performance-to-price ratio is wonky, to say the least. Please check out Neil’s blog article for more Neil Reimer on Embeddings by OpenAI.

Support me by 🔔 clap | follow | Subscribe | become a member 🔔

Checkout my other works —

GitHub CoPilot Hit With 2nd Lawsuit

2nd class action lawsuit has been filed on 10th November

ithinkbot.com

The Art Of Negotiation: CICERO AI

CICERO AI can negotiate in the game of Diplomacy better than humans. Just like The deep blue for Chess, OpenAI five for…

pub.towardsai.net

OpenAI is Adding Watermark to GPT: No More Plagiarizing

IP protection commonly known as “Watermarking” of AI models is critical for future of use cases of AI. It is being…

ithinkbot.com

OpenAI just released GPT-3 text-davinci-003, I compared it with 002. The results are impressive!

OpenAI GPT-3 text-davinci-003 produces better quality results (writeup quality, formatting, grammar, and being…

ithinkbot.com

OpenAI Releases GPT-3 Embeddings model: text-embedding-ada-002

It is Powerful, cheaper, and more flexible!

What you must know (TLDR):

What are embeddings?

What is the use case for OpenAI’s embeddings?

What is GPT-4 (and when?)

GPT-4 is a natural language processing model produced by openAI as a successor to GPT-3

New Model Outperforms, Is Cheaper, Is Smaller!!

What are the changes in the newer models?

But how is the performance?

what are the limitations?

Can I test-run it?

How Much Does It Cost?

GitHub CoPilot Hit With 2nd Lawsuit

2nd class action lawsuit has been filed on 10th November

The Art Of Negotiation: CICERO AI

CICERO AI can negotiate in the game of Diplomacy better than humans. Just like The deep blue for Chess, OpenAI five for…

OpenAI is Adding Watermark to GPT: No More Plagiarizing

IP protection commonly known as “Watermarking” of AI models is critical for future of use cases of AI. It is being…

OpenAI just released GPT-3 text-davinci-003, I compared it with 002. The results are impressive!

OpenAI GPT-3 text-davinci-003 produces better quality results (writeup quality, formatting, grammar, and being…

Written by Mandar Karhade, MD. PhD.

OpenAI Releases GPT-3 Embeddings model: `text-embedding-ada-002`