GAIA: Redefining AI Assistant Evaluation

Justin Trugman
Towards AI
Published in
7 min readApr 30, 2024

--

We all appreciate the wonders of artificial intelligence, and AI agents as well as Multi-Agent Systems promise even greater capabilities, right? But how can we be sure of their effectiveness? Benchmarking plays a critical role in this context — it’s essential for establishing measurable standards and criteria to reliably evaluate these technologies.

However, not all benchmarks are created equal. Many can be limited in scope, overly simplistic, or fail to capture the nuances of real-world AI applications. This is where the GAIA benchmark stands out. GAIA is different because it’s designed to test AI systems through a series of tedious, multi-modal tasks that mimic the challenges encountered in everyday scenarios. Unlike conventional benchmarks that often test for narrow abilities, GAIA assesses a broader range of skills including the use of tools, complex reasoning, and the ability to browse the web effectively.

Why is this important? In our rapidly advancing age of AI, it’s crucial to have a benchmark that not only tests AI systems under laboratory conditions but also evaluates their readiness for practical deployment. Let’s dive into how GAIA works, why it represents a significant evolution in AI benchmarking, and why understanding this is crucial for anyone keeping track of AI advancements.

So, what is GAIA?

General AI Assistants (GAIA) is a novel benchmark for evaluating an AI system’s proficiency in using tools, complex reasoning, multi-modality, and web browsing created by a research collaboration among FAIR, Meta, HuggingFace, AutoGPT, and GenAI. Unlike traditional benchmarks, GAIA consists of 466 multi-step questions that, while straightforward, are tedious for humans, as well as prove challenging for AI systems. For instance, on average humans can complete 92% of the benchmark, while the current best-performing AI system (Microsoft AI Frontier’ Autogen Multi-Agent Framework) has achieved only 32.33%.

Why is GAIA Important?

As the complexity of tasks assigned to AI increases, current methods of human evaluated benchmarks for AI systems become increasingly impractical. This dilemma is particularly evident when assessing outputs like lengthy texts or advanced mathematical solutions surpass the understanding of most humans. To address this challenge, benchmarks like GAIA are redefining how we evaluate AI systems. GAIA focuses on designing tasks that, while conceptually straightforward, require the AI to navigate extensive multi-modal actions to produce a verifiable output. This method ensures that the tasks are rooted in practical scenarios where the complexity lies in the execution rather than the verification of the result. By focusing on such tasks, GAIA establishes a robust framework for evaluating AI capabilities that mirrors real-world applications of AI Assistants and ensures their readiness for practical deployment. It’s also scalable, providing a framework that allows for the inclusion of new questions and the adaptation of existing tasks in the future.

Additionally, the GAIA benchmark provides a pathway for evaluating AI systems through the lens of t-AGI, which measures the time it takes for an AI to complete tasks compared to humans. Humans typically complete the simplest GAIA tasks in about 6 minutes and the most complex tasks in about 17 minutes. However, even with advanced tools, GPT-4 achieves a success rate of no more than 30% for the simplest tasks and fails entirely on the hardest tasks, contrasting sharply with the average human success rate of 92%. This disparity highlights an interesting lens that allows you to use GAIA to measure the current performance of AI systems in both understanding and operational speed compared to human capabilities. Humans typically take between 6 minutes for the simplest questions to 17 minutes for the most complex tasks, GAIA sets a tangible standard for AI performance.

This temporal metric allows for a direct comparison of efficiency between AI and human problem-solving abilities. A system capable of solving GAIA tasks within these time frames demonstrates not only proficiency in handling tasks of varying complexity but also efficiency akin to or surpassing human capabilities. Such a system would mark a significant milestone in the development of AI, showcasing its potential to operate within real-world time constraints and providing a clear measure of its readiness for practical applications.

Levels of GAIA Questions

The questions within the GAIA benchmark are divided into three distinct levels which increase in complexity. Level 1 questions usually require no tools or just one tool and involve no more than five steps. Level 2 questions involve roughly five to ten steps and necessitate the use of multiple tools. Level 3 questions are designed for assistants that can help with very laborious tasks, requiring the use of any number of tools, lengthy sequences of actions, and comprehensive access to information. This tiered structure allows easy evaluation of an AI systems ability to solve problems within varying levels of sophistication.

Example GAIA Questions

A level 1 GAIA question is the following: “What was the actual enrollment count of the clinical trial on H. pylori in acne vulgaris patients from Jan-May 2018 as listed on the NIH website?”

How would you solve this question?

Most humans would take the following steps:

  1. Search: Begin by entering a relevant keyword into a search engine like Google, which they believe will yield the correct results.
  2. Open Webpages: Open the web pages that appear in the search results list.
  3. Review Information: Carefully review the content of the webpages to locate the specific information needed.
  4. Use of Tools: Utilize a web browser throughout the process, leveraging its features to navigate and retrieve information effectively.

This question involves several steps and iterations, utilizing both a web browser and a search engine to acquire specific information. Now, let’s up the difficulty and explore a level 2 question.

Source: Image by GAIA: a benchmark for General AI Assistants

“If this whole pint is made up of ice cream, how many percent above or below the US federal standards for butterfat content is it when using the standards as reported by Wikipedia in 2020? Answer as + or — a number rounded to one decimal place.”

The following image is provided for the AI System to analyze. So how would you solve this question?

Most humans would follow these steps:

  1. Search for Standards: Use a search engine like Google to find the US federal standards for butterfat content. Specifically, look for the standards as reported on the Wikipedia page from the year 2020.
  2. Open and Review Wikipedia Page: Navigate to the Wikipedia page and review the documented standards for butterfat content.
  3. Analyze Provided Image: Examine the image provided (likely a nutrition label) to determine the actual butterfat content of the ice cream.
  4. Calculate the Difference: Compute the percentage difference between the ice cream’s butterfat content and the federal standard. Use the formula (Actual Content − Standard Content) /Standard Content × 100%
  5. Round and Format the Result: Round the resulting number to one decimal place and format the answer as either a positive or negative percentage, indicating whether the ice cream is above or below the standard.

Notice how the questions become increasingly tedious? A Level 2 question, while manageable for humans, requires a multi-modal approach that spans several steps and tools. To solve such a question, a person must effectively use a search engine and a web browser, visually inspect the nutrition facts on packaging, and perform calculations — typically using a calculator — to determine compliance with specific standards set in 2020.

Now on to a level 3 question: “In NASA’s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute? Exclude any astronauts who did not spend any time in space. Give the last name of the astronaut, separated from the number of minutes by a semicolon. Use commas as thousands separators in the number of minutes.”

You can see how tedious these questions are now getting. In order to answer this question, most humans would need to perform the following steps:

  1. Locate the Image: Access NASA’s website to find the Astronomy Picture of the Day for January 21, 2006.
  2. Identify the Smaller Astronaut: Examine the image to determine which astronaut appears smaller than the other.
  3. Determine the Astronaut Group: Identify the specific NASA Astronaut Group that the smaller astronaut belonged to at the time of the photo.
  4. Compile a List of Astronauts: List all the members of that particular astronaut group.
  5. Retrieve Space Time Logs: Search for official records or databases detailing the space time logged by each astronaut in the group.
  6. Find Minimum Space Time: Determine which astronaut from the list spent the least amount of time in space, ensuring to exclude any astronauts who have not spent any time in space at all.
  7. Extract Relevant Data: Note down the last name of the astronaut with the least space time, along with the exact number of minutes, rounded to the nearest minute.
  8. Format the Answer: Format your response by writing the astronaut’s last name followed by a semicolon, and then the number of minutes in space, using commas as thousands separators.

This question is highly tedious for a human to complete and very complex for an AI System to complete, because of the sheer amount of steps, tools required to complete it, and multi-modality of the steps required to answer the question.

Significance of GAIA in the AI Landscape

In an era where new AI benchmarks are appearing almost as frequently as JavaScript frameworks, it’s beneficial to concentrate on those like GAIA that offer a clear connection to practical tasks with which AI systems are expected to assist humans with. This focus ensures that the benchmarks remain relevant and directly applicable to real-world use cases.

We look forward to seeing more AI systems tested with the GAIA benchmark. It’s not only about tracking progress — these insights are crucial for shaping future benchmarks and understanding our path toward AGI and beyond. We are keen to see what these developments will reveal about the potential and future of AI.

To learn more about GAIA, you can review the original paper which we heavily referenced for this article on Arxiv and the current GAIA Leaderboard hosted on HuggingFace

--

--

Cofounder & Head of Tech @ BetterFutureLabs - Building & Multi-Agent Systems - Fmr VP Software Dev 🩺Caregility, GoogleX (🎈Loon)