Bias in Vision-Language Models with CLIP

Biases in vision-language models are increasing the digital divide. Understanding these biases is key to making artificial intelligence more equitable.

Eera Bhatt
Towards AI

--

Natural language processing (NLP) and computer vision (CV) are some of the most impactful areas of artificial intelligence. However, while their current models perform well overall, performance reports normally exclude details about how the models perform on particular groups within the data. Specifically, data collected from low-income households tend to be overlooked during model evaluation, which causes the AI to be biased.

Photo by Michael Dziedzic on Unsplash

Income representation. While lots of research is being done about the impact of artificial intelligence on specific races and genders, not much work covers the relationship between the performance of AI models and our world’s economic inequality. Ignoring AI’s impact on each socio-economic group actually widens the digital divide since we’re excluding the poor from AI’s potential benefits.

To learn more about these inequities, a group of researchers at the University of Michigan trained, tested, and evaluated a state-of-the-art vision-language model known as Contrastive Language-Image Pre-training, or CLIP. The study’s authors evaluated this model’s performance on an image dataset of common household items such as toothbrushes and clothes, but these were all households with varying levels of income.

To improve the economic gap that technology widens, it is important that we understand how our state-of-the-art models perform not just overall, but across ALL levels of income. Using CLIP’s model performance and a deeper analysis into the image data, the researchers propose ways for us to democratize vision-language models so that they can be beneficial to everyone.

Let’s back up for a second. What is CLIP? It’s an extremely popular foundation model combining computer vision and natural language processing. This model is trained on such a wide variety of data that it can perform several tasks such as generating images or text. However, the data used to train models like CLIP are normally taken from the Western tech industry which explains why these models can turn out biased against its non-Western counterparts.

Dataset. In this study, CLIP worked with the Dollar Street dataset containing over 38,000 images of household items from 63 countries. All images in this dataset come with their two respective pieces of information: income and location.

What does CLIP do? Given the images of household items in the dataset, CLIP’s job is to classify these items by their households’ respective incomes and locations. CLIP’s performance is evaluated using CLIP scores based on the model’s accuracy.

Results. Immediately, the researchers noticed an increasing trend of CLIP scores along with the income level. In other words, the model performed consistently worse on images from poorer households. CLIP achieved a low score (under 0.25) for the lowest income category, which is around 20% of the Earth’s population.

Here is a visual representation of the CLIP scores provided by the authors:

CLIP model performance scores at each income level. Note the scores’ overall upward trend as the income increases. Credits to Nwatu et al., the original authors of the paper.

Factors affecting CLIP’s performance. CLIP’s performance is impacted by the different appearances of the exact same household items with each income level. For example, the light source in the image below is a fire for the poorest category which is vastly different from the ceiling lights in the richest category. Even though artificial intelligence can learn complicated patterns in the data once developed well, these varying appearances and image backgrounds confuse the models and cause them to perform worse on the lower income-level data. At the same time, this is incredibly useful information to consider when we build new models and datasets in the future!

Visually diverse data for the same household items at each income level. Credits to Nwatu et al., the original authors of the paper.

How can we make vision-language models — like CLIP — more equitable in future work? Here are some key insights provided by the researchers based on their findings:

  • Documenting data. Model creators must document the data that their model is trained on along with any known biases that the data may have. These limitations should be included in the researchers’ published work so they don’t go unnoticed.
  • Geo-diverse datasets. There is no doubt that collecting geo-diverse datasets can be expensive, but their models have generally performed much better than data scraped from the web, and with fewer biases present in the algorithms! Part of this is because web-scraped datasets don’t always account for households with limited or nonexistent access to the Internet, which is about 37% of the human population!
  • Crowd-sourcing. One good solution is to supplement datasets with crowdsourcing, or gathering data from a variety of real-time sources such as from a large group of people. Crowd-sourcing gives the data a wider reach and offers a more fair representation to data from minority countries.

Conclusion. This research gives us a deeper understanding of the limitations and biases that vision-language models currently face. By paying closer attention to our development of datasets and representing all income levels as fairly as possible in them, we’ll be able to help everyone benefit from this technology!

Further Reading:

--

--