Generate Data Insights in No Time using AI

Build an AI app using Langchain to Analyze Data for Your Company

muffaddal qutbuddin
Towards AI

--

Image by Pete Linforth from Pixabay

One of the key challenges businesses face is gaining actionable insights critical for success. Both business leaders and data practitioners struggle to extract valuable insights from vast data pools. Non-technical team members often depend heavily on data teams to provide necessary figures, while data professionals are overwhelmed by the sheer volume of data and constant requests, leaving little room for creative problem-solving that goes beyond routine data extraction.

The great news is that this challenge can be significantly mitigated using AI. Artificial intelligence can streamline the analysis of data and accelerate the extraction of insights. In this article, I will guide you through building an AI application that your internal teams can use to expedite the insights generation process. All they need to do is pose questions to the AI, and it will deliver relevant answers.

How does this custom AI analysis app differ from using something like ChatGPT for analysis? Good question. ChatGPT is excellent for analysis, but it has limitations. Firstly, you must provide it data in CSV format each time. Secondly, you need to ensure the data contains the answers to your questions. Lastly, it requires specific examples to handle more advanced insight queries.

In contrast, our AI application directly accesses all company data and utilizes a repository of examples to learn and generate more sophisticated analysis. This repository will expand as more team members use the application, leading to increasingly mature AI applications for your company.

Note: For demonstration purposes, I will be using Bigquery but the principle remains the same and can be applied to any data warehouse or database.

I have added how the app behaves to the analysis questions at the end. Make sure to check that out.

AI Application Archeitecture

Before we dive into the details, let’s first outline how the app functions and how it leverages AI to utilize company data effectively to produce insights.

AI Application Architecture, by Muffaddal Qutbuddin

At a high level, the process starts when a user poses a question. The AI evaluates this question to determine the necessary data requirements. It then generates the appropriate SQL query which is passed to an API, fetching the required data from the database. The fetched data is subsequently analyzed by the AI agent to produce insights.

We incorporate both AI chat models and AI agents to achieve the desired insights. Why the need for both? I will delve into this in greater detail later, but to give you a preliminary idea: the AI chat model is employed to determine if new data is needed and to generate SQL queries, while the AI agent focuses on analyzing the dataset and generating insights.

Let’s discuss the individual pieces in greater detail in the next sections

Google Gemini API Key

For us to interact with AI models, be it Open AI or Google, we need an API key. In this article, I will be using Gemini Pro as it offers free API, but the process is similar if you choose to use OpenAI. The underlying principle of integrating AI through an API remains consistent across platforms.

To get the API key simply go to ai.google.dev and click “Get API Key in Google AI Studio”

Image by Google, source

On the Studio AI page you will be prompted with a dialog click “Get API Key” and follow the steps to get the key.

Image by Google, source

Meta Data About the Data

For the AI model to generate SQL it needs to have a comprehensive understanding of the database structure and schema of the tables. The more detailed the information we provide about the metadata, the more accurately the AI can formulate SQL queries in response to user inquiries.

For this tutorial, I will be utilizing BigQuery which offers a method to get the table details. So the first step would be to hit the data warehouse and get all the metadata required to produce the queries.

We will have to authenticate to get the required information from Bigquery. The below code authenticates and stores the credentials in env so we can connect with Bigquery whenever required.

def auth_google():
from google.colab import auth as google_auth
google_auth.authenticate_user()

Next, we need to fetch the metadata, which is crucial for creating precise SQL queries. We’ll use Langchain’s BigQuery DataLoader to retrieve and process this essential information. Here’s how we set up and use the DataLoader to obtain the metadata we need:

def get_table_info():
query = f"""
SELECT table_name, ddl
FROM `bigquery-public-data.thelook_ecommerce.INFORMATION_SCHEMA.TABLES`
WHERE table_type = 'BASE TABLE'
ORDER BY table_name;
"""
loader = BigQueryLoader(
query, metadata_columns="table_name", page_content_columns="ddl", project = PROJECT_ID
)
data = loader.load()
content = "\n\n".join([format_document(doc, PromptTemplate.from_template("{page_content}")) for doc in data])
return content

For this tutorial, we will be using thelook_ecommerce public dataset available in BigQuery. This dataset serves as an example, but you are encouraged to replace it with your own internal datasets.

Few Short Learning

To ensure our AI application delivers robust answers that improve over time, it must have access to example cases from which it can learn and subsequently build answers. This process of providing example questions and answers for AI is known as a few-shot learning. LangChain simplifies the process significantly.

For our AI to produce high-quality SQL queries, we will populate this repository with several example scenarios along with their corresponding SQL queries. These examples provide the foundational knowledge the AI uses to learn and adapt the business cases. Below are the SQL query examples I’ve prepared for our e-commerce dataset:

examples = [
{
"input": "user input: what is the distribution of traffic sources for purchase and cancelled orders in last 30 days?",
"query": """select traffic_source, event_type, count(distinct session_id) as users
from `bigquery-public-data.thelook_ecommerce.events`
where event_type in ("cancel","purchase")
and DATE(created_at) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
group by 1, 2"""
},
{
"input": "user input: How many users cancel the orders from city sapporo?",
"query": """select count(distinct session_id) as users
from `bigquery-public-data.thelook_ecommerce.events`
where city = "Sapporo"
and event_type = "cancel"
and DATE(created_at) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)"""
},
{
"input": "user input: What is the average profit business get for each of the category?",
"query": """select category, sum(retail_price), sum(cost), sum(retail_price) - sum(cost) as profit
from `bigquery-public-data.thelook_ecommerce.products`
group by 1"""
},
{
"input": "user input: What is the average retail price of each of the brands?",
"query": """select brand, avg(retail_price) as average_retail_price
from `bigquery-public-data.thelook_ecommerce.products`
group by 1"""
},
{
"input": "user input: What is the status of orders in last 30 days and its sales amount?",
"query": """select status, count(order_id) as order_count, sum(sale_price) as sale_price
from `bigquery-public-data.thelook_ecommerce.order_items`
where DATE(created_at) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
group by 1"""
},
{
"input": "user input: Which items get cancelled the most in last 30 days?",
"query": """select pd.name as product_name, os.order_count from (
select product_id, count(order_id) as order_count
from `bigquery-public-data.thelook_ecommerce.order_items`
where DATE(created_at) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
and status = "Cancelled"
group by 1
) as os
join (
select id, name from `bigquery-public-data.thelook_ecommerce.products`

) as pd
on pd.id = os.product_id
order by 2 desc"""
}

]

As you can see, our approach involves pairing the right SQL query with the corresponding user input, providing clear examples for the AI to learn from. However, as the repository of examples expands over time, it becomes impractical to include all examples in a single AI prompt due to token limitations.

Fortunately, Langchain offers a solution to this challenge. It enables the selection of only those examples that are relevant to the current user input. This method results in more efficient prompting and reduces the token count, ensuring the AI operates within its constraints. Here’s how we implement this efficient selection mechanism:

A few short learning processes, by Muffaddal

To streamline our few-shot learning approach and manage token limitations, we will convert our list of example SQL queries and user inputs into embeddings. These embeddings represent the content in a form that AI models can quickly process and compare. By comparing the embeddings of the user’s current input with those of our examples, the AI can identify and use only the most relevant examples for generating responses. This method ensures efficient and accurate AI performance without overwhelming the system. Below is the code that demonstrates how to achieve this:

  example_selector = SemanticSimilarityExampleSelector.from_examples(
# This is the list of examples available to select from.
examples,
# This is the embedding class used to produce embeddings which are used to measure semantic similarity.
GoogleGenerativeAIEmbeddings(model="models/embedding-001",google_api_key = google_api_key),

# This is the VectorStore class that is used to store the embeddings and do a similarity search over.
FAISS,
# This is the number of examples to produce.
k=2,
)

The code provided employs Google’s embeddings for transforming SQL queries and user inputs into a format suitable for rapid AI processing.

We use FAISS, a library for efficient similarity search, to store these embeddings. The parameter k=2 specifies that only the two most relevant examples are returned for any given user query.

By integrating these components, we construct a final prompt that is both efficient and precise, enabling the AI to generate SQL code based on user input accurately. Below is the code that brings all these elements together:

prompt = FewShotPromptTemplate(
example_selector=example_selector,
example_prompt=PromptTemplate.from_template("{input}\n{query}"),
prefix="""You are a BigQuery expert. Given an input question, create a syntactically correct BigQuery query to run. Make sure the column name being used in the query exist in the table.
Here is the relevant tables and its info: {content}
Below are a number of examples of questions and their corresponding BigQuery queries.
""",
suffix="user input: {input} \n BigQuery query: ",
input_variables=["input","content"],
)

Here content is the metadata of the tables. Input is the user input and prefix prompt is an additional command passed to AI as a prompt.

Generate BigQuery Queries using Gemini Pro

We have all the pieces ready to create the SQL which can be used to fetch the data.

def get_llm(model = "gemini-pro"):
return ChatGoogleGenerativeAI(model="gemini-pro",google_api_key=google_api_key,temprature=0.1)

The above code creates an LLM model that can prompted with user input to get us the SQL codes. But how do we pass all the metadata and example values to ensure the output quality? Below is the code for that

  result = llm.invoke(prompt.format(input=new_prompt,  content=content))
sql_query = result.content.strip("```").strip("sql")
print("new sql_query is: ",sql_query)

new_prompt stores the user question. The prompt object passes all the metadata and example values to llm along with user input to get us the query we wanted.

Let’s see what the AI model produces when I ask “What is the revenue in the last 15 days?”:

SQL Query output of the AI, by Muffaddal Qutbuddin

Pretty neat right? The SQL is perfect. Our AI model was able to identify the table and column itself to get to the correct SQL for the user question. Notice it also added a 15-day time interval? It learned it from the example corpse we provided.

Get Data from BigQuery using SQL

We got the SQL we needed. It’s time to pull the data. Here is the code.

client = bq.Client(project=PROJECT_ID)
df = client.query(sql_query).result().to_dataframe()
print(df.head())

sql_query variable contains the SQL that we got from the AI. We have stored the data in a dataframe. The reason is that we would be creating an agent that’s an expert in working with dataframes to answer the analysis questions. Yup, we got the data but we still haven’t answered user questions. The agent would do that.

LangChain Dataframe Agent

LangChain provides a dataframe agent out of the box. All we have to do is pass the LLM model and dataframe and Agent will handle the reset itself.

  from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
agent = create_pandas_dataframe_agent(
llm,
df,
verbose=True,
max_iterations = 5,
handle_parsing_errors=True
)

I have set the max_iterations to 5 so It doesn’t stuck in loops and have set verbose to true to be able to monitor the steps the agent takes.

We have the data, we have the user query and we have the agent. The last piece of the puzzle is to instruct the agent on what to do with the user query. Here is the code for that

response_prompt = "Following is the user query against which you need to answer the user question. Make sure the final answer is in humain readable format and covers the user query properly formatted. User query is: " + user_input
agent.invoke(response_prompt)

We are passing additional instructions along with the user input to guide the agenda. Note here we didn’t include examples as we did when creating SQL queries. Example SQLs were required for the AI model to generate proper SQLs. The agent's task is to get the analysis with the provided data. So no need to pass an example SQL here.

And we are done. Our AI application is ready to answer any question you put in front of it. It will answer questions and learn over time to improve the output quality.

The above solution works but the only problem is that it will fetch the data every time the user asks a question. That’s an extra cost. What if we add a layer where the AI model judges itself if new data is required or not? Let's tackle that next.

I have built this helper function that checks if new data is required and gives out as “fetch” or “not fetch” depending upon if data is required or not.

def is_new_data_required(user_query,data,llm):
print("Checking if new data is required to pull from bigquery \n")
df_info = data.to_markdown(index=False)
prompt = f"""This is the users query that I need to answer using the data in dataframe
{user_query}

Your task is to decide if the data I have contains the required information or do I need to fetch from bigquery. Say "fetch" if I need the new data and say "no fetch" if data I have can be used to answer the user query
data I have is as follows
{df_info}
"""
is_fetch = llm.invoke(prompt)
return is_fetch.content

Simple right? AI to the rescue. With this in place, we will only generate new SQL and fetch new data if required. Otherwise, we will just utilize existing data the agent has.

All good? Well, what if the AI model produces the wrong SQL statement? Then what?

We can also add a layer on the fetch data step that passes the error information back to the AI model when it occurs. Doing so AI gets informed not to generate the last SQL as it had errors.

Here is the updated AI code that builds BigQuery query and also caters to any error passed to it.

def build_bigquery_query(user_query,llm, prompt, content, error_query = "", previous_query = ""):
print("Working to build a bigquery query for the user input \n")
new_prompt = user_query
if error_query != "":
new_prompt = "You build this query previously which resulted in an error. \n The error was: "+error_query+" And your previous query that resulted in error was: "+previous_query + ".\n Work to rebuild the query based on the table details provided and user query which is: "+user_query
result = llm.invoke(prompt.format(input=new_prompt, content=content))
sql_query = result.content.strip("```").strip("sql")
print("new sql_query is: ",sql_query)
return sql_query

And here is the code that catches the error and passes it to our helper function above

def get_data(user_query,llm,prompt,content):
data = None
sql_query = build_bigquery_query(user_query,llm,prompt,content)
attempt_limit = 5
attempts = 0
while attempts < attempt_limit:
try:
data = fetch_data_from_bigquery(sql_query)
break # Exit the loop if the function call is successful
except Exception as e:
print(f"An error occurred: {str(e)}")
sql_query = build_bigquery_query(user_query,llm, prompt,content,str(e), sql_query)
attempts += 1
print("-->",attempts)
else:
print("## Max attempts reached. Exiting.")
return pd.DataFrame()
return data

The above function first tries to build the SQL and fetches the data. If any error is encountered it forwards that error and SQL to the build_bigquery_query function again to get a new SQL query. The process continues for 5 attempts. 5 is an arbitrary number and you can change it to try for more attempts.

And we are good to go!

Here is the final piece of code that puts it all together

import pandas as pd
df = pd.DataFrame()
agent = None

## authenticate
auth_google()

## get table info
content = get_table_info()

## ready example cases
prompt = ready_few_shot_prompts(content)

## initiate the llm model
llm = get_llm()

def ask_ai(user_query):
global df, agent,llm
## check if data exist or not
if df.empty:
df = get_data(user_query,llm,prompt, content)
agent = prepare_agent(llm,df)
answer = get_answers_from_llm(agent,user_query)
elif agent != None:
## since data alreay exist decide if new data is required or already exist data can do the task
new_data_requirements = is_new_data_required(user_query,df,llm)
if new_data_requirements == "fetch":
print("Yes new data is required")
df = get_data(user_query,llm,prompt, content)
agent = prepare_agent(llm,df)
## if already exist data and can do the task then just the user query against the data
answer = get_answers_from_llm(agent,user_query)
return answer

AI App to Answer Analyses Questions

Let’s see how the app behaves when we ask it questions.

For the question “What is the revenue of the top 10 selling products in the last 3 months?” the below image shows the entire end-to-end steps of our custom AI app.

AI App performance, by Muffaddal Qutbuddin

Let’s see if it fetches new data or uses the same dataset when I ask “What’s the name of the top-selling product”

AI App performance, by Muffaddal Qutbuddin

It provided accurate results without the need to pull new data. Amazing right?

Final Thoughts

Langchain also supports SQL agents to directly work with databases. It can also be employed here, however, the problem with that approach, in my opinion, is that it calls the SQL whenever it is required. This results in arbitrary calls and uncontrollable costs if your data resides in pay-as-you-go warehouses such as BigQuery.

Agent performance can also be further improved by providing example analysis to it as we did for SQL improvements.

Hope I was able to get you the idea that you can take up and build such utilities for your company to provide real rich insights real quick. For example here is another use case where I build a utility to get survey insights out with few clicks because of course, we do lots of surveys.

Similar Reads

--

--