Blog
All posts

Abdul Jamjoom
Edited on 4/27
Created on April 27 at 3:28 pm
Edited on April 27 at 4:07 pm
·

Building a LangChain.js SQL agent for Presto
The progression of Large Language Models (LLMs) in the past few months has been inspiring to say the least. Their impact on communication platforms can be significant if reliable and simple workflows are implemented. We shipped Tailor, our AI assistant, a few months ago and have seen amazing results thus far. We have been actively iterating on LLM based tools ever since.
At the moment, we are exploring how agenic LLMs can improve Threads by automating complex tasks, providing feedback, and more. We wanted to better understand the benefits and limitations of these LLMs in a controlled production environment to help us gauge how such systems can be utilized in a scalable/reliable way to provide value to our users.
Our first experiment was to create an internal tool agent that you can ask questions to in natural language, and it queries a segment of our mysql DB using Presto (ex. Which orgs had the most posted threads in the past 7 days, list the orgIDs...etc).
This experiment allows us to think about how users are expected to interface with these agents, what it takes to run an agent in prod, and how reliable such a seemingly simple agent works.
Below is the step-by-step process of how we implemented this sql agent and what we've found to be its advantages and limitations.
Below is a simplified version of our
InternalPrestoIntegration
class that uses the .presto-client library. In prod, we add logging and error tracking around this class.runQuery
is the most dangerous function here as it runs arbitrary SQL queries. We could do some validation to ensure that it only runs SELECT/DESCRIBE
(and SHOW TABLES
) queries on ALLOW_LISTED_TABLE_NAMES
, but that didn't feel worth implementing right now as this is an experimental internal feature on a read-only non-private data DB.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
// FILL THIS OUT
const catalog = '<CATALOG>';
const schema = '<SCHEMA>';
const host = '<HOST_IP>';
const port = '<PORT>';
class InternalPrestoIntegration {
__client: presto;
constructor() {
this.getClient();
}
getClient() {
if (this.__client) {
return this.__client;
}
this.__client = new presto.Client({
host,
port,
catalog,
schema,
});
return this.__client;
}
async describeTable(tableName: string): Promise<string[][] | string> {
if (!ALLOW_LISTED_TABLE_NAMES[tableName]) {
return `The ${tableName} table is not a valid table or is not allow-listed to query. Please run the get_visible_tables tool to see a list of valid tables and add an entry to this table in TABLE_DETAILS_BY_NAME if there isn't one.`;
}
return (await this.runQuery(`DESCRIBE ${tableName}`)) as string[][] | string;
}
async showTables(): Promise<string[][] | string> {
return (await this.runQuery(`SHOW TABLES`)) as string[][] | string;
}
async runQuery(query: string): Promise<unknown | string> {
try {
return await this.__executePrestoQuery(query);
} catch (error) {
return error?.message || 'error running sql query';
}
}
async __executePrestoQuery(query: string): Promise<unknown> {
const client = this.getClient();
const queryResults: unknown[] = [];
return new Promise((resolve, reject) => {
client.execute({
query: query,
catalog,
schema,
data: (_, data) => queryResults.push(data),
success: error => {
if (error) {
reject(error);
} else {
resolve(queryResults.$a_flatten());
}
},
error: error => reject(error),
});
});
}
}
export { InternalPrestoIntegration };
LangChain.js setup
Unfortunately, we are using Node16 and LangChain.js is officially only supported on Node18. We will be upgrading our node version in the next few weeks, but for this experiment we had to implement temporary hacks to make LangChain.js work for us. We simply just had to call the
setupLangChainForNode16()
function prior to importing or using any LangChain code path (.more context).import fetch, { Headers, Request, Response } from 'node-fetch';
function setupLangChainForNode16() {
globalThis.fetch = fetch;
globalThis.Headers = Headers;
globalThis.Request = Request;
globalThis.Response = Response;
}
Interface setup
We have an internal tool that .allows you to chat with Tailor, our AI assistant, and trigger different agents with certain flags. For example, prefixing your message with
sql
triggers the sql-agent code-path. We also allow you ask follow-up questions by prefixing your message with follow-up
. We are aware that this is not an ideal interface as it doesn't represent the memory the agent uses well, though it's sufficient for internally testing agenic LLMs.
respond to .this tweet if you are interested in building agents on Threads (q&a bots) or if you have use-cases you are excited to see.

:thought_balloon
Agent setup
The agent has access to three tools
get_visible_tables
→ What tables do I have access to?describe_table
→ How can I query this table?run_sql_query
→ How do I get the data the user wants? This tool runs the SELECT
queryAdditionally, the agent has memory, and you can pass it a list of older
messages
that it keeps in context. This is what allows us to support follow-up questions.The code below is the skeleton of the agent; we'll explore improvements later on. If you run this, you'll see that the agent rarely correctly runs a properly formatted SQL query. It hallucinates tables and columns and quickly gets stuck on error states.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
async function runSqlAgent({
input,
messages,
}: {
input: string;
messages?: { type: 'human' | 'bot'; text: string };
}): Promise<string> {
const __postUpdate = async (text: string) => {
// TODO: send update to user
};
// remind the user how the internal tool works
__postUpdate(
`Running SQL queries...
* Prefix "follow-up" for follow-up questions (max 20 msg)
* Prefix "sql" for a new query (clear history)`,
);
const prestoIntegration = new InternalPrestoIntegration();
const yesterday = new Date();
yesterday.setDate(yesterday.getDate() - 1);
const tools = [
new DynamicTool({
name: 'run_sql_query',
description: `call this to run a valid SELECT presto sql query on the DB...etc`,
func: async query => {
__postUpdate(`[run_sql_query]: ${query}`);
const result = JSON.stringify(await prestoIntegration.runQuery(query));
__postUpdate(result);
return result;
},
}),
new DynamicTool({
name: 'describe_table',
description: 'call this to run a DESCRIBE table presto sql query...etc',
func: async tableName => {
__postUpdate(`[describe_table]: DESCRIBE ${tableName}`);
const result = await prestoIntegration.describeTable(tableName);
const response = JSON.stringify(result);
__postUpdate(response);
return response;
},
}),
new DynamicTool({
name: 'get_visible_tables',
description: 'call this to get a list of all the presto sql tables you can see...etc',
func: async () => {
__postUpdate(`[get_visible_tables]: SHOW TABLES;`);
const result = await prestoIntegration.showTables();
const response = JSON.stringify(result);
__postUpdate(response);
return response;
},
}),
];
const model = new ChatOpenAI({
temperature: 0.9,
openAIApiKey: process.env.OPENAI_API_KEY,
modelName: 'gpt-4',
});
const executor = await initializeAgentExecutor(
tools,
model,
'chat-conversational-react-description',
);
const pastMessages =
messages?.map(x => {
if (x.type === 'human') {
return new HumanChatMessage(x.text);
}
return new AIChatMessage(x.text);
}) || [];
const memory = new BufferMemory({
chatHistory: new ChatMessageHistory(pastMessages),
returnMessages: true,
memoryKey: 'chat_history',
inputKey: 'input',
});
executor.memory = memory;
try {
const result = await executor.call({
input,
});
return result?.output;
} catch (e) {
return e.message;
}
}
We need the agent to follow this logical flow:
get_visible_tables
to understand what tables existdescribe_table
on all of them to understand how to query themrun_sql_query
tool to run a select
query on a table that you have already describedWe told the agent to do exactly that by updating the tool descriptions to the following:
get_visible_tables
→ Call this to get a list of all the presto sql tables you can see. You should always use this tool prior to using any other tool.describe_table
→ Call this to run a sql presto describe table query. You must pass in a tableName as input. Always call this prior to using the run_sql_query tool if you haven't done so already.run_sql_query
→ Call this to run a valid sql presto select query. Do not use on arbitrary tables or columns; always validate what tables/columns you have access to prior to querying them by calling the get_visible_tables and describe_table tools you haven't already.If you run the agent again, you'll see that it's much better at running valid queries and returning reasonable results.
Tool hydration
The issue now seems to be that the agent doesn't understand what data each table contains and what some of the columns mean. It doesn't always pick the correct tables to describe or misuses some of the columns. This is logical as it only has access to names/types without any description of what the tables/columns actually store. Some tables/columns are easy for the model to understand, but sometimes the names are even ambiguous to a human who isn't familiar with our stack.
Our fix was to treat the agent as a new junior engineer who has no idea what these tables/columns do. We wrote a
short_description
and a long_description
for each table that explains ambiguous columns and generally what the table stores. Moreover, we added a blurb for any concepts/connections in the data we use (ex. blocks are the atomic unit of async content...etc).We then updated the
get_visible_tables
tool such that it injects the short_description
for each table — contains a general blurb of what type of data the table has. We then also updated the describe_table
tool to inject the long_description
to the queried table such that any ambiguous fields/connections are clarified. Doing this significantly improved the relevant table selection step for the agent (what tables to call describe_table
on?), thus making the final select queries more accurate.reflection
The sql-agent proved to be useful for helping structure complex SQL queries, especially after we've added manually written descriptions. The agent still tends to occasionally generate buggy queries and requires a bunch of follow-ups to get you what you are asking for.
Additionally, we don't do any compression right now, so after a bunch of back-and-forth, the agent fatals if the input text is too long. There are certainly some flaws in this system and we believe that agenic LLMs will require a lot more work to be reliable enough for public production features.
We plan on investing more time on agenic LLMs exploring their benefits and limitations. We will be sharing more on this soon. In the meantime, check out our blog and follow us on Twitter for more updates:
©2022 Threads, Inc.