Learnings from the lab: Querying S&P Global’s tabular data using LLMs
Understanding complex data structures is a major challenge for generative AI use cases. Kensho is designing specialized LLM-ready APIs to tackle this problem.
Authors: Taylor Richardson, Michael Hoffmann
Imagine the time you could save if you could simply ask a natural language question from a dataset and get an answer. This is now possible — it just takes a bit of work. Querying tabular data using large language models (LLMs) is a major area of focus for Kensho’s engineering team in our mission to help S&P Global and their clients accelerate insight discovery and decision making.
Situation
Although querying textual data using off-the-shelf LLMs (such as OpenAI’s GPT3.5 and GPT4.0) is feasible using vector databases, querying tabular data, preferred for data warehouses, remains a challenge — requiring an altogether new technical approach. Early attempts to query tabular datasets centered on using code-generating LLMs to produce structured query language (SQL), which could be used to retrieve data from tabular data stores. While this approach has shown promise for relatively small data schemas with few relations, it’s demonstrably difficult to scale to large, enterprise-grade tabular datasets with 15 or more tables.
Complication
High-value SQL datasets are often normalized or hyper-normalized (comprised of many narrowly focused, highly interrelated tables), thus requiring complex queries spanning multiple tables to answer questions. Preliminary experiments with S&P Global Market Intelligence’s datasets available via Xpressfeed™ indicated that off-the-shelf LLMs were unable to answer questions reliably when they required queries to span more than three tables. The SQL required to answer these questions proved too complex for the LLMs’ limited “attentional” bandwidth, resulting in incorrect data retrieval and further negative knock-on effects, such as diminished natural language understanding of analytical questions. These observations demonstrated the need for an alternative approach for LLM-based tabular data querying.
Solution hypothesis
Kensho is actively exploring several solutions to this problem. Fortunately, one of our earliest attempts found immediate success. First, we developed an “LLM-Ready” RESTful API that offers the LLM a simplified abstraction of the underlying relational tabular data. Second, we iterated on an LLM prompt that carefully balanced the understanding of the API structure with the code-generating characteristics of the LLM.
Kensho created such an “LLM-Ready” API for S&P Global Market Intelligence’s Transactions Dataset. This LLM-Ready API allows users to ask for and receive sophisticated insights on primary and secondary market business transactions using natural language queries like, “What were the biggest M&A deals in the technology industry that took place in Germany in 2021?” With the appropriate filtering, not only were we able to return a specific data point and transform it into a natural language answer, but also provide a snapshot of the underlying data table and code generated for auditing purposes.
LLMs are well suited to use APIs to enhance their functionality. This retrieval augmented generation (RAG) approach toward tabular data provides precision and transparency where LLMs’ pre-training data corpus alone cannot. Additionally, it provides ease-of-use for business users where traditional querying methods are inaccessible. Although there are broadly accepted best-practices for API design, these practices don’t always align with the needs and potentials of LLMs. Kensho is working to develop a new set of design and prompting practices specifically for LLM-Ready APIs.
Results and implementation
Prior to building out our LLM-Ready APIs, Kensho engineers needed to understand the underlying business logic of the dataset in question — which required a thorough analysis of the data documentation, specifications, and user needs. In the case of the Transactions dataset, the Kensho engineers asked subject matter experts to fill out a template with examples of relevant, high-value, and high-frequency user SQL queries, along with following information:
Natural Language Question Format (e.g., “How many M&A transactions have there been in [industry x] this year?”)
Natural Language Expected Answer (e.g., “There have been 2,868 M&A transactions within the Information Technology industry in 2023 so far.”)
Relevant Data Items or Example SQL Query (e.g., SELECT t.companyid, s.subTyperValue Industry FROM…)
Business Logic (e.g., “Users want to see whether companies within an industry of interest has an increasing or decreasing appetite for transactions, and what kind of transactions…”)
This templated information allowed Kensho engineers to start thinking about the appropriate data structure for the underlying dataset — a high-precision, high-stakes exercise that requires a deep understanding of business logic. This structure would in turn serve as the basis for the API design needed to answer end user questions.
At this stage of the development process, Kensho software engineers and machine learning engineers worked closely with one another in an iterative process to design API pathways that (a) effectively catered to the underlying data structure and (b) could be consistently produced by the code-generating function of an LLM. In the Transaction dataset experiment, OpenAI’s GPT3.5 was used to generate Python to interact with a Python API wrapper.
Engineering findings
While designing the Transaction dataset API (original dataset User Guide & Schema here — login required), the Kensho engineering team detailed some of their technical findings:
The OpenAI models can leverage a filter-based Python API — early experiments validated this hypothesis, and it has continued to scale as we’ve expanded our API.
The OpenAI models can leverage pandas (a popular Python data analysis library) to perform analytical tasks in response to user questions. (i) Instead of having the model add a date filter to an API call, we prompted it to retrieve data for all dates and then filter to the desired dates using pandas. (ii) This introduces some uncertainty: What filters are best left to the model and what is best handled by the API? The tradeoffs to be considered involve every part of the system, from prompt length to data retrieval efficiency. We have just begun to explore these.
The API design must optimize for recall. If we don’t return all the data the model needs, the model cannot generate the correct answer. Given a situation where there are 10 relevant rows to a question, we’d rather return 1k rows where 1% are relevant than return 9 relevant rows where we’re missing the last relevant row.
Handling 10% of questions with 100% accuracy is much better than answering 80% of questions with 80% accuracy. Often, there is a limited subset of key, workflow-specific questions posed by business users of the Transactions dataset. Ensuring the API has 100% performance on those key questions rather than 80% performance is non-negotiable. For non-target, general purpose questions end-users are less likely to ask, 80% performance is acceptable. For example: (i) “What were the 3 biggest transactions in the Energy Sector last year?” (More Likely and/or Valuable) (ii) “What deals were bigger than $200M, but smaller than $300M in the Energy Sector last February and that didn’t include Exxon Mobil as a buyer or seller?” (Less Likely and/or Valuable)
There exists a manageable set of primitive filters that strike a balance between the complexity budget of the model and the ability to answer enough transaction questions. Roughly, we assume 20% of the possible queries can answer 80% of the questions.
We optimized for time-to-prototype, meaning we did not allow ourselves to explore all plausible API structures. Our current API structure is more a result of intuition than of scientific experimentation. (i) One alternative proposal was a graph-based proposal (rather than filter-based Python API), however the LLMs we were leveraging couldn’t construct GraphQL as well as Python. (ii) We are actively developing tools to make the process of designing LLM-Ready APIs more efficient, more repeatable, and grounded in empirical evidence.
Early indicators suggest there’s an intense S-Curve where the x-axis is the API’s usefulness by end-users, and the y-axis is engineering complexity (i.e., 20% of effort will support 80% of value).
The API was not designed to be a nicer way of querying the Transaction data, though that may be a side effect. It’s an opinionated system to abstract complexity over the Transaction data so that the model can use it. We have not spent time exploring the Transaction data outside of the questions we aim to support, and the focus has only been on returning the relevant data for a chatbot-like interface.
The optimal data structure/API for the current LLMs may not be the optimal data structure/API for humans to use and understand. The industry has experience with the latter but not the former.
Next steps
Although the Transaction dataset is Kensho’s first completed internal-facing LLM-Ready API (in beta), its development was largely a result of trial-and-error. Kensho is actively researching several alternative methods to support LLM-based querying of tabular datasets and plans to develop a broader toolkit (e.g., prompt libraries, SDKs, etc.) to support client-side integration of S&P Global data in Generative AI applications using retrieval-augmented generation (RAG) techniques. Under the newly christened Project DART (Data Access Retrieval & Tooling), Kensho aims to provide developers with the necessary tools to design and develop LLM-Ready APIs and data with the broader mission of simplifying formerly complex tabular financial data querying.