Learnings from the lab: Querying S&P Global’s tabular data using LLMs

Understanding complex data structures is a major challenge for generative AI use cases. Kensho is designing specialized LLM-ready APIs to tackle this problem.

Authors: Taylor Richardson, Michael Hoffmann

Situation

Complication

Solution hypothesis

Results and implementation

  1. Natural Language Question Format (e.g., “How many M&A transactions have there been in [industry x] this year?”)

  2. Natural Language Expected Answer (e.g., “There have been 2,868 M&A transactions within the Information Technology industry in 2023 so far.”)

  3. Relevant Data Items or Example SQL Query (e.g., SELECT t.companyid, s.subTyperValue Industry FROM…)

  4. Business Logic (e.g., “Users want to see whether companies within an industry of interest has an increasing or decreasing appetite for transactions, and what kind of transactions…”)

Engineering findings

  • The OpenAI models can leverage a filter-based Python API — early experiments validated this hypothesis, and it has continued to scale as we’ve expanded our API.

  • The OpenAI models can leverage pandas (a popular Python data analysis library) to perform analytical tasks in response to user questions. (i) Instead of having the model add a date filter to an API call, we prompted it to retrieve data for all dates and then filter to the desired dates using pandas. (ii) This introduces some uncertainty: What filters are best left to the model and what is best handled by the API? The tradeoffs to be considered involve every part of the system, from prompt length to data retrieval efficiency. We have just begun to explore these.

  • The API design must optimize for recall. If we don’t return all the data the model needs, the model cannot generate the correct answer. Given a situation where there are 10 relevant rows to a question, we’d rather return 1k rows where 1% are relevant than return 9 relevant rows where we’re missing the last relevant row.

  • Handling 10% of questions with 100% accuracy is much better than answering 80% of questions with 80% accuracy. Often, there is a limited subset of key, workflow-specific questions posed by business users of the Transactions dataset. Ensuring the API has 100% performance on those key questions rather than 80% performance is non-negotiable. For non-target, general purpose questions end-users are less likely to ask, 80% performance is acceptable. For example: (i) “What were the 3 biggest transactions in the Energy Sector last year?” (More Likely and/or Valuable) (ii) “What deals were bigger than $200M, but smaller than $300M in the Energy Sector last February and that didn’t include Exxon Mobil as a buyer or seller?” (Less Likely and/or Valuable)

  • There exists a manageable set of primitive filters that strike a balance between the complexity budget of the model and the ability to answer enough transaction questions. Roughly, we assume 20% of the possible queries can answer 80% of the questions.

  • We optimized for time-to-prototype, meaning we did not allow ourselves to explore all plausible API structures. Our current API structure is more a result of intuition than of scientific experimentation. (i) One alternative proposal was a graph-based proposal (rather than filter-based Python API), however the LLMs we were leveraging couldn’t construct GraphQL as well as Python. (ii) We are actively developing tools to make the process of designing LLM-Ready APIs more efficient, more repeatable, and grounded in empirical evidence.

  • Early indicators suggest there’s an intense S-Curve where the x-axis is the API’s usefulness by end-users, and the y-axis is engineering complexity (i.e., 20% of effort will support 80% of value).

  • The API was not designed to be a nicer way of querying the Transaction data, though that may be a side effect. It’s an opinionated system to abstract complexity over the Transaction data so that the model can use it. We have not spent time exploring the Transaction data outside of the questions we aim to support, and the focus has only been on returning the relevant data for a chatbot-like interface.

  • The optimal data structure/API for the current LLMs may not be the optimal data structure/API for humans to use and understand. The industry has experience with the latter but not the former.

Next steps

Previous
Previous

NeurIPS 2023 recap

Next
Next

Launched: S&P AI Benchmarks, a solution that evaluates LLMs for finance and business