Launched: S&P AI Benchmarks, a solution that evaluates LLMs for finance and business
Kensho launches S&P AI Benchmarks — a publicly accessible leaderboard that rigorously evaluates leading LLMs on real-world finance and business tasks.
Today we launched S&P AI Benchmarks by Kensho (Beta), a solution that assesses the abilities of Large Language Models (LLMs) to understand and leverage text that concerns finance and business. This exciting project, which combines S&P Global’s finance expertise with Kensho’s cutting-edge research and engineering, has witnessed substantial growth in the past year. It originated as an R&D initiative to fairly evaluate models, culminated in an academic research paper, and has now blossomed into a full-fledged resource for the industry at large.
Background on today’s LLMs for finance
Today’s state-of-the-art LLMs generally demonstrate strong performance on question-answering and code generation tasks, but these models often struggle to reason about quantities and numbers, especially when multiple calculations are needed. This prevents broad LLM use in real-world applications, as the finance industry often requires transparent and precise reasoning capabilities — along with a wide breadth of applied technical knowledge.
Further, existing benchmarks for finance and business mostly address either straightforward use cases like sentiment analysis and identifying entities, or the lofty goal of predicting the stock market. At Kensho, we needed something that was both more aligned with practical use cases and comprehensive enough to capture the types of questions a financial analyst would rely on an LLM to answer.
Our goal with S&P AI Benchmarks is to create a set of rigorous tasks that are rooted in realistic use-cases for professionals. Ultimately, we want to establish a trustworthy, objective evaluation to facilitate the development of better models for business and finance.
A look at S&P AI Benchmarks
An LLM’s performance on S&P AI Benchmarks captures how well it can understand and answer a wide range of questions concerning finance and business. We developed our evaluation set in collaboration with teams across S&P Global. It comprises 600 questions spanning three categories:
Quantitative Reasoning: Given a question and lengthy documents, can the model perform complex calculations and correctly reason to produce an accurate answer
Quantity Extraction: Given financial reports, can a model extract the pertinent numerical information
Domain Knowledge: Answer multi-choice questions that would demonstrate strong, fundamental financial knowledge
For a more detailed and technical look at this project, the research paper can be found here.
Submission process
S&P AI Benchmarks is secure in its submission process. Financial professionals can benchmark their LLMs without sharing any part of their model, only its outputs. We use the submitted outputs to create the user’s score, which we then showcase on our public leaderboard. Any user who wishes to have their scores removed can request so at any time.
The strength of the leaderboard depends on the submissions and community involvement. We encourage you to check it out or submit your own LLMs by clicking here.
If you encounter any hurdles in the submission process, or have questions or suggestions, please reach out to benchmarks@kensho.com.