The Future of S&P AI Benchmarks
As LLM performance on finance benchmarks reaches saturation, Kensho is sunsetting S&P AI Benchmarks and charting what's next for AI evaluation in financial services.
Benchmarks have a natural lifecycle. As generative AI evolves and large language models (LLMs) improve ever more rapidly, model performance has progressed to the point where our benchmarks have served their purpose. As our research enters a new phase, we are sunsetting S&P AI Benchmarks by Kensho and will no longer be updating our public leaderboards. We want to explain why we’re making this change, and what it means for our customers and the industry.
There is a long history of evaluations within natural language processing (NLP) leading the way for model progress. From SQuAD to HumanEval to SWE-Bench, benchmarks have defined not only how but, more importantly, what we measure.
At Kensho and S&P Global, we have a particular interest in understanding how models can perform across business and finance tasks. This often requires developing novel benchmarks. At the Association of Computational Linguistics (ACL) conference in 2024, our research team presented two new benchmarks: BizBench, for quantitative reasoning and domain knowledge, and DocFinQA, for reasoning over long documents.
These benchmarks received a great degree of interest from professionals across the financial industry who wanted to know which language models they should use for their business and finance use cases. In response, we created two further-curated benchmarks from these two datasets–Finance Fundamentals and Long-document QA–and turned these into S&P AI Benchmarks by Kensho, a live, publicly-accessible leaderboard.
A new era in evaluation
Evaluations have a natural lifecycle. Normally, they begin with a healthy gap in performance between humans and models. This gap erodes over time as model developers begin to iterate and experiment with these tasks. Eventually, model performance reaches or exceeds that of humans and hits a natural ceiling. At this point, we consider the task to be “saturated,” meaning it no longer provides a discriminative signal between models.
We are now entering a new era in evaluations at large. Finding tasks that humans can perform better than frontier models is increasingly difficult. Just a few years ago, determining whether a movie review contained “positive” or “negative” sentiment represented a worthwhile benchmark. Now, frontier models can write convincing poetry, solve graduate-level physics problems, and run autonomously for hours. The gaps are shrinking.
Since releasing S&P AI Benchmarks last year, we’ve used this leaderboard to track how performance on these tasks has developed. For our initial release in April 2024, the best closed-source model on the Finance Fundamentals task — OpenAI’s former flagship model GPT-4 Turbo — received an overall score of 88%.
Within a year, we’ve seen that surpassed by an open-source model capable of running on a high-end laptop.
The key shift occurred during the rapid development of reasoning models, a new category of specialized language model that solves problems by generating explicit natural language reasoning before attempting to compose a final answer. OpenAI’s o1 was the first model on our benchmark to surpass 91% in late 2024, followed by a wave of open-source models sparked by Deepseek R1. In March 2025, Qwen’s 32B parameter reasoning model, QwQ, became the sixth model to score between 90% and 92.5%. S&P AI Benchmarks had become saturated, and it became clear that we’d reached the end of the life cycle.
Our goal was to create a rigorous set of tasks rooted in realistic use cases, establishing a trustworthy and objective evaluation to facilitate a better understanding of model capabilities. Over the past year we’ve not only achieved this goal, but we watched as our public-facing leaderboards encouraged innovation and advanced collaborative understanding for business and financial GenAI models.
As of July 2025, we are sunsetting the current iteration of S&P AI Benchmarks by Kensho, so you won’t see further updates to our public leaderboards. As we close the chapter, we are excited to take all we’ve learned from this benchmarking effort forward into new research. We’d also like to share our learnings with the AI community, so as part of this effort we are open-sourcing our benchmarking data and will continue to publish our research in this domain. We hope this dataset will further the industry’s collective research into developing more efficient and powerful models.
As we think critically about evaluations at large, we are shifting our focus toward the role benchmarks will play as AI continues to evolve. This means not only rethinking how we build the next generation of benchmarks incorporating more complex reasoning and agentic workflows, but how we can effectively quantify the utility of benchmarks and measure the value of model performance. Ultimately, we hope this research and understanding will inform how we develop benchmarks that better address the performance gap between LLMs and humans. We look forward to sharing more on this front down the road.