■  Research and innovation

Shaping the future through cutting-edge research and development

Research and experimentation are core to Kensho. We push the boundaries of what’s possible, and our teams are often first to explore, publish, and patent new approaches, advancing research and shaping how AI is built and deployed.

■  Who we are

Research runs through everything we build.

A dedicated R&D team leads our pure research, contributing to the field through peer-reviewed publications and collaborations with academic and industry partners. Applied research extends across Kensho's ML and engineering teams, where we build novel, practical solutions for our customers through rigorous research, feedback, and development.

Office workers sitting at desks in a modern workspace with large windows, two people smiling at each other, and others working on computers.

A woman with dark, wavy hair, wearing a rust-colored leopard print blouse, sits on a teal sofa reading a book. She is smiling with her eyes closed. The scene is bright with natural light from the large window beside her, which shows a vase with yellow flowers on the windowsill. Blurred greenery and leaves are in the foreground.

■  Key focus areas

Current research

We adapt and develop state-of-the-art methodologies to optimize how agents interact with data retrieval tools and how contextual queries are routed between multiple agents. Methods include dynamic prompting, query expansion, self-reflection, and intelligent routing across any number of datasets, which enable LLMs to effectively leverage S&P Global’s's AI-ready data. We also develop agentic architectures built to orchestrate complex workflows such as conducting research and generating long-form content.
We explore and develop cutting-edge strategies for unstructured and structured data retrieval, enabling agentic workflows atop complex financial data. We push unstructured data retrieval far beyond naive vector search RAG, leveraging structural information and relationships from our document processing toolkit to enrich data for complex data discovery and question answering at performant inference speeds. For structured and tabular data, we combine LLM-optimized tools with a proprietary Text-to-SQL framework. Across both, we apply strategies such as reinforcement learning, self-consistency, and self-reflection to optimize our agents for financial datasets to deliver sharper customer insights across complex data sources.
We research and develop multi-tiered evaluation frameworks and benchmarks for finance and business. Our evaluations assess models across a broad spectrum of tasks, from simple data retrieval and long-context QA to advanced multi-step reasoning and program synthesis. These frameworks support complex agentic products spanning tool usage, code and SQL generation, computer use, and report generation. To measure both end-to-end solution quality and individual component performance, we employ diverse strategies—ranging from AST parsing to LLM-juries calibrated by domain experts—while rigorously auditing data relevance and output consistency. Beyond applied metrics, we actively research and challenge the fundamentals of evaluation itself. We explore critical questions such as the reliability of LLM-as-a-judge, the impact of high-quality labels, methods for efficiently predicting model performance on unseen evaluations, and the real-world properties that current benchmarks fail to address. Ultimately, our comprehensive approach enables data-driven decisions and continuous model improvement.
We analyze and improve tokenization methods, which determine how all input data should be parsed and represented to a model. Tokenization naturally impacts a model's ability to process and understand data such as human language text, and remains an under-explored LLM research frontier. We develop fundamental tokenization capabilities we can use to better present data to LLMs, including our rich, domain-specific business and finance data. This work includes drastically improving the speed and efficiency of training and using tokenizers, discovering LLM limitations with processing numeric data in various languages and scripts, moving beyond the restrictions of the ubiquitous BPE tokenizer, and building our own enhanced tokenizer.
We explore how to best harness the structural and relational information within and across documents, enabling models to move beyond surface-level text and achieve richer document understanding. Our research across document components and their relationships informs ingestion strategies that ensure both structural and semantic coherence as well as providing better structured more usable content downstream. This coherence and richer representation unlocks greater performance and even new capabilities compared with traditional extraction. Models need this context and structure to answer targeted questions, surface broader insights across wider knowledge bases, and ground validated assertions against a document corpus.
We are on a mission to extract and structure all information contained within documents, from plain text to the most complex visual elements. Our proprietary models extract figure data from document images and graphs, unlocking highly accurate quantitative question-answering capabilities. In our pursuit of comprehensive document AI, we are also building the next generation of table extraction models including Vision Language Models (VLMs) to parse tables in their natural context within page images.

Two people, a woman and a man, smiling as they look at and point to a whiteboard with drawings in an office setting.

■  Publications

Advancing industry knowledge

Our research regularly appears in top conferences, academic journals, and leading NLP textbooks. These publications reflect a commitment to advancing industry knowledge while grounding innovation in practical applications. Explore our full list of publications.

▪ Apr. 2026	Tokenization	Faster Superword Tokenization	Craig W. Schmidt, Chris Tanner, Yuval Pinter
▪ Apr. 2026	Evaluation	Cost-Efficient Estimation of General Abilities Across Benchmarks	Michael Krumdick, Adam Wiemerslage, Seth Ebner, Charles Lovering, Chris Tanner
▪ Apr. 2026	Evaluation	FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks	Michael Krumdick, Varshini Reddy, Shivani Chaudhary, William Day, Maarij Ahmed, Hayan Haqqi, Muhammad Ahsen Fahim, Hanzallah Amjad, Ahmad Orakzai, Aqsa Gul, Chris Tanner
▪ Apr. 2026	Evaluation	No Free Labels: Limitations of LLM-as-a-Judge without Human Grounding	Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, Chris Tanner
▪ ACL 2026	Evaluation	The Effect of Scripts and Formats on LLM Numeracy	Varshini Reddy, Craig W. Schmidt, Seth Ebner, Adam Wiemerslage, Yuval Pinter, Chris Tanner
▪ ACL 2026	Document Understanding	On Finding Inconsistencies in Documents	Charles J. Lovering, Seth Ebner, Brandon Smock, Michael Krumdick, Saad Rabbani, Ahmed Muhammad, Varshini Reddy, Chris Tanner
▪ Dec. 2025	Extraction	PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction	Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam, Maury Courtland
▪ NeurIPS 2025	Evaluation	Complexity Scaling Laws for Neural Models using Combinatorial Optimization	Lowell Weissman, Michael Krumdick, A. Lynn Abbott
▪ NeurIPS 2025	Evaluation	BLEUBERI: BLEU is a surprisingly effective reward for instruction following	Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Lie, Chris Tanner, Mohit Iyyer
▪ COLM 2025	Tokenization	Boundless Byte Encoding: Breaking the Pre-Tokenization Barrier	Craig Schmidt, Varshini Reddy, Chris Tanner, Yuval Pinter
▪ UIST 2025	Evaluation	When Context Grows, So Does the Challenge: Human Oversight in LLM Evaluation of Financial Tables	Arijit Sehanobish, Shirley Anderson, Guillaume Michel, Mike Arov
▪ Interspeech 2025	NLP	SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription	Raymond Grossman, Taejin Park, Kunal Dhawan, Andrew Titus, Sophia Zhi, Yulia Shchadilova, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg
▪ ICML 2025	Tokenization	Entropy-Driven Pre-tokenization for Byte Pair Encoding	Yifan Hu, Frank Liang, Dachuan Zhao, Jonathan Geuter, Varshini Reddy, Craig W Schmidt, Chris Tanner
▪ ICML 2025	Tokenization	How Much is Enough? The Diminishing Returns of Tokenization Training Data	Varshini Reddy, Craig Schmidt, Yuval Pinter, Chris Tanner
▪ ICLR 2025	GenAI	On-Device Watermarking: A Socio-Technical Imperative For Authenticity In The Age of Generative AI	Houssam Kherraz
▪ IUI/HCI 2025	GenAI	Generative AI Interface Design Considerations for Private Equity	Shirley Anderson, Yuanfei Zhao
▪ ACL 2025	Evaluation	Language Probability Models are Not Calibrated in Numerical Contexts	Charles Lovering, Michael Krumdick, Viet Dac Lai, Seth Ebner, Nilesh Kumar, Varshini Reddy, Rik Koncel-Kedziorski, Chris Tanner
▪ EMNLP 2025	Evaluation	SEC-QA: A Systematic Evaluation Corpus for Financial QA	Viet Dac Lai, Michael Krumdick, Charles Lovering, Varshini Reddy, Craig Schmidt, Chris Tanner
▪ EMNLP 2024	Tokenization	Tokenization is More Than Compression	Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner
▪ EMNLP 2024	Document Understanding	An Analysis of Multilingual FActScore	Vu Trong Kim, Michael Krumdick, Varshini Reddy, Franck Dernoncourt, Viet Dac Lai
▪ ACL 2024	Tokenization	Greed is All You Need: An Evaluation of Tokenizer Inference Methods	Omri Uzan, Craig W Schmidt, Chris Tanner, Yuval Pinter
▪ ACL 2024	Evaluation	DocFinQA: A Long-Context Financial Reasoning Dataset	Varshini Reddy, Rik Koncel-Kedziorski, Viet Dac Lai, Michael Krumdick, Charles Lovering, Chris Tanner
▪ ACL 2024	Evaluation	BizBench: A Quantitative Reasoning Benchmark for Business and Finance	Michael Krumdick, Rik Koncel-Kedziorski, Viet Dac Lai, Varshini Reddy, Charles Lovering, Chris Tanner
▪ NAACL 2024	NLP	MCECR: A Novel Dataset for Multilingual Cross-Document Event Coreference Resolution	Amir Pouran Ben Veyseh, Viet Dac Lai, Chien Nguyen, Franck Dernoncourt, Thien Nguyen
▪ LREC-Coling 2024	NLP	CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages	Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen
▪ LREC-Coling 2024	NLP	CAMAL: A Novel Dataset for Multi-label Conversational Argument Move Analysis	Viet Dac Lai, Duy Ngoc Pham, Jonathan Steinberg, Jamie Mikeska, Thien Huu Nguyen
▪ NCME 2024	ML	Using Machine Learning to Detect Student Learning Levels along a Learning Progression	Duy Pham, Viet Dac Lai
▪ ACL 2023	NLP	Learning Answer Generation using Supervision from Automatic Question Answering Evaluators	Matteo Gabburo, Siddhant Garg, Rik Koncel-Kedziorski, Alessandro Moschitti
▪ ICDAR 2023	Document Understanding	A Graphical Approach to Document Layout Analysis	Jilin Wang, Michael Krumdick, Baojia Tong, Hamima Halim, Maxim Sokolov, Vadym Barda, Delphine Vendryes, Chris Tanner
▪ EACL 2023	NLP	What happens before and after: Multi-Event Commonsense in Event Coreference Resolution	Sahithya Ravi, Chris Tanner, Raymond Ng, Vered Shwartz
▪ Interspeech 2023	ML	Boosting Punctuation Restoration with Data Generation and Reinforcement Learning	Viet Dac Lai, Abel Salinas, Hao Tan, Trung Bui, Quan Tran, Seunghyun Yoon, Hanieh Deilamsalehy, Franck Dernoncourt, Thien Huu Nguyen
▪ ICLR 2023	ML	Mask Conditional Synthetic Satellite Imagery	Van Anh Le, Varshini Reddy, Zixi Chen, Mengyuan Li, Xinran Tang, Anthony Ortiz, Simone Fobi Nsutezo, Caleb Robinson

An isometric illustration of a 3D cube with purple smaller cubes placed on specific grid points around its perimeter representing Kensho's release of datasets designed to train AI models on financial tasks.

■  Datasets

Better data means better AI

We build high-quality datasets to advance AI and ML. Some are open-sourced to support the research community. Others underpin Kensho's own models and products.

SPGISpeech 2.0

A large-scale dataset containing thousands of hours of professionally transcribed and formatted financial audio for transcription, acoustic modeling, and ASR.

Download

PubTables-v2

A first-of-its-kind dataset developed to advance document AI by empowering end-to-end table extraction tasks.

Download

Finance Fundamentals

A collection of datasets featuring quantitative reasoning benchmarks, financial domain knowledge, and document-based questions to evaluate and train LLMs.

Download

FIND

A document benchmarking dataset enabling models to detect, describe, and provide evidence of inconsistencies in long, technical, and complex documents.

Download

See all datasets

■  Research to product

From research to real-world solutions

Kensho turns AI research into the products S&P Global and its customers rely on every day. What starts in our labs ends up powering decisions across global markets.

Explore solutions

Shaping the future through cutting-edge research and development

Research runs through everything we build.

Current research

Multi-agent systems

Advanced data retrieval

Evaluation for financial tasks

Tokenization

Complex document understanding

Whole document extraction

Advancing industry knowledge

Better data means better AI

SPGISpeech 2.0

PubTables-v2

Finance Fundamentals

FIND

From research to real-world solutions