■  Research and innovation

Shaping the future through cutting-edge research and development

Research and experimentation are core to Kensho. We push the boundaries of what’s possible, and our teams are often first to explore, publish, and patent new approaches, advancing research and shaping how AI is built and deployed.

■  Who we are

Research runs through everything we build.

A dedicated R&D team leads our pure research, contributing to the field through peer-reviewed publications and collaborations with academic and industry partners. Applied research extends across Kensho's ML and engineering teams, where we build novel, practical solutions for our customers through rigorous research, feedback, and development.

Office workers sitting at desks in a modern workspace with large windows, two people smiling at each other, and others working on computers.
A woman with dark, wavy hair, wearing a rust-colored leopard print blouse, sits on a teal sofa reading a book. She is smiling with her eyes closed. The scene is bright with natural light from the large window beside her, which shows a vase with yellow flowers on the windowsill. Blurred greenery and leaves are in the foreground.
■  Key focus areas

Current research

  • We adapt and develop state-of-the-art methodologies to optimize how agents interact with data retrieval tools and how contextual queries are routed between multiple agents. Methods include dynamic prompting, query expansion, self-reflection, and intelligent routing across any number of datasets, which enable LLMs to effectively leverage S&P Global’s's AI-ready data. We also develop agentic architectures built to orchestrate complex workflows such as conducting research and generating long-form content.

  • We explore and develop cutting-edge strategies for unstructured and structured data retrieval, enabling agentic workflows atop complex financial data. We push unstructured data retrieval far beyond naive vector search RAG, leveraging structural information and relationships from our document processing toolkit to enrich data for complex data discovery and question answering at performant inference speeds. For structured and tabular data, we combine LLM-optimized tools with a proprietary Text-to-SQL framework. Across both, we apply strategies such as reinforcement learning, self-consistency, and self-reflection to optimize our agents for financial datasets to deliver sharper customer insights across complex data sources.

  • We research and develop multi-tiered evaluation frameworks and benchmarks for finance and business. Our evaluations assess models across a broad spectrum of tasks, from simple data retrieval and long-context QA to advanced multi-step reasoning and program synthesis. These frameworks support complex agentic products spanning tool usage, code and SQL generation, computer use, and report generation. To measure both end-to-end solution quality and individual component performance, we employ diverse strategies—ranging from AST parsing to LLM-juries calibrated by domain experts—while rigorously auditing data relevance and output consistency. Beyond applied metrics, we actively research and challenge the fundamentals of evaluation itself. We explore critical questions such as the reliability of LLM-as-a-judge, the impact of high-quality labels, methods for efficiently predicting model performance on unseen evaluations, and the real-world properties that current benchmarks fail to address. Ultimately, our comprehensive approach enables data-driven decisions and continuous model improvement.

  • We analyze and improve tokenization methods, which determine how all input data should be parsed and represented to a model. Tokenization naturally impacts a model's ability to process and understand data such as human language text,  and remains an under-explored LLM  research frontier. We develop fundamental tokenization capabilities we can use to better present data to LLMs, including our rich, domain-specific business and finance data. This work includes drastically improving the speed and efficiency of training and using tokenizers, discovering LLM limitations with processing numeric data in various languages and scripts, moving beyond the restrictions of the ubiquitous BPE tokenizer, and building our own enhanced tokenizer.

  • We explore how to best harness the structural and relational information within and across documents, enabling models to move beyond surface-level text and achieve richer document understanding. Our research across document components and their relationships informs ingestion strategies that ensure both structural and semantic coherence as well as providing better structured more usable content downstream. This coherence and richer representation unlocks greater performance and even new capabilities compared with traditional extraction. Models need this context and structure to answer targeted questions, surface broader insights across wider knowledge bases, and ground validated assertions against a document corpus.

  • We are on a mission to extract and structure all information contained within documents, from plain text to the most complex visual elements. Our proprietary models extract figure data from document images and graphs, unlocking highly accurate quantitative question-answering capabilities. In our pursuit of comprehensive document AI, we are also building  the next generation of table extraction models including Vision Language Models (VLMs) to parse tables in their natural context within page images.

Two people, a woman and a man, smiling as they look at and point to a whiteboard with drawings in an office setting.
■  Publications

Advancing industry knowledge

Our research regularly appears in top conferences, academic journals, and leading NLP textbooks. These publications reflect a commitment to advancing industry knowledge while grounding innovation in practical applications. Explore our full list of publications.

Apr. 2026 Tokenization Faster Superword Tokenization Craig W. Schmidt, Chris Tanner, Yuval Pinter
Apr. 2026 Evaluation Cost-Efficient Estimation of General Abilities Across Benchmarks Michael Krumdick, Adam Wiemerslage, Seth Ebner, Charles Lovering, Chris Tanner
Apr. 2026 Evaluation FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks Michael Krumdick, Varshini Reddy, Shivani Chaudhary, William Day, Maarij Ahmed, Hayan Haqqi, Muhammad Ahsen Fahim, Hanzallah Amjad, Ahmad Orakzai, Aqsa Gul, Chris Tanner
Apr. 2026 Evaluation No Free Labels: Limitations of LLM-as-a-Judge without Human Grounding Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, Chris Tanner
ACL 2026 Evaluation The Effect of Scripts and Formats on LLM Numeracy Varshini Reddy, Craig W. Schmidt, Seth Ebner, Adam Wiemerslage, Yuval Pinter, Chris Tanner
ACL 2026 Document Understanding On Finding Inconsistencies in Documents Charles J. Lovering, Seth Ebner, Brandon Smock, Michael Krumdick, Saad Rabbani, Ahmed Muhammad, Varshini Reddy, Chris Tanner
Dec. 2025 Extraction PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam, Maury Courtland
NeurIPS 2025 Evaluation Complexity Scaling Laws for Neural Models using Combinatorial Optimization Lowell Weissman, Michael Krumdick, A. Lynn Abbott
NeurIPS 2025 Evaluation BLEUBERI: BLEU is a surprisingly effective reward for instruction following Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Lie, Chris Tanner, Mohit Iyyer
COLM 2025 Tokenization Boundless Byte Encoding: Breaking the Pre-Tokenization Barrier Craig Schmidt, Varshini Reddy, Chris Tanner, Yuval Pinter
UIST 2025 Evaluation When Context Grows, So Does the Challenge: Human Oversight in LLM Evaluation of Financial Tables Arijit Sehanobish, Shirley Anderson, Guillaume Michel, Mike Arov
Interspeech 2025 NLP SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription Raymond Grossman, Taejin Park, Kunal Dhawan, Andrew Titus, Sophia Zhi, Yulia Shchadilova, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg
ICML 2025 Tokenization Entropy-Driven Pre-tokenization for Byte Pair Encoding Yifan Hu, Frank Liang, Dachuan Zhao, Jonathan Geuter, Varshini Reddy, Craig W Schmidt, Chris Tanner
ICML 2025 Tokenization How Much is Enough? The Diminishing Returns of Tokenization Training Data Varshini Reddy, Craig Schmidt, Yuval Pinter, Chris Tanner
ICLR 2025 GenAI On-Device Watermarking: A Socio-Technical Imperative For Authenticity In The Age of Generative AI Houssam Kherraz
IUI/HCI 2025 GenAI Generative AI Interface Design Considerations for Private Equity Shirley Anderson, Yuanfei Zhao
ACL 2025 Evaluation Language Probability Models are Not Calibrated in Numerical Contexts Charles Lovering, Michael Krumdick, Viet Dac Lai, Seth Ebner, Nilesh Kumar, Varshini Reddy, Rik Koncel-Kedziorski, Chris Tanner
EMNLP 2025 Evaluation SEC-QA: A Systematic Evaluation Corpus for Financial QA Viet Dac Lai, Michael Krumdick, Charles Lovering, Varshini Reddy, Craig Schmidt, Chris Tanner
EMNLP 2024 Tokenization Tokenization is More Than Compression Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner
EMNLP 2024 Document Understanding An Analysis of Multilingual FActScore Vu Trong Kim, Michael Krumdick, Varshini Reddy, Franck Dernoncourt, Viet Dac Lai
ACL 2024 Tokenization Greed is All You Need: An Evaluation of Tokenizer Inference Methods Omri Uzan, Craig W Schmidt, Chris Tanner, Yuval Pinter
ACL 2024 Evaluation DocFinQA: A Long-Context Financial Reasoning Dataset Varshini Reddy, Rik Koncel-Kedziorski, Viet Dac Lai, Michael Krumdick, Charles Lovering, Chris Tanner
ACL 2024 Evaluation BizBench: A Quantitative Reasoning Benchmark for Business and Finance Michael Krumdick, Rik Koncel-Kedziorski, Viet Dac Lai, Varshini Reddy, Charles Lovering, Chris Tanner
NAACL 2024 NLP MCECR: A Novel Dataset for Multilingual Cross-Document Event Coreference Resolution Amir Pouran Ben Veyseh, Viet Dac Lai, Chien Nguyen, Franck Dernoncourt, Thien Nguyen
LREC-Coling 2024 NLP CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen
LREC-Coling 2024 NLP CAMAL: A Novel Dataset for Multi-label Conversational Argument Move Analysis Viet Dac Lai, Duy Ngoc Pham, Jonathan Steinberg, Jamie Mikeska, Thien Huu Nguyen
NCME 2024 ML Using Machine Learning to Detect Student Learning Levels along a Learning Progression Duy Pham, Viet Dac Lai
ICLR 2024 ML Scalable Neural Network Kernels Arijit Sehanobish, Krzysztof Choromanski, Yunfan Zhao, Avinava Dubey, Valerii Likhosherstov
ACL 2023 ML The economic trade-offs of large language models: A case study Kristen Howell, Gwen Christian, Pavel Fomitchov, Gitit Kehat, Julianne Marzulla, Leanne Rolston, Jadin Tredup, Ilana Zimmerman, Ethan Selfridge, Joseph Bradley
ACL 2023 NLP Learning Answer Generation using Supervision from Automatic Question Answering Evaluators Matteo Gabburo, Siddhant Garg, Rik Koncel-Kedziorski, Alessandro Moschitti
ICDAR 2023 Extraction GriTS: Grid Table Similarity Metric for Table Structure Recognition Brandon Smock, Rohith Pesala, Robin Abraham
ICDAR 2023 Extraction Aligning benchmark datasets for table structure recognition Brandon Smock, Rohith Pesala, Robin Abraham
ICDAR 2023 Document Understanding A Graphical Approach to Document Layout Analysis Jilin Wang, Michael Krumdick, Baojia Tong, Hamima Halim, Maxim Sokolov, Vadym Barda, Delphine Vendryes, Chris Tanner
EACL 2023 NLP What happens before and after: Multi-Event Commonsense in Event Coreference Resolution Sahithya Ravi, Chris Tanner, Raymond Ng, Vered Shwartz
ICML 2023 ML Efficient Graph Field Integrators Meet Point Clouds Krzysztof Choromanski, Arijit Sehanobish, Han Lin, Yunfan Zhao, Eli Berger, Tetiana Parshakova, Alvin Pan, David Watkins, Tianyi Zhang, Valerii Likhosherstov, Somnath Basu Roy Chowdhury, Avinava Dubey, Deepali Jain, Tamas Sarlos, Snigdha Chaturvedi, Adrian Weller
Interspeech 2023 ML Boosting Punctuation Restoration with Data Generation and Reinforcement Learning Viet Dac Lai, Abel Salinas, Hao Tan, Trung Bui, Quan Tran, Seunghyun Yoon, Hanieh Deilamsalehy, Franck Dernoncourt, Thien Huu Nguyen
ICLR 2023 ML Mask Conditional Synthetic Satellite Imagery Van Anh Le, Varshini Reddy, Zixi Chen, Mengyuan Li, Xinran Tang, Anthony Ortiz, Simone Fobi Nsutezo, Caleb Robinson
An isometric illustration of a 3D cube with purple smaller cubes placed on specific grid points around its perimeter representing Kensho's release of datasets designed to train AI models on financial tasks.
■  Datasets

Better data means better AI

We build high-quality datasets to advance AI and ML. Some are open-sourced to support the research community. Others underpin Kensho's own models and products.

SPGISpeech 2.0

A large-scale dataset containing thousands of hours of professionally transcribed and formatted financial audio for transcription, acoustic modeling, and ASR.

PubTables-v2

A first-of-its-kind dataset developed to advance document AI by empowering end-to-end table extraction tasks.

Finance Fundamentals

A collection of datasets featuring quantitative reasoning benchmarks, financial domain knowledge, and document-based questions to evaluate and train LLMs.

FIND

A document benchmarking dataset enabling models to detect, describe, and provide evidence of inconsistencies in long, technical, and complex documents.

■  Research to product

From research to real-world solutions

Kensho turns AI research into the products S&P Global and its customers rely on every day. What starts in our labs ends up powering decisions across global markets.