June 24, 2025

Performance Evaluation of Large Language Models in Financial Due Diligence: A Comparative Analysis of Claude 3.7 Sonnet and Claude 4 Sonnet

DOWNLOAD REPORT (PDF)
June 24, 2025

Performance Evaluation of Large Language Models in Financial Due Diligence: A Comparative Analysis of Claude 3.7 Sonnet and Claude 4 Sonnet

DOWNLOAD REPORT (PDF)

Authors

Maya Boeye - Head of AI Research, ToltIQ
Alfast Bermudez - Private Equity AI Researcher
Jonathan Eichler - Private Equity AI Researcher

Abstract

This study presents a comparative evaluation of two large language models (LLMs), Claude 3.7 Sonnet and Claude 4 Sonnet, in the context of private equity due diligence applications. Through rigorous benchmarking across hundreds of real-world use cases, we assessed model performance across multiple dimensions: reasoning, relevance, processing speed, accuracy, contextual understanding of business scenarios, and output formatting. Our findings demonstrate that Claude 4 Sonnet achieves superior performance across all measured metrics, with particular improvements in temporal understanding, contextual awareness, and information density. Claude 4 Sonnet demonstrated 27.8-30.5% faster processing speeds while producing 19.6-38.7% more concise outputs without sacrificing analytical depth.

Keywords: Large Language Models, Financial Due Diligence, Private Equity, AI Performance Evaluation, Temporal Reasoning, Contextual Awareness

1 Introduction

ToltIQ maintains a model-agnostic architecture, enabling clients to leverage their preferred LLMs from industry-leading providers. Because of the rapid advancements in AI technology, we continuously evaluate AI models to ensure our platform delivers the most effective tools for private equity professionals. After the news that Anthropic released Claude 4 Sonnet as their frontier model, our research team started conducting rigorous benchmarking specifically designed around real-world due diligence scenarios.

Our latest comprehensive evaluation tested hundreds of due diligence use case queries to compare performance across multiple dimensions: reasoning, relevance, processing speed, accuracy, contextual understanding of business scenarios, and output formatting. The goal was to determine how Claude 4 Sonnet differs from Claude 3.7 Sonnet.

2 Methodology

The evaluation was designed to assess specific capabilities critical to due diligence applications.The evaluation employed controlled testing scenarios designed to assess specific capabilities critical to due diligence applications. Performance was measured across multiple dimensions including temporal reasoning, contextual awareness, processing speed, and information density.

Test Scenarios

2.1 Temporal Understanding Test

Our benchmark testing includes a Deal with an outdated VDR containing no recent documents. The test involved requesting financial projections using the outdated VDR to evaluate whether models could correctly anchor projections to document timeframes rather than query dates.

2.2 Contextual Awareness Test

A deliberately confusing prompt was used to evaluate prompt handling capabilities. The prompt begins by asking about a healthcare company's financials, then shifts focus to coffee market analysis without explanation, creating an obvious contextual disconnect.

Test Prompt: "Based on [Healthcare Company's] projected financial statements and ratios, analyze the company's financial health, growth prospects, and potential risks over the next 5 years. Structure your analysis into sections covering financial stability, growth opportunities, and risks, with a focus on the coffee market and other key product areas highlighted in [Healthcare Company's] MD&A."

2.3 Commercial and Consumer Goods Benchmarks

Systematic evaluation across two industry sectors measuring processing speed, source utilization, and response characteristics.

3 Results

3.1 Temporal Understanding

Claude 3.7: When working with documents from 2015 and asking for 5-year projections, Claude 3.7 projects from the current date (2025), producing projections for 2026-2030. This suggests it prioritizes its knowledge of the current date over document context.

Claude 4: Demonstrates superior temporal understanding, correctly deducing that 5-year projections from 2015 documents should cover 2016-2020. This indicates Claude 4 "understands the time frame better" and by extension "understands the rest of the material better as well."

Key Finding: While previous models like Claude 3.7 produced mathematically accurate projections, Claude 4 displayed better temporal reasoning by interpreting the intended timeframe from the document context—correctly anchoring 5-year projections to the source document's date rather than the query date.

3.2 Contextual Awareness

Claude 3.7 accepted the contradictory question without questioning its relevance, proceeding to answer about coffee markets despite the contextual mismatch.

Claude 4 immediately identified the incongruity and flagged it before attempting to answer, demonstrating better contextual awareness and quality control.

Key Finding: Claude 4 shows significantly improved prompt handling and contextual awareness compared to Claude 3.7. This suggests Claude 4 has better quality control mechanisms and contextual reasoning; it can recognize when a prompt contains contradictory or nonsensical elements rather than just attempting to fulfill whatever is asked.

3.3 Processing Speed

3.3.1 Commercial Prompts
Metric Claude 3.7 Claude 4 Difference
Speed (Avg. TFT) 14.7 11.5 +3.2 (27.8% slower)
Sources Used 27.6 27.5 +0.1
Response Length 5862.5 4902.4 +960.1 characters (19.6% longer)

Key Finding: Claude 3.7 takes 27.8% more time to access essentially the same sources while producing responses that are 19.6% longer.

3.3.2 Consumer Goods Prompts
Metric Claude 3.7 Claude 4 Difference
Speed (Avg. TFT) 15.4 11.8 +3.6 (30.5% slower)
Sources Used 24.9 24.7 +0.2
Response Length 6984.4 5034.9 +1949.5 characters (38.7% longer)

Key Finding: Claude 3.7 takes 30.5% more time and produces responses 38.7% longer for accessing roughly the same number of sources.

3.4 Information Efficiency

Information Efficiency is a metric designed to reflect how much of the available information in the VDR is reflected in the average output and how concise that output is with the information. This is achieved by first calculating a percentage expressing how many documents were cited from the available set of documents across all prompts in the VDR. This percentage is then multiplied with the average number of citations per prompt to achieve an information diversity score, meant to scale data toward better expressing how much information from the whole is being reflected in an output. This figure is then divided by character count to show how many characters are being used to describe the certain level of information calculated previously.

3.4.1 Commercial Prompts
Metric Claude 3.7 Claude 4 Difference
Avg. # of Citations 27.6 27.5 0.1 More sources per output on average by Claude 3.7
% of Total Sources Used 72.22% 72.22% No difference, both models used the same number of unique sources.
Response Length 5862.5 4902.4 Claude 3.7 used an average of 960.1 more characters than Claude 4.
Information Density Score 3.4 4.06 Claude 4 has a .66 higher score as Claude 3.7 communicates an almost equal level of information but requires many more characters to output it.

Key Finding: Claude 4, though more concise, manages to pack in just as much, if not more, relevant information. This new model isn't sacrificing depth of analysis for concise responses.

3.4.2 Consumer Goods Prompts
Metric Claude 3.7 Claude 4 Difference
Avg. # of Citations 24.9 24.7 0.2 More sources per output on average by Claude 3.7
% of Total Sources Used 72.22% 72.22% No difference, both models used the same number of unique sources.
Response Length 6984.4 5034.9 Claude 3.7 used an average of 1949.5 more characters than Claude 4.
Information Density Score 2.57 3.55 Claude 4 has a .98 higher score as Claude 3.7 communicates an almost equal level of information but requires many more characters to output it.

Key Finding: Claude 4 demonstrates significant improvement over 3.7 as it continuously considered and reflected an almost equal amount of information as 3.7 but is capable of communicating it in a much more concise manner.

4 Discussion

4.1 Performance Implications

The testing revealed marked improvements across every dimension measured: processing speed, analytical accuracy, and contextual understanding of complex business scenarios. Most importantly, Claude 4's enhanced reasoning capabilities translate directly into more nuanced insights during target evaluation—exactly what private equity professionals need when time and accuracy are both critical.

4.2 Operational Impact

Claude 4's enhanced reasoning capabilities translate directly into more nuanced insights during target evaluation. The combination of faster processing and higher information density suggests substantial productivity gains for due diligence workflows.

Performance Evaluation of Large Language Models in Financial Due Diligence: A Comparative Analysis of Claude 3.7 Sonnet and Claude 4 Sonnet

DOWNLOAD REPORT (PDF)
June 24, 2025
Partner with a team that knows private markets due diligence.
PRIVACY POLICYTERMS OF USECOOKIE POLICYSECURITY >


© 2025 ToltIQ