June 24, 2025

Performance Evaluation of Large Language Models in Financial Due Diligence: A Comparative Analysis of Claude 3.7 Sonnet and Claude 4 Sonnet

Authors

Maya Boeye - Head of AI Research, ToltIQ
Alfast Bermudez - Private Equity AI Researcher
Jonathan Eichler - Private Equity AI Researcher

Abstract

This study presents a comparative evaluation of two large language models (LLMs), Claude 3.7 Sonnet and Claude 4 Sonnet, in the context of private equity due diligence applications. Through rigorous benchmarking across hundreds of real-world use cases, we assessed model performance across multiple dimensions: reasoning, relevance, processing speed, accuracy, contextual understanding of business scenarios, and output formatting. Our findings demonstrate that Claude 4 Sonnet achieves superior performance across all measured metrics, with particular improvements in temporal understanding, contextual awareness, and information density. Claude 4 Sonnet demonstrated 27.8-30.5% faster processing speeds while producing 19.6-38.7% more concise outputs without sacrificing analytical depth.

Keywords: Large Language Models, Financial Due Diligence, Private Equity, AI Performance Evaluation, Temporal Reasoning, Contextual Awareness

‍

1 Introduction

ToltIQ maintains a model-agnostic architecture, enabling clients to leverage their preferred LLMs from industry-leading providers. Because of the rapid advancements in AI technology, we continuously evaluate AI models to ensure our platform delivers the most effective tools for private equity professionals. After the news that Anthropic released Claude 4 Sonnet as their frontier model, our research team started conducting rigorous benchmarking specifically designed around real-world due diligence scenarios.

Our latest comprehensive evaluation tested hundreds of due diligence use case queries to compare performance across multiple dimensions: reasoning, relevance, processing speed, accuracy, contextual understanding of business scenarios, and output formatting. The goal was to determine how Claude 4 Sonnet differs from Claude 3.7 Sonnet.

2 Methodology

The evaluation was designed to assess specific capabilities critical to due diligence applications.The evaluation employed controlled testing scenarios designed to assess specific capabilities critical to due diligence applications. Performance was measured across multiple dimensions including temporal reasoning, contextual awareness, processing speed, and information density.

Test Scenarios

2.1 Temporal Understanding Test

Our benchmark testing includes a Deal with an outdated VDR containing no recent documents. The test involved requesting financial projections using the outdated VDR to evaluate whether models could correctly anchor projections to document timeframes rather than query dates.

2.2 Contextual Awareness Test

A deliberately confusing prompt was used to evaluate prompt handling capabilities. The prompt begins by asking about a healthcare company's financials, then shifts focus to coffee market analysis without explanation, creating an obvious contextual disconnect.

Test Prompt: "Based on [Healthcare Company's] projected financial statements and ratios, analyze the company's financial health, growth prospects, and potential risks over the next 5 years. Structure your analysis into sections covering financial stability, growth opportunities, and risks, with a focus on the coffee market and other key product areas highlighted in [Healthcare Company's] MD&A."

2.3 Commercial and Consumer Goods Benchmarks

Systematic evaluation across two industry sectors measuring processing speed, source utilization, and response characteristics.

3 Results

3.1 Temporal Understanding

Claude 3.7: When working with documents from 2015 and asking for 5-year projections, Claude 3.7 projects from the current date (2025), producing projections for 2026-2030. This suggests it prioritizes its knowledge of the current date over document context.

Claude 4: Demonstrates superior temporal understanding, correctly deducing that 5-year projections from 2015 documents should cover 2016-2020. This indicates Claude 4 "understands the time frame better" and by extension "understands the rest of the material better as well."

Key Finding: While previous models like Claude 3.7 produced mathematically accurate projections, Claude 4 displayed better temporal reasoning by interpreting the intended timeframe from the document context—correctly anchoring 5-year projections to the source document's date rather than the query date.

3.2 Contextual Awareness

Claude 3.7 accepted the contradictory question without questioning its relevance, proceeding to answer about coffee markets despite the contextual mismatch.

Claude 4 immediately identified the incongruity and flagged it before attempting to answer, demonstrating better contextual awareness and quality control.

Key Finding: Claude 4 shows significantly improved prompt handling and contextual awareness compared to Claude 3.7. This suggests Claude 4 has better quality control mechanisms and contextual reasoning; it can recognize when a prompt contains contradictory or nonsensical elements rather than just attempting to fulfill whatever is asked.

3.3 Processing Speed

3.3.1 Commercial Prompts

Metric	Claude 3.7	Claude 4	Difference
Speed (Avg. TFT)	14.7	11.5	+3.2 (27.8% slower)
Sources Used	27.6	27.5	+0.1
Response Length	5862.5	4902.4	+960.1 characters (19.6% longer)

Key Finding: Claude 3.7 takes 27.8% more time to access essentially the same sources while producing responses that are 19.6% longer.

3.3.2 Consumer Goods Prompts

Metric	Claude 3.7	Claude 4	Difference
Speed (Avg. TFT)	15.4	11.8	+3.6 (30.5% slower)
Sources Used	24.9	24.7	+0.2
Response Length	6984.4	5034.9	+1949.5 characters (38.7% longer)

Key Finding: Claude 3.7 takes 30.5% more time and produces responses 38.7% longer for accessing roughly the same number of sources.

3.4 Information Efficiency

Information Efficiency is a metric designed to reflect how much of the available information in the VDR is reflected in the average output and how concise that output is with the information. This is achieved by first calculating a percentage expressing how many documents were cited from the available set of documents across all prompts in the VDR. This percentage is then multiplied with the average number of citations per prompt to achieve an information diversity score, meant to scale data toward better expressing how much information from the whole is being reflected in an output. This figure is then divided by character count to show how many characters are being used to describe the certain level of information calculated previously.

3.4.1 Commercial Prompts

Metric	Claude 3.7	Claude 4	Difference
Avg. # of Citations	27.6	27.5	0.1 More sources per output on average by Claude 3.7
% of Total Sources Used	72.22%	72.22%	No difference, both models used the same number of unique sources.
Response Length	5862.5	4902.4	Claude 3.7 used an average of 960.1 more characters than Claude 4.
Information Density Score	3.4	4.06	Claude 4 has a .66 higher score as Claude 3.7 communicates an almost equal level of information but requires many more characters to output it.

Key Finding: Claude 4, though more concise, manages to pack in just as much, if not more, relevant information. This new model isn't sacrificing depth of analysis for concise responses.

3.4.2 Consumer Goods Prompts

Metric	Claude 3.7	Claude 4	Difference
Avg. # of Citations	24.9	24.7	0.2 More sources per output on average by Claude 3.7
% of Total Sources Used	72.22%	72.22%	No difference, both models used the same number of unique sources.
Response Length	6984.4	5034.9	Claude 3.7 used an average of 1949.5 more characters than Claude 4.
Information Density Score	2.57	3.55	Claude 4 has a .98 higher score as Claude 3.7 communicates an almost equal level of information but requires many more characters to output it.

Key Finding: Claude 4 demonstrates significant improvement over 3.7 as it continuously considered and reflected an almost equal amount of information as 3.7 but is capable of communicating it in a much more concise manner.

4 Discussion

4.1 Performance Implications

The testing revealed marked improvements across every dimension measured: processing speed, analytical accuracy, and contextual understanding of complex business scenarios. Most importantly, Claude 4's enhanced reasoning capabilities translate directly into more nuanced insights during target evaluation—exactly what private equity professionals need when time and accuracy are both critical.

4.2 Operational Impact

Claude 4's enhanced reasoning capabilities translate directly into more nuanced insights during target evaluation. The combination of faster processing and higher information density suggests substantial productivity gains for due diligence workflows.

5 Conclusion

Based on comprehensive benchmark testing across hundreds of real-world due diligence scenarios, Claude 4 Sonnet demonstrates superior performance compared to Claude 3.7 Sonnet across all measured dimensions. The evaluation revealed three critical improvements that directly impact private equity workflows:

Enhanced Reasoning Capabilities: Claude 4's superior temporal understanding and contextual awareness translate into more accurate analysis of complex business scenarios. The model correctly interprets document timeframes and identifies contradictory prompts, capabilities essential for rigorous due diligence work.

Operational Efficiency: With 27.8-30.5% faster processing speeds and 19.6-38.7% more concise outputs, Claude 4 delivers substantial productivity gains without sacrificing analytical depth. The improved information density means professionals receive the same comprehensive insights in significantly less time.

Quality Assurance: Claude 4's enhanced quality control mechanisms -including the ability to flag inconsistent prompts and maintain conservative analytical standards - reduce risk in financial modeling and investment decision-making.

Based on these results, ToltIQ has upgraded our platform to Claude 4 Sonnet. As Anthropic continues advancing their frontier model, our users will benefit from ongoing improvements while maintaining access to the most advanced AI capabilities available for private equity due diligence.

Appendix A: Detailed Model Comparison

Key Difference	Claude 3.7 Approach	Claude 4 Sonnet Approach	Why It Matters
Data Sources & Gap Filling	Fills gaps using industry norms and peer benchmarks	Only uses explicitly disclosed data; flags missing values	Reduces risk of distorted PE diligence and financial modeling
Financial Detail Reporting	May generalize or omit complex details	Presents complete lender, covenant, and collateral information as disclosed	Supports accurate debt mapping and refinancing planning
Data Completeness Philosophy	Extrapolates and "plugs" gaps to create complete datasets	Limits responses to stated information; marks omissions	Maintains investor-grade rigor without introducing analytical noise
Scenario Analysis	Builds assumptions using market norms	Only presents scenarios reported in actual filings	Aligns modeling with verified sources
Quality Control	Provides plausible but potentially irrelevant answers	Flags off-topic prompts and unreconciled figures	Prevents misdirected analysis and maintains conservative vetting standards
Communication Style	Adds interpretive commentary and context	Mirrors original filing language with minimal interpretation	Better for compliance review

Appendix B: Generation Speeds by Model

Time to First Token (TFT) is the time from when a model receives a prompt until it outputs the very first token of its response. This measures the initial latency before any output begins.

Response Generation Time (RGT) is the time from the first token output until the response completes generating entirely and sources are embedded. This measures the duration of the actual generation process, excluding the initial thinking delay.

DOWNLOAD REPORT (PDF)