March 11, 2026

A Comprehensive Analysis of Platform-Enhanced LLM Accuracy Across 20 Critical Due Diligence Use Cases

DOWNLOAD FULL REPORT (PDF)
March 11, 2026

A Comprehensive Analysis of Platform-Enhanced LLM Accuracy Across 20 Critical Due Diligence Use Cases

DOWNLOAD REPORT (PDF)

Private equity firms have increasingly adopted AI tools for due diligence, but rigorous, use case-specific benchmarks have been largely absent from the literature. This study addresses that gap.

Authors

Maya Boeye, Steiner Williams, Aubrey Inmon, Jeondo Lee, and Jennifer Venus

Introduction

We evaluated four leading large language models (GPT 5.1, GPT 5.2, Claude Sonnet 4.5, and Claude Opus 4.5) across 20 PE due diligence use cases, comparing standalone API performance against ToltIQ's platform. The evaluation used the Vals AI CorpFin V2 validation dataset, an independent benchmark comprising 360 prompts derived from 18 financial documents; the ToltIQ research team mapped these prompts to the 20 use case categories. Each test was replicated five times per model, yielding 14,400 scored responses.

ToltIQ's platform combines purpose-built document ingestion with a retrieval-augmented generation (RAG) architecture. Source materials are processed, indexed, and made retrievable at query time, grounding responses in actual deal documents. A standalone model can only reason over what is directly supplied in a single session, with no ability to retrieve across a broader corpus. That architectural difference is what the results below are measuring.

Findings

Three of four models showed statistically significant accuracy improvements through ToltIQ's custom architecture, with gains ranging from 6.63 to 9.63 percentage points. Opus 4.5 achieved the highest aggregate accuracy at 85.11% on ToltIQ versus 76.38% standalone. GPT 5.1 improved from 70.82% to 80.45%. Sonnet 4.5 improved from 76.91% to 83.54%. GPT 5.2 was the exception, with no significant improvement.

These figures warrant a methodological note. The standalone condition in this study used single-document sessions: one document per evaluation, with no multi-document context. This represents the most favorable possible conditions for standalone models. In practice, a PE team works across a full virtual data room that may span hundreds of documents and thousands of pages, a document volume that standalone LLMs cannot ingest. The accuracy gains reported here should therefore be understood as conservative estimates of the real-world advantage a purpose-built platform provides.

ToltIQ vs. Standalone LLM - Accuracy Across Models and Use Cases

Use Case Model / Platform
GPT 5.1 GPT 5.2 Opus 4.5 Sonnet 4.5
Standalone ToltIQ Standalone ToltIQ Standalone ToltIQ Standalone ToltIQ
DIP / Restructuring Analysis 80.80% 98.10% 82.40% 97.14% 91.60% 100.00% 96.00% 100.00%
CIM Analysis 72.79% 89.84% 75.81% 70.82% 79.07% 92.13% 76.05% 93.77%
VDR Analysis 71.19% 89.49% 74.05% 68.14% 77.38% 92.86% 75.24% 92.88%
Events of Default Analysis 84.29% 92.86% 81.43% 93.57% 85.00% 96.38% 95.00% 92.86%
Interest Rate & Benchmark Analysis 76.80% 89.60% 79.20% 89.60% 94.40% 93.55% 91.20% 90.40%
Financial Model 78.70% 86.96% 77.39% 83.91% 84.78% 90.35% 85.65% 88.26%
Covenant Analysis 71.58% 77.54% 71.93% 71.23% 78.60% 87.94% 82.11% 86.67%
Credit Agreement Term Extraction 74.56% 82.75% 76.80% 78.30% 78.74% 88.87% 78.83% 86.41%
M&A / Change of Control Analysis 73.10% 81.86% 74.48% 72.56% 77.41% 89.25% 75.34% 84.19%
IC Memo Draft 70.47% 80.84% 71.96% 72.24% 76.51% 85.56% 76.82% 84.13%
LOI Draft 73.74% 81.74% 75.64% 75.13% 77.54% 84.13% 77.09% 83.83%
Contract Extraction 70.62% 80.62% 71.80% 72.08% 77.19% 85.24% 77.58% 83.65%
Guarantor & Collateral Analysis 69.81% 76.23% 70.57% 66.79% 77.36% 82.64% 79.25% 81.13%
Portfolio Monitoring 65.48% 74.19% 62.90% 69.68% 64.84% 81.61% 74.19% 80.65%
Document Checklist 65.16% 77.42% 76.13% 59.35% 72.26% 79.94% 63.87% 79.68%
Investment Criteria 63.57% 72.95% 62.50% 68.76% 69.82% 80.23% 73.93% 79.43%
EBITDA Definition Analysis 63.10% 74.90% 59.66% 48.24% 67.93% 77.65% 65.17% 78.43%
Debt Structure Analysis 72.62% 74.46% 72.31% 67.69% 75.38% 81.07% 76.00% 76.92%
Management Team Questions 62.17% 73.91% 61.30% 68.70% 68.70% 78.07% 73.48% 73.91%
Market Research 63.78% 79.46% 58.92% 52.97% 63.78% 80.33% 57.30% 71.89%

Performance Varied Substantially by Use Case

The accuracy improvements were not uniformly distributed across workflows. The largest platform-driven gains were concentrated in document-navigation tasks: CIM Analysis and VDR Analysis each showed improvements exceeding 13 percentage points across three models. These workflows require models to locate, synthesize, and reason across information distributed throughout long, structurally heterogeneous documents, precisely the condition where context window limitations most constrain standalone models and where system architecture delivers the greatest marginal value.

At the other end, use cases involving standardized, uniformly formatted provisions, including DIP/Restructuring Analysis and Events of Default Analysis, showed minimal platform lift, as standalone models already performed at or near ceiling on these tasks, with several scores exceeding 90% across both conditions. 

Conclusion

The models will keep improving. So will the benchmark. What these results establish is a baseline; in PE due diligence, a purpose-built platform delivers measurable accuracy gains over standalone AI, and the conditions tested here represent the most conservative version of that advantage.

These results reflect model performance as of February 2026. As new models are released, we will apply the same methodology and publish updated findings. 

Note: Only the licensed Vals AI private validation set was used in these results. The held-out test set was not used, nor did Vals independently verify these results.

DOWNLOAD FULL REPORT (PDF)

A Comprehensive Analysis of Platform-Enhanced LLM Accuracy Across 20 Critical Due Diligence Use Cases

DOWNLOAD REPORT (PDF)
March 11, 2026
AT A GLANCE
Partner with a team that knows private markets due diligence.
PRIVACY POLICYTERMS OF USECOOKIE POLICY >SECURITY >


© 2026 ToltIQ

This is some text inside of a div block.