March 11, 2026

A Comprehensive Analysis of Platform-Enhanced LLM Accuracy Across 20 Critical Due Diligence Use Cases

Private equity firms have increasingly adopted AI tools for due diligence, but rigorous, use case-specific benchmarks have been largely absent from the literature. This study addresses that gap.

Authors

Maya Boeye, Steiner Williams, Aubrey Inmon, Jeondo Lee, and Jennifer Venus

Introduction

We evaluated four leading large language models (GPT 5.1, GPT 5.2, Claude Sonnet 4.5, and Claude Opus 4.5) across 20 PE due diligence use cases, comparing standalone API performance against ToltIQ's platform. The evaluation used the Vals AI CorpFin V2 validation dataset, an independent benchmark comprising 360 prompts derived from 18 financial documents; the ToltIQ research team mapped these prompts to the 20 use case categories. Each test was replicated five times per model, yielding 14,400 scored responses.

ToltIQ's platform combines purpose-built document ingestion with a retrieval-augmented generation (RAG) architecture. Source materials are processed, indexed, and made retrievable at query time, grounding responses in actual deal documents. A standalone model can only reason over what is directly supplied in a single session, with no ability to retrieve across a broader corpus. That architectural difference is what the results below are measuring.

Findings

Three of four models showed statistically significant accuracy improvements through ToltIQ's custom architecture, with gains ranging from 6.63 to 9.63 percentage points. Opus 4.5 achieved the highest aggregate accuracy at 85.11% on ToltIQ versus 76.38% standalone. GPT 5.1 improved from 70.82% to 80.45%. Sonnet 4.5 improved from 76.91% to 83.54%. GPT 5.2 was the exception, with no significant improvement.

These figures warrant a methodological note. The standalone condition in this study used single-document sessions: one document per evaluation, with no multi-document context. This represents the most favorable possible conditions for standalone models. In practice, a PE team works across a full virtual data room that may span hundreds of documents and thousands of pages, a document volume that standalone LLMs cannot ingest. The accuracy gains reported here should therefore be understood as conservative estimates of the real-world advantage a purpose-built platform provides.

ToltIQ vs. Standalone LLM - Accuracy Across Models and Use Cases

Use Case	Model / Platform
	GPT 5.1		GPT 5.2		Opus 4.5		Sonnet 4.5
	Standalone	ToltIQ	Standalone	ToltIQ	Standalone	ToltIQ	Standalone	ToltIQ
DIP / Restructuring Analysis	80.80%	98.10%	82.40%	97.14%	91.60%	100.00%	96.00%	100.00%
CIM Analysis	72.79%	89.84%	75.81%	70.82%	79.07%	92.13%	76.05%	93.77%
VDR Analysis	71.19%	89.49%	74.05%	68.14%	77.38%	92.86%	75.24%	92.88%
Events of Default Analysis	84.29%	92.86%	81.43%	93.57%	85.00%	96.38%	95.00%	92.86%
Interest Rate & Benchmark Analysis	76.80%	89.60%	79.20%	89.60%	94.40%	93.55%	91.20%	90.40%
Financial Model	78.70%	86.96%	77.39%	83.91%	84.78%	90.35%	85.65%	88.26%
Covenant Analysis	71.58%	77.54%	71.93%	71.23%	78.60%	87.94%	82.11%	86.67%
Credit Agreement Term Extraction	74.56%	82.75%	76.80%	78.30%	78.74%	88.87%	78.83%	86.41%
M&A / Change of Control Analysis	73.10%	81.86%	74.48%	72.56%	77.41%	89.25%	75.34%	84.19%
IC Memo Draft	70.47%	80.84%	71.96%	72.24%	76.51%	85.56%	76.82%	84.13%
LOI Draft	73.74%	81.74%	75.64%	75.13%	77.54%	84.13%	77.09%	83.83%
Contract Extraction	70.62%	80.62%	71.80%	72.08%	77.19%	85.24%	77.58%	83.65%
Guarantor & Collateral Analysis	69.81%	76.23%	70.57%	66.79%	77.36%	82.64%	79.25%	81.13%
Portfolio Monitoring	65.48%	74.19%	62.90%	69.68%	64.84%	81.61%	74.19%	80.65%
Document Checklist	65.16%	77.42%	76.13%	59.35%	72.26%	79.94%	63.87%	79.68%
Investment Criteria	63.57%	72.95%	62.50%	68.76%	69.82%	80.23%	73.93%	79.43%
EBITDA Definition Analysis	63.10%	74.90%	59.66%	48.24%	67.93%	77.65%	65.17%	78.43%
Debt Structure Analysis	72.62%	74.46%	72.31%	67.69%	75.38%	81.07%	76.00%	76.92%
Management Team Questions	62.17%	73.91%	61.30%	68.70%	68.70%	78.07%	73.48%	73.91%
Market Research	63.78%	79.46%	58.92%	52.97%	63.78%	80.33%	57.30%	71.89%

Performance Varied Substantially by Use Case

The accuracy improvements were not uniformly distributed across workflows. The largest platform-driven gains were concentrated in document-navigation tasks: CIM Analysis and VDR Analysis each showed improvements exceeding 13 percentage points across three models. These workflows require models to locate, synthesize, and reason across information distributed throughout long, structurally heterogeneous documents, precisely the condition where context window limitations most constrain standalone models and where system architecture delivers the greatest marginal value.

At the other end, use cases involving standardized, uniformly formatted provisions, including DIP/Restructuring Analysis and Events of Default Analysis, showed minimal platform lift, as standalone models already performed at or near ceiling on these tasks, with several scores exceeding 90% across both conditions.

Conclusion

The models will keep improving. So will the benchmark. What these results establish is a baseline; in PE due diligence, a purpose-built platform delivers measurable accuracy gains over standalone AI, and the conditions tested here represent the most conservative version of that advantage.

These results reflect model performance as of February 2026. As new models are released, we will apply the same methodology and publish updated findings.

Note: Only the licensed Vals AI private validation set was used in these results. The held-out test set was not used, nor did Vals independently verify these results.

‍

DOWNLOAD FULL REPORT (PDF)

A Comprehensive Analysis of Platform-Enhanced LLM Accuracy Across 20 Critical Due Diligence Use Cases

A Comprehensive Analysis of Platform-Enhanced LLM Accuracy Across 20 Critical Due Diligence Use Cases

Authors

Introduction

Findings

ToltIQ vs. Standalone LLM - Accuracy Across Models and Use Cases

Performance Varied Substantially by Use Case

Conclusion

A Comprehensive Analysis of Platform-Enhanced LLM Accuracy Across 20 Critical Due Diligence Use Cases

AT A GLANCE