Private equity firms have increasingly adopted AI tools for due diligence, but rigorous, use case-specific benchmarks have been largely absent from the literature. This study addresses that gap.
Maya Boeye, Steiner Williams, Aubrey Inmon, Jeondo Lee, and Jennifer Venus
We evaluated four leading large language models (GPT 5.1, GPT 5.2, Claude Sonnet 4.5, and Claude Opus 4.5) across 20 PE due diligence use cases, comparing standalone API performance against ToltIQ's platform. The evaluation used the Vals AI CorpFin V2 validation dataset, an independent benchmark comprising 360 prompts derived from 18 financial documents; the ToltIQ research team mapped these prompts to the 20 use case categories. Each test was replicated five times per model, yielding 14,400 scored responses.
ToltIQ's platform combines purpose-built document ingestion with a retrieval-augmented generation (RAG) architecture. Source materials are processed, indexed, and made retrievable at query time, grounding responses in actual deal documents. A standalone model can only reason over what is directly supplied in a single session, with no ability to retrieve across a broader corpus. That architectural difference is what the results below are measuring.
Three of four models showed statistically significant accuracy improvements through ToltIQ's custom architecture, with gains ranging from 6.63 to 9.63 percentage points. Opus 4.5 achieved the highest aggregate accuracy at 85.11% on ToltIQ versus 76.38% standalone. GPT 5.1 improved from 70.82% to 80.45%. Sonnet 4.5 improved from 76.91% to 83.54%. GPT 5.2 was the exception, with no significant improvement.
These figures warrant a methodological note. The standalone condition in this study used single-document sessions: one document per evaluation, with no multi-document context. This represents the most favorable possible conditions for standalone models. In practice, a PE team works across a full virtual data room that may span hundreds of documents and thousands of pages, a document volume that standalone LLMs cannot ingest. The accuracy gains reported here should therefore be understood as conservative estimates of the real-world advantage a purpose-built platform provides.
The accuracy improvements were not uniformly distributed across workflows. The largest platform-driven gains were concentrated in document-navigation tasks: CIM Analysis and VDR Analysis each showed improvements exceeding 13 percentage points across three models. These workflows require models to locate, synthesize, and reason across information distributed throughout long, structurally heterogeneous documents, precisely the condition where context window limitations most constrain standalone models and where system architecture delivers the greatest marginal value.
At the other end, use cases involving standardized, uniformly formatted provisions, including DIP/Restructuring Analysis and Events of Default Analysis, showed minimal platform lift, as standalone models already performed at or near ceiling on these tasks, with several scores exceeding 90% across both conditions.
The models will keep improving. So will the benchmark. What these results establish is a baseline; in PE due diligence, a purpose-built platform delivers measurable accuracy gains over standalone AI, and the conditions tested here represent the most conservative version of that advantage.
These results reflect model performance as of February 2026. As new models are released, we will apply the same methodology and publish updated findings.
Note: Only the licensed Vals AI private validation set was used in these results. The held-out test set was not used, nor did Vals independently verify these results.
© 2026 ToltIQ