Why Building AI for Vendor Risk Assessment Is Harder Than It Looks

Blog

Product Updates

We tried to teach AI to assess vendor risk. Here is why the scoring problem almost broke us.

min read

Published on

Jun 10, 2026

Updated on

Jun 18, 2026

2026-06-10

2026-06-18

Authored by

Tanvi Anand

AI Product Leader

reviewed by

Joel Noronha

Director of Product Marketing

Table of contents

Summarize it with -

Copy as .md file

Vendor risk assessment should be straightforward for AI. Read the vendor's security documentation. Compare it against a framework. Score the gaps. Generate a report.

We built that version first. It worked well on clean inputs and produced garbage on everything else. This post is about the three hardest problems we encountered building a vendor risk assessment agent, and the design choices we made to solve them.

The document problem: vendor responses are not what you expect

Our first assumption was that vendor risk assessment starts with structured data. The vendor fills out a questionnaire, you get a spreadsheet of answers, and you score them. Clean in, clean out.

In reality, vendors send whatever they have. A SOC 2 Type II report as a 90-page PDF. A "security overview" slide deck. A FAQ page that does not actually answer the questions you asked. Sometimes they send all three and expect you to reconcile the information yourself.

The extraction challenge is difficult. A SOC 2 report has useful information scattered across the auditor’s opinion, the system description, the control objectives, and the test results. The structure varies by audit firm. Some reports are text-native PDFs. Others are scanned images with OCR artifacts. A few are protected PDFs that resist programmatic extraction entirely.

We built a document extraction pipeline that handles text PDFs, OCR documents, and structured questionnaire responses. The hardest part was truncation. When a vendor sends a 90-page SOC 2 report, and your extraction context window can handle maybe 30 pages of dense content, what do you cut? The answer matters. Cut the wrong section and your risk assessment misses a material control gap.

Why risk scoring is harder than it looks

Here is the problem that almost broke us. Given a vendor's response to a security question, score the risk. High, medium, low. Sounds straightforward.

Except that risk is contextual. A vendor that lacks MFA on their admin panel is a critical risk if they process your customer data, and a low risk if they provide your office coffee. The same control gap has varying severity depending on what the vendor has access to, what data they handle, and your regulatory obligations.

Our first scoring model treated every vendor the same. It looked at the response, compared it to best practices, and assigned a score. The results were technically defensible but practically useless. A CISO reviewing 50 vendor assessments does not want to see every vendor flagged as high risk because each has at least one imperfect control.

We iterated toward a risk extraction approach that maps vendor responses to specific framework controls, identifies where the vendor’s controls meet, partially meet, or fail to meet expectations, and surfaces the gaps that actually matter given the vendor’s data access and your compliance requirements. The reasoning chain is visible. The reviewer can see why a particular gap was flagged and decide whether they agree.

The false positive trap

Early versions of our risk extractor flagged everything. Every hedged answer, every partial response, every "we are in the process of implementing" got marked as a risk. The output was a wall of findings. Technically accurate, practically unusable.

False positives are the fastest way to kill trust in an AI system. If a reviewer has to dismiss 40 out of 50 findings as irrelevant noise, they stop trusting the other 10, even when those 10 are real. The ratio matters more than the total count.

Reducing false positives without increasing false negatives is a precision-recall tradeoff that we tuned through hundreds of real vendor assessments. The lever that made the biggest difference was not the model. It was the rubric. When we gave the AI a structured framework for distinguishing between material risks, documentation gaps, and stylistic concerns, the signal-to-noise ratio improved dramatically.

This is where our infosec domain experts became essential. An engineer can build the extraction pipeline. Only a compliance expert can define what "material risk" means for SOC 2 versus ISO 27001 versus a custom enterprise questionnaire. We now maintain per-framework rubrics that the AI uses to calibrate severity.

Reassessment: the problem nobody budgets time for

Initial vendor assessments get all the attention. The first time you onboard a vendor, you run a thorough review. The problem is that vendor risk does not freeze after the initial assessment. SOC 2 reports expire. Certifications lapse. Vendors change their infrastructure.

Reassessments are where most vendor risk programs quietly fail. The initial assessment happened 14 months ago. Nobody scheduled the follow-up. The vendor’s SOC 2 report expired three months ago, and nobody noticed. Your audit comes around, and the auditor flags 15 vendors with stale assessments.

We built automated reassessment scheduling into the vendor risk workflow. The system tracks when each vendor's documentation expires, when the last assessment was completed, and when the next one is due. When a reassessment is triggered, it pulls the vendor’s latest available documentation and runs the risk extraction pipeline again, giving the reviewer a comparison of what changed since the last assessment.

This is not glamorous AI work. It is calendar logic and notification triggers. But it solves a real operational gap that causes more audit findings than any fancy risk scoring model.

What evaluating AI output taught us about our own system

Building the vendor risk extractor was one challenge. Knowing whether it was actually working was a separate, equally hard challenge.

When we started evaluating the risk extractor’s output, we discovered that our infosec domain experts could not even agree on what a "correct" risk assessment looks like. One reviewer would flag a vendor response as a gap. Another would call it acceptable. The inter-annotator agreement was embarrassingly low in early rounds.

This forced us to build rubrics, which forced us to have explicit conversations about what we actually mean by "risk completeness" and "control mapping accuracy." The eval process improved the product, not because it caught bugs (though it did), but because it made us define our own quality standard with enough specificity that both humans and AI could be measured against it.

We will have much more to say about our eval process in an upcoming post. It turned out to be one of the most consequential things we built.

The bottom line

Vendor risk assessment AI is not a solved problem. It is a collection of hard problems: document extraction at scale, contextual risk scoring, false positive reduction, and the unglamorous but critical work of reassessment scheduling. Each one required its own engineering investment and its own set of domain expertise.

The part we are most proud of is not the AI. It is the feedback loop between our domain experts and the system. Every vendor assessment that gets reviewed teaches us where the risk extractor breaks. That is the engine that makes it better over time.

Liked the post? Share on:

Tanvi Anand

AI Product Leader

Tanvi Anand is an AI Product Leader with 7+ years of experience across AI and ML. She leads Scrut’s AI product strategy and built Scrut Teammates, the agentic AI suite from 0 to 1, bringing autonomous automation to governance, risk, and compliance at enterprise scale. She has worked on product and AI with P&G, Nielsen, and Activision Blizzard. A builder, inventor, and researcher, she holds a patent, has published peer-reviewed AI research cited 100+ times, and has earned awards for her work in artificial intelligence. She holds an MS from the University of Texas at Austin, where she conducted applied machine learning and NLP research.

Authored by

Joel Noronha

Director of Product Marketing

Joel Noronha is the Director of Product Marketing at Scrut Automation, where he focuses on GTM strategy for the Scrut platform. With over a decade in B2B SaaS product marketing across enterprise software, consumer tech, and media, Joel specialises in bringing products to market in emergent technology categories and translating complex capabilities into clear, market-ready narratives.

reviewed by

Choose risk-first compliance that’s always on, built for you.

Book a Demo

Enjoyed this post? Let us know!

About Scrut Automation

Scrut Automation is a modern GRC platform designed to help fast-growing organizations simplify security, compliance, and risk management.

By combining continuous automation with expert guidance, Scrut reduces manual workloads, accelerates audit readiness, and empowers teams to scale their security posture confidently.

From HIPAA and SOC 2 to ISO 27001, GDPR, PCI, and beyond; Scrut helps teams achieve multi-framework compliance with ease.

No items found.

We put a compliance program in front of 3 expert reviewers. Here’s what they flagged

No items found.

Meet the Scrut MCP: A secure way to bring compliance work into your MCP-compatible AI app

No items found.

Risk Grustlers EP 25: AI is artificial, you are the intelligence

Choose risk-first compliance that’s always on, built for you, and never in your way.

The Scrut Platform helps you move fast, stay compliant, and build securely from the start.

Book a Demo

We tried to teach AI to assess vendor risk. Here is why the scoring problem almost broke us.

The document problem: vendor responses are not what you expect

Why risk scoring is harder than it looks

The false positive trap

Reassessment: the problem nobody budgets time for

What evaluating AI output taught us about our own system

The bottom line

About Scrut Automation

Join our community and be the first to know about updates!

Related Posts

Choose risk-first compliance that’s always on, built for you, and never in your way.

We tried to teach AI to assess vendor risk. Here is why the scoring problem almost broke us.

The document problem: vendor responses are not what you expect

Why risk scoring is harder than it looks

The false positive trap

Reassessment: the problem nobody budgets time for

What evaluating AI output taught us about our own system

The bottom line

-- / 5 average rating from -- reviews

About Scrut Automation

Join our community and be the first to know about updates!

Related Posts

Choose risk-first compliance that’s always on, built for you, and never in your way.