Blog
/
Product Updates
/
We tried to teach AI to assess vendor risk. Here is why the scoring problem almost broke us.

We tried to teach AI to assess vendor risk. Here is why the scoring problem almost broke us.

7
min read
Published on
Jun 10, 2026
Updated on
Jun 10, 2026
Authored by
Tanvi Anand
AI Product Leader
reviewed by
Joel Noronha
Director of Product Marketing
Table of contents

Vendor risk assessment should be straightforward for AI. Read the vendor's security documentation. Compare it against a framework. Score the gaps. Generate a report.

We built that version first. It worked well on clean inputs and produced garbage on everything else. This post is about the three hardest problems we encountered building a vendor risk assessment agent, and the design choices we made to solve them.

The document problem: vendor responses are not what you expect

Our first assumption was that vendor risk assessment starts with structured data. The vendor fills out a questionnaire, you get a spreadsheet of answers, and you score them. Clean in, clean out.

In reality, vendors send whatever they have. A SOC 2 Type II report as a 90-page PDF. A "security overview" slide deck. A FAQ page that does not actually answer the questions you asked. Sometimes they send all three and expect you to reconcile the information yourself.

The extraction challenge is difficult. A SOC 2 report has useful information scattered across the auditor’s opinion, the system description, the control objectives, and the test results. The structure varies by audit firm. Some reports are text-native PDFs. Others are scanned images with OCR artifacts. A few are protected PDFs that resist programmatic extraction entirely.

We built a document extraction pipeline that handles text PDFs, OCR documents, and structured questionnaire responses. The hardest part was truncation. When a vendor sends a 90-page SOC 2 report, and your extraction context window can handle maybe 30 pages of dense content, what do you cut? The answer matters. Cut the wrong section and your risk assessment misses a material control gap.

Why risk scoring is harder than it looks

Here is the problem that almost broke us. Given a vendor's response to a security question, score the risk. High, medium, low. Sounds straightforward.

Except that risk is contextual. A vendor that lacks MFA on their admin panel is a critical risk if they process your customer data, and a low risk if they provide your office coffee. The same control gap has varying severity depending on what the vendor has access to, what data they handle, and your regulatory obligations.

Our first scoring model treated every vendor the same. It looked at the response, compared it to best practices, and assigned a score. The results were technically defensible but practically useless. A CISO reviewing 50 vendor assessments does not want to see every vendor flagged as high risk because each has at least one imperfect control.

We iterated toward a risk extraction approach that maps vendor responses to specific framework controls, identifies where the vendor’s controls meet, partially meet, or fail to meet expectations, and surfaces the gaps that actually matter given the vendor’s data access and your compliance requirements. The reasoning chain is visible. The reviewer can see why a particular gap was flagged and decide whether they agree.

The false positive trap

Early versions of our risk extractor flagged everything. Every hedged answer, every partial response, every "we are in the process of implementing" got marked as a risk. The output was a wall of findings. Technically accurate, practically unusable.

False positives are the fastest way to kill trust in an AI system. If a reviewer has to dismiss 40 out of 50 findings as irrelevant noise, they stop trusting the other 10, even when those 10 are real. The ratio matters more than the total count.

Reducing false positives without increasing false negatives is a precision-recall tradeoff that we tuned through hundreds of real vendor assessments. The lever that made the biggest difference was not the model. It was the rubric. When we gave the AI a structured framework for distinguishing between material risks, documentation gaps, and stylistic concerns, the signal-to-noise ratio improved dramatically.

This is where our infosec domain experts became essential. An engineer can build the extraction pipeline. Only a compliance expert can define what "material risk" means for SOC 2 versus ISO 27001 versus a custom enterprise questionnaire. We now maintain per-framework rubrics that the AI uses to calibrate severity.

Reassessment: the problem nobody budgets time for

Initial vendor assessments get all the attention. The first time you onboard a vendor, you run a thorough review. The problem is that vendor risk does not freeze after the initial assessment. SOC 2 reports expire. Certifications lapse. Vendors change their infrastructure.

Reassessments are where most vendor risk programs quietly fail. The initial assessment happened 14 months ago. Nobody scheduled the follow-up. The vendor’s SOC 2 report expired three months ago, and nobody noticed. Your audit comes around, and the auditor flags 15 vendors with stale assessments.

We built automated reassessment scheduling into the vendor risk workflow. The system tracks when each vendor's documentation expires, when the last assessment was completed, and when the next one is due. When a reassessment is triggered, it pulls the vendor’s latest available documentation and runs the risk extraction pipeline again, giving the reviewer a comparison of what changed since the last assessment.

This is not glamorous AI work. It is calendar logic and notification triggers. But it solves a real operational gap that causes more audit findings than any fancy risk scoring model.

What evaluating AI output taught us about our own system

Building the vendor risk extractor was one challenge. Knowing whether it was actually working was a separate, equally hard challenge.

When we started evaluating the risk extractor’s output, we discovered that our infosec domain experts could not even agree on what a "correct" risk assessment looks like. One reviewer would flag a vendor response as a gap. Another would call it acceptable. The inter-annotator agreement was embarrassingly low in early rounds.

This forced us to build rubrics, which forced us to have explicit conversations about what we actually mean by "risk completeness" and "control mapping accuracy." The eval process improved the product, not because it caught bugs (though it did), but because it made us define our own quality standard with enough specificity that both humans and AI could be measured against it.

We will have much more to say about our eval process in an upcoming post. It turned out to be one of the most consequential things we built.

The bottom line

Vendor risk assessment AI is not a solved problem. It is a collection of hard problems: document extraction at scale, contextual risk scoring, false positive reduction, and the unglamorous but critical work of reassessment scheduling. Each one required its own engineering investment and its own set of domain expertise.

The part we are most proud of is not the AI. It is the feedback loop between our domain experts and the system. Every vendor assessment that gets reviewed teaches us where the risk extractor breaks. That is the engine that makes it better over time.

Liked the post? Share on:
Choose risk-first compliance that’s always on, built for you.
Book a Demo
Book a Demo
Enjoyed this post? Let us know!
About Scrut Automation

Scrut Automation is a modern GRC platform designed to help fast-growing organizations simplify security, compliance, and risk management.

By combining continuous automation with expert guidance, Scrut reduces manual workloads, accelerates audit readiness, and empowers teams to scale their security posture confidently.

From HIPAA and SOC 2 to ISO 27001, GDPR, PCI, and beyond; Scrut helps teams achieve multi-framework compliance with ease.

Join our community and be the first to know about updates!

Subscribe
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Choose risk-first compliance that’s always on, built for you, and never in your way.

The Scrut Platform helps you move fast, stay compliant, and build securely from the start.

Book a Demo
Book a Demo