Vendor risk assessment should be straightforward for AI. Read the vendor's security documentation. Compare it against a framework. Score the gaps. Generate a report.
We built that version first. It worked well on clean inputs and produced garbage on everything else. This post is about the three hardest problems we encountered building a vendor risk assessment agent, and the design choices we made to solve them.
The document problem: vendor responses are not what you expect
Our first assumption was that vendor risk assessment starts with structured data. The vendor fills out a questionnaire, you get a spreadsheet of answers, and you score them. Clean in, clean out.
In reality, vendors send whatever they have. A SOC 2 Type II report as a 90-page PDF. A "security overview" slide deck. A FAQ page that does not actually answer the questions you asked. Sometimes they send all three and expect you to reconcile the information yourself.
The extraction challenge is difficult. A SOC 2 report has useful information scattered across the auditor’s opinion, the system description, the control objectives, and the test results. The structure varies by audit firm. Some reports are text-native PDFs. Others are scanned images with OCR artifacts. A few are protected PDFs that resist programmatic extraction entirely.
We built a document extraction pipeline that handles text PDFs, OCR documents, and structured questionnaire responses. The hardest part was truncation. When a vendor sends a 90-page SOC 2 report, and your extraction context window can handle maybe 30 pages of dense content, what do you cut? The answer matters. Cut the wrong section and your risk assessment misses a material control gap.
Why risk scoring is harder than it looks
Here is the problem that almost broke us. Given a vendor's response to a security question, score the risk. High, medium, low. Sounds straightforward.

Except that risk is contextual. A vendor that lacks MFA on their admin panel is a critical risk if they process your customer data, and a low risk if they provide your office coffee. The same control gap has varying severity depending on what the vendor has access to, what data they handle, and your regulatory obligations.
Our first scoring model treated every vendor the same. It looked at the response, compared it to best practices, and assigned a score. The results were technically defensible but practically useless. A CISO reviewing 50 vendor assessments does not want to see every vendor flagged as high risk because each has at least one imperfect control.
We iterated toward a risk extraction approach that maps vendor responses to specific framework controls, identifies where the vendor’s controls meet, partially meet, or fail to meet expectations, and surfaces the gaps that actually matter given the vendor’s data access and your compliance requirements. The reasoning chain is visible. The reviewer can see why a particular gap was flagged and decide whether they agree.

The false positive trap
Early versions of our risk extractor flagged everything. Every hedged answer, every partial response, every "we are in the process of implementing" got marked as a risk. The output was a wall of findings. Technically accurate, practically unusable.
False positives are the fastest way to kill trust in an AI system. If a reviewer has to dismiss 40 out of 50 findings as irrelevant noise, they stop trusting the other 10, even when those 10 are real. The ratio matters more than the total count.
Reducing false positives without increasing false negatives is a precision-recall tradeoff that we tuned through hundreds of real vendor assessments. The lever that made the biggest difference was not the model. It was the rubric. When we gave the AI a structured framework for distinguishing between material risks, documentation gaps, and stylistic concerns, the signal-to-noise ratio improved dramatically.
This is where our infosec domain experts became essential. An engineer can build the extraction pipeline. Only a compliance expert can define what "material risk" means for SOC 2 versus ISO 27001 versus a custom enterprise questionnaire. We now maintain per-framework rubrics that the AI uses to calibrate severity.
Reassessment: the problem nobody budgets time for
Initial vendor assessments get all the attention. The first time you onboard a vendor, you run a thorough review. The problem is that vendor risk does not freeze after the initial assessment. SOC 2 reports expire. Certifications lapse. Vendors change their infrastructure.
Reassessments are where most vendor risk programs quietly fail. The initial assessment happened 14 months ago. Nobody scheduled the follow-up. The vendor’s SOC 2 report expired three months ago, and nobody noticed. Your audit comes around, and the auditor flags 15 vendors with stale assessments.
We built automated reassessment scheduling into the vendor risk workflow. The system tracks when each vendor's documentation expires, when the last assessment was completed, and when the next one is due. When a reassessment is triggered, it pulls the vendor’s latest available documentation and runs the risk extraction pipeline again, giving the reviewer a comparison of what changed since the last assessment.
This is not glamorous AI work. It is calendar logic and notification triggers. But it solves a real operational gap that causes more audit findings than any fancy risk scoring model.

What evaluating AI output taught us about our own system
Building the vendor risk extractor was one challenge. Knowing whether it was actually working was a separate, equally hard challenge.
When we started evaluating the risk extractor’s output, we discovered that our infosec domain experts could not even agree on what a "correct" risk assessment looks like. One reviewer would flag a vendor response as a gap. Another would call it acceptable. The inter-annotator agreement was embarrassingly low in early rounds.
This forced us to build rubrics, which forced us to have explicit conversations about what we actually mean by "risk completeness" and "control mapping accuracy." The eval process improved the product, not because it caught bugs (though it did), but because it made us define our own quality standard with enough specificity that both humans and AI could be measured against it.
We will have much more to say about our eval process in an upcoming post. It turned out to be one of the most consequential things we built.
The bottom line
Vendor risk assessment AI is not a solved problem. It is a collection of hard problems: document extraction at scale, contextual risk scoring, false positive reduction, and the unglamorous but critical work of reassessment scheduling. Each one required its own engineering investment and its own set of domain expertise.
The part we are most proud of is not the AI. It is the feedback loop between our domain experts and the system. Every vendor assessment that gets reviewed teaches us where the risk extractor breaks. That is the engine that makes it better over time.


Tanvi Anand is an AI Product Leader with 7+ years of experience across AI and ML. She leads Scrut’s AI product strategy and built Scrut Teammates, the agentic AI suite from 0 to 1, bringing autonomous automation to governance, risk, and compliance at enterprise scale. She has worked on product and AI with P&G, Nielsen, and Activision Blizzard. A builder, inventor, and researcher, she holds a patent, has published peer-reviewed AI research cited 100+ times, and has earned awards for her work in artificial intelligence. She holds an MS from the University of Texas at Austin, where she conducted applied machine learning and NLP research.
.png)
Joel Noronha is the Director of Product Marketing at Scrut Automation, where he focuses on GTM strategy for the Scrut platform. With over a decade in B2B SaaS product marketing across enterprise software, consumer tech, and media, Joel specialises in bringing products to market in emergent technology categories and translating complex capabilities into clear, market-ready narratives.
























