The Groundwork Behind a Successful AI Workflow

In one of our projects in the Utilization Management space in US Healthcare, we were working on optimizing the Clinical Appeal writing process for Clinical Denials.

Though AI driven, the backbone of the solution was a neatly scraped and Efficiently Stored Clinical Policy which can be picked up by the model for comparing with the patient’s records for analysing deviations etc.

So while the product and architecture team evangelized about the more shiny parts like how the analysis/comparison will happen, creating the write prompts, deciding on the right AI model, and how to automate the flow of data, there was a lot of homework being done wrt

consolidating Insurance payers across US
understanding their Insurance policy structures
scraping off the policies,
massaging them into a pattern best understood and used by our system

This was the non-sexy work! It was repetitive, messy, and had to be done continuously as policies changed.

Phase 1: Getting the Project Off the Floor

The initial implementation was manual scraping. The traditional web scrapers available at the time (2023–24) weren’t enough. Insurance websites are dynamic, protected by Cloudflare, and require intelligent navigation.

The Pain Areas

What seems like a simple task — downloading medical policy PDFs — became an endless cycle of:

searching through complex website hierarchies
clicking through pagination across dozens of pages
copying URLs into spreadsheets
downloading files one by one
repeating this for every payer

For our team, this process was

time-consuming and mind-numbing
error-prone (missing policies, duplicates)
unscalable

Simply Automating each payor website was useless: Each had its own structure, navigation and document layout. We would have had to create as many branches of code. This was scalable!

Phase 2: Enter the AI Scraper

We moved to an AI-powered scraping approach using Google’s ADK with Gemini.

But we didn’t make everything AI-driven.

By this time, we had learned something important:

Not every step needs intelligence. We designed a hybrid, token-optimized architecture where not every step was driven by AI.

Where AI Helped (Decision Making)

understanding page structure
identifying relevant documents
navigating dynamic content
validating whether something is actually a policy

Where AI Was Overkill (Execution)

downloading files
managing HTTP requests
handling browser automation
converting to PDF

That part just needed to be reliable.

The Next Bottleneck

Things improved significantly.We were scraping ~5x faster but then we hit our next bottleneck…Many Insurance payer websites are protected by Cloudflare, which is excellent at detecting and blocking automation tools like Selenium. Even with our AI agent working perfectly, we kept hitting:

endless verification loops
CAPTCHA challenges
IP blocks and rate limiting
“Access Denied” pages

No amount of prompting was going to get us through this. We used a workaround to limp around this issue till we entered Phase 3!

Phase 3: Stealth Mode

We switched to a more advanced browser automation setup using Scrapling (Playwright-based).Because of our interface-based architecture, the switch required minimal changes.

This enabled:

better fingerprint handling
more realistic browser behavior
built-in handling of verification challenges

And finally we had a fully automated system that worked end to end and “AI was just a part of the full puzzle”

The Final Piece

At the center of everything was a simple orchestrator with superpowers!

Responsible for:

coordinating search
navigating intelligently
deduplicating results
aggregating data

The Real Lesson

The takeaway for the team was NOT “How to automate scraping”. It was how and where we used AI and understanding what it cannot do for us!

It was a system design problem.

The value of AI is not in how much you use it —
it’s in where you use it.

And more importantly:

The fundamentals of building scalable, performant, and reliable systems haven’t changed.

Team Cennest!