In one of our projects in the Utilization Management space in US Healthcare, we were working on optimizing the Clinical Appeal writing process for Clinical Denials.
Though AI driven, the backbone of the solution was a neatly scraped and Efficiently Stored Clinical Policy which can be picked up by the model for comparing with the patient’s records for analysing deviations etc.
So while the product and architecture team evangelized about the more shiny parts like how the analysis/comparison will happen, creating the write prompts, deciding on the right AI model, and how to automate the flow of data, there was a lot of homework being done wrt
- consolidating Insurance payers across US
- understanding their Insurance policy structures
- scraping off the policies,
- massaging them into a pattern best understood and used by our system
This was the non-sexy work! It was repetitive, messy, and had to be done continuously as policies changed.
Phase 1: Getting the Project Off the Floor
The initial implementation was manual scraping. The traditional web scrapers available at the time (2023–24) weren’t enough. Insurance websites are dynamic, protected by Cloudflare, and require intelligent navigation.
The Pain Areas
What seems like a simple task — downloading medical policy PDFs — became an endless cycle of:
- searching through complex website hierarchies
- clicking through pagination across dozens of pages
- copying URLs into spreadsheets
- downloading files one by one
- repeating this for every payer
For our team, this process was
- time-consuming and mind-numbing
- error-prone (missing policies, duplicates)
- unscalable
Simply Automating each payor website was useless: Each had its own structure, navigation and document layout. We would have had to create as many branches of code. This was scalable!
Phase 2: Enter the AI Scraper
We moved to an AI-powered scraping approach using Google’s ADK with Gemini.
But we didn’t make everything AI-driven.
By this time, we had learned something important:
Not every step needs intelligence. We designed a hybrid, token-optimized architecture where not every step was driven by AI.
Where AI Helped (Decision Making)
- understanding page structure
- identifying relevant documents
- navigating dynamic content
- validating whether something is actually a policy
Where AI Was Overkill (Execution)
- downloading files
- managing HTTP requests
- handling browser automation
- converting to PDF
That part just needed to be reliable.
The Next Bottleneck
Things improved significantly.We were scraping ~5x faster but then we hit our next bottleneck…Many Insurance payer websites are protected by Cloudflare, which is excellent at detecting and blocking automation tools like Selenium. Even with our AI agent working perfectly, we kept hitting:
- endless verification loops
- CAPTCHA challenges
- IP blocks and rate limiting
- “Access Denied” pages
No amount of prompting was going to get us through this. We used a workaround to limp around this issue till we entered Phase 3!
Phase 3: Stealth Mode
We switched to a more advanced browser automation setup using Scrapling (Playwright-based).Because of our interface-based architecture, the switch required minimal changes.
This enabled:
- better fingerprint handling
- more realistic browser behavior
- built-in handling of verification challenges
And finally we had a fully automated system that worked end to end and “AI was just a part of the full puzzle”
The Final Piece
At the center of everything was a simple orchestrator with superpowers!
Responsible for:
- coordinating search
- navigating intelligently
- deduplicating results
- aggregating data
The Real Lesson
The takeaway for the team was NOT “How to automate scraping”. It was how and where we used AI and understanding what it cannot do for us!
It was a system design problem.
The value of AI is not in how much you use it —
it’s in where you use it.
And more importantly:
The fundamentals of building scalable, performant, and reliable systems haven’t changed.
Team Cennest!