(and what every developer can learn about grounding, citations & really long links)
We recently built a tool to retrieve information about any company using AI. It wasn’t just about generating text—we needed structured insights, along with the exact URLs showing where the facts came from.
When we saw the problem statement the solution was fairly simple- Identify and select the correct AI Model, Engineer the right prompt and get the data! After playing with a LOT of AI Models and their browsing tools, cost considerations etc we settled in Gemini 2.5 Flash with Google grounding!
Sounds simple? It wasn’t.
Because Gemini had other plans.
The Plan: AI + Real Sources
We used Gemini AI, which supports built-in Google Search grounding—meaning it doesn’t just answer questions, it tries to show you where the answer came from.
Our idea was straightforward:
• Ask Gemini about a company
• Return a structured JSON with sections like overview, key events, etc.
• Include source URLs in each section
We were thrilled to see citation URLs showing up in the response.
Until we clicked them.
The 404 Wall
Half the URLs Gemini gave us returned 404 errors.
Our first assumption?
“The model must be hallucinating citations.”
So we dove deeper.
Discovery: Grounding Metadata
After a bit of research, we found the truth: Gemini stores citation data in a special field called grounding_metadata. It’s part of the model’s internal output and contains start and end character offsets for every source it used.
This metadata is incredibly useful—it maps parts of the output to exact URLs.
But there was a catch…
JSON Breaks Everything
Our application was designed to generate JSON directly from Gemini. But every time we asked for structured JSON, the grounding_metadata came back empty.
Why?
Because citations are tied to text positions. If you change the output format (like converting to JSON), those offsets don’t make sense anymore—so Gemini just drops them.
The Two-Call Fix
To work around this, we used two calls:
1. First, ask Gemini for plain text (Markdown) with proper citations
2. Then, send that response back and ask Gemini to convert it to structured JSON
Success! The citations were preserved and attached correctly.
Well… almost.
The Return of the 404s
Even after this two-step process, we hit the same 404 issue again.
So we looked closer.
What Was Really Happening
Gemini doesn’t copy strings—even inside code fences or JSON. It re-generates each character token-by-token. That’s fine for text. But for long, cryptic signed URLs, it’s a disaster.
✅ Example:
Original: ixZIyDXuAD79…
Returned: ixZIyD0uAD79…
The model flipped a character—or dropped it entirely—because signed URLs are made of random-looking tokens. Gemini’s tokenizer may substitute or omit characters that are hard to regenerate exactly. Even a single character mismatch renders the link useless
This isn’t hallucination.
This is Gemini being too creative with your URLs.
The Placeholder Trick
To fix this, we came up with a simple (and effective) solution:
• Replace each long URL with a placeholder like <<URL_1>>
• Send that version to Gemini to convert to JSON
• After receiving the response, swap the placeholders back to the original URLs
This kept the links safe—no more mutations, no more broken redirects.
def mask_urls(self, text: str, citations: List[Citation]):
“””
For every Citation.link that occurs in *text*,
replace it with <<URL_n>> and keep a mapping.
“””
url_map = {}
masked_text = text
for idx, cit in enumerate(citations, start=1):
ph = f”<<URL_{idx}>>”
cit.placeholder = ph
resolved_url = self.resolve_redirect(cit.link) or cit.link
url_map[ph] = resolved_url
masked_text = masked_text.replace(cit.link, ph)
return masked_text, url_map
def unmask_urls(self,text: str, url_map: dict[str, str]) -> str:
“””Reverse _mask_urls: swap placeholders back to real URLs”””
for ph, url in url_map.items():
text = text.replace(ph, url)
return text
But Wait… There’s an Expiry Problem
Even with clean URLs, we noticed that Gemini was returning links like:
https://vertexaisearch.cloud.google.com/grounding-api-redirect/…
These are temporary redirects, and they expire after a few days.
So we added one final step: A small script to resolve the redirect, fetch the true destination, and cache it. Now we always serve clean, permanent URLs.
def resolve_redirect(self, redirect_url: str) -> str:
try:
response = requests.head(redirect_url, allow_redirects=True, timeout=5)
return response.url
except requests.RequestException:
return redirect_url # fallback to original if resolving fails
Final Architecture
1️⃣ Gemini LLM Call → Markdown with citations
↓ mask URLs
2️⃣ Gemini LLM Call → JSON structure
↓ unmask + resolve redirect
✅ Final Output → Structured JSON + stable source URLs
Wrap-up
This started as a small AI project. It turned into a deep dive on how LLMs handle grounding, structured output, and byte-sensitive URLs.
We learned a lot—and now our service reliably returns structured company info with fully traceable, working URLs.
If you’re building something similar, we hope this helps you avoid the same pitfalls—and if Gemini ever breaks your links, you’ll know exactly where to look.
PS:- Making good sense of the Citations was just one of the hurdles we crossed!
Google Gemini had many more surprises in store for us. Watch this space as we get down to writing down our experiences with this seemingly simply AI project #AINeedsDev!!
Team Cennest!