Skip to content

Making sense of the Gemini 2.5 Flash( With Google Grounding) source urls!

(and what every developer can learn about grounding, citations & really long links)

We recently built a tool to retrieve information about any company using AI. It wasn’t just about generating text—we needed structured insights, along with the exact URLs showing where the facts came from.

When we saw the problem statement the solution was fairly simple- Identify and select the correct AI Model, Engineer the right prompt and get the data!  After playing with a LOT of AI Models and their browsing tools, cost considerations etc we settled in Gemini 2.5 Flash with Google grounding!

Sounds simple? It wasn’t.
Because Gemini had other plans.

The Plan: AI + Real Sources

We used Gemini AI, which supports built-in Google Search grounding—meaning it doesn’t just answer questions, it tries to show you where the answer came from.

Our idea was straightforward:

• Ask Gemini about a company

• Return a structured JSON with sections like overview, key events, etc.

• Include source URLs in each section

We were thrilled to see citation URLs showing up in the response.
Until we clicked them.

The 404 Wall

Half the URLs Gemini gave us returned 404 errors.
Our first assumption?
“The model must be hallucinating citations.”

So we dove deeper.

Discovery: Grounding Metadata

After a bit of research, we found the truth: Gemini stores citation data in a special field called grounding_metadata. It’s part of the model’s internal output and contains start and end character offsets for every source it used.

This metadata is incredibly useful—it maps parts of the output to exact URLs.

But there was a catch…

JSON Breaks Everything

Our application was designed to generate JSON directly from Gemini. But every time we asked for structured JSON, the grounding_metadata came back empty.

Why?
Because citations are tied to text positions. If you change the output format (like converting to JSON), those offsets don’t make sense anymore—so Gemini just drops them.

The Two-Call Fix

To work around this, we used two calls:

1. First, ask Gemini for plain text (Markdown) with proper citations
2. Then, send that response back and ask Gemini to convert it to structured JSON

Success! The citations were preserved and attached correctly.

Well… almost.

The Return of the 404s

Even after this two-step process, we hit the same 404 issue again.
So we looked closer.

What Was Really Happening

Gemini doesn’t copy strings—even inside code fences or JSON. It re-generates each character token-by-token. That’s fine for text. But for long, cryptic signed URLs, it’s a disaster.

✅ Example:

Original: ixZIyDXuAD79…

Returned: ixZIyD0uAD79…

The model flipped a character—or dropped it entirely—because signed URLs are made of random-looking tokens. Gemini’s tokenizer may substitute or omit characters that are hard to regenerate exactly. Even a single character mismatch renders the link useless

This isn’t hallucination.
This is Gemini being too creative with your URLs.

The Placeholder Trick

To fix this, we came up with a simple (and effective) solution:
• Replace each long URL with a placeholder like <<URL_1>>
• Send that version to Gemini to convert to JSON
• After receiving the response, swap the placeholders back to the original URLs

This kept the links safe—no more mutations, no more broken redirects.

def mask_urls(self, text: str, citations: List[Citation]):

“””

For every Citation.link that occurs in *text*,

replace it with <<URL_n>> and keep a mapping.

“””

url_map = {}

masked_text = text

for idx, cit in enumerate(citations, start=1):

ph = f”<<URL_{idx}>>”

cit.placeholder = ph

resolved_url = self.resolve_redirect(cit.link) or cit.link

url_map[ph] = resolved_url

masked_text = masked_text.replace(cit.link, ph)

return masked_text, url_map

def unmask_urls(self,text: str, url_map: dict[str, str]) -> str:

“””Reverse _mask_urls: swap placeholders back to real URLs”””

for ph, url in url_map.items():

text = text.replace(ph, url)

return text

 

But Wait… There’s an Expiry Problem

Even with clean URLs, we noticed that Gemini was returning links like:
https://vertexaisearch.cloud.google.com/grounding-api-redirect/…

These are temporary redirects, and they expire after a few days.

So we added one final step: A small script to resolve the redirect, fetch the true destination, and cache it. Now we always serve clean, permanent URLs.

def resolve_redirect(self, redirect_url: str) -> str:

try:

response = requests.head(redirect_url, allow_redirects=True, timeout=5)

return response.url

except requests.RequestException:

return redirect_url  # fallback to original if resolving fails

Final Architecture

1️⃣ Gemini LLM Call → Markdown with citations
↓ mask URLs
2️⃣ Gemini LLM Call → JSON structure
↓ unmask + resolve redirect
✅ Final Output → Structured JSON + stable source URLs

Wrap-up

This started as a small AI project. It turned into a deep dive on how LLMs handle grounding, structured output, and byte-sensitive URLs.

We learned a lot—and now our service reliably returns structured company info with fully traceable, working URLs.

If you’re building something similar, we hope this helps you avoid the same pitfalls—and if Gemini ever breaks your links, you’ll know exactly where to look.

PS:- Making good sense of the Citations was just one of the hurdles we crossed!
Google Gemini had many more surprises in store for us. Watch this space as we get down to writing down our experiences with this seemingly simply AI project #AINeedsDev!!

Team Cennest!

 

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *