Skip to content

Our Easter Bug: A Debugging Journey

Every tech team has that one baffling issue that keeps them up at night. For us, it was a problem that haunted our system every Easter, year after year.

The Beginning of the Mystery

One of our customers, a highly technical person, managed his entire infrastructure in a U.S.-based data center. Our system, backed by a robust SQL Server database with a high-core CPU, ran smoothly—except for one strange anomaly.

Every Easter (and sometimes Christmas), our CPU usage would suddenly skyrocket. Not a gradual increase, but an alarming jump from 1% to 60%, then 90%, and eventually 100%, bringing the entire system down. Given that our software is widely used by churches, our customer assumed this was simply due to a surge in user traffic during these holidays.

The Failed Fixes

To handle the expected traffic, he upgraded the CPU to a massive 40-core system—far more than what was needed for normal operations. Yet, the following Easter, the same issue occurred.

Determined to find the culprit, we analyzed SQL queries using diagnostic tools, optimized inefficient queries, and fine-tuned our database. But despite all efforts, Easter after Easter, the CPU spike returned—leading to frustrated customers and lost business.

We used the following query to retrieve and analyze CPU usage

INSERT INTO @DBAverageTimeResultset

SELECT TOP 20

SUBSTRING(

CASE

WHEN CHARINDEX('[',qt.text) > 0 AND (CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text)) > 0 THEN

SUBSTRING(qt.text, CHARINDEX('[',qt.text) , CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text) + 1)

ELSE qt.text

END, 0, 255) AS [SP Name],

DB_NAME(qt.dbid) AS [Database],

qs.total_worker_time AS 'TotalWorkerTime',

qs.total_worker_time/qs.execution_count AS 'AvgWorkerTime',

qs.execution_count AS 'Execution Count',

ISNULL(qs.total_elapsed_time/qs.execution_count, 0) AS 'AvgElapsedTime',

qs.max_logical_reads, qs.max_logical_writes,

DATEDIFF(Minute, qs.creation_time, GetDate()) AS 'Age in Cache'

FROM sys.dm_exec_query_stats AS qs

CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt

WHERE qs.execution_count > 0

AND qt.dbid = db_id() -- Filter by current database

ORDER BY qs.total_worker_time/qs.execution_count DESC

INSERT INTO @DBTotalTimeResultset

SELECT TOP 20

SUBSTRING(

CASE

WHEN CHARINDEX('[',qt.text) > 0 AND (CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text)) > 0 THEN

SUBSTRING(qt.text, CHARINDEX('[',qt.text) , CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text) + 1)

ELSE qt.text

END, 0, 255) AS [SP Name],

DB_NAME(qt.dbid) AS [Database],

qs.total_worker_time AS 'TotalWorkerTime',

qs.total_worker_time/qs.execution_count AS 'AvgWorkerTime',

qs.execution_count AS 'Execution Count',

ISNULL(qs.total_elapsed_time/qs.execution_count, 0) AS 'AvgElapsedTime',

qs.max_logical_reads, qs.max_logical_writes,

DATEDIFF(Minute, qs.creation_time, GetDate()) AS 'Age in Cache'

FROM sys.dm_exec_query_stats AS qs

CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt

WHERE (qs.execution_count > 0 AND (qs.total_worker_time/qs.execution_count) > 100000)

AND qt.dbid = db_id() -- Filter by current database

ORDER BY qs.total_worker_time DESC

INSERT INTO @EngineAverageTimeResultset

SELECT TOP 20

SUBSTRING(

CASE

WHEN CHARINDEX('[',qt.text) > 0 AND (CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text)) > 0 THEN

SUBSTRING(qt.text, CHARINDEX('[',qt.text) , CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text) + 1)

ELSE qt.text

END, 0, 255) AS [SP Name],

DB_NAME(qt.dbid) AS [Database],

qs.total_worker_time AS 'TotalWorkerTime',

qs.total_worker_time/qs.execution_count AS 'AvgWorkerTime',

qs.execution_count AS 'Execution Count',

ISNULL(qs.total_elapsed_time/qs.execution_count, 0) AS 'AvgElapsedTime',

qs.max_logical_reads, qs.max_logical_writes,

DATEDIFF(Minute, qs.creation_time, GetDate()) AS 'Age in Cache'

FROM sys.dm_exec_query_stats AS qs

CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt

WHERE qs.execution_count > 0

AND qt.dbid IS NOT NULL

-- AND qt.dbid = db_id() -- Filter by current database

ORDER BY qs.total_worker_time/qs.execution_count DESC

INSERT INTO @EngineTotalTimeResultset

SELECT TOP 20

SUBSTRING(

CASE

WHEN CHARINDEX('[',qt.text) > 0 AND (CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text)) > 0 THEN

SUBSTRING(qt.text, CHARINDEX('[',qt.text) , CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text) + 1)

ELSE qt.text

END, 0, 255) AS [SP Name],

DB_NAME(qt.dbid) AS [Database],

qs.total_worker_time AS 'TotalWorkerTime',

qs.total_worker_time/qs.execution_count AS 'AvgWorkerTime',

qs.execution_count AS 'Execution Count',

ISNULL(qs.total_elapsed_time/qs.execution_count, 0) AS 'AvgElapsedTime',

qs.max_logical_reads, qs.max_logical_writes,

DATEDIFF(Minute, qs.creation_time, GetDate()) AS 'Age in Cache'

FROM sys.dm_exec_query_stats AS qs

CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt

WHERE (qs.execution_count > 0 AND (qs.total_worker_time/qs.execution_count) > 100000)

-- AND qt.dbid = db_id() -- Filter by current database

ORDER BY qs.total_worker_time DESC



A Shift to Azure—Yet the Problem Persists

Eventually, the customer sold his company, and we took over the infrastructure, migrating everything from the data center to Azure. We started with an 8-core setup, and surprisingly, we faced no issues. However, this was during the COVID-19 pandemic when usage was lower than usual.

The next year, expecting a traffic surge, we increased the CPU to 16 cores. Once again, Easter arrived, and so did the same unexplained CPU spike.

The Clue That Changed Everything

While analyzing usage patterns, we noticed something peculiar:
1. Christmas had significantly lower usage, yet the CPU spike still occurred.
2. On Easter, usage was high, but it wasn’t drastically different from a regular Sunday.
3. Every time, the first complaint came from the same customer before other reports followed.

This led us to a bold experiment. To buy ourselves time for debugging, we temporarily scaled up the CPU to a staggering 96 cores—far beyond normal requirements.

The Eureka Moment

As expected, the spike hit again. But this time, with extra computing power, we had the opportunity to investigate in real-time. Running SQL diagnostic queries, we found the problem:

SELECT

session_id,

status,

blocking_session_id,

wait_type,

wait_time,

last_wait_type,

cpu_time,

total_elapsed_time,

reads,

writes,

logical_reads,

text AS query_text

FROM sys.dm_exec_requests r

CROSS APPLY sys.dm_exec_sql_text(r.sql_handle)

ORDER BY total_elapsed_time DESC;


🔍 A single function was stuck in an infinite loop, consuming massive CPU resources!

It turned out this function was responsible for generating unique security codes for check-ins. Normally, these codes were either numeric or alphanumeric. However, this particular customer had chosen a 3-digit numeric code, severely limiting the available unique codes.

On high-traffic days like Easter, when they exceeded 1,000 check-ins for that single customer, the function would enter a never-ending loop, continuously searching for an unused code but failing to find one. The result? A system-wide meltdown.

The Fix and the Lesson

To prevent further crashes, we immediately:
– Killed the problematic function again and again on one system to keep the service alive using KILL <session_id>;.
– Debugged the root issue on another system simultaneously.

As a quick fix, we changed the security code format to alphanumeric, dramatically increasing the number of available codes. Instantly, the CPU dropped back to normal. Easter was finally saved!

Later, we implemented a proper fix to prevent infinite loops and ensure better error handling.

Key Takeaways

1. Stay calm in a crisis – Panicking doesn’t help, but data analysis does.
2. Be aware of legacy code – This issue was hidden deep within an old SQL function no one had reviewed.
3. Scalability is not always the answer – Simply throwing more CPU at a problem won’t fix bad code.

What seemed like a capacity issue turned out to be a simple coding flaw that went unnoticed for years. In the end, the solution wasn’t more power—it was smarter debugging.

Do you have a similar story of a mysterious bug? Share it in the comments!

Leave a Reply

Your email address will not be published. Required fields are marked *