Every tech team has that one baffling issue that keeps them up at night. For us, it was a problem that haunted our system every Easter, year after year.
The Beginning of the Mystery
One of our customers, a highly technical person, managed his entire infrastructure in a U.S.-based data center. Our system, backed by a robust SQL Server database with a high-core CPU, ran smoothly—except for one strange anomaly.
Every Easter (and sometimes Christmas), our CPU usage would suddenly skyrocket. Not a gradual increase, but an alarming jump from 1% to 60%, then 90%, and eventually 100%, bringing the entire system down. Given that our software is widely used by churches, our customer assumed this was simply due to a surge in user traffic during these holidays.
The Failed Fixes
To handle the expected traffic, he upgraded the CPU to a massive 40-core system—far more than what was needed for normal operations. Yet, the following Easter, the same issue occurred.
Determined to find the culprit, we analyzed SQL queries using diagnostic tools, optimized inefficient queries, and fine-tuned our database. But despite all efforts, Easter after Easter, the CPU spike returned—leading to frustrated customers and lost business.
We used the following query to retrieve and analyze CPU usage
INSERT INTO @DBAverageTimeResultset
SELECT TOP 20
SUBSTRING(
CASE
WHEN CHARINDEX('[',qt.text) > 0 AND (CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text)) > 0 THEN
SUBSTRING(qt.text, CHARINDEX('[',qt.text) , CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text) + 1)
ELSE qt.text
END, 0, 255) AS [SP Name],
DB_NAME(qt.dbid) AS [Database],
qs.total_worker_time AS 'TotalWorkerTime',
qs.total_worker_time/qs.execution_count AS 'AvgWorkerTime',
qs.execution_count AS 'Execution Count',
ISNULL(qs.total_elapsed_time/qs.execution_count, 0) AS 'AvgElapsedTime',
qs.max_logical_reads, qs.max_logical_writes,
DATEDIFF(Minute, qs.creation_time, GetDate()) AS 'Age in Cache'
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
WHERE qs.execution_count > 0
AND qt.dbid = db_id() -- Filter by current database
ORDER BY qs.total_worker_time/qs.execution_count DESC
INSERT INTO @DBTotalTimeResultset
SELECT TOP 20
SUBSTRING(
CASE
WHEN CHARINDEX('[',qt.text) > 0 AND (CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text)) > 0 THEN
SUBSTRING(qt.text, CHARINDEX('[',qt.text) , CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text) + 1)
ELSE qt.text
END, 0, 255) AS [SP Name],
DB_NAME(qt.dbid) AS [Database],
qs.total_worker_time AS 'TotalWorkerTime',
qs.total_worker_time/qs.execution_count AS 'AvgWorkerTime',
qs.execution_count AS 'Execution Count',
ISNULL(qs.total_elapsed_time/qs.execution_count, 0) AS 'AvgElapsedTime',
qs.max_logical_reads, qs.max_logical_writes,
DATEDIFF(Minute, qs.creation_time, GetDate()) AS 'Age in Cache'
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
WHERE (qs.execution_count > 0 AND (qs.total_worker_time/qs.execution_count) > 100000)
AND qt.dbid = db_id() -- Filter by current database
ORDER BY qs.total_worker_time DESC
INSERT INTO @EngineAverageTimeResultset
SELECT TOP 20
SUBSTRING(
CASE
WHEN CHARINDEX('[',qt.text) > 0 AND (CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text)) > 0 THEN
SUBSTRING(qt.text, CHARINDEX('[',qt.text) , CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text) + 1)
ELSE qt.text
END, 0, 255) AS [SP Name],
DB_NAME(qt.dbid) AS [Database],
qs.total_worker_time AS 'TotalWorkerTime',
qs.total_worker_time/qs.execution_count AS 'AvgWorkerTime',
qs.execution_count AS 'Execution Count',
ISNULL(qs.total_elapsed_time/qs.execution_count, 0) AS 'AvgElapsedTime',
qs.max_logical_reads, qs.max_logical_writes,
DATEDIFF(Minute, qs.creation_time, GetDate()) AS 'Age in Cache'
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
WHERE qs.execution_count > 0
AND qt.dbid IS NOT NULL
-- AND qt.dbid = db_id() -- Filter by current database
ORDER BY qs.total_worker_time/qs.execution_count DESC
INSERT INTO @EngineTotalTimeResultset
SELECT TOP 20
SUBSTRING(
CASE
WHEN CHARINDEX('[',qt.text) > 0 AND (CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text)) > 0 THEN
SUBSTRING(qt.text, CHARINDEX('[',qt.text) , CHARINDEX(']',qt.text, CHARINDEX('[',qt.text) + 10) - CHARINDEX('[',qt.text) + 1)
ELSE qt.text
END, 0, 255) AS [SP Name],
DB_NAME(qt.dbid) AS [Database],
qs.total_worker_time AS 'TotalWorkerTime',
qs.total_worker_time/qs.execution_count AS 'AvgWorkerTime',
qs.execution_count AS 'Execution Count',
ISNULL(qs.total_elapsed_time/qs.execution_count, 0) AS 'AvgElapsedTime',
qs.max_logical_reads, qs.max_logical_writes,
DATEDIFF(Minute, qs.creation_time, GetDate()) AS 'Age in Cache'
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
WHERE (qs.execution_count > 0 AND (qs.total_worker_time/qs.execution_count) > 100000)
-- AND qt.dbid = db_id() -- Filter by current database
ORDER BY qs.total_worker_time DESC
A Shift to Azure—Yet the Problem Persists
Eventually, the customer sold his company, and we took over the infrastructure, migrating everything from the data center to Azure. We started with an 8-core setup, and surprisingly, we faced no issues. However, this was during the COVID-19 pandemic when usage was lower than usual.
The next year, expecting a traffic surge, we increased the CPU to 16 cores. Once again, Easter arrived, and so did the same unexplained CPU spike.
The Clue That Changed Everything
While analyzing usage patterns, we noticed something peculiar:
1. Christmas had significantly lower usage, yet the CPU spike still occurred.
2. On Easter, usage was high, but it wasn’t drastically different from a regular Sunday.
3. Every time, the first complaint came from the same customer before other reports followed.
This led us to a bold experiment. To buy ourselves time for debugging, we temporarily scaled up the CPU to a staggering 96 cores—far beyond normal requirements.
The Eureka Moment
As expected, the spike hit again. But this time, with extra computing power, we had the opportunity to investigate in real-time. Running SQL diagnostic queries, we found the problem:
SELECT
session_id,
status,
blocking_session_id,
wait_type,
wait_time,
last_wait_type,
cpu_time,
total_elapsed_time,
reads,
writes,
logical_reads,
text AS query_text
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle)
ORDER BY total_elapsed_time DESC;
🔍 A single function was stuck in an infinite loop, consuming massive CPU resources!
It turned out this function was responsible for generating unique security codes for check-ins. Normally, these codes were either numeric or alphanumeric. However, this particular customer had chosen a 3-digit numeric code, severely limiting the available unique codes.
On high-traffic days like Easter, when they exceeded 1,000 check-ins for that single customer, the function would enter a never-ending loop, continuously searching for an unused code but failing to find one. The result? A system-wide meltdown.
The Fix and the Lesson
To prevent further crashes, we immediately:
– Killed the problematic function again and again on one system to keep the service alive using KILL <session_id>;.
– Debugged the root issue on another system simultaneously.
As a quick fix, we changed the security code format to alphanumeric, dramatically increasing the number of available codes. Instantly, the CPU dropped back to normal. Easter was finally saved!
Later, we implemented a proper fix to prevent infinite loops and ensure better error handling.
Key Takeaways
1. Stay calm in a crisis – Panicking doesn’t help, but data analysis does.
2. Be aware of legacy code – This issue was hidden deep within an old SQL function no one had reviewed.
3. Scalability is not always the answer – Simply throwing more CPU at a problem won’t fix bad code.
What seemed like a capacity issue turned out to be a simple coding flaw that went unnoticed for years. In the end, the solution wasn’t more power—it was smarter debugging.
Do you have a similar story of a mysterious bug? Share it in the comments!