Real-Time Transcription Fallback and Recovery — Postmortem and Debugging Notes
We operate a real-time transcription service with the following architecture:
A custom WebSocket speech server handles traffic, Azure STT acts as fallback, and Flutter clients connect through Swift/Kotlin plugins for WebSocket and VAD.
Problem Overview
In production, we encountered a critical issue where Azure STT would stop receiving audio and fail to recover properly, even though fallback mechanisms were triggered. This led to lost audio segments, especially when attempting to switch back to our custom STT server after receiving final results from Azure.
System Behavior (Expected vs Actual)
Expected Flow
The client connects to our STT server over WebSocket. If that fails, it falls back to Azure, then reconnects to our server after Azure returns a final result. If reconnection fails again, it uses Azure once more.
Actual Behavior
During reconnection attempts, audio input went silent, Azure stopped processing, and recovery took a long time or never happened on weak networks.
Root Causes (Multiple Factors Involved)
Diagnosing this issue was especially difficult due to several intertwined root causes:
1. TCP Hang During WebSocket Server Recovery
WebSocket Server Lifecycle Observations
+---------------------+-------------------------------+-----------------------------+
| Phase 1 | Phase 2 | Phase 3 |
| Restarting Process | Starting Server | Server Fully Started |
+---------------------+-------------------------------+-----------------------------+
| - WebSocket server | - FastAPI starts | - All internal services |
| is stopped | - WebSocket accepts connections| are ready |
| - WebSocket fails | but hangs (~60s timeout) | - Connections stay open |
| immediately | | and functional |
+---------------------+-------------------------------+-----------------------------+
In Phase 2, the server looked reachable—handshakes succeeded—but the internal STT pipeline was not ready. Connections hung without audio, stalling the client before fallback.
2. Azure STT Thread Lifecycle Issues
After Azure returned final results, the client reconnected to our server. In edge cases, multiple Azure recognizer threads stayed alive, causing contention and occasional deadlocks, which dropped or delayed audio.
3. Network Instability and Fallback Timing
Slow or unstable networks made timeouts likely during recovery, triggered aggressive fallbacks, and amplified the thread-lifecycle issues above.
Actions Taken for Debugging and Mitigation
To understand and fix the issue, I took the following steps:
On the Server Side
Added detailed logging during WebSocket startup to separate process restarts from FastAPI boot completion.
On the Mobile Client
Refactored STT thread lifecycles so only one Azure recognizer runs, reconnection attempts do not overlap, and cleanup occurs right after final results. Added guards in Swift/Kotlin to avoid sending audio during server Phase 2 and improved logging for connection state transitions. Implemented manual read timeouts to detect silent hangs the library missed.
On the Debugging Environment
Created staging builds that log to native system logs and expose switches to simulate slow networks and delayed server readiness. Used them to reproduce and verify the behavior.
Outcome
After these fixes, restarts recover reliably, fallback and reconnection stabilized, audio loss shrank even during long reconnections, and staging logs now reveal server and client lifecycles clearly.