Real-Time Transcription Fallback and Recovery — Postmortem and Debugging Notes

We operate a real-time transcription service with the following architecture:

A custom WebSocket-based speech recognition server.
Azure STT is used as a fallback when our server is unavailable.
Flutter-based mobile clients connect via native Swift/Kotlin plugins for:
- WebSocket connection.
- Voice Activity Detection (VAD).

Problem Overview

In production, we encountered a critical issue where Azure STT would stop receiving audio and fail to recover properly, even though fallback mechanisms were triggered. This led to lost audio segments, especially when attempting to switch back to our custom STT server after receiving final results from Azure.

System Behavior (Expected vs Actual)

Expected Flow

The mobile client connects to our STT server via WebSocket.
If the connection fails, it falls back to Azure STT.
Upon receiving the final result from Azure, the client attempts to reconnect to our server.
If reconnection fails again, it falls back to Azure once more.

Actual Behavior

During reconnection attempts, audio input was silently lost.
Azure STT appeared to stop processing audio entirely.
Recovery only happened after a significant delay—or not at all—depending on network conditions.

Root Causes (Multiple Factors Involved)

Diagnosing this issue was especially difficult due to several intertwined root causes:

1. TCP Hang During WebSocket Server Recovery

WebSocket Server Lifecycle Observations

+---------------------+-------------------------------+-----------------------------+
| Phase 1             | Phase 2                       | Phase 3                     |
| Restarting Process  | Starting Server               | Server Fully Started        |
+---------------------+-------------------------------+-----------------------------+
| - WebSocket server   | - FastAPI starts               | - All internal services     |
|   is stopped         | - WebSocket accepts connections|   are ready                 |
| - WebSocket fails    |   but hangs (~60s timeout)     | - Connections stay open     |
|   immediately        |                                |   and functional            |
+---------------------+-------------------------------+-----------------------------+

In Phase 2, the server appeared reachable: WebSocket handshakes succeeded, but the internal STT pipeline was not yet ready. As a result, the connection would hang without receiving audio, causing the mobile client to stall before falling back.

2. Azure STT Thread Lifecycle Issues

We found that:

When final results were received from Azure, the client attempted to reconnect to our own server.
In some edge cases, multiple Azure STT recognizer threads were alive simultaneously.
This caused thread contention, and in some cases, deadlocks in audio streaming and network routines.
As a result, audio was either dropped or delayed significantly.

3. Network Instability and Fallback Timing

On slower or unstable mobile networks, timeouts were more likely to occur during recovery.
These conditions caused the fallback to Azure to trigger too aggressively or too often.
This further increased thread and connection complexity and amplified the thread lifecycle issues mentioned above.

Actions Taken for Debugging and Mitigation

To understand and fix the issue, I took the following steps:

On the Server Side

Implemented detailed logging during the WebSocket server startup phase to clearly distinguish between:
- Process restarts
- FastAPI boot completion

On the Mobile Client

Audited and refactored STT thread lifecycle management:
- Ensured only one Azure STT recognizer could run at a time.
- Prevented overlapping WebSocket reconnection attempts.
- Added more aggressive cleanup logic after receiving final STT results.
Introduced guard logic in Swift and Kotlin layers to avoid sending audio during Phase 2 (server not fully ready).
Improved logging for STT connection state transitions (e.g., onConnected, onClosed, onError) to correlate mobile behavior with server-side events.
Added manual read timeout handling for WebSocket connections, since the library in use didn’t provide built-in read timeout support. This helped detect “silent” hangs where the connection was established but no data was processed.

On the Debugging Environment

Built custom debugging builds for staging environments:
- Enabled logging to native system logs (idevicesyslog for iOS, adb logcat for Android).
- Enabled internal debug switches to simulate slow network conditions and delayed server readiness.
Used those builds to reproduce and verify the behavior under constrained or error-prone network scenarios.

Outcome

After these fixes:

The system recovered from server restarts more reliably.
Fallback and reconnection between Azure STT and our own STT server became much more stable.
Audio loss was minimized, even in cases of prolonged reconnection attempts.
Debugging in staging now offers much clearer insight into both server and client lifecycle states.