Real-time speech recognition systems with automatic language detection

Purpose

Many real-time speech systems demand a language setting before you speak, which breaks the flow. This design automatically detects the language and transcribes without configuration so a user can just talk.

Exclusions

Model specifics and infrastructure are out of scope. The design targets an MVP with one active speaker at a time; overlapping speech is not covered.

Design

System Requirements

Detect the speaker’s language, transcribe, and do it in real time. Audio arrives over WebSocket in 256-byte chunks.

High-Level Design

The system workflow is roughly as follows:

Client (Audio Input)
        │
        ▼
┌───────────────┐
│  WebSocket    │  Receive audio from client
└─────┬─────────┘
      ▼
┌───────────────────────────┐
│ Speaker Diarization Model │  Identify who is speaking
└─────┬─────────────────────┘
      ▼
┌───────────────────────────────┐
│ Language Identification Model │  Determine the language per speaker
└─────┬─────────────────────────┘
      ▼
┌───────────────────────┐
│      STT Model        │  Perform transcription
└───────────────────────┘

Key Points

Speaker diarization comes first so even tiny chunks can be grouped by “who spoke when.” Language identification and transcription then run per speaker, improving accuracy and keeping processing simple. The flow is: detect who is speaking (treat the same person in another language as separate), send that audio to language ID, then transcribe in the detected language.

Detailed Design

Audio Buffering

Tiny chunks hurt accuracy and inflate inference calls, so a short buffer precedes diarization. The trade-off is a small added delay in exchange for more stable results.

Minimize Inference to Reduce Latency

To keep latency down, stop re-running language identification once confidence is high enough and run STT only on new audio. A context object per speaker segment tracks buffered audio, language ID, and STT results:

Context Object
├─ AudioChunker           # Manage audio chunks per speaker
├─ LanguageIdentification # Language identification result
└─ STTResults             # Transcription result

When a new speaker is detected, create a context, buffer their audio, perform language ID and STT, then close the context when done.

Final design

Client (Audio Input)
        │
        ▼
┌───────────────┐
│  WebSocket    │  Receive audio chunks (256 bytes each)
└─────┬─────────┘
      ▼
┌─────────────────┐
│ Audio Buffer    │  Aggregate small chunks for stable processing
└─────┬───────────┘
      ▼
┌───────────────────────────┐
│ Real-time Diarization     │  Segment who is speaking
└─────┬─────────────────────┘
      ▼
      ┌──────────────────────────────────────────────┐
      │ Speaker-wise Context Creation & AudioChunker │
      └─────────────┬─────────────┬──────────────────┘
                    │             │
        ┌───────────┘             └─────────────┐
        ▼                                       ▼
┌─────────────────────┐             ┌─────────────────────┐
│ Context: Speaker 1  │             │ Context: Speaker 2  │
│ ├─ AudioChunker     │             │ ├─ AudioChunker     │
│ ├─ LangID (ja)      │             │ ├─ LangID (?)       │
│ └─ STTResults       │             │ └─ STTResults       │
└─────────┬───────────┘             └─────────┬───────────┘
          │                                   │
          ▼                                   ▼
┌─────────────────────┐             ┌─────────────────────┐
│ Language Detection: │             │ Language Detection: │
│  Speaker1: skip     │             │  Speaker2: en       │
└─────────┬───────────┘             └─────────┬───────────┘
          │                                   │
          ▼                                   ▼
┌─────────────────────┐             ┌─────────────────────┐
│ STT: Speaker 1 (ja) │             │ STT: Speaker 2 (en) │
└─────────┬───────────┘             └─────────┬───────────┘
          │                                   │
          ▼                                   ▼
     ┌────────────────┐                ┌────────────────┐
     │  Transcription │                │  Transcription │
     │   (Speaker1)   │                │   (Speaker2)   │
     └────────────────┘                └────────────────┘

Evaluation

Automatic language detection plus transcription is workable with this layout, but accuracy on short or accented speech remains a challenge, synchronous steps still constrain latency, and broad multilingual coverage is difficult—narrow language pairs are more realistic.

Conclusion

Adding translation or TTS on top yields a real-time multilingual system without explicit language settings. It is MVP-level but technically feasible.