Purpose
Many real-time speech recognition systems require a language setting before starting recognition.
However, switching languages every time can be cumbersome.
In this article, I introduce a system design that automatically detects the spoken language without any language setting and performs speech recognition.
The biggest advantage is that users can converse naturally without thinking about language settings.
No language switching is required; just speaking is enough for automatic transcription and translation.
Exclusions
I’m not going to talk about Machine learning model details or infrastructure setup. These are confidential, so I will not cover them in this article.
This design assumes a simple MVP-level use case targeting one speaker at a time. The simultaneous speech is not discussed.
Design
System Requirements
- Detect the language of the speaker
- Transcribe the speaker’s audio
- Process the above in real-time
Conditions
- Audio data is transmitted via WebSocket in small chunks of 256 bytes per message.
High-Level Design
The system workflow is roughly as follows:
Client (Audio Input)
│
▼
┌───────────────┐
│ WebSocket │ Receive audio from client
└─────┬─────────┘
▼
┌───────────────────────────┐
│ Speaker Diarization Model │ Identify who is speaking
└─────┬─────────────────────┘
▼
┌───────────────────────────────┐
│ Language Identification Model │ Determine the language per speaker
└─────┬─────────────────────────┘
▼
┌───────────────────────┐
│ STT Model │ Perform transcription
└───────────────────────┘
Key Points
-
Perform speaker diarization first
→ Even with small audio chunks, we can accurately segment “who spoke when.”
-
Subsequent processes (language identification and transcription) are performed per speaker
→ This improves accuracy and simplifies processing.
Workflow steps:
- Use the speaker diarization model to identify who is speaking now (treat the same person speaking in a different language as a separate speaker).
- Send the segmented audio per speaker to the language identification model.
- Once the language is detected, perform transcription in the corresponding language.
Detailed Design
Audio Buffering
Small audio chunks can reduce model inference accuracy and increase the number of inference calls, which can lead to higher STT latency.
Therefore, it is necessary to buffer some audio data before performing speaker diarization.
- Benefit: Improved accuracy
- Trade-off: Slight delay in real-time processing
Minimize Inference to Reduce Latency
To avoid unnecessary inference, we apply the following optimizations:
- Skip re-inference once the language identification reaches a certain confidence level
- Perform STT only on unprocessed audio segments
To manage this, we introduce a Context class per speaker segment:
Context Object
├─ AudioChunker # Manage audio chunks per speaker
├─ LanguageIdentification # Language identification result
└─ STTResults # Transcription result
- When a new speaker is detected, create a Context
- Buffer audio per speaker, perform language identification and STT
- End the Context once processing is complete
Final design
Client (Audio Input)
│
▼
┌───────────────┐
│ WebSocket │ Receive audio chunks (256 bytes each)
└─────┬─────────┘
▼
┌─────────────────┐
│ Audio Buffer │ Aggregate small chunks for stable processing
└─────┬───────────┘
▼
┌───────────────────────────┐
│ Real-time Diarization │ Segment who is speaking
└─────┬─────────────────────┘
▼
┌──────────────────────────────────────────────┐
│ Speaker-wise Context Creation & AudioChunker │
└─────────────┬─────────────┬──────────────────┘
│ │
┌───────────┘ └─────────────┐
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Context: Speaker 1 │ │ Context: Speaker 2 │
│ ├─ AudioChunker │ │ ├─ AudioChunker │
│ ├─ LangID (ja) │ │ ├─ LangID (?) │
│ └─ STTResults │ │ └─ STTResults │
└─────────┬───────────┘ └─────────┬───────────┘
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Language Detection: │ │ Language Detection: │
│ Speaker1: skip │ │ Speaker2: en │
└─────────┬───────────┘ └─────────┬───────────┘
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ STT: Speaker 1 (ja) │ │ STT: Speaker 2 (en) │
└─────────┬───────────┘ └─────────┬───────────┘
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Transcription │ │ Transcription │
│ (Speaker1) │ │ (Speaker2) │
└────────────────┘ └────────────────┘
Evaluation
With this design, automatic language detection + transcription is feasible to a certain extent.
However, there are still challenges:
- Limited model accuracy (short speech segments or accented speech can cause errors)
- Many synchronous processes limit real-time performance
- Multilingual support is still challenging (focusing on specific language pairs is more practical)
Conclusion
By adding translation or TTS on top of this system, a real-time multilingual translation system without language settings can be built.
While this is still an MVP-level design, it is technically feasible, and practical implementation could be expected in the near future.