Purpose

Many real-time speech recognition systems require a language setting before starting recognition.

However, switching languages every time can be cumbersome.

In this article, I introduce a system design that automatically detects the spoken language without any language setting and performs speech recognition.

The biggest advantage is that users can converse naturally without thinking about language settings.

No language switching is required; just speaking is enough for automatic transcription and translation.

Exclusions

I’m not going to talk about Machine learning model details or infrastructure setup. These are confidential, so I will not cover them in this article.

This design assumes a simple MVP-level use case targeting one speaker at a time. The simultaneous speech is not discussed.

Design

System Requirements

Detect the language of the speaker
Transcribe the speaker’s audio
Process the above in real-time

Conditions

Audio data is transmitted via WebSocket in small chunks of 256 bytes per message.

High-Level Design

The system workflow is roughly as follows:

Client (Audio Input)
        │
        ▼
┌───────────────┐
│  WebSocket    │  Receive audio from client
└─────┬─────────┘
      ▼
┌───────────────────────────┐
│ Speaker Diarization Model │  Identify who is speaking
└─────┬─────────────────────┘
      ▼
┌───────────────────────────────┐
│ Language Identification Model │  Determine the language per speaker
└─────┬─────────────────────────┘
      ▼
┌───────────────────────┐
│      STT Model        │  Perform transcription
└───────────────────────┘

Key Points

Perform speaker diarization first

→ Even with small audio chunks, we can accurately segment “who spoke when.”
Subsequent processes (language identification and transcription) are performed per speaker

→ This improves accuracy and simplifies processing.

Workflow steps:

Use the speaker diarization model to identify who is speaking now (treat the same person speaking in a different language as a separate speaker).
Send the segmented audio per speaker to the language identification model.
Once the language is detected, perform transcription in the corresponding language.

Detailed Design

Audio Buffering

Small audio chunks can reduce model inference accuracy and increase the number of inference calls, which can lead to higher STT latency.

Therefore, it is necessary to buffer some audio data before performing speaker diarization.

Benefit: Improved accuracy
Trade-off: Slight delay in real-time processing

Minimize Inference to Reduce Latency

To avoid unnecessary inference, we apply the following optimizations:

Skip re-inference once the language identification reaches a certain confidence level
Perform STT only on unprocessed audio segments

To manage this, we introduce a Context class per speaker segment:

Context Object
├─ AudioChunker           # Manage audio chunks per speaker
├─ LanguageIdentification # Language identification result
└─ STTResults             # Transcription result

When a new speaker is detected, create a Context
Buffer audio per speaker, perform language identification and STT
End the Context once processing is complete

Final design

Client (Audio Input)
        │
        ▼
┌───────────────┐
│  WebSocket    │  Receive audio chunks (256 bytes each)
└─────┬─────────┘
      ▼
┌─────────────────┐
│ Audio Buffer    │  Aggregate small chunks for stable processing
└─────┬───────────┘
      ▼
┌───────────────────────────┐
│ Real-time Diarization     │  Segment who is speaking
└─────┬─────────────────────┘
      ▼
      ┌──────────────────────────────────────────────┐
      │ Speaker-wise Context Creation & AudioChunker │
      └─────────────┬─────────────┬──────────────────┘
                    │             │
        ┌───────────┘             └─────────────┐
        ▼                                       ▼
┌─────────────────────┐             ┌─────────────────────┐
│ Context: Speaker 1  │             │ Context: Speaker 2  │
│ ├─ AudioChunker     │             │ ├─ AudioChunker     │
│ ├─ LangID (ja)      │             │ ├─ LangID (?)       │
│ └─ STTResults       │             │ └─ STTResults       │
└─────────┬───────────┘             └─────────┬───────────┘
          │                                   │
          ▼                                   ▼
┌─────────────────────┐             ┌─────────────────────┐
│ Language Detection: │             │ Language Detection: │
│  Speaker1: skip     │             │  Speaker2: en       │
└─────────┬───────────┘             └─────────┬───────────┘
          │                                   │
          ▼                                   ▼
┌─────────────────────┐             ┌─────────────────────┐
│ STT: Speaker 1 (ja) │             │ STT: Speaker 2 (en) │
└─────────┬───────────┘             └─────────┬───────────┘
          │                                   │
          ▼                                   ▼
     ┌────────────────┐                ┌────────────────┐
     │  Transcription │                │  Transcription │
     │   (Speaker1)   │                │   (Speaker2)   │
     └────────────────┘                └────────────────┘

Evaluation

With this design, automatic language detection + transcription is feasible to a certain extent.

However, there are still challenges:

Limited model accuracy (short speech segments or accented speech can cause errors)
Many synchronous processes limit real-time performance
Multilingual support is still challenging (focusing on specific language pairs is more practical)

Conclusion

By adding translation or TTS on top of this system, a real-time multilingual translation system without language settings can be built.

While this is still an MVP-level design, it is technically feasible, and practical implementation could be expected in the near future.