What happened?
My FastAPI server froze and every incoming request hung indefinitely. The periodic logging loop kept running because it lives on another thread.
Investigation
Reproduce code
import itertools
import logging
import threading
import time
from multiprocessing import JoinableQueue
import uvicorn
from fastapi import FastAPI, Query
logging.basicConfig(
format="%(asctime)s %(levelname)s %(name)s %(message)s",
level=logging.INFO,
)
logger = logging.getLogger("stuck-server")
app = FastAPI()
task_queue = JoinableQueue(maxsize=1)
worker_should_run = threading.Event()
worker_should_run.set()
threads_started = threading.Event()
task_counter = itertools.count()
def log_heartbeat() -> None:
while True:
logger.info("heartbeat: FastAPI process is still logging")
time.sleep(2)
def drain_queue() -> None:
while True:
worker_should_run.wait()
task_id = task_queue.get()
logger.info("worker consumed %s", task_id)
time.sleep(0.2)
task_queue.task_done()
@app.on_event("startup")
def start_threads() -> None:
if threads_started.is_set():
return
threading.Thread(target=drain_queue, daemon=True).start()
threading.Thread(target=log_heartbeat, daemon=True).start()
threads_started.set()
logger.info("background threads started")
@app.get("/transcribe")
async def transcribe(
block: bool = Query(
False,
description=(
"When true, pause the worker and enqueue tasks until JoinableQueue.put() "
"blocks forever, freezing the server."
),
),
) -> dict[str, str]:
task_id = f"task-{next(task_counter)}"
if block:
worker_should_run.clear()
logger.info("pausing worker and filling queue before blocking")
task_queue.put(f"{task_id}-warmup")
logger.info("queue filled; the next put call will block forever")
task_queue.put(f"{task_id}-blocking")
return {"status": "unreachable"} # pragma: no cover
worker_should_run.set()
task_queue.put(task_id)
logger.info("task submitted normally")
return {"status": "enqueued", "task_id": task_id}
@app.get("/healthz")
async def healthz() -> dict[str, str]:
return {"status": "ok"}
if __name__ == "__main__":
uvicorn.run(
"stuck_server:app",
host="127.0.0.1",
port=8001,
log_level="info",
)
Guiding question
Do you understand how this happens?
How did the deadlock form?
The main thread runs the asyncio event loop.
We emit events to a worker thread via a multiprocessing.JoinableQueue.
When that worker dies, it stops pulling events, the queue fills up, and the synchronous put() call blocks forever.
Because the event loop lives on that same thread, the entire FastAPI server freezes at that point.
The logging heartbeat runs on another thread, so it continues to print even while the main thread is wedged.
Background
How OS manages many I/Os efficiently
Operating systems provide mechanisms for waiting on many I/O sources efficiently. Without them, the CPU would sit idle while a single task blocks.
Inefficient case:
Task A
|
|---- IO wait (3 seconds) ----|
|
rest of Task A
In this case the CPU cannot do useful work until the I/O finishes.
More efficient case:
Task A
|
| STOP (await)
|
Task B runs
Task C runs
Task D runs
|
IO completes
|
Task A resumes
While Task A waits on I/O, the scheduler can run other tasks, making much better use of the CPU. Having a mechanism to run other work while a task waits on I/O is therefore crucial.
Linux I/O model
Linux provides an API named epoll that can wait on a large number of file descriptors efficiently.
epoll is a kernel-level I/O event notification API; it lets you ask the kernel to watch many descriptors and tell you when they become ready.
Linux also follows the well-known design idea:
Everything is a file
Most I/O resources are managed as file descriptors (fd), which are just integers. Examples include:
- files
- sockets (network connections)
- pipes
- terminals
- timers
- event notification handles
When you open a socket, the OS returns an fd, e.g.
socket fd = 42
You then use that fd to issue I/O calls:
read(fd)
write(fd)
Role of epoll
epoll watches the state of each registered fd.
A program typically:
- registers the fds it wants to monitor
- calls
epoll_wait()to wait for those fds to become ready
epoll_wait()
When the program calls epoll_wait(), the OS watches for readiness:
fd ready ?
and puts the process to sleep until something becomes ready.
If a network packet arrives, the kernel marks the socket as readable:
socket fd → READABLE
At that point epoll_wait() returns with the list of ready fds, for example:
ready fds:
- socket_1
- socket_7
Event loop
When epoll_wait() returns, user-space code resumes.
In most asynchronous programs an event loop performs this work.
Typical flow:
while True:
run_ready_tasks()
events = epoll_wait()
resume_tasks_for(events)
This enables the efficient cycle of
I/O wait
↓
run other tasks
↓
I/O complete
↓
resume original task
In Linux:
- most I/O surfaces are exposed as file descriptors (fd)
- epoll lets the kernel watch many fds efficiently
epoll_wait()reports which fd is ready- the event loop restarts the task that was waiting on that fd
All of this allows the CPU to keep working on other tasks while one task waits for I/O.
How the event loop works
The event loop is implemented in Python, not in the OS. Conceptually it looks like this:
+----------------------+
| Event Loop |
+----------------------+
| |
| |
Ready Tasks epoll
(Queue) (waiting)
- Ready Tasks → tasks that can run immediately
- epoll → file descriptors that are currently waiting for I/O
Consider this example:
Task A
|
| STOP (await)
|
Task B runs
Task C runs
Task D runs
|
IO completes
|
Task A resumes
When Task A executes an await, the following happens:
- Task A enters the ready queue.
- The event loop schedules Task A.
- Task A hits
awaitand suspends. - The socket fd (say 42) backing that await is registered with
epoll. - Control returns to the event loop.
- The ready queue now contains
[Task B, Task C, Task D]. - The loop runs B, C, and D.
- More network data arrives and marks fd 42 READABLE.
- The event loop knows fd 42 belongs to Task A, so it re-queues Task A.
- Task A eventually runs again and continues.
What is a coroutine?
A coroutine is a function that can pause and later resume. A regular function runs straight through once called; a coroutine can yield control mid-way.
Python's await keyword pauses the currently running coroutine and hands control back to the event loop.
The event loop runs other tasks and resumes the suspended coroutine when its awaited I/O (or other awaited operation) completes.
async def task_a():
data = await sock.recv()
print(data)
The await pauses task_a:
task_a()
|
await sock.recv()
|
pause
Putting it all together:
Event Loop
|
run(Task A)
|
Coroutine A
|
await future
|
pause
|
return control
|
Event Loop
socket readable
|
epoll_wait returns
|
Future done
|
Task A re-enters ReadyQueue
|
Event Loop
|
resume coroutine
How JoinableQueue works
multiprocessing.JoinableQueue provides a task queue plus synchronization for tracking task completion. Unlike a normal Queue, it lets a producer wait until every enqueued task has been processed.
This structure shows up frequently in parallel programs:
Main Process
|
v
Producer → Queue → Worker1
Worker2
Worker3
Common examples include map-reduce jobs, batch processing, crawlers, and generic task executors.
JoinableQueue acts as the synchronization primitive that lets the producer wait until all workers finish their tasks.
If a task in the queue never gets processed (so no one calls task_done()), join() blocks forever.
In practice the program appears stuck — a deadlock.
Failure analysis
Why did the event loop hang completely?
- The worker thread consuming the
JoinableQueuedies. - The async handler still tries to synchronously enqueue work.
- That
put()call blocks once the queue fills up, so the handler never returns. - Because the handler runs on the main event-loop thread, the thread stops processing events.
Why does put() on the queue hang?
Because the worker that should be draining the JoinableQueue is dead, nothing consumes the items, the queue reaches its capacity, and the next put() call blocks.
Summary
The asyncio event loop is blocked by a synchronous hang inside an async handler. To prevent a repeat:
- Give
queue.put()a timeout and handlequeue.Full(log, drop, or retry) so the event loop never blocks forever. - Add a health check/restart for the worker thread so the queue keeps draining; if the worker dies, the main loop can fail fast instead of wedging.