Document Processing Timeouts and a Real-Time Progress API
Long-running document jobs sometimes hung; the UI had no way to show progress. We added timeout safeguards and a progress API so users could see status without polling blindly.
The Problem
The Curia document processing pipeline runs a series of tasks (parsing, extraction, analysis) on uploaded contracts. Users reported that some jobs appeared to run forever with no feedback. Support had no visibility into whether a job was still working or stuck. The frontend either showed a generic “processing” state or polled for completion with no intermediate progress, so users did not know if they should wait or refresh.
Investigation
I looked at the task runner and the individual processing steps. Some steps could block on external services or heavy computation; there were no timeouts, so a stuck dependency could hang the entire job. On the frontend, the only signal was “job started” and “job finished”; there was no way for the backend to push or expose partial progress (e.g. “step 2 of 5 complete”). So we had two issues: (1) jobs could run indefinitely if a step hung, and (2) the UI had no progress information to show.
Root Cause
- No timeouts: Processing steps did not enforce a maximum duration, so a slow or stuck call could block the pipeline indefinitely.
- No progress model: The system only stored a final “completed” or “failed” state. Intermediate step completion was not persisted or exposed, so the API had nothing to return for “how far along is this job?”
The Fix
We did two things:
-
Timeout safeguards per step. Each processing task was wrapped with a configurable timeout (e.g. 60–120 seconds per step depending on the step type). If a step exceeded the timeout, we marked the job as failed with a clear error (e.g. “Step X timed out”) and avoided leaving the pipeline in an ambiguous state. We also added timeouts around external HTTP calls used inside steps.
-
Real-time task progress API. We extended the job model to store the current step index and step status (pending, running, completed, failed). As each step started and finished, we updated this state. We then added a lightweight API (e.g.
GET /jobs/:id/progress) that returned the current step and overall progress (e.g. “Step 3 of 5”). The frontend could poll this endpoint or use it in an existing polling loop to show a progress bar or “Step 3 of 5” instead of a generic spinner.
Lessons Learned
- Timeouts are essential for any pipeline that touches external services or heavy work. Without them, one stuck step can make the whole system look broken and leave resources tied up.
- Exposing progress improves perceived performance and supportability. Even a simple “step N of M” or “current step name” gives users and support a clear picture and reduces “is it stuck?” tickets.
- Storing progress in the job model keeps the API simple. The frontend doesn’t need to know the internal steps; it just reads a progress field and displays it.
Have thoughts on this story or questions? Get in touch.