When a sub-agent fails midway through an orchestrated pipeline, the orchestrator should pause, surface the failure with context, and wait for human review rather than pushing bad output to the next step. The worst outcome is silent failure — the pipeline completes but produces wrong output that you don’t catch until it’s already published.
The Three Ways a Pipeline Can Fail
Pipeline failures fall into three categories. First: the sub-agent produces no output — it gets stuck, times out, or returns an error. Second: the sub-agent produces output that’s technically present but wrong — off-topic, wrong format, or misinterpreting the input. Third: the sub-agent produces output that looks right but contains factual errors or poor quality that downstream agents then build on.
The first type is the easiest to catch — no output means no trigger for the next step, so the pipeline stalls visibly. The second type is caught by output format verification — if the output doesn’t match the expected structure, flag it. The third type is the most dangerous because it can pass through undetected. This is why review checkpoints exist.
Building Fail-Safe Behavior Into Your Orchestrator
The most practical fail-safe for educator orchestrators is an explicit “pause and show” instruction at key handoffs: “After each major step, present the output to the user for approval before continuing to the next step.” This slows the pipeline but catches errors before they propagate. For high-volume, well-tested pipelines you trust, you can remove the checkpoints. For new pipelines, keep them in until you’ve run the workflow enough times to trust each step.
You can also add a step-level retry instruction: “If the expected output is not received from the sub-agent, retry once with the same input before flagging for human review.” Most transient failures — where a step produces poor output because of an ambiguous input — resolve on a retry.
What This Means for Educators
Orchestrated pipelines are more efficient than manual workflows, but they require upfront design work on error handling. A pipeline that fails silently is worse than a manual workflow — at least with manual work, you know what happened. Build explicit failure behavior into every orchestrator: pause, surface, wait. You can always make it more autonomous as trust builds.
The Simple Rule
Design for failure before you design for success. Ask yourself: “What happens if step 3 produces garbage?” before you build step 3. The answer to that question is your fallback instruction. Write it into the orchestrator before you go live.
