Martin Oehlert

Posted on Jun 19

Intro to Durable Functions: Orchestrations and the Chaining Pattern

#serverless #azure #azurefunctions #dotnet

An order comes in, and you need to validate it, create it, then send a confirmation, three steps that have to run in order and survive a crash halfway through. A single Azure Function can't do that, because the moment it returns it forgets everything, so the usual fix is a chain of queue-triggered functions wired together with a correlation ID, a status table, and your own retry logic. That hand-rolled state machine is exactly what Durable Functions replaces, and the price of admission is learning to write an orchestrator: normal-looking C# that the runtime is allowed to run more than once.

Why a stateless function can't run a workflow

A plain Azure Function is a single invocation. It receives a trigger, runs, returns, and the worker that ran it can be recycled the instant it finishes. Nothing in the function body survives to the next invocation: no local variables, no "where was I" pointer, no record that step two of a three-step process already succeeded. That design is what makes Functions cheap to scale, and it's the right model for the bulk of event handlers you write.

It stops being enough the moment one logical unit of work spans more than one step. Take the order example: validate the request, create the order record, send a confirmation. You want those to run in sequence, you want the second step to use the output of the first, and you want the whole thing to pick up where it left off if the host restarts between steps. None of that is possible inside one function, so the standard pattern is to split each step into its own queue-triggered function and pass a message down the chain.

That works, but look at what you end up owning. You need a correlation ID so you can tell which messages belong to the same order. You need a status table so you can answer "is order 4815 done yet" and so a retry doesn't redo a step that already completed. You need poison-queue handling, timeout logic, and some way to fan results back together if any step branches. You have hand-built a state machine, and state machines spread across five queues are where the 2 a.m. pages come from.

Durable Functions takes over the state. It records every step your workflow completes in durable storage, and it reconstructs your workflow's position from that record after any interruption. You write the sequence as ordinary C# with await between the steps; the runtime makes the sequence survive crashes, scale-ins, and host upgrades. The correlation ID, the status table, and the retry bookkeeping all move from your code into the framework.

The three roles: orchestrator, activity, client

Durable Functions splits a workflow into three kinds of function, each identified by a trigger or binding type. Keeping them straight is most of the battle when you're starting out, because the rules about what code is legal where depend entirely on which role you're in.

The orchestrator is the workflow itself. It's a function marked with the [OrchestrationTrigger] binding, and its job is to coordinate: call this step, wait for the result, decide what to call next. It contains the control flow (if, loops, sequencing) but does no real work of its own. The orchestrator is the one role with a hard constraint attached. Its body can be re-executed many times over the life of a single workflow instance, so the code in it must be deterministic. That single fact (the orchestrator replays) is what the rest of this series keeps coming back to; the replay mechanics and the exact list of rules are covered later in this article.

An activity is where the actual work happens. It's a function marked with [ActivityTrigger], and it's the only one of the three roles allowed to touch the outside world: database writes, HTTP calls, sending email, reading a blob. Activities can bind directly to their input type, so an activity that validates an order can take an OrderRequest parameter and nothing else. The guarantee Durable Functions gives you on activities is at-least-once execution, and that has a real consequence. An activity can run more than once for the same logical step (after a transient failure and retry, for instance), so activity logic should be idempotent. Sending a confirmation email twice because the worker died after sending but before recording success is the kind of bug this guarantee invites if you're not careful.

The client (often called the starter) is the entry point that kicks a workflow off and lets you query it. In the isolated worker model the client is the injected DurableTaskClient, supplied through the [DurableClient] binding on a normal trigger such as an HTTP function. It is not something you call from inside an orchestrator; it lives in regular functions that start instances with ScheduleNewOrchestrationInstanceAsync and hand back a status response the caller can poll.

Behind all three sits the task hub: the set of Azure Storage resources (queues, tables, and a couple of blob containers) that the default storage provider creates in your function app's storage account to hold the workflow's messages and history. You don't provision it or write to it directly. It's enough to know it exists and that it's where your workflow's durable state actually lives; the history table inside it is the star of the replay section below.

The chaining pattern

Function chaining is the simplest orchestration and the one you'll reach for most. You run a sequence of activities in order, where each step's output feeds the next. Here is the order-processing workflow as an orchestrator.

using Microsoft.Azure.Functions.Worker;
using Microsoft.DurableTask;

public record OrderRequest(string CustomerId, string Sku, int Quantity);

public static class OrderOrchestrator
{
    [Function(nameof(OrderOrchestrator))]
    public static async Task<string> RunOrchestrator(
        [OrchestrationTrigger] TaskOrchestrationContext context)
    {
        var order = context.GetInput<OrderRequest>()!;

        var validated = await context.CallActivityAsync<bool>(
            nameof(ValidateOrderActivity), order);
        if (!validated)
            return "Order validation failed";

        var orderId = await context.CallActivityAsync<string>(
            nameof(CreateOrderActivity), order);

        await context.CallActivityAsync(
            nameof(SendConfirmationActivity), orderId);

        return orderId;
    }
}

Read it top to bottom and it's the sequence from the opening, written as plain C#. context.GetInput<OrderRequest>() pulls the input the client passed when it started the instance; the ! is there because the input is typed as nullable and you know this orchestrator always gets one. Each await context.CallActivityAsync(...) schedules an activity and waits for its result before moving on, which is what gives you the chain: validated gates whether the order is created, and orderId (the output of CreateOrderActivity) becomes the input to SendConfirmationActivity.

The call comes in two shapes. When an activity returns a value you use the generic CallActivityAsync<TResult>, which gives you back a Task<TResult>: CallActivityAsync<bool> for the validation result, CallActivityAsync<string> for the new order ID. When an activity is fire-the-step with nothing to return, you use the non-generic CallActivityAsync, which returns a plain Task; that's the confirmation send. The first argument is the activity name as a TaskName, and since there's an implicit conversion from string, nameof(ValidateOrderActivity) works directly and keeps the name refactor-safe.

The activity is an ordinary function that does the real work. This is where I/O is allowed, so it's where validation against your database or rules engine actually happens.

public static class ValidateOrderActivity
{
    [Function(nameof(ValidateOrderActivity))]
    public static bool Run([ActivityTrigger] OrderRequest order)
    {
        // Real work and I/O belong here, never in the orchestrator:
        // check inventory, validate the customer, hit the database.
        return order.Quantity > 0 && !string.IsNullOrWhiteSpace(order.Sku);
    }
}

The activity binds straight to OrderRequest, the same type the orchestrator passed as input, so there's no manual deserialization. Keep in mind the at-least-once guarantee from earlier. If ValidateOrderActivity did something with side effects, you'd want running it twice to be safe. Pure validation like this is naturally idempotent, which is one reason it's a good first step in the chain.

Something has to start the workflow. That's the client, here an HTTP-triggered function that schedules a new instance and returns a status response the caller can poll.

using Microsoft.Azure.Functions.Worker;
using Microsoft.Azure.Functions.Worker.Http;
using Microsoft.DurableTask.Client;

public static class OrderClient
{
    [Function(nameof(StartOrder))]
    public static async Task<HttpResponseData> StartOrder(
        [HttpTrigger(AuthorizationLevel.Function, "post")] HttpRequestData req,
        [DurableClient] DurableTaskClient client)
    {
        var order = await req.ReadFromJsonAsync<OrderRequest>();

        string instanceId = await client.ScheduleNewOrchestrationInstanceAsync(
            nameof(OrderOrchestrator), order);

        return client.CreateCheckStatusResponse(req, instanceId);
    }
}

ScheduleNewOrchestrationInstanceAsync starts the orchestrator with the deserialized order as input and returns the new instance's ID. CreateCheckStatusResponse then builds an HttpResponseData that's an HTTP 202 (Accepted) carrying a set of management URLs (status, terminate, and so on) keyed to that instance ID. The workflow runs asynchronously; the HTTP caller gets an immediate 202 and uses the status URL to find out when the order is done. The orchestrator never starts itself, and the client never contains workflow logic; each role stays in its lane.

One pin before you copy this into a project: the Durable extension is Microsoft.Azure.Functions.Worker.Extensions.DurableTask version 1.16.5 (the 1.x line, even on .NET 10), and TaskOrchestrationContext lives in Microsoft.DurableTask while DurableTaskClient lives in Microsoft.DurableTask.Client. These are the isolated-worker types; the in-process model used different names (IDurableOrchestrationContext, a different client), and mixing the two is the most common reason a copied snippet won't compile.

Replay mechanics

The orchestrator code reads like it runs once, top to bottom. It doesn't. To understand every rule that follows, you have to start from the one mechanic that drives them all: the orchestrator function body runs many times over the life of a single workflow instance.

Durable Functions doesn't snapshot the orchestrator's current state and resume it. It uses event sourcing. Every action the orchestrator takes (an activity scheduled, an activity completed, a timer created, a result returned) is appended to an append-only log in the History table, which lives in the task hub from the previous section. That log, not the in-memory state of the function, is the source of truth for where the workflow is.

Here is what actually happens when the order orchestrator runs. The first time, it executes from the top, hits await context.CallActivityAsync<bool>(nameof(ValidateOrderActivity), order), and yields. The dispatcher commits "ValidateOrderActivity scheduled" to the History table and unloads the orchestrator from memory entirely. There is now no thread, no stack, nothing in RAM waiting; the workflow exists only as rows in storage. When the validation activity finishes, its result is written to history and the orchestrator is woken back up.

On that wake-up, the orchestrator runs again from the very first line. It reaches the same CallActivityAsync call, but this time the framework checks the History table, sees that ValidateOrderActivity already completed, and replays the result from history instead of re-running the activity. The activity does not execute a second time; the recorded true (or false) is read straight out of storage and handed back, the validated local gets the value it had on the first run, and execution fast-forwards to the first step that hasn't completed yet, CreateOrderActivity. That step is now scheduled, the orchestrator yields again, and the cycle repeats until SendConfirmationActivity returns and the orchestrator runs to completion.

This is precisely what lets a workflow survive a crash. If the host dies after creating the order but before sending the confirmation, the History table still holds "order created" with its result. When a new worker picks the instance up, it replays from the top, fast-forwards past validation and creation using the recorded results, and resumes at exactly the confirmation step. No step that already succeeded runs twice as part of recovery, because recovery is just replay against the same history.

One practical consequence shows up the first time you add a log line to an orchestrator. It fires on every replay, so a single workflow can emit the same log message several times. The context exposes context.IsReplaying so you can suppress noise from replayed execution.

[Function(nameof(OrderOrchestrator))]
public static async Task<string> RunOrchestrator(
    [OrchestrationTrigger] TaskOrchestrationContext context,
    ILogger logger)
{
    var order = context.GetInput<OrderRequest>()!;

    if (!context.IsReplaying)
        logger.LogInformation("Starting order for customer {CustomerId}", order.CustomerId);

    var validated = await context.CallActivityAsync<bool>(
        nameof(ValidateOrderActivity), order);
    // ... rest of the chain
}

The if (!context.IsReplaying) guard means the "Starting order" line is written once, on the genuine first pass, and skipped on every replay. Without it the line would appear once for every time the orchestrator is dispatched, roughly once per activity in the chain.

Determinism rules

Replay is also why the orchestrator is the one role with code restrictions. If the body re-executes from the top every time, then any line that produces a different value on the second run than it did on the first will make the workflow take a different path during replay than it took originally, and the state reconstructed from history no longer matches the code's decisions. So the rule is blunt: orchestrator code must be deterministic. The same inputs and the same history must always produce the same sequence of calls.

These restrictions apply only to orchestrators. Activities can do anything; that's the point of them. It's the orchestrator, and only the orchestrator, that has to behave identically on every pass.

Here is the trap, written the way it usually gets written:

// BROKEN inside an orchestrator: re-evaluates on every replay.
[Function(nameof(OrderOrchestrator))]
public static async Task<string> RunOrchestrator(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var order = context.GetInput<OrderRequest>()!;

    var receivedAt = DateTime.UtcNow;       // different value every replay
    var traceId = Guid.NewGuid();           // a new GUID every replay

    var validated = await context.CallActivityAsync<bool>(
        nameof(ValidateOrderActivity), order);
    // ...
}

Both DateTime.UtcNow and Guid.NewGuid() look harmless. The problem is that the first run records one timestamp and one GUID, then every replay computes fresh ones. If receivedAt or traceId ever feeds a branch, a comparison, or an activity input, the replayed run disagrees with the recorded history and the workflow corrupts. The failure is intermittent (replay only happens after a yield, and only some values flow into decisions), which is exactly what makes it hard to catch in testing and ugly in production.

The fix is to take time and identity from the context, which returns replay-stable values:

// CORRECT: context helpers return the same value on every replay.
var receivedAt = context.CurrentUtcDateTime;   // same instant every replay
var traceId = context.NewGuid();               // same GUID every replay

context.CurrentUtcDateTime records the current UTC time on the first execution and replays that exact instant afterward. context.NewGuid() produces a replay-safe GUID the same way. (One naming gotcha: the property is CurrentUtcDateTime. Some docs prose mis-spells it CurrentDateTimeUtc, which does not exist and will not compile.)

The same reasoning rules out a few more things. Don't call new Random() in an orchestrator; if you need randomness, return it from an activity, where the result is saved to history and replayed like any other activity output. Don't read environment variables or configuration directly, since those can change between the first run and a replay hours later; pass config in as orchestrator input or fetch it from an activity. And don't do real I/O (database, file, HTTP) in the orchestrator: it would fire again on every replay, and a network call is never replay-stable anyway. Push all of it into activities.

Delays have their own replay-safe form. A Task.Delay or Thread.Sleep in an orchestrator both blocks a thread and re-evaluates on replay; the durable equivalent is context.CreateTimer, which records the wake-up time in history and releases the worker entirely until then, so a workflow can wait minutes or days without holding any resources.

People trip on this rule for a fair reason: the broken code compiles, passes a quick local test, and looks like ordinary C#. The orchestrator only betrays you under replay, and replay only happens after a yield on a worker that may not be the one that started the run. The mental shortcut that keeps you safe is to read every line of an orchestrator and ask whether it would return the same value if the method ran again right now. If the answer is no, it belongs in an activity or behind a context helper.

When to use Durable Functions vs queues

Durable Functions is not the default answer to "my functions need to talk to each other." A plain storage queue with a table for state is cheaper, simpler to provision, and entirely enough for a large class of problems. If the work is a single hand-off (one function drops a message, another picks it up, does its job, and that's the end of it) a queue is the right tool. Keep those functions stateless and idempotent, carry whatever state you need on the message itself, and you never have to think about replay rules or determinism.

The line to watch for is stateful coordination across steps. The moment you need the output of one step to drive the next, retries that don't redo work that already succeeded, results from parallel branches aggregated back together, a workflow that waits on an external event or a human approval, durable delays measured in hours or days, or a status endpoint that answers "where is order 4815 right now," a bare queue stops being enough. You can build all of that on queues and tables, but you'll be hand-rolling correlation IDs, a status table, poison-message handling, and timeout plumbing across several queues. That hand-rolled state machine is exactly the thing an orchestrator replaces, and it replaces it with code that reads like the workflow it implements.

So the honest recommendation: reach for a queue first. Don't pull in Durable Functions for a single fire-and-forget hand-off; the orchestrator's constraints and the task hub's storage footprint are real overhead that buys you nothing there. The signal to switch is the second or third piece of coordination bookkeeping you find yourself writing by hand. When you're maintaining a correlation ID and a status table and retry logic just to keep a multi-step process straight, you've already built a worse version of what Durable Functions gives you, and that's when the orchestration earns its complexity.

What's next

Chaining is the first of several orchestration patterns, and it's deliberately the simplest: a straight line of activities. The same replay engine powers fan-out/fan-in (running activities in parallel and aggregating their results), waiting on external events for human-in-the-loop approval, and durable entities for stateful objects. Each is a later part of this series, and each rests on the single fact this article was built around: the orchestrator is C# that replays.

Which coordination problem pushed you past a plain queue first: a multi-step sequence that needed to survive restarts, fan-out with result aggregation, or waiting on an external event or approval?