Durable workflows with Step Functions and Lambda, The Cloud Ledger

I used to orchestrate multi-step jobs with a Lambda that called another Lambda that called another Lambda, with retry logic hand-rolled in each one. It worked until step three failed at 2 a.m., left a half-written record, and I had no idea which steps had already run. Rebuilding idempotency and resumability by hand is a losing game.

Step Functions exists to make that someone else's problem. It is a managed state machine that remembers exactly where your workflow is, retries with backoff, and survives the process dying mid-flight. Lambda does the work; Step Functions remembers the work.

Standard vs. Express: pick the right engine

The first decision is workflow type, and it changes everything about cost and behavior:

	Standard	Express
Max duration	1 year	5 minutes
Pricing	per state transition (~$25/M)	per request + duration
Execution semantics	exactly-once	at-least-once
Best for	long, auditable, human-in-loop	high-volume, short, streaming

For an order-fulfillment flow that waits on a warehouse callback, Standard is correct: it can sit idle for hours at near-zero cost and is fully auditable. For a per-event enrichment pipeline doing millions of runs a day, Express is dramatically cheaper but you must make your steps idempotent because they can run more than once.

Let the state machine own retries and errors

The biggest win is deleting your hand-rolled retry code. Define Retry and Catch on each task and let the engine handle backoff and routing to a failure state.

{
  "Comment": "Order processing",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:validate",
      "Retry": [{
        "ErrorEquals": ["States.TaskFailed"],
        "IntervalSeconds": 2,
        "MaxAttempts": 3,
        "BackoffRate": 2.0
      }],
      "Catch": [{
        "ErrorEquals": ["States.ALL"],
        "Next": "NotifyFailure"
      }],
      "Next": "ChargePayment"
    },
    "ChargePayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:charge",
      "Next": "Fulfill"
    },
    "Fulfill":       { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789012:function:fulfill", "End": true },
    "NotifyFailure": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789012:function:notify", "End": true }
  }
}

Waiting without burning compute

The pattern that changed how I build these is waitForTaskToken. A task can pause indefinitely until an external system POSTs back a token, all without a Lambda sitting there billing you for idle time. Use it for human approvals, third-party webhooks, or any "kick off work, come back later" step.

The mental shift: stop writing code that waits. Emit a token, let the workflow sleep for free, and resume when the callback arrives. You pay for state transitions, not for waiting.

Combine this with the .sync integration pattern when you orchestrate other AWS services (a Glue job, an ECS task, a Batch job) so Step Functions blocks until that job genuinely completes rather than polling. The same idea applies to parallel work: a Map state fans out over a list of items and runs each branch concurrently, with a configurable concurrency limit so you do not stampede a downstream API. I lean on this for batch jobs that used to be a hand-written loop with manual error aggregation.

Make idempotency a first-class concern

Retries and at-least-once Express semantics both mean a step can run twice. Design every side-effecting task to be safe to repeat: pass an idempotency key (the execution ID works well), use conditional writes in DynamoDB, and check-then-act on external charges. A payment step that double-charges on retry is worse than one that crashes. The good news is that Step Functions gives you the execution history for free, so when something does go wrong you can see exactly which state failed, with what input, and how many times it retried, no more guessing which of four Lambdas ate the request.

Takeaways

Choose Standard for long, auditable, exactly-once workflows; Express for high-volume short runs you can make idempotent.
Move retry/backoff and error routing into Retry/Catch and delete the hand-rolled versions.
Use waitForTaskToken and .sync integrations to wait on external systems without paying for idle compute.
Make every side-effecting step idempotent with an idempotency key and conditional writes.

Durable workflows with Step Functions and Lambda

Standard vs. Express: pick the right engine

Let the state machine own retries and errors

Waiting without burning compute

Make idempotency a first-class concern

Takeaways

More on Serverless