Pausable AWS Step Functions

6 min readMar 10, 2024

Some time ago, working with the AWS Step Functions service, I learned about a lovely feature: to pause a step function execution when there is a dependency on another service/process.

This article is for people, who already have high-level knowledge of AWS Step Functions, but want to be aware of its lovely features.

It’s my second article about AWS Step Functions. Take a look at the AWS Step functions. Input and Output Processing, which turned out to be successful.

TLDR: The full code of the implementation can be found in aws-pausable-step-function git repo.

Common Step Function workflow

AWS Step functions service organizes a workflow — tasks to execute with dependencies. It has a smooth integration with myriad other AWS services and acts as a glue to connect all services to process data. More often, Step Functions is served to organize offline data processing. If you are familiar with Airflow, it can be treated as an alternative.

AWS Step Functions has a low barrier to entry thence you can set up an initial graph in a couple of hours without any knowledge at the beginning. The service provides a friendly UI (from AWS console), where you can simply drag and drop tasks (steps) and make connections between them.

A typical execution graph (AWS Step Functions)

A common Step Function workflow consists of

AWS Glue (Spark) jobs
AWS Lambda functions
Sending messages through Amazon SQS
Additional service steps that organize data flow (such as conditional branching, lambda function parallelization, step functions failure handling, and step functions nesting — to execute another step function as a task of a main step function)

One typical example is represented in the image above. Every step is a call to an AWS service, that makes computations to process an incoming order.

Picture (and it happens more often than you might think), that a part of data processing is external. Such as, we call an external service, that executes a long-running job and we want to continue the workflow once, the job finishes. A real-world example: imagine, we work at a company, that advertises items on Meta and we want to update a product catalog to reflect changes on product cards in advertising. To trigger the update we use ProductCatalogBatch request and we should wait until the execution finishes with a given request_id (products are updated) — CheckBatchRequestStatus API.

The first thought that may come is to have 2 step functions: one to preprocess products and initiate the update and the second, that contains some postprocessing after we receive a product catalog completion signal. But from a workflow execution point of view, we split the whole execution into 2 parts, which complicates maintenance and observability, as the process is split between 2 step functions and we have to navigate to both of them to see the full execution.

There is a better alternative, that is supported natively by AWS Step Functions — a feature to pause a step function execution and resume it based on a completion event.

Note: I mentioned, that the external job is long-running on purpose. AWS Lambda functions as of now support up to 15 minutes execution time. Consequently, if we are confident, that the external process is short, then, we can get by using the lambda function only without utilizing the feature.

Pausing a Step Function execution

There is an official documentation (Wait for a Callback with the Task Token), that explains the concept with an example to implement it through UI. Still, we are missing a specific example of implementing it with code.

AWS Step Functions solves the declared problem of interacting with an external service by supporting several Service integration patterns. We are interested in the waitForTaskToken pattern.

In a nutshell, pause/resume actions happen through a token, that is passed forth and back:

A task, that suspends a step function execution (for example, a lambda function) specifies integrationPattern to be WAIT_FOR_TASK_TOKEN and specifies a taskToken field (AWS Step Functions has to know, which step function execution to resume, in case of several executions)
A Lambda function, that resumes the step function, should make a call to the send_task_success or send_task_failure providing the token to return control to the step function and resume the execution

In code, it would look like:

Pausing execution:

# AWS CDK, Typescript
new tasks.LambdaInvoke(this, stepName, {
  lambdaFunction: callAsyncServiceLambda,
  integrationPattern: sfn.IntegrationPattern.WAIT_FOR_TASK_TOKEN,
  payload: sfn.TaskInput.fromObject({
    taskToken: sfn.JsonPath.taskToken,
  }),
});

Callback:

# Assets, Python
def handler(event: dict, context: dict):
    # is_succeeded, token, message are defined
    sfn_client = boto3.client("stepfunctions")
    if is_succeeded:
        sfn_client.send_task_success(taskToken=token, output=message)
    else:
        sfn_client.send_task_failure(taskToken=token, error="Error message")

A minimal example (with code)

I favor building infrastructure as code using AWS CDK for maintainability and reproducibility over time. The example is written with:

typescript — for infrastructure
python — to implement lambda functions

A diagram is worth a thousand words, so, here is the schema of interactions between services:

The Step Function itself in AWS Console:

A pausable Step Function minimal example

The full code can be found in aws-pausable-step-function git repo.

We use dummy Pass steps to emulate work (transformations) before/after calling an external job
Then, we trigger a task, that pauses the step function execution (WAIT_FOR_TASK_TOKEN). From this point, our step function is suspended and AWS Step Functions service expects a send_task_{success/failure} call to resume the execution
To simplify the example, a long-running job simulation happens inside the lambda function (_execute_long_running_job), whereas, in reality, the _execute_long_running_job should make, for example, POST request to an external service to start a job, and the lambda function should finish the execution right away
To continue the step function execution, when the long-running job is done, an SNS message with the token should trigger the resume_sfn.py lambda function, which sends the signal (send_task_success/send_task_failure) back to AWS Step Functions

To play with code, I implemented 2 ways of SFN resuming, that is configurable through step function payload (README)

The external service automatically sends a notification to resume execution with the token. Step Function input payload: {"IsManualCallback": "False"}
We play the role of the external service and should resume the step function execution manually by sending an SNS message or calling the ResumeStepFunctionLambda function with the token. Step Function input payload in this case: {"IsManualCallback": "True"} . This use case has an actual application: if there is a necessity for a manual intervention. For example, if the dataset, that we are cooking is required for a manual evaluation as a part of the process.

Productionalization

The minimal example, that is introduced, lifts the veil just a little on how to use this integration pattern. Applying the pattern to a specific use case requires an additional effort:

In the real-world scenario, most likely we don't have the luxury to propagate the token through an external service and we have to rely on a job_id, that is returned by it. As a result, we have to store the mapping between job_id and token somewhere, for example, in AWS DynamoDB. Thus, when the external job is finished, and our resume_sfn lambda function receives a callback with the job_id, it translates the job_id to the token.
An additional effort is buried in sending an SNS message and triggering the resume_sfn function. Suppose we cannot set up an external service to send an SNS message (for example, through an email event or having s3 path, on which the external service will write and by this creating a triggering event). In that case, we will have to set up a lambda function, that will monitor (call from time to time) an external job_id status.

Summary

Despite the mentioned difficulties and seeming simplicity, the WAIT_FOR_TASK_TOKEN integration pattern is mighty. It smoothly glues an external job execution and represents it as a single step of a step function. With this representation, we can easily see not only overall time execution in the AWS Step Functions UI but also breaking down the time into steps and how long exactly the external task takes.

It wouldn’t be amiss to also mention, that a value for the token that we pass can be chosen by us, not necessarily the sfn.JsonPath.taskToken variable.

Happy coding!