How to set up an alarm for an AWS Glue job failure

Roman Marakulin
5 min readNov 29, 2023

I was inspired to write the article by a large number of questions and discussions in my company and on the Internet.

There are a few approaches to add such an alarm (some of them you can also find on Medium), but I believe, this one is the simplest.

(Follow me to read similar articles, about IT, AWS, and other services and technologies).

The problem statement

Let’s suppose that we are building native AWS infrastructure (using services, provided purely by AWS). Therefore, for computations on big data, we use AWS Glue.

A typical AWS Glue job represents a script (written on Apache Spark), that runs an ETL (extract, transform, load) offline pipeline with a certain frequency, for example, daily.

When we schedule the AWS Glue job (for example, as a part of an AWS Step Functions workflow with several steps and branches), we would like to know when the job fails, receive a notification, and act proactively.

In other words, we have to set up an alarm for an AWS Glue job execution failure.

AWS Glue service sends a certain set of metrics to AWS CloudWatch, that can be used to set up an alarm. Just to name a few (AWS Glue CloudWatch metrics):

glue.driver.aggregate.bytesRead
glue.driver.aggregate.numCompletedTasks
glue.driver.aggregate.numFailedTasks
glue.driver.aggregate.numKilledTasks
...

When one is not very familiar with Spark, he could have a tendency towards using the numFailedTasks metric and setting up an alarm on it for this purpose — I’ve seen a certain number of discussions around this topic — which is wrong and in the next section I will describe you why.

In reality, AWS Glue as of now doesn’t send the desired numFailedJobs metric to AWS Cloudwatch (a request for this feature is already several years old). Thus, many blogs were written to overcome the limitation.

Why numFailedTasks metric doesn’t fit

A single Spark script consists of jobs, that are spawned by any encountered Spark action (save(), collect(), count(), etc).

Each job in its turn gets divided into smaller sets of tasks — stages and, finally, an ‘atom’ of the script execution is a task — a unit of execution. Each task maps to a single core and works on a single partition of data.

From Learning Spark

Tasks are executed on workers (worker nodes), and the driver manages the execution.

Having this in mind, you can clearly see, that the numFailedTasks metric, representing the number of failed spark tasks cannot be used to monitor job failures:

  1. Failed tasks can be restarted and it is configurable. Thus, even if numFailedTasks > 0, a job itself can be finished successfully
  2. If there is an error on the driver side, then, the job fails, but numFailedTasks will be 0 (as there are 0 running tasks)

Implementing numFailedJobs alarm

Diagram of the solution

Each AWS service that generates events sends them to AWS EventBridge. As we talk about AWS Glue, among others the service sends Glue Job State Change events, that reflect the status change: SUCCEEDED, FAILED, TIMEOUT, etc (reference).

It is possible to create an AWS EventBridge Rule to capture those events.

A nice thing about EventBridge Rules is that it’s possible to create AWS Cloudwatch metrics based on them, without involving AWS Lambda functions. And when we define the Cloudwatch metric, it’s straightforward to create a Cloudwatch Alarm to notify a person on duty to take a look, for instance.

To sum up, what we should do is:

  1. Create an AWS EventBridge Rule to catch Glue Job State Change events (detailType field)
  2. Define a Cloudwatch metric, based on the created Rule
  3. Set up Cloudwatch alarm for the metric (as usual)

As the whole world is moving towards infrastructure as a code paradigm, I would like to give snippets, based on the AWS CDK framework (using typescript, but it’s easy to adjust using any other programming language).

Assuming, that we already have a stack (AWS CDK hello world), let's add additional imports:

import { aws_cloudwatch as cloudwatch, aws_events as events } from 'aws-cdk-lib';

(Every code snippet below is inside of the Stack).

First, we should create an EventBridge Rule:

// jobName is defined somewhere above. This is our AWS Glue script to monitor
const jobFailedRule = new Rule(scope, `{jobName} JobFailureRule`, {
eventPattern: {
source: ['aws.glue'],
detailType: ['Glue Job State Change'],
detail: {
state: ['FAILED', 'TIMEOUT', 'ERROR'],
jobName: [jobName],
},
},
})

Then, having the Rule, we have to set up a Cloudwatch metric:

const numFailedJobsMetric = new cloudwatch.Metric({
namespace: 'AWS/Events',
metricName: 'TriggeredRules',
statistic: 'Sum',
dimensionsMap: {
RuleName: jobFailedRule.ruleName,
},
})

As the last step — we will set up a regular Cloudwatch Alarm (using the object we can set up a desired action for this alarm, using the addAlarmAction method, for example, sending an email to a person on duty):

const numFailedJobsAlarm = new cloudwatch.Alarm(scope, `{jobName} numFailedJobsAlarm`, {
metric: numFailedJobsMetric,
threshold: 1,
evaluationPeriods: 1,
});

There are a couple of things, that we should take into consideration:

  1. With the setup, when configuring a rule, we had to provide a jobName, thus, we have to create as many rules, as glue jobs we have
  2. As the detail.state field, I put the array: ['FAILED', 'TIMEOUT', 'ERROR'] , although, you could want to monitorSTOPPED status as well. Setting up an alarm for the status may not be a good idea. The STOPPED status is sent when someone manually stops the glue job (maybe it was done on purpose within maintenance activities).

Creating a Rule from AWS console

When trying to create a Rule from the AWS console (EventBridge tab), the UI doesn’t let us create a Rule without specifying a target:

A rule, created from UI has to have a target

One workaround is just to provide one. For example, we can create a dummy SNS topic (Amazon SNS tab) with the Standard type:

Creating a dummy TestSNSTopic

and specify it as our target:

Dummy TestSNSTopic is set up as a target for a Rule

Summary

Let's recap what we learned today 😉:

  1. identified a problem when setting up an alarm for failed AWS Glue jobs
  2. Understood why the numFailedTasks metric, exposed to CloudWatch cannot be used as a replacement for the ‘number of failed jobs’ metric
  3. Observed simple ‘3 steps’ CDK-based solution
  4. Proposed a workaround, by pointing to an SNS-topic for an EventBridge Rule, created from UI.

--

--

Roman Marakulin

I write about Technologies, Software and my life in Spain