One of the most important features of Azure Monitor is its ability to send alerts when something interesting happens – in other words, when our telemetry meets some criteria we have told Azure Monitor that we’re interested in. We might have alerts that indicate when our application is down, or when it’s getting an unusually high amount of traffic, or when the response time or other performance metrics aren’t within the normal range. We can also have alerts based on the contents of log messages, and on the health status of Azure resources as reported by Azure itself. In this post, we’ll look at how alerts work within Azure Monitor and will see how these can be automated using ARM templates. This post will focus on the general workings of the alerts system, including action groups, and on metric alerts; part 5 (coming soon) will look at log alerts and resource health alerts.

This post is part of a series:

    • Part 1 provides an introduction to the series by describing why we should instrument our systems, outlines some of the major tools that Azure provides such as Azure Monitor, and argues why we should be adopting an ‘infrastructure as code’ mindset for our instrumentation and monitoring components.

    • Part 2 describes Azure Application Insights, including its proactive detection and alert features. It also outlines a pattern for deploying instrumentation components based on the requirements we might typically have for different environments, from short-lived development and test environments through to production.

    • Part 3 discusses how to publish custom metrics, both through Application Insights and to Azure Monitor. Custom metrics let us enrich the data that is available to our instrumentation components.

    • Part 4 (this post) covers the basics of alerts and metric alerts. Azure Monitor’s powerful alerting system is a big topic, and in this part we’ll discuss how it works overall, as well as how to get alerts for built-in and custom metrics.

    • Part 5 (coming soon) covers log alerts and resource health alerts, two other major types of alerts that Azure Monitor provides. Log alerts let us alert on information coming into Application Insights logs and Log Analytics workspaces, while resource health alerts us when Azure itself is having an issue that may result in downtime or degraded performance.

    • Part 6 (coming soon) describes dashboards. The Azure Portal has a great dashboard UI, and our instrumentation data can be made available as charts. Dashboards are also possible to automate, and I’ll show a few tips and tricks I’ve learned when doing this.

    • Part 7 (coming soon) covers availability tests, which let us proactively monitor our web applications for potential outages. We’ll discuss deploying and automating both single-step (ping) and multi-step availability tests.

    • Part 8 (coming soon) describes autoscale. While this isn’t exactly instrumentation in and of itself, autoscale is built on much of the same data used to drive alerts and dashboards, and autoscale rules can be automated as well.

    • Finally, part 9 (coming soon) covers exporting data to other systems. Azure Monitor metrics and log data can be automatically exported, as can Application Insights data, and the export rules can be exported and used from automation scripts.

What Are Alerts?

Alerts are described in detail on the Azure Monitor documentation, and I won’t re-hash the entire page here. Here is a quick summary, though.

An alert rule defines the situations under which an alert should fire. For example, an alert rule might be something like when the average CPU utilisation goes above 80% over the last hour, or when the number of requests that get responses with an HTTP 5xx error code goes above 3 in the last 15 minutes. An alert is a single instance in which the alert rule fired. We tell Azure Monitor what alert rules we want to create, and Azure Monitor creates alerts and sends them out.

Alert rules have three logical components:

    • Target resource: the Azure resource that should be monitored for this alert. For example, this might be an app service, a Cosmos DB account, or an Application Insights instance.
    • Rule: the rule that should be applied when determining whether to fire an alert for the resource. For example, this might be a rule like when average CPU usage is greater than 50% within the last 5 minutes, or when a log message is written with a level of Warning. Rules include a number of sub-properties, and often include a time window or schedule that should be used to evaluate the alert rule.
    • Action: the actions that should be performed when the alert has fired. For example, this might be email admin@example.com or invoke a webhook at https://example.com/alert. Azure Monitor provides a number of action types that can be invoked, which we’ll discuss below.

There are also other pieces of metadata that we can set when we create alert rules, including the alert rule name, description, and severity. Severity is a useful piece of metadata that will be propagated to any alerts that fire from this alert rule, and allows for whoever is responding to understand how important the alert is likely to be, and to prioritise their list of alerts so that they deal with the most important alerts first.

Classic Alerts

Azure Monitor currently has two types of alerts. Classic alerts are the original alert type supported by Azure Monitor since its inception, and can be contrasted with the newer alerts – which, confusingly, don’t seem to have a name, but which I’ll refer to as newer alerts for the sake of this post.

There are many differences between classic and newer alerts. One such difference is that in classic alerts, actions and rules are mixed into a single ‘alert’ resource, while in newer alerts, actions and rules are separate resources (as described below in more detail). A second difference is that as Azure migrates from classic to newer alerts, some Azure resource types only support classic alerts, although these are all being migrated across to newer alerts.

Microsoft recently announced that classic alerts will be retired in June 2019, so I won’t spend a lot of time discussing them here, although if you need to create a classic alert with an ARM template before June 2019, you can use this documentation page as a reference.

All of the rest of this discussion will focus on newer alerts.

Alert Action Groups

A key component of Azure Monitor’s alert system is action groups, which define how an alert should be handled. Importantly, action groups are independent of the alert rule that triggered them. An alert rule defines when and why an alert should be fired, while an action group defines how the alert should be sent out to interested parties. For example, an action group can send an email to a specified email address, send an SMS notification, invoke a webhook, trigger a Logic App, or perform a number of other actions. A single action group can perform one or several of these actions.

Action groups are Azure Resource Manager resources in their own right, and alert rules then refer to them. This means we can have shared action groups that work across multiple alerts, potentially spread across multiple applications or multiple teams. We can also create specific action groups for defined purposes. For example, in an enterprise application you might have a set of action groups like this:

Action Group Name Resource Group Actions Notes
CreateEnterpriseIssue Shared-OpsTeam Invoke a webhook to create issue in enterprise issue tracking system. This might be used for high priority issues that need immediate, 24×7 attention. It will notify your organisation’s central operations team.
SendSmsToTeamLead MyApplication Send an SMS to the development team lead. This might be used for high priority issues that also need 24×7 attention. It will notify the dev team lead.
EmailDevelopmentTeam MyApplication Send an email to the development team’s shared email alias. This might be used to ensure the development team is aware of all production issues, including lower-priority issues that only need attention during business hours.

Of course, these are just examples; you can set up any action groups that make sense for your application, team, or company.

Automating Action Group Creation

Action groups can be created and updated using ARM templates, using the Microsoft.Insights/actionGroups resource type. The schema is fairly straightforward, but one point to consider is the groupShortName property. The short name is used in several places throughout Azure Monitor, but importantly it is used to identify the action group on email and SMS message alerts that Azure Monitor sends. If you have multiple teams, multiple applications, or even just multiple alert groups, it’s important to choose a meaningful short name that will make sense to someone reading the alert. I find it helpful to put myself in the mind of the person (likely me!) who will be woken at 3am to a terse SMS informing them that something has happened; they will be half asleep while trying to make sense of the alert that they have received. Choosing an appropriate action group short name may help save them several minutes of troubleshooting time, reducing the time to diagnosis (and the time before they can return to bed). Unfortunately these short names must be 12 characters or fewer, so it’s not always easy to find a good name to use.

With this in mind, here is an example ARM template that creates the three action groups listed above:

Note that this will create all three action groups in the same resource group, rather than using separate resource groups for the shared and application-specific action groups.

Once the action groups have been created, any SMS and email recipients will receive a confirmation message to let them know they are now in the action group. They can also unsubscribe from the action group if they choose. If you use a group email alias, it’s important to remember that if one recipient unsubscribes then the whole email address action will be disabled for that alert, and nobody on the email distribution list will get those alerts anymore.

Metric Alerts

Now that we know how to create action groups that are ready to receive alerts and route them to the relevant people and places, let’s look at how we create an alert based on the metrics that Azure Monitor has recorded for our system.

Important: Metric alerts are not free of charge, although there is a small free quota you get. Make sure you remove any test alert rules once you’re done, and take a look at the pricing information for more detail.

A metric alert rule has a number of important properties:

    • Scope is the resource that has the metrics that we want to monitor and alert on.
    • Evaluation frequency is how often Azure Monitor should check the resource to see if it meets the criteria. This is specified as an ISO 8601 period – for example, PT5M means check this alert every 5 minutes.
    • Window size is how far back in time Azure Monitor should look when it checks the criteria. This is also specified as an ISO 8601 period – for example, PT1H means when running this alert, look at the metric history for the last 1 hour. This can be between 5 minutes and 24 hours.
    • Criteria are the specific rules that should be evaluated. There is a sophisticated set of functionality available when specifying criteria, but commonly this will be something like (for an App Service) look at the number of requests that resulted in a 5xx status code response, and alert me if the count is greater than 3 or (for a Cosmos DB database) look at the number of requests where the StatusCode dimension was set to the value 429 (representing a throttled request), and alert me if the count is greater than 1.
    • Actions are references to the action group (or groups) that should be invoked when an alert is fired.

Each of these properties can be set within an ARM template using the resource type Microsoft.Insights/metricAlerts. Let’s discuss a few of these in more detail.

Scope

As we know from earlier in this series, there are three main ways that metrics get into Azure Monitor:

    • Built-in metrics, which are published by Azure itself.
    • Custom resource metrics, which are published by our applications and are attached to Azure resources.
    • Application Insights allows for custom metrics that are also published by our applications, but are maintained within Application Insights rather than tied to a specific Azure resource.

All three of these metric types can have alerts triggered from them. In the case of built-in and custom resource metrics, we will use the Azure resource itself as the scope of the metric alert. For Application Insights, we use the Application Insights resource (i.e. the resource of type Microsoft.Insights/components) as the scope.

Note that Microsoft has recently announced a preview capability of monitoring multiple resources in a single metric alert rule. This currently only works with virtual machines, and as it’s such a narrow use case, I won’t discuss it here. However, keep in mind that the scopes property is specified as an array because of this feature.

Criteria

A criterion is a specification of the conditions under which the alert should fire. Criteria have the following sub-properties:

    • Name: a criterion can have a friendly name specified to help understand what caused an alert to fire.
    • Metric name and namespace: the name of the metric that was published, and if it’s a custom metric, the namespace. For more information on metric namespaces see part 3 of this seriesA list of built-in metrics published by Azure services is available here.
    • Dimensions: if the metric has dimensions associated with it, we can filter the metrics to only consider certain dimension values. Dimension values can be included or excluded.
    • Time aggregation: the way in which the metric should be aggregated – e.g. counted, summed, or have the maximum/minimum values considered.
    • Operator: the comparison operator (e.g. greater than, less than) that should be used when comparing the aggregated metric value to the threshold.
    • Threshold: the critical value at which the aggregated metric should trigger the alert to fire.

These properties can be quite abstract, so let’s consider a couple of examples.

First, let’s consider an example for Cosmos DB. We might have a business rule that says whenever we see more than one throttled request, fire an alert. In this example:

    • Metric name would be TotalRequests, since that is the name of the metric published by Cosmos DB. There is no namespace since this is a built-in alert. Note that, by default, TotalRequests is the count of all requests and not just throttled requests, so…
    • Dimension would be set to filter the StatusCode dimension to only include the value 429, since 429 represents a throttled request.
    • Operator would be GreaterThan, since we are interested in knowing when we see more than a single throttled request.
    • Threshold would be 1, since we want to know whether we received more than one throttled request.
    • Time aggregation would be Maximum. The TotalRequests metric is a count-based metric (i.e. each metric raw value represents the total number of requests for a given period of time), and so we want to look at the maximum value of the metric within the time window that we are considering.

Second, let’s consider an example for App Services. We might have a business rule that says whenever our application returns more than three responses with a 5xx response code, fire an alert. In this example:

    • Metric name would be Http5xx, since that is the name of the metric published by App Services. Once again, there is no namespace.
    • Dimension would be omitted. App Services publishes the Http5xx metric as a separate metric rather than having a TotalRequests metric with dimensions for status codes like Cosmos DB. (Yes, this is inconsistent!)
    • Operator would again be GreaterThan.
    • Threshold would be 3.
    • Time aggregation would again be Maximum.

Note that a single metric alert can have one or more criteria. The odata.type property of the criteria property can be set to different values depending on whether we have a single criterion (in which case use Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria) or multiple (Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria). At the time of writing, if we use multiple criteria then all of the criteria must be met for the alert rule to fire.

Static and Dynamic Thresholds

Azure Monitor recently added a new preview feature called dynamic thresholds. When we use dynamic thresholds then rather than specifying the metric thresholds ourselves, we instead let Azure Monitor watch the metric and learn its normal values, and then alert us if it notices a change. The feature is currently in preview, so I won’t discuss it in a lot of detail here, but there are example ARM templates available if you want to explore this.

Example ARM Templates

Let’s look at a couple of ARM templates to create the metric alert rules we discussed above. Each template also creates an action group with an email action, but of course you can have whatever action groups you want; you can also refer to shared action groups in other resource groups.

First, here is the ARM template for the Cosmos DB alert rule (lines 54-99), which uses a dimension to filter the metrics (lines 74-81) like we discussed above:

Second, here is the ARM template for the App Services alert rule (lines 77 to 112):

Note: when I tried to execute the second ARM template, I sometimes found it would fail the first time around, but re-executing it worked. This seems to just be one of those weird things with ARM templates, unfortunately.

Summary

Azure’s built-in metrics provide a huge amount of visibility into the operation of our system components, and of course we can enrich these with our own custom metrics (see part 3 of this series). Once the data is available to Azure Monitor, Azure Monitor can alert us based on whatever criteria we want to establish. The definitions of these metric alert rules is highly automatable using ARM templates, as is the definition of action groups to specify what should happen when an alert is fired.

In the next part of this series we will look at alerts based on log data.

Category:
Application Development and Integration, Azure Infrastructure, Azure Platform
Tags:
, , , , ,