Plugging the Gaps in Azure Policy

Introduction

Welcome to the first part of a two-part blog on Azure Policy. Multi-part blogs are not my usual style, but the nature of blogging whilst also being a full-time Consultant is that you slip some words in when you find the time, and I was starting to feel if I wrote this in a single part, it would just never see the light of day. Part one of this blog deals with the high-level overview of what the problem is, and how we solved it at a high level, part two will include the icky sticky granular detail, including some scripts which you can shamelessly plagiarise.

Azure Policy is a feature-complete solution which performs granular analysis on all your Azure resources, allowing your IT department to take swift and decisive action on resources which attempt to skirt infrastructure policies you define. Right, the sales guys now have their quotable line, let’s get stuck into how you’re going to deliver on that.

Azure Policy Overview

First, a quick overview of what Azure Policy actually is. Azure Policy is a service which allows you to create rules (policies) which allow you to take action on an attempt to create or modify an Azure resource. For example, I might have a policy which says “only allow VM SKU’s of Standard_D2s_v3” with the effect of denying the creation of said VM if it’s anything other than that SKU. Now, if a user attempts to create a VM other than the sizing I specify, they get denied – the same story if they attempt to modify an existing VM to use that SKU. Deny is just one example of an “effect” we can take via Azure Policy, but we can also use Audit, Append, AuditIfNotExists, DeployIfNotExists, and Disabled.

Taking the actions described above obviously requires that you evaluate the resource to take the action. We do this using some chunks of JSON with fairly basic operators to determine what action we take. The properties you plug into a policy you create via Azure Policy, are not actually direct properties of the resource you are attempting to evaluate, rather we have “Aliases”, which map to those properties. So, for example, the alias for the image SKU we used as an example is “Microsoft.Compute/virtualMachines/imageSku”, which maps to the path “properties.storageProfile.imageReference.sku” on the actual resource. This leads me to….

The Gap

If your organisation has decided Azure Policy is the way forward (because of the snazzy dashboard you get for resource compliance, or because you’re going down the path of using baked in Azure stuff, or whatever), you’re going to find fairly quickly that there is currently not a one to one mapping between the aliases on offer, and the properties on a resource. Using a virtual machine as an example, we can use Azure Policy to take an effect on a resource depending on its SKU (lovely!) but up until very recently, we didn’t have the ability to say if you do spin up a VM with that SKU, that it should only ever have a single NIC attached. The existing officially supported path to getting such aliases added to Policy is via the Azure Policy GitHub (oh, by the way, if you’re working with policy and not frequenting that GitHub, you’re doing it wrong). The example I used about the multiple NIC’s, you can see was a requested as an alias by my colleague Ken on October 22nd 2018, and marked as “deployed” into the product on February 14th 2019. Perhaps this is not bad for the turnaround from request to implementation into the product speaking in general terms, but not quick enough when you’re working on a project which relies on that alias for a delivery deadline which arrives months before February 14th 2019. A quick review of both the open and closed issues on the Azure Policy GitHub gives you a feel for the sporadic nature of issues being addressed, and in some cases due to complexity or security, the inability to address the issues at all. That’s OK, we can fix this.

Plugging the Gap

Something we can use across all Azure resources in Policy is fields. One of the fields we can use is the tag on a resource. So, what we can do here is report compliance status to the Azure Policy dashboard not based on the actual compliance status of the resource, but based on whether or not it has a certain tag applied to it – that is to say, a resource can be deemed compliant or non-compliant based on whether or not it has a tag of a certain value – then, we can use something out of band to evaluate the resources compliance and apply the compliance tag. Pretty cunning huh?

So I’m going to show you how we built out this idea for a customer. In this first part, you’re going to get the high-level view of how it hangs together, and in the second part I will share with you the actual scripts, policies, and other delicious little nuggets so you can build out a demo yourself should it be something you want to have a play with. Bear in mind the following things when using all this danger I am placing in your hands:

This was not built to scale, more as a POC, however,
- This idea would be fine for handling a mid-sized Azure environment
This concept is now being built out using Azure Functions (as it should be)
Roll-your-own error handling and logging, the examples I will provide will contain none
Don't rely on 100% event-based compliance evaluation (I'll explain why in part 2)
I'm giving you just enough IP to be dangerous, be a Good Consultant

Here’s a breakdown of how the solution hangs together. The example below will more or less translate to the future Functions based version, we’ll just scribble out a couple bits, add a couple bits in.

So, from the diagram above, here’s the high-level view what’s going on:

1. Event Grid forwards events to a webhook hanging off a PowerShell Runbook

2. The PowerShell Runbook executes a script which evaluates the resource forwarded in the webhook data, and applies, removes, or modifies a tag accordingly. Separately, a similar PowerShell runbook fires on a schedule. The schedule-based script contains the same evaluations as the event drive one, but rather than evaluate an individual resource, it will evaluate all of them.

3. Azure Policy evaluates resources for compliance, and reports on it. In our case, compliance is simply the presence of a tag of a particular value.

Now, that might already be enough for many of you guys to build out something like this on your own, which is great! If you are that person, you’re probably going to come up with a bunch of extra bits I wouldn’t have thought about because you’re working from a (more-or-less) blank idea. For others, you’re going to want some gnarly config and scripts so you can plug that stuff into your own environment, tweak it up, and customise it to fit your own lab – for you guys, see you soon for part two!

Telstra Purple has been building out a bunch of stuff recently in Azure Policy, using both complex native policies, and ideas such as the one I’ve detailed here. If your organisation is looking at Azure Policy and you think it might be a good fit for your business, by all means, reach out for a chat. We love talking about this stuff.