Plugging the Gaps in Azure Policy – Part Two

Introduction

Welcome to the second and final part of my blogs on how to plug some gaps in Azure Policy. If you missed part one, this second part isn’t going to be a lot of use without the context from that, so maybe head on back and read part one before you continue.

In part one, I gave an overview of Azure Policy, a basic idea of how it works, what the gap in the product is in terms of resource evaluation, and a high-level view of how we plug that gap. In this second part, I’m going to show you how I built that idea out and provide you some scripts and a policy so you can spin up the same idea in your own Azure environments.

Just a quick note, that this is not a “next, next, finish” tutorial – if you do need something like that, there are plenty to be found online for each component I describe. My assumption is that you have a general familiarity with Azure as a whole, and the detail provided here should be enough to muddle your way through.

I’ll use the same image I used in part one to show you which bits we’re building, and where that bit fits in to the grand scheme of things.

We’re going to create a singular Azure Automation account, and we’re going to have two PowerShell scripts under it. One of those scripts will be triggered by a webhook, which will receive a POST from Event Grid, and the other will be fired by a good old-fashioned scheduler. I’m not going to include all the evaluations performed in my production version of this (hey, gotta hold back some IP right?) but I will include enough for you to build on for your own environment.

The Automation Account

When creating the automation account, you don’t need to put a lot of thought into it. By default when you create an automation account, it is going to automatically create as Azure Run As account on your behalf. If you’re doing this in your own lab, or an environment you have full control over, you’ll be able to do this step without issue , but typically in an Azure environment you may have access to build resources within a subscription, but perhaps not be able to create Azure AD objects – if that level of control applies to your environment, you will likely need to get someone to manually create an Azure AD Service Principal on your behalf. For this example, we’ll just let Azure Automation create the Run As account, which, by default, will have contributor access on the subscription you are creating the account under (which is plenty for what we are doing). You will also notice a “Classic” Run As account is also created – we’re not going to be using that, so you can scrap it. Good consultants like you will of course figure out the least permissions required for the production account and implement that accordingly rather than relying on these defaults.

The Event-Based Runbook

The Event-Based Runbook grabs parameters from POSTed JSON which we get from Event Hub. The JSON we get contains enough information about an individual resource which has been created or modified that we are able to perform an evaluation on that resource alone. In the next section, I will give you a sample of what that JSON looks like.

When we create this event-based Runbook, obviously we need somewhere to receive the POSTed JSON, so we need to create a Webhook. If you’ve never done this before, it’s a fairly straight forward exercise, but you need to be aware of the following things

  • When creating the Webhook, you are displayed the tokenized URL at the point of creation. Take note of it, you won’t be seeing it again and you’ll have to re-create the webhook if you didn’t save your notepad.
  • This URL is open out to the big bad internet. Although the damage you can cause in this instance is limited, you need to be aware that anyone with the right URL can hit that Webhook and start poking.
  • The security of the Webhook is contained solely in that tokenised URL (you can do some trickery around this, but it’s out of scope for this conversation) so in case the previous two points weren’t illustrative enough, the point is that you should be careful with Webhook security.

Below is the script we will use for the event-driven Runbook.

So, what are the interesting bits in there we need to know about? Well firstly, the webhook data. You can see we ingest the data initially into the $WebhookData variable, then store it in a more useful format in the $InputJSON variable, and then break it up into a bunch of other more useful variables $resourceUri, $status and $subject. The purpose in each of those variables is described below

 

VariablePurpose
$resourceUriThe resource URI of the resource we want to evaluate
$statusThe status of the Azure operation we received from Event Grid. If the operation failed to make a change for example, we don’t need to re-evaluate it.
$subjectThe subject contains the resource type, this helps us to narrow down the scope of our evaluation

 

As you can see, aside from dealing with inputs at the top, the script essentially has two parts to it: the tagging function, and the evaluation. As you can see from the evaluation (line 78-88) we scope down the input to make sure we only ever bother evaluating a resource if it’s one we care about. The evaluation itself, as you can see is really just saying “hey, does this resource have more than one NIC? If so, tag the resource using the tagging function. If it doesn’t? remove the tag using the tagging function”. Easy.

The Schedule-Based Runbook

The evaluations (and the function) we have in the schedule-based Runbook is essentially the same as what we have in the event-based one. Why do we even have the schedule-based Runbook then? Well, imagine for a second that Azure Automation has fallen over for a few minutes, or someone publishes dud code, or one of many other things happens which means the automation account is temporarily unavailable – this means the fleeting event which may occur one time only as a resource is being created is essentially lost to the ether, Having the schedule-based books means we can come back every 24 hours (or whatever your organisation decides) and pick up things which may have been missed.

The schedule-based runbook obviously does not have the ability to target individual resources, so instead it must perform an evaluation on all resources. The larger your Azure environment, the longer the processing time, and potentially the higher the cost. Be wary of this and make sensible decisions.

The schedule-based runbook PowerShell is pasted below.

Event Grid

Event Grid is the bit which is going to take logs from our Azure Subscription and allow us to POST it to our Azure Automation Webhook in order to perform our evaluation. Create your Event Grid Subscription with the “Event Grid Schema”, the “Subscription” topic type (using your target subscription) and listening only for “success” event types. The final field we care about on the Event Subscription create form, is for the Webhook – this is the one we created earlier in our Azure Automation Runbook, and now is the time to paste that value in.

Below is an example of the JSON we end up getting POSTed to our Webhook.

Azure Policy

And finally, we arrive at Azure Policy itself. So once again to remind you, all we are doing at this point is performing a compliance evaluation on a resource based solely on the tag applied to it, and accordingly, the policy itself is very simple. Because this is a policy based only on the tag, it means the only effect we can really use is “Audit” – we cannot deny creation of resources based on these evaluations.

The JSON for this policy is pasted below.

And that’s it, folks – I hope these last two blog posts have given you enough ideas or artifacts to start building out this idea in your own environments, or building out something much bigger and better using Azure Functions in place of our Azure Automation examples!

If you want to have a chat about how Azure Policy might be useful for your organisation, by all means, please do reach out, as a business we’ve done a bunch of this stuff now, and I’m sure we can help you to plug whatever gaps you might have.

 

Plugging the Gaps in Azure Policy – Part One

Introduction

Welcome to the first part of a two part blog on Azure Policy. Multi-part blogs are not my usual style, but the nature of blogging whilst also being a full time Consultant is that you slip some words in when you find time, and I was starting to feel if I wrote this in a single part, it would just never see the light of day. Part one of this blog deals with the high-level overview of what the problem is, and how we solved it at a high level, part two will include the icky sticky granular detail, including some scripts which you can shamelessly plagiarise.

Azure Policy is a feature complete solution which performs granular analysis on all your Azure resources, allowing your IT department to take swift and decisive action on resources which attempt to skirt infrastructure policies you define. Right, the sales guys now have their quotable line, let’s get stuck in to how you’re going to deliver on that.

Azure Policy Overview

First, a quick overview of what Azure Policy actually is. Azure Policy is a service which allows you to create rules (policies) which allow you to take an action on an attempt to create or modify an Azure resource. For example, I might have a policy which says “only allow VM SKU’s of Standard_D2s_v3” with the effect of denying the creation of said VM if it’s anything other than that SKU. Now, if a user attempts to create a VM other than the sizing I specify, they get denied – same story if they attempt to modify an existing VM to use that SKU. Deny is just one example of an “effect” we can take via Azure Policy, but we can also use Audit, Append, AuditIfNotExists, DeployIfNotExists, and Disabled.

Taking the actions described above obviously requires that you evaluate the resource to take the action. We do this using some chunks of JSON with fairly basic operators to determine what action we take. The properties you plug into a policy you create via Azure Policy, are not actually direct properties of the resource you are attempting to evaluate, rather we have “Aliases”, which map to those properties. So, for example, the alias for the image SKU we used as an example is “Microsoft.Compute/virtualMachines/imageSku”, which maps to the path “properties.storageProfile.imageReference.sku” on the actual resource. This leads me to….

The Gap

If your organisation has decided Azure Policy is the way forward (because of the snazzy dashboard you get for resource compliance, or because you’re going down the path of using baked in Azure stuff, or whatever), you’re going to find fairly quickly that there is currently not a one to one mapping between the aliases on offer, and the properties on a resource. Using a virtual machine as an example, we can use Azure Policy to take an effect on a resource depending on its SKU (lovely!) but up until very recently, we didn’t have the ability to say if you do spin up a VM with that SKU, that it should only ever have a single NIC attached. The existing officially supported path to getting such aliases added to Policy is via the Azure Policy GitHub (oh, by the way if you’re working with policy and not frequenting that GitHub, you’re doing it wrong). The example I used about the multiple NIC’s, you can see was a requested as an alias by my colleague Ken on October 22nd 2018, and marked as “deployed” into the product on February 14th 2019. Perhaps this is not bad for the turnaround from request to implementation into the product speaking in general terms, but not quick enough when you’re working on a project which relies on that alias for a delivery deadline which arrives months before February 14th 2019. A quick review of both the open and closed issues on the Azure Policy GitHub gives you a feel for the sporadic nature of issues being addressed, and in some cases due to complexity or security, the inability to address the issues at all. That’s OK, we can fix this.

Plugging the Gap

Something we can use across all Azure resources in Policy, are fields. One of the fields we can use, is the tag on a resource. So, what we can do here is report compliance status to the Azure Policy dashboard not based on the actual compliance status of the resource, but based on whether or not is has a certain tag applied to it – that is to say, a resource can be deemed compliant or non-compliant based on whether or not it has a tag of a certain value – then, we can use something out of band to evaluate the resources compliance and apply the compliance tag. Pretty cunning huh?

So I’m going to show you how we built out this idea for a customer. In this first part, you’re going to get the high-level view of how it hangs together, and in the second part I will share with you the actual scripts, policies, and other delicious little nuggets so you can build out a demo yourself should it be something you want to have a play with. Bear in mind the following things when using all this danger I am placing in your hands:

  • This was not built to scale, more as a POC, however;
    • This idea would be fine for handling a mid-sized Azure environment
  • This concept is now being built out using Azure Functions (as it should be)
  • Roll-your-own error handling and logging, the examples I will provide will contain none
  • Don’t rely on 100% event-based compliance evaluation (I’ll explain why in part 2)
  • I’m giving you just enough IP to be dangerous, be a good Consultant

Here’s a breakdown of how the solution hangs together. The example below will more or less translate to the future Functions based version, we’ll just scribble out a couple bits, add a couple bits in.

So, from the diagram above, here’s the high-level view what’s going on:

  1. Event Grid forwards events to a webhook hanging off a PowerShell Runbook.
  2. The PowerShell Runbook executes a script which evaluates the resource forwarded in the webhook data, and applies, removes, or modifies a tag accordingly. Separately, a similar PowerShell runbook fires on a schedule. The schedule-based script contains the same evaluations as the event-driven one, but rather than evaluate an individual resource, it will evaluate all of them.
  3. Azure Policy evaluates resources for compliance, and reports on it. In our case, compliance is simply the presence of a tag of a particular value.

Now, that might already be enough for many of you guys to build out something like this on your own, which is great! If you are that person, you’re probably going to come up with a bunch of extra bits I wouldn’t have thought about, because you’re working from a (more-or-less) blank idea. For others, you’re going to want some gnarly config and scripts so you can plug that stuff into your own environment, tweak it up, and customise it to fit your own lab – for you guys, see you soon for part two!

Kloud has been building out a bunch of stuff recently in Azure Policy, using both complex native policies, and ideas such as the one I’ve detailed here. If your organisation is looking at Azure Policy and you think it might be a good fit for your business, by all means, reach out for a chat. We love talking about this stuff.

Follow Us!

Kloud Solutions Blog - Follow Us!