Replacing the service desk with bots using Amazon Lex and Amazon Connect (Part 4)

Welcome back to the final blog post in this series! In parts 1, 2 and 3, we set up an Amazon Lex bot to converse with users, receive and validate verification input, and perform a password reset. While we’ve successfully tested this functionality in the AWS console, we want to provide our users with the ability to call and talk with the bot over the phone. In this blog post, we’ll wire up Amazon Connect with our bot to provide this capability.

What is Amazon Connect

Amazon Connect is a Cloud based contact service center that can be set up in minutes to take phone calls and route them to the correct service center agents. Additionally, Connect is able to integrate with Amazon Lex to create a self-service experience, providing a cost effective method for resolving customer queries without having to wait in queue for a human agent. In our case, Lex will be integrated with Amazon Connect to field password reset requests.

Provisioning Amazon Connect

The following steps provision the base Amazon Connect tenant:

  1. Begin by heading to the AWS Console, then navigate to Amazon Connect and select Add an instance.
  2. Specify a sub-domain for the access URL which will be used to log into Amazon Connect. Select Next step.
  3. For now, skip creating an administrator by selecting Skip this, then select Next step.
  4. For Telephony Options ensure Incoming Calls is selected and Outbound Calls is unselected, then click Next step.
  5. Accept the default data storage configuration and select Next step.
  6. Finally, review the information provided and select Create instance.

That’s all that’s required to provision the Amazon Connect service. Pretty simple stuff. It takes a few minutes to provision, then you’ll be ready to begin configuring your Amazon Connect tenant.

Configuring Amazon Connect

Next, we need to claim a phone number to be used with our service:

  1. Once Amazon Connect has been provisioned, click Get started to log into your Amazon Connect instance.
  2. On the Welcome to Amazon Connect page, select Let’s Go.
  3. To claim a phone number, select your preferred country code and type then select Next. You may find that there are no available numbers for your country of choice (like the screenshot below). If that’s the case and it’s necessary that you have a local number, you can raise a support case with Amazon. For testing purposes, I’m happy to provision a US number and use Google Hangouts to dial the number for free.
  4. When prompted to make a call, choose Skip for now.

You should now be at the Amazon Connect dashboard where there are several options, but before we can continue, we first need to add the Lex bot to Amazon Connect to allow it to be used within a contact flow.

Adding Lex to Amazon Connect

  1. Switch back to the AWS Console and navigate to Amazon Connect.
  2. Select the Amazon Connect instance alias created in the previous step.
  3. On the left-hand side, select Contact Flows.
  4. Under the Amazon Lex section, click Add Lex Bot, and select the user administration bot we created.
  5. Select Save Lex Bots.

Now that our bot has been added to Amazon Connect, we should be able to create an appropriate Contact Flow that leverages our bot.

Creating the Contact Flow

  1. Switch back to the Amazon Connect dashboard then navigate to Contact Flows under routing on the left sidebar.
  2. Select Create contact flow and enter a name (e.g. User administration) for the contact flow.
  3. Expand the Interact menu item then click and drag Get customer input to the grid.
  4. Click the Get customer input item, and set the following properties:
    • Enable Text to speech then add a greeting text (e.g. Welcome to the cloud call center. What would you like assistance with?).
    • Ensure that Interpret as is set to Text
    • Choose the Amazon Lex option, then add the Lex Bot name (e.g. UserAdministration) and set the alias to $LATEST to ensure it uses the latest build of the bot.
  5. Under Intents, select Add a parameter then enter the password reset intent for the Lex Bot (e.g. ResetPW)
  6. Select Save to save the configuration.It’s worth noting that if you wanted to send the user’s mobile number through to your Lex bot for verification purposes, this can be done by sending a session attribute as shown below. The phone number will be passed to the Lambda function in the sessionAttributes object.
  7. On the left sidebar, expand Terminate/Transfer then drag and drop Disconnect/Hang up onto the grid.
  8. Connect the Start entry point to the Get Customer Input box and connect all the branches of the Get Customer Input Box to the Disconnect/Hang up box as shown below.
    We could have added more complex flows to deal with unrecognised intents or handle additional requests that our Lex bot isn’t configured for (both of which would be forwarded to a human agent), however this is outside the scope of this blog post.
  9. In the top right-hand corner above the grid, select the down arrow, then Save & Publish.

Setting the Contact Flow

Now that we have a contact flow created, we need to attach it to the phone number we provisioned earlier.

  1. On the left sidebar in the Amazon Connect console, select Phone Numbers under the Routing menu then select the phone number listed.
  2. Under the Contact Flow/IVR dropdown menu, select the Contact flow you created, then select Save.

Testing the Contact Flow

Now that we’ve associated the contact flow with the phone number, you’re ready for testing! Remember, if you’ve provisioned a number in the US (and you’re overseas), you can dial for free using Google hangouts.

That’s it! You now have a fully functioning chatbot that can be called and spoken to. From here, you can add more intents to build a bot that can handle several simple user administration tasks.

A few things worth noting:

  • You may notice that Lex repeats the user ID as a number, rather than individual digits. Unfortunately, Amazon Connect doesn’t support SSML content from Lex at this time however it’s in the product roadmap.
  • You can view missed utterances on the Monitoring tab on your Lex bot and potentially add them to existing intents. This is a great way to monitor and expand on the capability of your bot.

Replacing the service desk with bots using Amazon Lex and Amazon Connect (Part 3)

Hopefully you’ve had the chance to follow along in parts 1 and 2 where we set up our Lex chatbot to take and validate input. In this blog, we’ll interface with our Active Directory environment to perform the password reset function. To do this, we need to create a Lambda function that will be used as the logic to fulfil the user’s intent. The Lambda function will be packaged with the python LDAP library to modify the AD password attribute for the user. Below are the components that need to be configured.

Active Directory Service Account

To begin, we need to start by creating a service account in Active Directory that has permissions to modify the password attribute for all users. Our Lambda function will then use this service account to perform password resets. To do this, create a service account in your Active Directory domain and perform the following to delegate the required permissions:

  1. Open Active Directory Users and Computers.
  2. Right click the OU or container that contains organisational users and click Delegate Control
  3. Click Next on the Welcome Wizard.
  4. Click Add and enter the service account that will be granted the reset password permission.
  5. Click OK once you’ve made your selection, followed by Next.
  6. Ensure that Delegate the following common tasks is enabled, and select Reset user passwords and force password change at next logon.
  7. Click Next, and Finish.

KMS Key for the AD Service Account

Our Lambda function will need to use the credentials for the service account to perform password resets. We want to avoid storing credentials within our Lambda function, so we’ll store the password as an encrypted Lambda variable and allow our Lambda function to decrypt it using Amazon’s Key Management Service (KMS). To create the KMS encryption key, perform the following steps:

  1. In the Amazon Console, navigate to IAM
  2. Select Encryption Keys, then Create key
  3. Provide an Alias (e.g. resetpw) then select Next
  4. Select Next Step for all subsequent steps then Finish on step 5 to create the key.

IAM Role for the Lambda Function

Because our Lambda function will need access to several AWS services such as SNS to notify the user of their new password and KMS to decrypt the service account password, we need to provide our function with an IAM role that has the relevant permissions. To do this, perform the following steps:

  1. In the Amazon Console, navigate to IAM
  2. Select Policies then Create policy
  3. Switch to the JSON editor, then copy and paste the following policy, replacing the KMS resource with the KMS ARN created above
    {
       "Version": "2012-10-17",
       "Statement": [
         {
           "Effect": "Allow",
           "Action": "sns:Publish",
           "Resource": "*"
         },
         {
           "Effect": "Allow",
           "Action": "kms:Decrypt",
           "Resource": "<KMS ARN>"
         }
       ]
    }
  4. Provide a name for the policy (e.g. resetpw), then select Create policy
  5. After the policy has been created, select Roles, then Create Role
  6. For the AWS Service, select Lambda, then click Next:Permissions
  7. Search for and select the policy you created in step 5, as well as the AWSLambdaVPCAccessExecutionRole and AWSLambdaBasicExecutionRole policies then click Next: Review
  8. Provide a name for the role (e.g. resetpw) then click Create role

Network Configuration for the Lambda Function

To access our Active Directory environment, the Lambda function will need to run within a VPC or a peered VPC that hosts an Active Directory domain controller. Additionally, we need the function to access the internet to be able to access the KMS and SNS services. Lambda functions provisioned in a VPC are assigned an Elastic network Interface with a private IP address from the subnet it’s deployed in. Because the ENI doesn’t have a public IP address, it cannot simply leverage an internet gateway attached to the VPC for internet connectivity and as a result, must have traffic routed through a NAT Gateway similar to the diagram below.

Password Reset Lambda Function

Now that we’ve performed all the preliminary steps, we can start building the Lambda function. Because the Lambda execution environment uses Amazon Linux, I prefer to build my Lambda deployment package on a local Amazon Linux docker container to ensure library compatibility. Alternatively, you could deploy a small Amazon Linux EC2 instance for this purpose, which can assist with troubleshooting if it’s deployed in the same VPC as your AD environment.

Okay, so let’s get started on building the lambda function. Log in to your Amazon Linux instance/container, and perform the following:

  • Create a project directory and install python-ldap dependencies, gcc, python-devel and openldap-devel
    Mkdir ~/resetpw
    sudo yum install python-devel openldap-devel gcc
  • Next, we’re going to download the python-ldap library to the directory we created
    Pip install python-ldap -t ~/resetpw
  • In the resetpw directory, create a file called reset_function.py and copy and paste the following script

  • Now, we need to create the Lambda deployment package. As the package size is correlated with the speed of Lambda function cold starts, we need to filter out anything that’s not necessary to reduce the package size. The following zip’s the script and LDAP library:
    Cd ~/resetpw
    zip -r ~/resetpw.zip . -x "setuptools*/*" "pkg_resources/*" "easy_install*"
  • We need to deploy this package as a Lambda function. I’ve got AWSCLI installed in my Amazon Linux container, so I’m using the following CLI to create the Lambda function. Alternatively, you can download the zip file and manually create the Lambda function in the AWS console using the same parameters specified in the CLI below.
    aws lambda create-function --function-name reset_function --region us-east-1 --zip-file fileb://root/resetpw.zip --role resetpw --handler reset_function.lambda_handler --runtime python2.7 --timeout 30 --vpc-config SubnetIds=subnet-a12b3cde,SecurityGroupIds=sg-0ab12c30 --memory-size 128
  • For the script to run in your environment, a number of Lambda variables need to be set which will be used at runtime. In the AWS Console, navigate to Lambda then click on your newly created Lambda function. In the environment variables section, create the following variables:
    • Url – This is the LDAPS URL for your domain controller. Note that it must be LDAP over SSL.
    • Domain_base_dn – The base distinguished name used to search for the user
    • User – The service account that has permissions to change the user password
    • Pw – The password of the service account
  • Finally, we need to encrypt the Pw variable in the Lambda console. Navigate to the Encryption configuration and select Enable helpers for encryption in transit. Select your KMS key for both encryption in transit and at reset, then select the Encrypt button next to the pw variable. This will encrypt and mask the value.

  • Hit Save in the top right-hand corner to save the environment variables.

That’s it! The Lambda function is now configured. A summary of the Lambda function’s logic is as follows:

  1. Collect the Lambda environment variables and decrypt the service account password
  2. Perform a secure AD bind but don’t verify the certificate (I’m using a Self-Signed Cert in my lab)
  3. Collect the user’s birthday, start month, telephone number, and DN from AD
  4. Check the user’s verification input
  5. If the input is correct, reset the password and send it to the user’s telephone number, otherwise exit the function.

Update and test Lex bot fulfillment

The final step is to add the newly created Lambda function to the Lex bot so it’s invoked after input is validated. To do this, perform the following:

  1. In the Amazon Console, navigate to Amazon Lex and select your bot
  2. Select Fulfillment for your password reset intent, then select AWS Lambda function
  3. Choose the recently created Lambda function from the drop down box, then select Save Intent

That should be it! You’ll need to build your bot again before testing…

My phone buzzes with a new SMS…

Success! A few final things worth noting:

  • All Lambda execution logs will be written to CloudWatch logs, so this is the best place to begin troubleshooting if required
  • Modification to the AD password attribute is not possible without a secure bind and will result in an error.
  • The interrogated fields (month started and date of birth) are custom AD attributes in my lab.
  • Password complexity in my lab is relaxed. If you’re using the default password complexity in AD, the randomly generated password in the lambda function may not meet complexity requirements every time.

Hope to see you in part 4, where we bolt on Amazon Connect to field phone calls!

Replacing the service desk with bots using Amazon Lex and Amazon Connect (Part 2)

Welcome back! Hopefully you had the chance to follow along in part 1 where we started creating our Lex chatbot. In part 2, we attempt to make the conversation more human-like and begin integrating data validation on our slots to ensure we’re getting the correct input.

Creating the Lambda initialisation and validation function

As data validation requires compute, we’ll need to start by creating an AWS Lambda function. Head over to the AWS console, then navigate to the AWS Lambda page. Once you’re there, select Create Function and choose to Author from Scratch then specify the following:

Name: ResetPWCheck

Runtime: Python 2.7 (it’s really a matter of preference)

Role: I use an existing Out of the Box role, “Lambda_basic_execution”, as I only need access to CloudWatch logs for debugging.

Once you’ve populated all the fields, go ahead and select Create Function. The script we’ll be using is provided (further down) in this blog, however before we go through the script in detail, there are two items worth mentioning.

Input Events and Response Formats

It’s well worth familiarising yourself with the page on Lambda Function Input Event and Response Formats in the Lex Developer guide. Every time input is provided to Lex, it invokes the Lambda initalisation and validation function. For example, when I tell my chatbot “I need to reset my password”, the lambda function is invoked and the following event is passed:

Amazon Lex expects a response from the Lambda function in JSON format that provides it with the next dialog action.

Persisting Variables with Session Attributes

There are many ways to determine within your Lambda function where you’re up to in your chat dialog, however my preferred method is to pass state information within the SessionAttributes object of the input event and response as a key/value pair. The SessionAttributes can persist between invocations of the Lambda function (every time input is provided to the chatbot), however you must remember to collect and pass the attributes between input and responses to ensure it persists.

Input Validation Code

With that out of the way, let’s begin looking at the code. The below script is what I’ve used which you can simply copy and paste, assuming you’re using the same slot and intent names in your Lex bot that were used in Part 1.

Let’s break it down.

When the lambda function is first invoked, we check to see if any state is set in the sessionAttributes. If not, we can assume this is the first time the lambda function is invoked and as a result, provide a welcoming response while requesting the User’s ID. To ensure the user isn’t welcomed again, we set a session state so the Lambda function knows to move to User ID validation when next invoked. This is done by setting the “Completed” : “None” key/value pair in the response SessionAttributes.

Next, we check the User ID. You’ll notice the chkUserId function checks for two things; That the slot is populated, and if it is, the length of the field. Because the slot type is AMAZON.Number, any non-numeric characters that are entered will be rejected by the slot. If this occurs, the slot will be left empty, hence this is something we’re looking out for. We also want to ensure the User ID is 6 digits, otherwise it is considered invalid. If the input is correct, we set the session state key/value pair to indicate User ID validation is complete then allow the dialog to continue, otherwise we request the user to re-enter their User ID.

Next, we check the Date of Birth. Because the slot type is strict regarding input, we don’t do much validation here. An utterance for this slot type generally maps to a complete date: YYYY-MM-DD. For validation purpose, we’re just looking for an empty slot. Like the User ID check, we set the session state and allow the dialog to continue if all looks good.

Finally, we check the last slot which is the Month Started. Assuming the input for the month started is correct, we then confirm the intent by reading all the slot values back to the user and asking if it’s correct. You’ll notice here that there’s a bit of logic to determine if the user is using voice or text to interact with Lex. If voice is used, we use Speech Synthesis Markup Language (SSML) to ensure the UserID value is read as digits, rather than as the full number.

If the user is happy with the slot values, the validation completes and Lex then moves to the next Lambda function to fulfil the intent (next blog). If the user isn’t happy with the slot values, the lambda function exits telling the user to call back and try again.

Okay, now that our Lambda function is finished, we need to enable it as a code hook for initialisation and validation. Head over to your Lex bot, select the “ResetPW” intent, then tick the box under Lambda initialisation and validation and select your Lambda function. A prompt will be given to provide permissions to allow your Lex bot to invoke the lambda function. Select OK.

Let’s hit Build on the chatbot, and test it out.

So, we’ve managed to make the conversation a bit more human like and we can now detect invalid input. If you use the microphone to chat with your bot, you’ll notice the User ID value is read as digits. That’s it for this blog. Next blog, we’ll integrate Active Directory and actually get a password reset and sent via SNS to a mobile phone.

Replacing the service desk with bots using Amazon Lex and Amazon Connect (Part 1)

“What! Is this guy for real? Does he really think he can replace the front line of IT with pre-canned conversations?” I must admit, it’s a bold statement. The IT Service Desk has been around for years and has been the foot in the door for most of us in the IT industry. It’s the face of IT operations and plays an important role in ensuring an organisation’s staff can perform to the best of their abilities. But what if we could take some of the repetitive tasks the service desk performs and automate them? Not only would we be saving on head count costs, we would be able to ensure correct policies and procedures are followed to uphold security and compliance. The aim of this blog is to provide a working example of the automation of one service desk scenario to show how easy and accessible the technology is, and how it can be applied to various use cases.
To make it easier to follow along, I’ve broken this blog up into a number of parts. Part 1 will focus on the high-level architecture for the scenario and begin creating the Lex chatbot.

Scenario

Arguably, the most common service desk request is the password reset. While this is a pretty simple issue for the service desk to resolve, many service desk staff seem to skip over, or not realise the importance of user verification. Both the simple resolution and the strict verification requirement make this a prime scenario to automate.

Architecture

So what does the architecture look like? The diagram below dictates the expected process flow. Let’s step through each item in the diagram.

 

Amazon Connect

The process begins when the user calls the service desk and requests to have their password reset. In our architecture, the service desk uses Amazon Connect which is a cloud based customer contact centre service, allowing you to create contact flows, manage agents, and track performance metrics. We’re also able to plug in an Amazon Lex chatbot to handle user requests and offload the call to a human if the chatbot is unable to understand the user’s intent.

Amazon Lex

After the user has stated their request to change their password, we need to begin the user verification process. Their intent is recognised by our Amazon Lex chatbot, which initiates the dialog for the user verification process to ensure they are who they really say they are.

AWS Lambda

After the user has provided verification information, AWS Lambda, which is a compute on demand service, is used to validate the user’s input and verify it against internal records. To do this, Lambda interrogates Active Directory to validate the user.

Amazon SNS

Once the user has been validated, their password is reset to a random string in Active Directory and the password is messaged to the user’s phone using Amazon’s Simple Notification Service. This completes the interaction.

Building our Chatbot

Before we get into the details, it’s worth mentioning that the aim of this blog is to convey the technology capability. There’s many ways of enhancing the solution or improving validation of user input that I’ve skipped over, so while this isn’t a finished production ready product, it’s certainly a good foundation to begin building an automated call centre.

To begin, let’s start with building our Chatbot in Amazon Lex. In the Amazon Console, navigate to Amazon Lex. You’ll notice it’s only available in Ireland and US East. As Amazon Connect and my Active Directory environment is also in US East, that’s the region I’ve chosen.

Go ahead and select Create Bot, then choose to create your own Custom Bot. I’ve named mine “UserAdministration”. Choose an Output voice and set the session timeout to 5 minutes. An IAM Role will automatically be created on your behalf to allow your bot to use Amazon Polly for speech. For COPPA, select No, then select Create.

Once the bot has been created, we need to identify the user action expected to be performed, which is known as an intent. A bot can have multiple intents, but for our scenario, we’re only creating one, which is the password reset intent. Go ahead and select Create Intent, then in the Add Intent window, select Create new intent. My intent name is “ResetPW”. Select Add, which should add the intent to your bot. We now need to specify some expected sample utterances, which are phrases the user can use to trigger the Reset Password intent. There’s quite a few that could be listed here, but I’m going to limit mine to the following:

  • I forgot my password
  • I need to reset my password
  • Can you please reset my password

The next section is the configuration for the Lambda validation function. Let’s skip past this for the time being and move onto the slots. Slots are used to collect information from the user. In our case, we need to collect verification information to ensure the user is who they say they are. The verification information collected is going to vary between environments. I’m looking to collect the following to verify against Active Directory:

  • User ID – In my case, this is a 6-digit employee number that is also the sAMAccountName in Active Directory
  • User’s birthday – This is a custom attribute in my Active Directory
  • Month started – This is a custom attribute in my Active Directory

In addition to this, it’s also worth collecting and verifying the user’s mobile number. This can be done by passing the caller ID information from Amazon Connect, however we’ll skip this, as the bulk of our testing will be text chat and we need to ensure we have a consistent experience.

To define a slot, we need to specify three items:

  • Name of the slot – Think of this as the variable name.
  • Slot type – The data type expected. This is used to train the machine learning model to recognise the value for the slot.
  • Prompt – How the user is prompted to provide the value sought.

Many slot types are provided by Amazon, two of which has been used in this scenario. For “MonthStarted”, I’ve decided to create my own custom slot type, as the in-built “AMAZON.Month” slot type wasn’t strictly enforcing recognisable months. To create your own slot type, press the plus symbol on the left-hand side of the page next to Slot Types, then provide a name and description for your slot type. Select to Restrict to Slot values and Synonyms, then enter each month and its abbreviation. Once completed, click Add slot to intent.

Once the custom slot type has been configured, it’s time to set up the slots for the intent. The screenshot below shows the slots that have been configured and the expected order to prompt the user.

Last step (for this blog post), is to have the bot verify the information collected is correct. Tick the Confirmation Prompt box and in the Confirm text box provided, enter the following:

Just to confirm, your user ID is {UserID}, your Date of Birth is {DOB} and the month you started is {MonthStarted}. Is this correct?

For the Cancel text box, enter the following:

Sorry about that. Please call back and try again.

Be sure to leave the fulfillment to Return parameters to client and hit Save Intent.

Great! We’ve built the bare basics of our bot. It doesn’t really do much yet, but let’s take it for a spin anyway and get a feel for what to expect. In the top right-hand corner, there’s a build button. Go ahead and click the button. This takes some time, as building a bot triggers machine learning and creates the models for your bot. Once completed, the bot should be available to text or voice chat on the right side of the page. As you move through the prompts, you can see at the bottom the slots getting populated with the expected format. i.e. 14th April 1983 is converted to 1983-04-14.

So at the moment, our bot doesn’t do much but collect the information we need. Admittedly, the conversation is a bit robotic as well. In the next few blogs, we’ll give the bot a bit more of a personality, we’ll do some input validation, and we’ll begin to integrate with Active Directory. Once we’ve got our bot working as expected, we’ll bolt on Amazon Connect to allow users to dial in and converse with our new bot.

Dynamically rename an AWS Windows host deployed via a syspreped AMI

One of my customers presented me with a unique problem last week. They needed to rename a Windows Server 2016 host deployed using a custom AMI without rebooting during the bootstrap process. This lack of a reboot rules out the simple option of using the PowerShell Rename-Computer Cmdlet. While there are a number of methods to do this, one option we came up with is dynamically updating the sysprep unattended answer file using a PowerShell script prior to the unattended install running during first boot of a sysprepped instance.

To begin, we looked at the Windows Server 2016 sysprep process on an EC2 instance to get an understanding of the process required. It’s important to note that this process is slightly different to Server 2012 on AWS, as EC2Launch has replaced EC2Config. The high level process we need to perform is the following:

  1. Deploy a Windows Server 2016 AMI
  2. Connect to the Windows instance and customise it as required
  3. Create a startup PowerShell script to set the hostname that will be invoked on boot before the unattended installation begins
  4. Run InitializeInstance.ps1 -Schedule to register a scheduled task that initializes the instance on next boot
  5. Modify SysprepInstance.ps1 to replace the cmdline registry key after sysprep is complete and prior to shutdown
  6. Run SysprepInstance.ps1 to sysprep the machine.

Steps 1 and 2 are fairly self-explanatory, so let’s start by taking a look at step 3.

After a host is sysprepped, the value for the cmdline registry key located in HKLM:\System\Setup is populated with the windeploy.exe path, indicating that it should be invoked after reboot to initiate the unattended installation. Our aim is to ensure that the answer file (unattend.xml) is modified to include the computer name we want the host to be named prior to windeploy.exe executing. To do this, I’ve created the following PowerShell script named startup.ps1 and placed it in C:\Scripts to ensure my hostname is based on the internal IP address of my instance.

Once this script is in place, we can move to step 4, where we schedule the InitializeInstance.ps1 script to run on first boot. This can be done by running InitializeInstance.ps1 -Schedule located in C:\ProgramData\Amazon\EC2-Windows\Launch\Scripts

Step 5 requires a modification of the SysprepInstance.ps1 file to ensure the shutdown after Sysprep is delayed to allow modification of the cmdline registry entry mentioned in step 3. The aim here is to replace the windeploy.exe path with our startup script, then shutdown the host. The modifications to the PowerShell script to accommodate this are made after the “#Finally, perform Sysprep” comment.

Finally, run the SysprepInstance.ps1 script.

Once the host changes to the stopped state, you can now create an AMI and use this as your image.

Viewing AWS CloudFormation and bootstrap logs in CloudWatch

Mature cloud platforms such as AWS and Azure have simplified infrastructure provisioning with toolsets such as CloudFormation and Azure Resource Manager (ARM) to provide an easy way to create and manage a collection of related infrastructure resources. Both tool sets allow developers and system administrators to use JavaScript Object Notation (JSON) to specify resources to provision, as well as provide the means to bootstrap systems, effectively allowing for single click fully configured environment deployments.

While these toolsets are an excellent means to prevent RSI from performing repetitive monotonous tasks, the initial writing and testing of templates and scripts can be incredibly time consuming. Troubleshooting and debugging bootstrap scripts usually involves logging into hosts and checking log files. These hosts are often behind firewalls, resulting in the need to use jump hosts which may be MFA integrated, all resulting in a reduced appetite for infrastructure as code.

One of my favourite things about Azure is the ability to watch the ARM provisioning and host bootstrapping process through the console. Unless there’s a need to rerun a script on a host and watch it in real time, troubleshooting the deployment failure can be performed by viewing the ARM deployment history or viewing the relevant Virtual Machine Extension. Examples can be seen below:

This screenshot shows the ARM resources have been deployed successfully.

This screenshot shows the ARM resources have been deployed successfully.

This screenshot shows the DSC extension status, with more deployment details on the right pane.

This screenshot shows the DSC extension status, with more deployment details on the right pane.

While this seems simple enough in Azure, I found it a little less straight forward in AWS. Like Azure, bootstrap logs for the instance reside on the host itself, however the logs aren’t shown in the console by default. Although there’s a blog post on AWS to view CloudFormation logs in CloudWatch, it was tailored to Linux instances. Keen for a similar experience to Azure, I decided to put together the following instructional to have bootstrap logs appear in CloudWatch.

To enable CloudWatch for instances dynamically, the first step is to create an IAM role that can be attached to EC2 instances when they’re launched, providing them with access to CloudWatch. The following JSON code shows a sample policy I’ve used to define my IAM role.

The next task is to create a script that can be used at the start of the bootstrap process to dynamically enable the CloudWatch plugin on the EC2 instance. The plugin is disabled by default and when enabled, requires a restart of the EC2Config service. I used the following script:

It’s worth noting that the EC2Config service is set to recover by default and therefore starts by itself after the process is killed.

Now that we’ve got a script to enable the CloudWatch plugin, we need to change the default CloudWatch config file on the host prior to enabling the CloudWatch plugin. The default CloudWatch config file is AWS.EC2.Windows.CloudWatch.json and contains details of all the logs that should be monitored as well as defining CloudWatch log groups and log streams. Because there’s a considerable number of changes made to the default file to achieve the desired result, I prefer to create and store a customised version of the file in S3. As part of the bootstrap process, I download it to the host and place it in the default location. My customised CloudWatch config file looks like the following:

Let’s take a closer look at what’s happening here. The first three components are windows event logs I’m choosing to monitor:

You’ll notice I’ve included the Desired State Configuration (DSC) event logs, as DSC is my preferred configuration management tool of choice when it comes to Windows. When defining a windows event log, a level needs to be specified, indicating the verbosity of the output. The values are as follows:

1 – Only error messages uploaded.
2 – Only warning messages uploaded.
4 – Only information messages uploaded.

You can add values together to include more than one type of message. For example, 3 means that error messages (1) and warning messages (2) get uploaded. A value of 7 means that error messages (1), warning messages (2), and information messages (4) get uploaded. For those familiar with Linux permissions, this probably looks very familiar! 🙂

To monitor other windows event logs, you can create additional components in the JSON template. The value of “LogName” can be found by viewing the properties of the event log file, as shown below:

img_57c431da7dfb4

The next two components monitor the two logs that are relevant to the bootstrap process:

Once again, a lot of this is self explanatory. The “LogDirectoryPath” specifies the absolute directory path to the relevant log file, and the filter specifies the log filename to be monitored. The tricky thing here was getting the “TimeStampFormat” parameter correct. I used this article on MSDN plus trial and error to work this out. Additionally, it’s worth noting that cfn-init.log’s timestamp is the local time of the EC2 instance, while EC2ConfigLog.txt takes on UTC time. Getting this right ensures you have the correct timestamps in CloudWatch.

Next, we need to define the log groups in CloudWatch that will hold the log streams. I’ve got three separate Log Groups defined:

You’ll also notice that the Log Streams are named after the instance ID. Each instance that is launched will create a separate log stream in each log group that can be identified by its instance ID.

Finally, the flows are defined:

This section specifies which logs are assigned to which Log Group. I’ve put all the WindowsEventLogs in a single Log Group, as it’s easy to search based on the event log name. Not as easy to differentiate between cfn-init.log and EC2ConfigLog.txt entries, so I’ve split them out.

So how do we get this customised CloudWatch config file into place? My preferred method is to upload the file with the set-cloudwatch.ps1 script to a bucket in S3, then pull them down and run the PowerShell script as part of the bootstrap process. I’ve included a subset of my standard cloudformation template below, showing the monitoring config key that’s part of the ConfigSet.

What does this look like in the end? Here we can see the log groups specified have been created:

img_57c4d1ddcd273

If we drill further down into the cfninit-Log-Group, we can see the instance ID of the recently provisioned host. Finally, if we click on the Instance ID, we can see the cfn-init.log file in all its glory. Yippie!

img_57c4d5c4c3b89

Hummm, looks like my provisioning failed because a file is non-existent. Bootstrap monitoring has clearly served its purpose! Now all that’s left to do is to teardown the infrastructure, remediate the issue, and reprovision!

The next step to reducing the amount of repetitive tasks in the infrastructure as code development process is a custom pipeline to orchestrate the provisioning and teardown workflow… More on that in another blog!

AWS CloudFormation AWS::RDS::OptionGroup Unknown option: Mirroring

Amazon recently announced Multi-AZ support for SQL Server in Sydney, which provides high availability for SQL RDS instances using SQL Server mirroring technology. In an effort to make life simpler for myself, I figured I’d write a CloudFormation template for future provisioning requests, however it wasn’t as straight forward as I’d expected.

I began by trying to guess my way through the JSON resources, based on what I’d already knew for MySQL deployments. I figured the MultiAZ property was still relevant, so I hacked together a template and attempted to provision the stack, which failed, indicating the following error:

CREATE_FAILED        |  Invalid Parameter Combination: MultiAZ property cannot be used with SQL Server DB instances, use the Mirroring option in an option group associated with the DB instance instead.

The CloudFormation output clearly provides some guidance on the correct parameters required to enable mirroring in SQL Server. I had a bit of trouble tracking down documentation for the mirroring option, but after crawling the web for sample templates, I managed to put together the correct CloudFormation template, which can be seen below.

Excited and thinking I had finalised my template, I attempted to create the stack in ap-southeast-2 (Sydney, Australia), only for it to fail with a different error this time…

CREATE_FAILED        | Unknown option: Mirroring

Finding this output strange, I attempted to run the CloudFormation template in eu-west-1, which completed successfully.

With no options left, I decided to contact AWS Support who indicated that this is a known issue in the ap-southeast-2 region, which is also evident when attempting to create an option group in the GUI and the dropdown box is greyed out, as shown below.

What it should look like:

What it currently looks like:

The suggested workaround is to manually create the SQL RDS instance in the GUI which provides the mirroring option in the deployment wizard. Although the limitation is being treated with priority, there’s no ETA for resolution at the moment.

Hopefully this assists someone out there banging their head against the wall trying to get this to work!

Leveraging Cloud Storage for the Enterprise: Microsoft StorSimple – Part 1

Originally posted on Bobbie’s blog @ www.thecloudguy.info

It’s no secret that one of the biggest pain points for enterprises today is the rapid growth of unstructured data. The ability to manage, protect and archive an organisation’s most valuable assets is arguably one of the biggest strains on IT department budgets.

The advent of cloud technology has many organisations looking for a way to leverage Pay-as-You-Go cloud storage offerings to assist in the data life-cycle process. The difficulty with these offerings is that data is stored as objects rather than on file systems such as NFS and CIFS, meaning integration with existing business processes and solutions isn’t straight forward.

Cloud Storage Gateways

Cloud Storage Gateways resolve integration issues by bridging the gap between commonly used storage protocols such as iSCSI/CIFS/NFS and cloud-based object storage. Storage Gateways take the form of a network appliance, server or virtual machine that resides on-premises and typically provide storage capacity for frequently used data.

As data ages and is accessed less frequently, it is moved into cloud storage which can cost considerably less than traditional on-premises storage offerings. Additional features are integrated into cloud Storage gateways such as backup technology to protect large volumes that can no longer be protected using traditional means.

Microsoft StorSimple

Microsoft have a competitive hybrid cloud storage offering called Microsoft StorSimple that takes into consideration a wide range of existing business pain points such as backup and Disaster Recovery.

Microsoft StorSimple is a physical on-premises storage system that uses three tiers of storage: SSD, SAS, and cloud storage. A number of models are offered based on storage and performance requirements, however StorSimple’s ability to leverage cloud storage as a cold tier significantly reduces its on-premises footprint compared to other storage offerings.

Some of the main features of StorSimple include:

  • Storage tiering – StorSimple dynamically moves data between the tiers of disk based on how active data is, providing efficient data storage and maximum performance for heavily used data.
  • iSCSI protocol support – StorSimple volumes are provisioned as block storage, allowing them to be used for file, database, or hypervisor attached storage.
  • Encryption in-flight and at rest in the cloud.
  • High capacity – The largest StorSimple appliance can currently store up to 500TB of deduplicated data (including cloud storage) in eight rack units.
  • Snapshot backups – Traditionally, snapshots were not considered a reliable form of backup due to their reliance on the source physical storage appliance, however StorSimple snapshots are stored in geographically redundant storage accounts in Microsoft Azure, meaning six copies of data are stored across two geographically separate regions.
  • Single pane management – All StorSimple devices in an organisation, regardless of location, can be managed from the same interface in the Azure portal.
  • Near instant DR – In the event of a disaster, a backup can be cloned and mounted on a virtual or physical StorSimple device and brought online. Only a fraction of the volume needs to reside on the target StorSimple for the volume to come online.
  • Virtual StorSimple – Virtual StorSimple devices can be provisioned in Azure to provide DR and test capabilities for volumes that were previously, or are currently, hosted on-premises.
  • Deduplication and compression – Microsoft StorSimple is able to minimize disk footprint by using global deduplication and compression across on premise and cloud storage.
  • Highly available physical storage architecture with dual components, to prevent single point of failure.

StorSimple in Action

Azure Portal Dashboard and Device Management

All StorSimple devices are managed from the familiar Azure Portal. The sample below shows five StorSimple devices with four being virtual and residing in Azure, and one a physical StorSimple 8100 running on-premises.

StorSimple Manager in Azure Portal

Each device can be selected and configured via the Portal and shown below where I am managing the network interface configuration for a physical StorSimple 8100 device.

Firstly we have the configuration of the WAN interface which is used to communicate with the cloud storage in Azure:

Managing WAN interface for StorSimple

Secondly I can manage the iSCSI interface used to connect storage to local servers (note that the StorSimple 8000 series offers multiple redundant 10GbE iSCSI interfaces, however the lab used for this blog post only has 1GbE switching infrastructure)

Managing iSCSI interface for StorSimple

Storage Provisioning

Compared to traditional storage systems, provisioning storage is incredibly simple (no pun intended!) Once a device is selected, the administrative user navigates to the Volume Containers menu and selects to add a Container (shown below).

Provisioning Storage on StorSimple

A Volume Container is used to group Volumes that share a storage account, bandwidth, and encryption key.

Best practice suggests a geographically redundant storage account is used in Azure to ensure data is highly available in the event of a regional disaster. Bandwidth can be throttled to ensure the WAN link is not saturated when tiering to cloud storage. If cloud storage encryption is used, an encryption key must be specified.

Once confirmed, the Volume Container is created and appears in the list of Containers for the device as shown below.

Creating a Volume Container on a StorSimple

A Volume can then be added to the Container:

Adding a Volume to a Container on a StorSimple

Notice that a usage type of “Tiered Volume” or “Archive Volume” can be selected which allows the StorSimple appliance to better judge where to initially place the data that is being moved to the Volume.

This can be handy for organisations that are looking to migrate stale data they are required to keep for compliance purposes to the cloud. Also note the capacity specified is always thin provisioned.

After confirming the basic settings, iSCSI initiators (servers) are specified that are allowed to access the volume. Once this is completed, the volume appears in the Volume Container page.

New Volume in Container on StorSimple

The Volume can now be attached to a host via the iSCSI protocol. Assuming that iSCSI connectivity is already configured, we log onto the server and perform a scan for new disks, which discovers the Volume recently provisioned as highlighted below.

StorSimple Volume available to Windows Host

This Volume can now be brought online, initialised, be assigned a drive letter, and then function as a drive on the server as shown below. Pretty simple stuff!

StorSimple Volume mounted as drive on server

One of the biggest benefits of StorSimple is its ability to provision traditional block storage which can be leveraged by familiar operating systems. Many other platforms offer storage in NAS form, requiring administrators to learn and manage another platform.

Data Protection and Backup

Now that we’ve provisioned a file system, how do we protect that data? Snapshots are used to protect data on a StorSimple in an efficient and reliable manner. Due to global deduplication, snapshots consume minimal storage in the cloud while still providing reliable protection due to Azure’s geographically redundant storage.

StorSimple backup policies and data protection is configured in the Azure Portal. Administrators navigate to the backup policies section of the device and add a new policy. Multiple Volumes can be grouped together within a policy to take crash-consistent snapshots across the multiple volumes.

Defining a backup policy on a StorSimple

A schedule is then defined. A policy can only have one schedule, however multiple policies can be defined for a Volume.

For example: a daily backup policy can be used perform daily snapshots of a Volume and retain them for short periods of time, while a monthly backup policy can take a snapshot of the same Volume once a month and retain snapshots long term for compliance purposes.

Additionally, a snapshot can be stored on either local storage for rapid restores or cloud storage for resiliency.

Defining backup schedule on a StorSimple

Once the schedule is defined, it appears in the backup policies tab.

Backup policies tab for a StorSimple

Although the first cloud snapshot can take some time, as all Volume data that resides on-premises needs to be copied to the cloud, all subsequent snapshots are quick, as only the changed data is deduped and copied to cloud storage.

Below is a view from the backup catalog, after a backup is complete.

StorSimple backup catalog view

Well, that’s it from me for now! Stay tuned for part two, where I will dive deeper into disaster recovery scenarios, data mobility and performance monitoring.

Originally posted on Bobbie’s blog @ www.thecloudguy.info