Highly Available deployment and availability pattern for heritage Windows applications on AWS

Quite often I am presented with the challenge of deploying Windows COTS applications onto the AWS platform with a need to requirement for taking advantage of cloud native patterns like auto-scaling and auto-healing. In this blog post I’m going to describe how I’ve used Auto Scaling Groups, Load Balancers, Cloudwatch Alarms and Route 53 to provide a self healing implementation for a heritage COTS Windows application. This pattern was also extended to use lifecycle hooks to support a Blue/Green deployment with zero downtime.

This pattern works quite nicely for heritage applications which suggest an Active/Passive configuration and the web tier is stateless, ie the Passive node does not have full application capability or is write only. When the primary node is unavailable during failure or blue/green upgrade, clients are transparently redirected to the passive node. I like to use the term “heritage” as it seems to have a softer ring than “legacy”. During actual failure, outage is less than 2 minutes in order for automatic failover to complete.

The diagram below summarises a number of the key components used in the design. In essence we have two autoscaling groups each within minimum, maximum of 1 within two availability zones. We have a private Route 53 hosted zone (int.aws) to host custom CNAME records which typically point to load balancers. A cross zone load balancer, in the example below I’m using a classic load balancer as I’m not doing SSL offload, however it could just as easily be an application load balancer. Route 53 and custom Cloudwatch alarms have been utilised to reduce the time required to fail over between nodes and support separate configuration of primary and secondary nodes.

A number of other assumptions:

  • Cloudwatch Alarm is set to detect where number of healthy nodes in AutoScaleGroup (ASG) ELB is less than 1. Current minimum polling interval is 60 seconds.
  • Independent server components – can support different configurations, ie primary/secondary config
  • Route 53 component (TTL 30 seconds) with a CNAME created with internal DNS (app.corp.com) to point to Route 53 CNAME (dns.master.int.aws). I use
  • ASG health checks on TCP port 443 configured (5 seconds interval, Healthy and Unhealthy threshold of 2). No point in setting any more granular as dependent on Cloudwatch alarm interval.
  • Single ASG deployed within each availability zone
  • Web tier is stateless
  • ELB still deployed over two availability zones.
  • TCP port monitors configured without SSL offload
  • No session stickiness configured as there is only a single server behind each ASG/ELB. In failover scenario clients will need to re-authenticate.
  • Use pre-baked AMIs to support shortest possible healing times.

Normal behaviour, client traffic is directed to Active node in AZ A.

 

Instance fails, and within 60 seconds Cloudwatch Alarm is triggered.

 

Route 53 health check is updated and Route 53 updates DNS record to the passive node. Clients now access to secondary/passive server. Clients may need to re-authenticate if application requires a stateful session.

 

Auto-healing rebuilds the failed server within AZ A.

 

Client now passes Route 53 health check and so Route 53 updates DNS record back to the primary node. Clients may need to re-authenticate if application requires a stateful session.

Secondary Node Failure

If secondary instance fails, there is no service disruption to service as traffic is never actively sent to this node, except during primary node failure.

Availability Zone Failure

These behave in a similar manner to instance failure and are dependent upon Cloudwatch alarm being sent.

Blue Green deployments

Blue Green deployments can be achieved using similar behaviour as experienced before.

On the left we see the existing release/build of the application stack, whilst on the right is the environment to be built. These are all within the same account and same availability zones, just different cloudformation stacks. There will be two stages described, a deploy stage where the new environment is being deployed and a release stage, where DNS is cutover. No additional build activities are conducted during the release stage.

DEPLOY Stage

1.Servers are built as independent components and then baked as AMIs.

2.Scales down server 2 component from previous build.

3. Server 2 is scaled up as part of deploy stage. Team can now test and validate this release prior to release via ELB for second instance. I like to include custom host headers with the servername and specific build number in order to easily identify which server I am hitting, which can be identified through Chrome debugger or fiddler.

RELEASE STAGE

4.Route 53 DNS is automatically updated to point to server 2 ELB. No service outage

5.Terminates the previous primary instance of the build and the primary server is now built within the new stack.

 

6. Server 1 bootstrap is initiated within the new cloudformation stack.

7 Route 53 DNS is updated to the CNAME of the ELB in front of the primary node and normal service resumes in the newly deployed/released environment.

 

Originally published at http://cloudconsultancy.info

Experiences with the new AWS Application Load Balancer

Originally posted on Andrew’s blog @ cloudconsultancy.info

Summary

Recently I had an opportunity to test drive AWS Application load balancer as my client had a requirement for making their websocket application fault tolerant. The implementation was complete windows stack and utilised ADFS 2.0 for SAML authentication however this should not affect other people’s implementation.

The AWS Application load balancer is a fairly new feature which provides layer 7 load balancing and support for HTTP/2 as well as websockets. In this blog post I will include examples of the configuration that I used to implement as well is some of the troubleshooting steps I needed to resolve.

The application load balancer is an independent AWS resource from classic ELB and is defined as aws elbv2 with a number of different properties.

Benefits of Application Load Balancer include:

  • Content based routing, ie route /store to a different set of instances from /apiv2
  • Support for websocket
  • Support for HTTP/2 over HTTPS only (much larger throughput as it’s a single stream multiplexed meaning it’s great for mobile and other high latency apps)
  • Cheaper cost than classic, roughly 10% cheaper than traditional.
  • Cross-zone load balancing is always enabled for ALB.

Some changes that I’ve noticed:

  • Load balancing algorithm used for application load balancer is currently round robin.
  • Cross-zone load balancing is always enabled for an Application Load Balancer and is disabled by default for a Classic Load Balancer.
  • With an Application Load Balancer, the idle timeout value applies only to front-end connections and not the LB-> server connection and this prevents the LB cycling the connection.
  • Application Load balancer is exactly that and performs at Layer 7, so if you want to perform SSL bridge use Classic load balancer with TCP and configure SSL certs on your server endpoint.
  • cookie-expiration-period value of 0 is not supported to defer session timeout to the application. I ended up having to configure the stickiness.lb_cookie.duration_seconds value. I’d suggest making this 1 minute longer than application session timeout, in my example a value of 1860.
  • The X-Forwarded-For parameter is still supported and should be utilised if you need to track client IP addresses, in particular useful if going through a proxy server.

For more detailed information from AWS see http://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/how-elastic-load-balancing-works.html.

Importing SSL Certificate into AWS – Windows

(http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_server-certs.html )

  1. Convert the existing pkcs key into .pem format for AWS

You’ll need openssl for this, the pfx and the password for the SSL certificate.

I like to use chocolatey as my Windows package manager, similar to yum or apt-get for Windows, which is a saviour for downloading package and managing dependencies in order to support automation, but enough of that, check it out @ https://chocolatey.org/

Once choco is installed I simply execute the following from an elevated command prompt.

“choco install openssl.light”

Thereafter I run the following two commands which breaks out the private and public keys (during which you’ll be prompted for the password):

openssl pkcs12 -in keyStore.pfx -out SomePrivateKey.key –nodes –nocerts

openssl pkcs12 -in keyStore.pfx -out SomePublic.cert –nodes –nokeys

NB: I’ve found that sometimes copy and paste doesn’t work when trying to convert keys.

  1. Next you’ll need to also break out the trust chain into one contiguous file, like the following.
-----BEGIN CERTIFICATE-----

Intermediate certificate 2

-----END CERTIFICATE-----

-----BEGIN CERTIFICATE-----

Intermediate certificate 1

-----END CERTIFICATE-----

-----BEGIN CERTIFICATE-----

Optional: Root certificate

-----END CERTIFICATE-----

Save the file for future use,

thawte_trust_chain.txt

Example attached above is for a Thawte trust chain with the following properties

“thawte Primary Root CA” Thumbprint ‎91 c6 d6 ee 3e 8a c8 63 84 e5 48 c2 99 29 5c 75 6c 81 7b 81

With intermediate

“thawte SSL CA - G2” Thumbprint ‎2e a7 1c 36 7d 17 8c 84 3f d2 1d b4 fd b6 30 ba 54 a2 0d c5

Ordinarily you’ll only have a root and intermediate CA, although sometimes there will be second intermediary CA.

Ensure that your certificates are base 64 encoded when you export them.

  1. Finally execute the following after authenticating to the AWS CLI (v1.11.14+ to support aws elbv2 function) then run “aws configure” applying your access and secret keys, configuring region and format type. Please note that this includes some of the above elements including trust chain and public and private keys.

If you get the error as below

A client error (MalformedCertificate) occurred when calling the UploadServerCertificate operation: Unable to validate certificate chain. The certificate chain must start with the immediate signing certificate, followed by any intermediaries in order. The index within the chain of the invalid certificate is: 2”

Please check the contents of the original root and intermediate keys as they probably still have the headers and maybe some intermediate,

ie

Bag Attributes

localKeyID: 01 00 00 00

friendlyName: serviceSSL

subject=/C=AU/ST=New South Wales/L=Sydney/O=Some Company/OU=IT/CN=service.example.com

issuer=/C=US/O=thawte, Inc./CN=thawte SSL CA - G2

Bag Attributes

friendlyName: thawte

subject=/C=US/O=thawte, Inc./OU=Certification Services Division/OU=(c) 2006 thawte, Inc. - For authorized use only/CN=thawte Primary Root CA

issuer=/C=US/O=thawte, Inc./OU=Certification Services Division/OU=(c) 2006 thawte, Inc. - For authorized use only/CN=thawte Primary Root CA

AWS Application LB Configuration

Follow this gist with comments embedded. Comments provided based on some gotchas during configuration.

You should be now good to go, the load balancer takes a little while to warm up, however will be available within multiple availability zones.

If you have issues connecting to the ALB, validate connectivity direct to the server using curl.  Again chocolatey comes in handy “choco install curl”

Also double check your security group registered against the ALB and confirm NACLS.

WebServer configuration

You’ll need to import the SSL certificate into the local computer certificate store. Some of the third party issuing (Ensign, Thawte, etc) CAs may not have the intermediate CA within the computed trusted Root CAs store, especially if built in a network isolated from the internet, so make sure after installing the SSL certificate on the server that the trust chain is correct.

You won’t need to update local hosts file on the servers to point to the load balanced address.

Implementation using CNAMEs

In large enterprises where I’ve worked there have been long lead times associated with fairly simple DNS changes, which defeats some of the agility provided by cloud computing. A pattern I’ve often seen adopted is to use multiple CNAMEs to work around such lead times. Generally you’ll have a subdomain domain somewhere where the Ops team have more control over or shorter lead times. Within the target domain (Example.com) create a CNAME pointing to an address within the ops managed domain (aws.corp.internal) and have a CNAME created within that zone to point to the ALB address, ie

Service.example.com -> service.aws.corp.internal -> elbarn.region.elb.amazonaws.com

With this approach I can update service.aws.corp.internal to reflect a new service which I’ve built via a new ELB and avoid the enterprise change lead times associated with a change in .example.com.