Recent growth in our Managed Services business (driven in part by our acquisition by Telstra) has meant that a number of tools and processes that we have previously taken for granted have had to be re-assessed and re-architected to allow us to scale and maintain the same level of service at low costs.
One particular area that we’ve recently reworked is how we remotely access and administer workloads within customer’s AWS environments. Previous methods of access leveraged either static bastion hosts or VPN endpoints and they worked well up until a point, but after analysing at the overall footprint of resources used and costs incurred by doing so, it became clear to us that we needed to find a better way.
Traditional methods of using a of a single, common ‘shared’ management zone was discarded after analysing the various security and regulatory requirements of our customers, so we had to come up with something else. We needed a solution that was;
- Secure, preventing access by unwanted parties and encrypting our communications to and from the customer networks;
- Auditable, capturing when a DevOps engineer connected to and disconnected from a customer account;
- Resilient, able to operate in event of a VM, host or AZ failure;
- Cost effective, aligning to AWS principles of making the most of the resources used; and of course
- Scalable, allowing us to have several users connected at once as well as being able to have the solution deployed across tens to hundreds of customer environments.
Traditional approaches having using redundant, highly available VPN or virtual desktop capabilities permanently running seemed expensive and inefficient – they were always running (in case someone needed to connect) and that meant ongoing costs, even when not in use – there had to be a better way… Looking at Auto Scaling Groups and other approaches where systems are treated as ephemeral rather than permanent, we started to toy with the idea of having a remote access service created on-demand using AWS’s APIs to generate a temporary, nano-sized VPN server only when needed and then torn down when finished with. Basically, we would use AWS APIs (1) to create the VPN server, somehow create a temporary access key – kind of like a one-time access token – and then use this key to establish a tunnel into the VPC.
After a bit of tinkering around, we managed to pull together a proof of concept solution which validated our objectives. I wanted to share the proof of concept environment we developed so that others could use it in their environments to reduce their costs of remote access, and although we have evolved it beyond what is described below, the core concepts of how it operates remains the same. The proof of concept design consists of a few components;
- A workload AWS account and VPC – containing the systems that we manage and need network-level access to.
- A management AWS account and VPC – this is our entry point into the workload AWS account and VPC. It is peered into the workload VPC. For the PoC, the public-facing subnet, routing, peering and role to be used by our OpenVPN instances with CloudWatch Logs and CloudWatch permissions are expected to be pre-created.
- (Optional) a SAML IDP – We use Azure AD as our IDP. This allows us to have a central location to store our engineer’s identities so don’t have to manage multiple sets of credentials as we horizontally scale our management accounts. For the PoC, this is not required but nice to have.
- Management Role to assume – this AWS role requires enough permissions to allow for the creation and configuration of EC2 instances and to be able to be assumed by our engineers.
- The last piece required to pull the whole PoC together is a PowerShell script which coordinates everything and is described in the execution flow further on.
The script performs a number of actions once it is executed by the user (1). It uses the locally-installed OpenSSL binaries (https://wiki.openssl.org/index.php/Binaries) to generate a self-signed certificate pair (2) to be used for the requested connection. The script will generate a new set of certificates every time it is run – kind of like a single-use set of credentials. From there it then leverages an AzureAD login script (https://www.npmjs.com/package/aws-azure-login) to allow the user to authenticate against AzureAD via the command line (3). Username, password and MFA token are checked by AzureAD and a SAML token provided back to the script (4). This token is then presented onto the AWS management account using the AssumeRoleWithSAML API, authorises the user and returns a SecurityTokenService token for the role assumed (5, 6 and 7). The role has permissions to create an EC2 instance, some basic IAM permissions (to assign a role to the EC2 instance). Once the role has been assumed, the script then goes onto calling the EC2 APIs (8) to create the temporary OpenVPN server, with the setup and configuration passed in as User Data. This includes the installation and configuration of OpenVPN as well as certificates generated at step 2. The script waits for the EC2 instance to be created successfully (9) and obtains the ephemeral public IP address of the system plus the network routes within the VPC for local destinations as well as peered VPCs (10). The script then creates the configuration file using the information gathered in steps 2 and 10, executes OpenVPN client on the local system to create a tunnel into the newly-created OpenVPN server and updates the local route table to allow connectivity into the AWS networks. Once connected the user is free to connect via SSH/RDP etc. to various endpoints within the management or peered workload account (11).
We’ve found that this whole process takes somewhere around 1 to 2 minutes to complete, all the way from certificate creation to tunnel establishment.
All in all, the whole solution is quite simple and makes use of a number of well-established features from AWS to provide a very cost effective and scalable way to access a remote environment for our DevOps engineers. To top it all off, the (not so) recent announcement by Amazon to move to per-second billing for Linux workloads makes this approach even more attractive, ensuring that we only pay for the resources that we use.