A Way to Keep Logs Safe on Disposable Servers

Automatic replacement of failed cloud configuration items is a life-saver. Having items recover themselves with no ops team intervention can be a life-saver too, and not to mention a sleep-saver. Relieved from the responsibility of having to restore service, the only outstanding task is often to explain what happened.
What if the thing that failed was an EC2 application server running RedHat and the logs were on the server’s now-replaced volumes though? The contents of /var/log are gone, and while we might be capturing them in a log aggregator like Splunk or a syslog system of some sort, those aren’t always simple to compile into a report or send to an application vendor for a post mortem.  I had this problem recently and the solution was to move logs off the instance using systemd.path, or path unit configuration. This replaces an old concept of inotify tools, which aren’t available in EC2 RedHat instance’s repositories.
Using path unit configuration we configure three things;

  1. A path watcher that that sets out what directory we’re going to watch for changes
  2. A service file that describes a script to run when that path has any changes
  3. A script that conditionally copies or moves contents off to a file share or somewhere safe in case our instance is retired, whether that was deliberate or a nasty surprise

The path watcher

Create a file in /etc/systemd/system called something descriptive like logsaver.path, with the following contents.

Description= Triggers a service that keeps historical log files out of the blast radius of an instance replacement.
Documentation= man:systemd.path

This has some metadata about what we’re trying to achieve, then a path, and some information about how it’s targeted.

The service

The next step is to create a file in /etc/systemd/logsaver.service, with the following contents

Description= Starts our log rescue script
Documentation= man:systemd.service

This is some metadata again saying what we’re doing, and a service which runs once (oneshot), the script specified in ExecStart.
Now we’ve got a path to watch, and the location of a script, let’s write the actual script.

The script

In our rescueRotatedLogs.sh script, we can do the following;

# Move files with a datestamp like 2015-04-03 in them somewhere else
for file in $(ls /var/log/fragileapp | grep -E "20[0-9]{2}\-(0[1-9]|1[0-2])\-([0-2][0-9]|3[0-1])")
  mv /var/log/fragileapp/"$file" /mnt/aMountpointToAShareOrDFSNS

I know what you’re thinking, regex! But it doesn’t need to be complicated. Linux system logs are usually straightforward with an application.log and then a bunch of 2015-03-22-application.log.gz or other format that means the logs have been rotated out and the app is currently writing to the main log. These are the ones that the vendor is going to need to diagnose, or the post incident review committee is going to want, and the regex just finds anything with an even vaguely datelike string in the name (that regex’ll find invalid dates too like 2099, but hopefully the instance didn’t crash because the app writes weird dates).
Once we have this all set up, we kick it off by starting our path watcher;

systemctl start logsaver.path

Now if you did this in a test system you’ll notice that all the rotated logs with an ISO 8601 ‘yyyy-mm-dd’ formatted date in them just got sent to the share. You could replace;

mv /var/log/fragileapp/"$file" /mnt/aMountpointToAShareOrDFSNamespace


aws s3 put /var/log/fragileapp/"$file" s3://AnS3Bucket –sse AES256

If you have the AWS CLI tools installed on your instances and you want to be fully cloud (and who doesn’t).
Or maybe you’re multi-cloud

az storage blob upload --container-name $aContainerYouMadeBefore --file $file --name $blob_name

You could also replace the grepping for a regex with just moving any zipped files with the extensions zip or tar.gz.  Or anything with an extension of .1 – .12 to handle logrotate rotated files. Implementing this, we always have the historical logs of our disposable instance, in a non-disposable place, and are better positioned to understand what happened to our servers, without being overly attached to them.

Follow ...+

Kloud Blog - Follow