DTA #DevOps: What to do when you have a BOSH outage

BOSH is an open source project that is used to package, deploy and manage cloud software. We recently had an outage within one of our BOSH environments which taught us a few things about BOSH we wanted to share.

an image of the error generated by BOSH

Background

‘> To make error is human. To propagate error to all server in automatic way is #devops.’

An essential part of a BOSH environment is the director, which typically is a virtual machine which manages the lifecycle of other virtual machines in the environment. BOSH uses a number of certificates that are automatically generated at installation time so its components can securely communicate, which are kept in a var-store file generally called creds.yml.

One of our BOSH directors was deployed around one year ago, and the automatically generated certificates have a one year expiry time. We first noticed this when our CI system ran a `bosh deploy` which gave an error:

Get https://10.x.x.x:25555/info: x509: certificate has expired or is not yet valid

We decided to jump into a test environment and get new certificates by simply deleting all the certificates in creds.yml, and redeploying the director. This mimics a new installation where BOSH will automatically generate the missing certificates it needs. After this finished, the director came up successfully, and responded successfully to bosh-cli commands like `bosh instances`, however all running instances were reported as “unresponsive agent”. This is expected, as the new certificates were not yet copied onto the running instances so the director could not yet talk to them, however jobs on the instances themselves were still running, and our applications were servicing user requests.

We then ran `bosh cloud-check` on each deployment to recreate each instance. For deployments with five or less instances, the resurrector will also automatically recreate these instances. This meant that each existing instance was deleted, and then recreated with the new certificates. After that point, the director could communicate with each instance happily.

We decided we had a viable solution to rotate our certificates, albeit with an outage for instances that did not have redundancy. In this particular environment, this was acceptable when the outage was scheduled for outside of business hours.

All too easy… until it wasn’t

We began our maintenance by deleting the certificates from the creds.yml using `vi`. It was later realised that some other lines were accidentally deleted, particularly the line containing `credhub_encryption_password`.

As planned, we then ran `bosh create-env` to recreate the BOSH director. It went ahead and generated new entries for the things that were missing in creds.yml that it needed, including our certificates and a new `credhub_encryption_password`.

Then `bosh create-env` gave this error

Deploying:

 Running the post-start script:

   Sending ‘get_task’ to the agent:

     Agent responded with error: Action Failed get_task: Task 1a06760d-a750-4d94-59ee-d65ec8793c56 result: 1 of 3 post-start scripts failed. Failed Jobs: credhub. Successful Jobs: director, uaa.

Initially we thought this was a problem with one of the certificates being rotated, so we re-ran `bosh create-env` but got the same error.

Meanwhile as we looked at why credhub was failing, the resurrector was still running. We found out later the resurrector saw that all our instances had “unresponsive agent”, and had kicked off jobs to recreate them. It just so happens that all our of deployments in this environment have a maximum of five instances, so the resurrector happily requested a fix for all deployments. The director went from seeing all instances as unhealthy “unresponsive agent”, to slowly seeing more and more instances in the healthy “running” process state.

Around this time, some alerts started going off as our applications stopped responding. A `bosh instances —ps` showed that there were no jobs running on any of these healthy “running” instances.

We tried to run `bosh recreate` on a deployment, which was when we got errors like this:

Task 147 | 06:39:48 | Preparing deployment: Preparing deployment (00:00:00)

Task 147 | 06:39:49 | Error: Unable to render instance groups for deployment. Errors are:

 - Unable to render jobs for instance group ‘web’. Errors are:

   - Unable to render templates for job ‘atc’. Errors are:

     Failed to open TCP connection to 10.x.x.x:8844 (Connection refused - connect(2) for “10.x.x.x” port 8844)

It appears that the resurrector ignores this errors when requesting a fix for a deployment. The director was happily recreating all our instances, and while able to communicate with the new instance, was then failing to install and configure each instance in all our deployments. The failure showed in `bosh events`, but did not halt the automated behaviour. The result was that the director was wiping all the virtual machines

Once we realised the `credhub_encryption_password` had been changed, we reverted this value to the original value, and `bosh create-env` then finished successfully and fixed the director, however the damage from the resurrector was already done and we had a much longer outage on each instance than we planned. We then went through and ran `bosh recreate` on each deployment to be sure we fixed them all.

Wrap-up

We were quite unlucky to accidentally change our `credhub_encryption_password` at the same time as rotating our certificates, but it reminded us of the risk of a single engineer manually editing such files. Given the risk, we will pair on changes like this in the future, and script it the next time we need to rotate our creds.yml certificates.

We hope that sharing our experiences helps other BOSH operators in troubleshooting similar issues.