Pacemaker and the recent GitHub service interruption

It never fails. Someone manages to break their Pacemaker cluster, and Henrik starts preaching his usual sermon of why Pacemaker is terrible and why you should never-ever use it. And when that someone is GitHub, which we all know, use and love, then that sermon gets a bit of excess attention. Let's take a quick look at the facts.

The week of September 10, GitHub suffered a couple of outages which caused a total downtime of 1 hour and 46 minutes, as Jesse precisely pointed out in a blog post. Exhibiting the excellent transparency that GitHub always offers at any time its infrastructure is affected by issues (remember their role-model behavior in an SSH security incident a few months back), Jesse explains, in a very detailed way, what happened on one of their Pacemaker clusters.

Now, all of what follows is based exclusively on the information in that blog post of Jesse's. I have no inside knowledge of the incident, so my picture may be incomplete or skewed. But here's my take on it anyway. I do encourage you to read Jesse's post full-length, as the rest of this post otherwise won't make much sense. I'll just quote certain pieces of it and comment on them here.

Please note: nothing in this post should be construed as a put-down of GitHub's excellent staff. They run a fantastic service and do an awesome job. It's just that their post-mortem seems to have created some misconceptions in the MySQL community about the Pacemaker stack as a whole, and those I'd like to help rectify. Also, I'm posting this in the hope that it provides useful insight to both the GitHub folks, and to anyone else facing similar issues.

Enable Maintenance Mode when you should

From the original post:

Monday's migration caused higher load on the database than our operations team has previously seen during these sorts of migrations. So high, in fact, that they caused Percona Replication Manager's health checks to fail on the master. In response to the failed master health check, Percona Replication manager moved the 'active' role and the master database to another server in the cluster and stopped MySQL on the node it perceived as failed.

At the time of this failover, the new database selected for the 'active' role had a cold InnoDB buffer pool and performed rather poorly. The system load generated by the site's query load on a cold cache soon caused Percona Replication Manager's health checks to fail again, and the 'active' role failed back to the server it was on originally.

At this point, I decided to disable all health checks by enabling Pacemaker's maintenance-mode; an operating mode in which no health checks or automatic failover actions are performed. Performance on the site slowly recovered as the buffer pool slowly reached normal levels.

Now there's actually several issues in there even in this early stage. Maintenance mode is generally the right thing to do here, but you enable it before making large changes to the configuration, and you disable it when done. If you're uncomfortable with the cluster manager taking its hands off the entire cluster, and you know what you're doing, you could also just disable cluster management and monitoring on a specific resource. Both approaches are explained here.

Also, as far as "health checks failing" on the master is concerned, pretty much the only thing that is likely to cause such a failure in this instance is a timeout, and you can adjust those even on a per-operation basis in Pacemaker. But even that is unnecessary if you enable maintenance mode at the right time.

"Maintenance mode" really means maintenance mode

The following morning, our operations team was notified by a developer of incorrect query results returning from the node providing the 'standby' role. I investigated the situation and determined that when the cluster was placed into maintenance-mode the day before, actions that should have caused the node elected to serve the 'standby' role to change its replication master and start replicating were prevented from occurring.

Well, of course. In maintenance mode, Pacemaker takes its hands off your resources. If you're enabling maintenance mode right in the middle of a failover, then that's not exactly a stellar idea. If you do, then it's your job to complete those actions manually.

I determined that the best course of action was to disable maintenance-mode to allow Pacemaker and the Percona Replication Manager to rectify the situation.

"Best" might be an exaggeration, if I may say so.

A segfault and rejected cluster messages

Upon attempting to disable maintenance-mode, a Pacemaker segfault occurred that resulted in a cluster state partition.

OK, that's bad, but what exactly segfaulted? crmd? attrd? pengine? Or the master Heartbeat process? But the next piece of information would have me believe that the segfault really isn't the root cause of the cluster partition:

After this update, two nodes (I'll call them 'a' and 'b') rejected most messages from the third node ('c'), while the third node rejected most messages from the other two.

Now it's a pity that we don't have any version information and logs, but this looks very much like the "not in our membership" issue present up to Pacemaker 1.1.6. This is a known issue, the fix is to update to a more recent version (here's the commit, on GitHub of course), and the workaround is to just restart the Pacemaker services on the affected node(s) while in maintenance mode.

A non-quorate partition running MySQL?

Despite having configured the cluster to require a majority of machines to agree on the state of the cluster before taking action, two simultaneous master election decisions were attempted without proper coordination. In the first cluster, master election was interrupted by messages from the second cluster and MySQL was stopped.

Now this is an example of me being tempted to say, "logs or it didn't happen." If you've got the default no-quorum-policy of "block", and you're getting a non-quorate partition, and you don't have any resources with operations explicitly configured to ignore quorum, then "two simultaneous master election decisions" can only refer to the Designated Coordinator (DC) election, which has no bearing whatsoever on MySQL master status. Luckily, Pacemaker allows us to take a meaningful snapshot of all cluster logs and status after the fact with crm_report. It would be quite interesting to see a tarball from that.

In the second, single-node cluster, node 'c' was elected at 8:19 AM, and any subsequent messages from the other two-node cluster were discarded. As luck would have it, the 'c' node was the node that our operations team previously determined to be out of date. We detected this fact and powered off this out-of-date node at 8:26 AM to end the partition and prevent further data drift, taking down all production database access and thus all access to github.com.

That's obviously a bummer, but really, if that partition is non-quorate, and Pacemaker hasn't explicitly been configured to ignore that, no cluster resources would start there. Needless to say a working fencing configuration would have helped oodles, too.

Your cluster has no crystal ball, but it does have a command line

I'll skip over most of the rest of the GitHub post, because it's an explanation of how these backend issues affected GitHub users. I'll just hop on down to this piece:

The automated failover of our main production database could be described as the root cause of both of these downtime events. In each situation in which that occurred, if any member of our operations team had been asked if the failover should have been performed, the answer would have been a resounding no.

Well, you could have told your Pacemaker of that fact beforehand. Enable maintenance mode and you're good to go.

There are many situations in which automated failover is an excellent strategy for ensuring the availability of a service. After careful consideration, we've determined that ensuring the availability of our primary production database is not one of these situations. To this end, we've made changes to our Pacemaker configuration to ensure failover of the 'active' database role will only occur when initiated by a member of our operations team.

That splash you just heard was the bath water. The scream was the baby being tossed out with it.

Automated failover is a pretty poor strategy in the middle of a large configuration change. And Pacemaker gives you a simple and easy interface to disable it, by changing a single cluster property. Failure to do so may result in problems, and in this case it did.

When you put a baby seat on the passenger side of your car, you disable the air bag to prevent major injury. But if you take that baby seat out and an adult passenger rides with you, are you seriously saying you're going to manually initiate the air bag in case of a crash? I hope you're not.

Finally, our operations team is performing a full audit of our Pacemaker and Heartbeat stack focusing on the code path that triggered the segfault on Tuesday.

That's probably a really good idea. For anyone planning to do the same, we can help.

Comments

this is an artificial intelligence problem

Hi Florian,

There is indeed a point - switch off automation when you mess around. The thing is that GitHub stuff was not messing around - in the sense that they were not managing the cluster. They were executing some SQL on the server. Like a user would/could do in a regular operation . And executing this SQL "caused Percona Replication Manager's health checks to fail on the master". And this situation - something is going wrong - may happen any time, even when nobody is around - that's why you want automated cluster management in the first place.

Now the problem is in evaluating of the whole situation. And unless you claim that Pacemaker is an artificial intelligence system capable of making right decisions in unpredictable situations, it can't be trusted to do a failover when a human is away.

Consider a break-in alarm. You also gotta switch it off when you're around. And switch is back on when you're away. But when something goes wrong, the alarm does not start shooting, it just goes off and leaves it to the cops. It looks to me that Pacemaker (with its limited "health checks" and no AI) semantically is in the same category. The best it can do - go off and wait for admin.

Not so sure on the alarm analogy

If you're planing to replace/break your own window, you really want to turn the alarm off first - not after the cops arrive.

On the face of it, it looks like that didn't happen here.

No question Pacemaker bugs screwed things up once maintenance-mode was re-enabled, but it appears that there were a couple of mis-steps leading up to that point that were not the software's fault.

Nope. It's not.

They were executing some SQL on the server. Like a user would/could do in a regular operation.

Based on Jesse's article, they were doing a two-pass schema migration. If that is a regular operation by their definition, and if (this part is conjecture, unfortunately I haven't found anyone to confirm this yet) the "failing health check" was indeed a timeout, then the configured timeouts were simply too short. If it is not a regular operation, they could have temporarily upped the timeouts, or (my recommendation) enabled maintenance mode.

And unless you claim that Pacemaker is an artificial intelligence system capable of making right decisions in unpredictable situations, it can't be trusted to do a failover when a human is away.

Nice straw man you're setting up there, but sorry, the assumption that an AI system is required to effect proper failover in a high-availability cluster is simply false.

The best it can do - go off and wait for admin.

Even if you insist on that being the case, Pacemaker has supported that behavior for ages. Can be combined very easily with $preferredmonitoringsystem to check for the existence of the control file, and alerting an operator. Of course, if you don't set up fencing (and again, that is conjecture for the GitHub issue), then that approach doesn't work.

And before Henrik starts getting any ideas again, no, I'm not "defending my baby." It's not my baby. I just fail to follow the progression from "we made some mistakes that culminated in a bigger problem" to "Pacemaker is broken" to "failover is evil".

Pacemaker is not broken

Its advertised usage as an automatic cluster manager is. And I think that what Henrik and others try to show is that it is not actually a Pacemaker-specific limitation. And Pacemaker is used simply as an example of a supposedly top notch automation tool. The whole case is just an example why automatic cluster management is a bad idea.

The GitHub case just shows that when something unexpected (and according to GitHub team it was unexpected) happens, even such advanced system like Pacemaker can go bonkers and instead of managing resources requires management itself - as you pointed out above - enabling maintenance mode, increasing timeouts, etc. And that pretty much defeats its purpose as an automatic manager: it is when something unexpected happens that you're normally not around and want the automation to work and do the Right Thing.

And it is impossible without full-blown AI. I guess it is pretty clear that any non-trivial cluster has so many possible scenarios, that it is simply impractical (if not impossible) to correctly a) identify and b) script them all. E.g. the aforementioned situation "mysqld is stuck but its OK" is quite hard to formalize in all entirety. So yes, you need some intelligence there. And that's why you still have human pilots in air planes and human dispatchers in ground control - in case something unexpected happens.

You're still not making sense to me.

So you're saying that X is unsuitable for Y because it doesn't also do Z? I don't follow. Specifically when that Z is psychic ability to discern what administrators want, rather than what they have configured.

In a Google+ thread related to this post, Lars Marowsky-Brée voiced an idea that, in my humble opinion, is the only useful and constructive suggestion for a Pacemaker improvement that has come out of this discussion so far: to have Pacemaker distinguish between confirmed failures (a process having died, or returning an error condition for example) and timeouts, and the ability to define separate recovery policies for them. Thus, after a timeout you might be able to just do nothing except retry, rather than initiate the recovery actions you would on a confirmed failure.

Whats the difference between

Whats the difference between that and "Don't tell us until you know for sure"?

Just set the timeout to be longer and do whatever you gotta do to be sure.

A few thoughts on that note

It isn't necessarily always possible for Pacemaker (rather, the RA) to be sure. Consider a monitor check not failing, but hanging because some process is stuck in the middle of I/O (== process in the D state). In that case, checks will time out, and there is no way to recover if recovery involves killing the process. So rather than fence the node outright, some users may prefer to just be notified of a problem, rather than Pacemaker initiating recovery actions.

I do think that having a separate "on-timeout" action is worth discussing, as is perhaps an on-fail="notify" — because the generic crm_mon notification capability usually produces too many notifications to be useful, at least for many users.

Suppose we could have a monitor op with on-fail="stop" on-timeout="notify", then Pacemaker would be able to tell an admin, "hey, there's a service that appears to be slow here, it hasn't failed outright so I'm not doing anything about it just yet, but you may want to take a look." Whereas if the service has demonstrably failed, it could just recover automatically.

But no, Alexey, this still wouldn't fit the "artificial intelligence" category. :)

It sure wouldn't!

But then it wouldn't be an atomatic failover as well :) Because you won't be able to detremine whether the service has demonstrably failed without AI in most but few occasions (if any at all). And that will make it (almost) a fully human-triggered system. So even if it wouldn't be "artificial", it still would be an intelligence ;)

And it sure would be a great improvement.

And it still would have a great potential to fail, because while you have human to confirm the failure, there is absolutely no certainty (and without AI - not even hope) that Pacemaker will chose a proper failover strategy. So ideally, with the notification of suspected failure, it should offer a choice of possible actions including "do nothing" - so that the operator can not merely confirm a failure, but also suggest what to do ("do nothing" would be then equivalent to ignore the failure).

Now that would be really sweet.