Here’s something I was up to during the week. It’s a fairly mundane task, but despite that one which is very important, both to the client I was doing it for, and for most people generally who run this type of configuration.
The databases I was working with were part of a true 24×7 configuration where any downtime whatsoever has the potential to lose data (email marketing where every remote click event is tracked, meaning that you can’t even take planned downtime as you can’t control email receipts opening and clicking emails).
The systems in question run a fairly standard database mirroring configuration, 2 physical servers (principal and mirror partners), mirroring multiple databases in high safety mode with auto failover determined by the quorum of a third witness server. The task in question was to run windows update on the 2 partner servers and then apply SP3 for SQL 2008 to bring it up to build 5500.
The guys who had been running these servers previously told me that normally they just patched the mirror partner before failing over and then patching the new mirror (which was previously the principal). This is the standard methodology of a rolling upgrade within a mirroring configuration, but it missed one important step. I’m incredibly risk averse in all situations, and in this scenario it’s essential to remove the mirror witness before starting this process as if you don’t you have the small potential risk that half way through the upgrade and patching process you might suddenly find your mirror partnership failing over.
In all fairness this is a quite unlikely scenario, as it would require a failure at the principal at the exact point in time that patch process was running. It was also require a theoretical problem with all the servers managing their quorum, as they ought to still deal with such a failure properly, but after many years in the business and particularly after many years within Microsoft support, I’ve had the unfortunate experience of experiencing a wide range of very obscure failures across the SQL Server product set, and a mirroring split brain is one of them.
A split brain can very simply be described as a scenario where both partners believe that they are either the principal or the mirror, therefore invalidating the partnership. If you ever get in this scenario it’s extremely horrible and sometimes (again speaking from experience) you are obliged to do some rather dirty troubleshooting to recover the situation.
Sometimes my experiences at Microsoft support can scare people and skew their view of the product, as all we ever used to deal with in the escalation team was obscure problems or bugs that didn’t normally occur and couldn’t easily be solved. This means that whenever someone asks me about a certain procedure I’ve normally seen it break in a really horrible way! 99.9% of the time in production scenarios this doesn’t happen of course, but the moral of this story is that it makes me very risk averse.
So back to the point in hand, if you want to be risk averse when patching mirror partnerships, the thing to do first is to remove the witness and thereby drop back to high safety WITHOUT auto failover, meaning that if something stupid happens whilst patching, the mirroring won’t try to failover and mess things up further.
To achieve this process in a controlled fashion, here are my step by step instructions (remember if you mirror multiple database you need to run the scripts for each DB)
1. Disable witness
ALTER DATABASE [your database] SET WITNESS OFF go
2. Patch current mirror
3. Reboot current mirror if necessary
4. Failover from principal to newly patching mirror
--RUN on THE COMMANDS ON THE CURRENT PRINCIPAL TO MOVE TO OTHER SERVER ALTER DATABASE [your database] SET PARTNER FAILOVER
5. there will be a short outage whilst the new principal comes online
6. patch the current mirror (original principal)
7. reboot current mirror
8. fail back to original principal server (if required – this is optional)
9. add back the witness
ALTER DATABASE [your databse] SET WITNESS = 'TCP://[your FQDN]:[your port]' go
10. you’re good to go
Good luck with your patching!