SQL Server 2008 R2 cluster setup fails on mount points

(Update from Monday morning, the workaround shown below works for me)

Another day, another set up problem (sigh) I was installing a second SQL 2008 R2 instance on a new cluster today. The first one had gone well without incident but the second one had a crucial change in that it used mount points. It was a fairly complex install with 8 mount points under the root volume and it took a while to complete all the setup configurations. When it got started it failed with the following selection of errors:

1. A dialog box with the error

Wait on database engine recovery handle failed. Check the SQL Server error log for potential causes.

2. The setup finished and pointed me to the usual error log files in the setup directories. In the summary log file I saw an equivalent error

Detailed results:
Feature: Database Engine Services
Status: Failed: see logs for details
MSI status: Passed
Configuration status: Failed: see details below
Configuration error code: 0x4BDAF9BA@1306@24
Configuration error description: Wait on the Database Engine recovery handle failed. Check the SQL Server error log for potential causes.
Configuration log: C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\Log\20120420_114647\Detail.txt

3. When opening the referred to detail.txt file I found pretty much the same again

2012-04-20 12:17:28 Slp: Configuration action failed for feature SQL_Engine_Core_Inst during timing ConfigRC and scenario ConfigRC.
2012-04-20 12:17:28 Slp: Wait on the Database Engine recovery handle failed. Check the SQL Server error log for potential causes.
2012-04-20 12:17:28 Slp: The configuration failure category of current exception is ConfigurationFailure
2012-04-20 12:17:28 Slp: Configuration action failed for feature SQL_Engine_Core_Inst during timing ConfigRC and scenario ConfigRC.

4. The text file pointed me to the SQL Server error log of the partially installed instance. When I opened this, I found the “real” cause

2012-04-20 16:00:51.51 spid8s Clearing tempdb database.
2012-04-20 16:00:51.52 spid8s Error: 5123, Severity: 16, State: 1.
2012-04-20 16:00:51.52 spid8s CREATE FILE encountered operating system error 5(failed to retrieve text for this error. Reason: 15100) while attempting to open or create the physical file 'O:\data\tempdb.mdf'.
2012-04-20 16:00:51.52 spid8s Error: 5123, Severity: 16, State: 1.
2012-04-20 16:00:51.52 spid8s CREATE FILE encountered operating system error 5(failed to retrieve text for this error. Reason: 15105) while attempting to open or create the physical file 'O:\data\tempdb.mdf'.

The SQL Server error code 5123 is why the instance couldn’t start, as it’s unable to create a tempdb database data file, and no SQL Server can start without tempdb being present. The “real” error though is the reason for this occurring which is error code 5 from the OS when SQL runs the createfile API. This is one that everyone probably recognises which is “access denied”. So the bottom line here was that I was getting access denied on the root of my mount point for tempdb (which is where I had placed the data files as part of setup configuration).

I checked through the other parts of setup and the rest of the system databases had managed to write to the root of other mount point drives (which seemed strange), and of more relevance, I noted that the SQL Server service accounts had been granted full access to the mount point root directories as part of the setup, so theoretically there ought not to be a permission error!

I spent a few hours digging around, tried the install again with new mount points and a clean position, but encountered the exact same problem. Eventually I figured that it was due to the fact that there are “issues” with the permissions on root mount points. This is distinctly not well documented within official MS sources (or not that I could find) and certainly not within the official pre-reqs.

http://msdn.microsoft.com/en-us/library/ms189910(v=sql.105).aspx

The best I could find was a connect item here

http://connect.microsoft.com/SQLServer/feedback/details/569895/installation-of-sql-server-2008-r2-on-mount-points-fails

a blog post here

http://getyouriton.blogspot.co.uk/2009/08/serious-gotchas-with-mounted-drives-or.html

and a forum post here

http://social.msdn.microsoft.com/Forums/en/sqlsetupandupgrade/thread/da05e467-6852-4e2d-8b8f-d253b479acfc

However the accepted workaround is that you should create a sub directory under the root of the mount point and then you’ll be fine. I’ll be trying this method next week as its 8pm on a Friday now and time for the weekend. If it works I’ll probably post a bug requesting that the documentation gets changed to point this out, as it would have me a whole lot of time.

As always, good luck with your setup……

Step by Step procedure for rolling upgrades to database mirroring for SQL Server 2008

Here’s something I was up to during the week. It’s a fairly mundane task, but despite that one which is very important, both to the client I was doing it for, and for most people generally who run this type of configuration.

The databases I was working with were part of a true 24×7 configuration where any downtime whatsoever has the potential to lose data (email marketing where every remote click event is tracked, meaning that you can’t even take planned downtime as you can’t control email receipts opening and clicking emails).

The systems in question run a fairly standard database mirroring configuration, 2 physical servers (principal and mirror partners), mirroring multiple databases in high safety mode with auto failover determined by the quorum of a third witness server. The task in question was to run windows update on the 2 partner servers and then apply SP3 for SQL 2008 to bring it up to build 5500.

The guys who had been running these servers previously told me that normally they just patched the mirror partner before failing over and then patching the new mirror (which was previously the principal). This is the standard methodology of a rolling upgrade within a mirroring configuration, but it missed one important step. I’m incredibly risk averse in all situations, and in this scenario it’s essential to remove the mirror witness before starting this process as if you don’t you have the small potential risk that half way through the upgrade and patching process you might suddenly find your mirror partnership failing over.

In all fairness this is a quite unlikely scenario, as it would require a failure at the principal at the exact point in time that patch process was running. It was also require a theoretical problem with all the servers managing their quorum, as they ought to still deal with such a failure properly, but after many years in the business and particularly after many years within Microsoft support, I’ve had the unfortunate experience of experiencing a wide range of very obscure failures across the SQL Server product set, and a mirroring split brain is one of them.

A split brain can very simply be described as a scenario where both partners believe that they are either the principal or the mirror, therefore invalidating the partnership. If you ever get in this scenario it’s extremely horrible and sometimes (again speaking from experience) you are obliged to do some rather dirty troubleshooting to recover the situation.

Sometimes my experiences at Microsoft support can scare people and skew their view of the product, as all we ever used to deal with in the escalation team was obscure problems or bugs that didn’t normally occur and couldn’t easily be solved. This means that whenever someone asks me about a certain procedure I’ve normally seen it break in a really horrible way! 99.9% of the time in production scenarios this doesn’t happen of course, but the moral of this story is that it makes me very risk averse.

So back to the point in hand, if you want to be risk averse when patching mirror partnerships, the thing to do first is to remove the witness and thereby drop back to high safety WITHOUT auto failover, meaning that if something stupid happens whilst patching, the mirroring won’t try to failover and mess things up further.

To achieve this process in a controlled fashion, here are my step by step instructions (remember if you mirror multiple database you need to run the scripts for each DB)

1. Disable witness

ALTER DATABASE [your database] SET WITNESS OFF
go

2. Patch current mirror
3. Reboot current mirror if necessary
4. Failover from principal to newly patching mirror

--RUN on THE COMMANDS ON THE CURRENT PRINCIPAL TO MOVE TO OTHER SERVER
ALTER DATABASE [your database] SET PARTNER FAILOVER

5. there will be a short outage whilst the new principal comes online
6. patch the current mirror (original principal)
7. reboot current mirror
8. fail back to original principal server (if required – this is optional)
9. add back the witness

ALTER DATABASE [your databse] SET WITNESS = 'TCP://[your FQDN]:[your port]'
go

10. you’re good to go

Good luck with your patching!