Windows Clustered MSMQ Service hangs on startup following failover

We were forced to do an unexpected failover on one of our windows 2008 R2 clusters yesterday which hosts a clustered MSMQ service. One of the nodes blue screened following a windows update and so we went over the to the other node. So far so good, but the service didn’t start on the other node, it just sat in the status of starting.

As anyone who’s ever read any of my tech blogs would know, I absolutely hate random troubleshooting where you simply try restarting the service in question multiple times, or failing back or manually starting the services outside of the cluster manager or anything like that.

Therefore I wanted to do the following “Find out why it’s actually hanging” rather than make decisions without knowing this information. I know there are those around who would just go for a quick restart or reboot, but years in deep SQL Server disasters has taught me that this often a bad first step. If you want to know what a process is doing, you simply ask it, or trace it and then you may well get the information you require.

In this case, first question is

1. Is the windows process which represents this service alive – answer yes, so it’s doing something….

2. is the process in question using CPU and memory and are these fluctuating – answer yes, so it definitely appears to be doing “something”….

3. What is it doing?

It’s at this point that pull our your diagnostic tool set of choice and monitor the actual process. There’s no hard and fast rule for how you do this, but this case Procmon from sysinternals has to be your first choice. You have a process which you know is alive, you want to know what it’s doing so you need to trace it as you have no logs. Obviously you could also take a memory dump but that’s a bit advanced for this stage.

Next you need to decide what to monitor. Monitoring can be expensive, so you might want to form a hypothesis of what you’re looking for, and try to focus your troubleshooting. This might be dangerous as you might miss to collect some data, so there’s no definitive answer here, but I definitely had a hypothesis already in my head just from experience, which was that maybe the process was scanning the message that were in the queue when the failover occurred, to ensure their integrity. I don’t claim to be an MSMQ expert, but this seemed a reasonable way to go.

Therefore we attached procmon to the process, monitored its file activity and filtered this down to the drive on which we knew the physical messages were stored. Low and behold we saw the process chewing through the messages, opening, reading and then closing one about every 150ms or so.

From this point we then wanted to know how long it was going to take, as it seemed to be doing a reasonable action for a service starting (reviewing its current queue). Again this is just a hypothesis at this stage, as maybe this is wrong behavior for an MSMQ service, but this was the train of thought.

So using cmd.exe (as opposed to the expensive and dangerous explorer (dangerous when viewing directories with potentially huge numbers of files) we counted the number of .mq files in the directory in question and dumped them to a text file. We then correlated the procmon output to the cmd output to see what files the process was reading. It was simply reading the files in date order, so again this seemed to bear out our hypothesis.

At this point it’s a simple maths calculation to work out the number of messages divided by the average open time observed in procmon, and you can approximate how long the operation will take. At this point we decided to wait for this amount of time (about 20 minutes) before doing anything else. If at the end of this period we observed that the the process was still hung and doing something else, then we might need a new hypothesis, but as it was after this amount of time, the process was started and came online.

So in the end everything was OK (apart from the backlog this caused) but it’s worth noting that in this scenario if you have thousands of messages outstanding as we had, you’ll have a delayed start.

Upon reading afterwards we found out that our hypothesis was actually slightly incorrect, in that the process needs to read all the messages into memory, and that’s it’s not checking their integrity as I had guessed, but it just goes to show that structured troubleshooting is always the way to go.

Remember – don’t guess, don’t reboot, investigate. That’s my mantra (certainly in most production enterprise systems anyway!)

SQL Server 2008 R2 cluster setup fails on mount points

(Update from Monday morning, the workaround shown below works for me)

Another day, another set up problem (sigh) I was installing a second SQL 2008 R2 instance on a new cluster today. The first one had gone well without incident but the second one had a crucial change in that it used mount points. It was a fairly complex install with 8 mount points under the root volume and it took a while to complete all the setup configurations. When it got started it failed with the following selection of errors:

1. A dialog box with the error

Wait on database engine recovery handle failed. Check the SQL Server error log for potential causes.

2. The setup finished and pointed me to the usual error log files in the setup directories. In the summary log file I saw an equivalent error

Detailed results:
Feature: Database Engine Services
Status: Failed: see logs for details
MSI status: Passed
Configuration status: Failed: see details below
Configuration error code: 0x4BDAF9BA@1306@24
Configuration error description: Wait on the Database Engine recovery handle failed. Check the SQL Server error log for potential causes.
Configuration log: C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\Log\20120420_114647\Detail.txt

3. When opening the referred to detail.txt file I found pretty much the same again

2012-04-20 12:17:28 Slp: Configuration action failed for feature SQL_Engine_Core_Inst during timing ConfigRC and scenario ConfigRC.
2012-04-20 12:17:28 Slp: Wait on the Database Engine recovery handle failed. Check the SQL Server error log for potential causes.
2012-04-20 12:17:28 Slp: The configuration failure category of current exception is ConfigurationFailure
2012-04-20 12:17:28 Slp: Configuration action failed for feature SQL_Engine_Core_Inst during timing ConfigRC and scenario ConfigRC.

4. The text file pointed me to the SQL Server error log of the partially installed instance. When I opened this, I found the “real” cause

2012-04-20 16:00:51.51 spid8s Clearing tempdb database.
2012-04-20 16:00:51.52 spid8s Error: 5123, Severity: 16, State: 1.
2012-04-20 16:00:51.52 spid8s CREATE FILE encountered operating system error 5(failed to retrieve text for this error. Reason: 15100) while attempting to open or create the physical file 'O:\data\tempdb.mdf'.
2012-04-20 16:00:51.52 spid8s Error: 5123, Severity: 16, State: 1.
2012-04-20 16:00:51.52 spid8s CREATE FILE encountered operating system error 5(failed to retrieve text for this error. Reason: 15105) while attempting to open or create the physical file 'O:\data\tempdb.mdf'.

The SQL Server error code 5123 is why the instance couldn’t start, as it’s unable to create a tempdb database data file, and no SQL Server can start without tempdb being present. The “real” error though is the reason for this occurring which is error code 5 from the OS when SQL runs the createfile API. This is one that everyone probably recognises which is “access denied”. So the bottom line here was that I was getting access denied on the root of my mount point for tempdb (which is where I had placed the data files as part of setup configuration).

I checked through the other parts of setup and the rest of the system databases had managed to write to the root of other mount point drives (which seemed strange), and of more relevance, I noted that the SQL Server service accounts had been granted full access to the mount point root directories as part of the setup, so theoretically there ought not to be a permission error!

I spent a few hours digging around, tried the install again with new mount points and a clean position, but encountered the exact same problem. Eventually I figured that it was due to the fact that there are “issues” with the permissions on root mount points. This is distinctly not well documented within official MS sources (or not that I could find) and certainly not within the official pre-reqs.

http://msdn.microsoft.com/en-us/library/ms189910(v=sql.105).aspx

The best I could find was a connect item here

http://connect.microsoft.com/SQLServer/feedback/details/569895/installation-of-sql-server-2008-r2-on-mount-points-fails

a blog post here

http://getyouriton.blogspot.co.uk/2009/08/serious-gotchas-with-mounted-drives-or.html

and a forum post here

http://social.msdn.microsoft.com/Forums/en/sqlsetupandupgrade/thread/da05e467-6852-4e2d-8b8f-d253b479acfc

However the accepted workaround is that you should create a sub directory under the root of the mount point and then you’ll be fine. I’ll be trying this method next week as its 8pm on a Friday now and time for the weekend. If it works I’ll probably post a bug requesting that the documentation gets changed to point this out, as it would have me a whole lot of time.

As always, good luck with your setup……

Tracing select statements on specific objects in SQL Server without using Profiler

I published a blog post with Basefarm last week about tracing select statements in SQL Server without using profiler. It demonstrates how to use SQL Audit (based on extended events) to record any select statements issues on particular tables over a prolonged time period in a scalable lightweight way. Although SQL Audit was designed as it name suggests, to provide an audit framework for the product, I find it can be quite useful in other ways as well. In this example we were helping developers effectively check their code base for unexplained access to a particular object.

You can read the full post here:

http://blog.basefarm.com/blog/2012/02/13/tracing-select-statements-on-specific-objects-in-sql-server-without-using-profiler/