We were forced to do an unexpected failover on one of our windows 2008 R2 clusters yesterday which hosts a clustered MSMQ service. One of the nodes blue screened following a windows update and so we went over the to the other node. So far so good, but the service didn’t start on the other node, it just sat in the status of starting.
As anyone who’s ever read any of my tech blogs would know, I absolutely hate random troubleshooting where you simply try restarting the service in question multiple times, or failing back or manually starting the services outside of the cluster manager or anything like that.
Therefore I wanted to do the following “Find out why it’s actually hanging” rather than make decisions without knowing this information. I know there are those around who would just go for a quick restart or reboot, but years in deep SQL Server disasters has taught me that this often a bad first step. If you want to know what a process is doing, you simply ask it, or trace it and then you may well get the information you require.
In this case, first question is
1. Is the windows process which represents this service alive – answer yes, so it’s doing something….
2. is the process in question using CPU and memory and are these fluctuating – answer yes, so it definitely appears to be doing “something”….
3. What is it doing?
It’s at this point that pull our your diagnostic tool set of choice and monitor the actual process. There’s no hard and fast rule for how you do this, but this case Procmon from sysinternals has to be your first choice. You have a process which you know is alive, you want to know what it’s doing so you need to trace it as you have no logs. Obviously you could also take a memory dump but that’s a bit advanced for this stage.
Next you need to decide what to monitor. Monitoring can be expensive, so you might want to form a hypothesis of what you’re looking for, and try to focus your troubleshooting. This might be dangerous as you might miss to collect some data, so there’s no definitive answer here, but I definitely had a hypothesis already in my head just from experience, which was that maybe the process was scanning the message that were in the queue when the failover occurred, to ensure their integrity. I don’t claim to be an MSMQ expert, but this seemed a reasonable way to go.
Therefore we attached procmon to the process, monitored its file activity and filtered this down to the drive on which we knew the physical messages were stored. Low and behold we saw the process chewing through the messages, opening, reading and then closing one about every 150ms or so.
From this point we then wanted to know how long it was going to take, as it seemed to be doing a reasonable action for a service starting (reviewing its current queue). Again this is just a hypothesis at this stage, as maybe this is wrong behavior for an MSMQ service, but this was the train of thought.
So using cmd.exe (as opposed to the expensive and dangerous explorer (dangerous when viewing directories with potentially huge numbers of files) we counted the number of .mq files in the directory in question and dumped them to a text file. We then correlated the procmon output to the cmd output to see what files the process was reading. It was simply reading the files in date order, so again this seemed to bear out our hypothesis.
At this point it’s a simple maths calculation to work out the number of messages divided by the average open time observed in procmon, and you can approximate how long the operation will take. At this point we decided to wait for this amount of time (about 20 minutes) before doing anything else. If at the end of this period we observed that the the process was still hung and doing something else, then we might need a new hypothesis, but as it was after this amount of time, the process was started and came online.
So in the end everything was OK (apart from the backlog this caused) but it’s worth noting that in this scenario if you have thousands of messages outstanding as we had, you’ll have a delayed start.
Upon reading afterwards we found out that our hypothesis was actually slightly incorrect, in that the process needs to read all the messages into memory, and that’s it’s not checking their integrity as I had guessed, but it just goes to show that structured troubleshooting is always the way to go.
Remember – don’t guess, don’t reboot, investigate. That’s my mantra (certainly in most production enterprise systems anyway!)