Failing cluster service? – CheckQueryProcessorAlive

OK, here’s my first technical post and at the time of writing this, the issue is still not resolved, but I do have a hunch. Probably from spending too much time with a laptop perched on my knees :-) I do have a theory on what the issue might be and that is exactly what I’m going to blog. If you are reading this and I haven’t posted an update and you have first hand experience of this issue, please drop me a line to either confirm or dispute my theory, I will be more than glad to hear from you. What ever happens, I am planning on applying the change on Saturday 4th October (yes I work in a tightly controlled organisation) and I will post my update after that date.

The Backgound

We have a windows cluster in an active/passive configuration containing two resource groups, one for the Windows cluster resources and one for the SQL Server cluster resources. Straight-forward stuff. In the SQL group we have three disks, E:, F: and G: – they are LUN’s (SAN storage) that have been allocated to this cluster. I’m not going to divulge too much detail on the placement of what data file lives where, it’s largely irrelevant, it works and it’ll likely detract from the real issue.

The service (I’m using this term in the business sense) started to outgrow it’s home, rapidly, so extra LUN’s were allocated to the server to try and head-off a potential disaster. The disk letters were assigned, they were added as physical disks in the SQL cluster group and I added them as dependencies of the SQL Server cluster resource. Bingo, we were good to go on the new disks. I then set about adding some new datafiles to the filegroup which were physically placed on the new disks and we were off again, safe in the knowledge that I wasn’t going to get called out that night for a database that had hung because there was nowhere left for it to grow. But then the problems started.

I’d never seen this type of failure before. In very basic terms, the SQL Server service crashes but doesn’t actually fail over to the second (passive) node, it simply dies and restarts on the active node and it all happens in the blink of any eye and without warning.

The Investigation

I trawled through the (Application) Event Viewer logs and sure enough there were a whole bunch of errors, all at the exact moment of failure and all having the value MSSQL$PC001 in the source column.

The Event ID is 17052, the category is (3) and the error text of the earliest error is as follows -

[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

The detail of the failures are exactly the same, there are always 14 errors, always at exactly the same time, all for MSSQL$PC001 and they all start with the error message quoted above.

For completeness, these are the other error messages as they appear in chronological order. I say chronological order but in fact they all have the exact same timestamp -

[sqsrvres] printODBCError: sqlstate = 01000; native error = 2746; message = [Microsoft][ODBC SQL Server Driver][DBNETLIB]ConnectionWrite (send()).
[sqsrvres] printODBCError: sqlstate = 08S01; native error = b; message = [Microsoft][ODBC SQL Server Driver][DBNETLIB]General network error. Check your network documentation.
[sqsrvres] OnlineThread: QP is not online.
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure

The Action

The very first thing we did was to look at what had changed on this service recently and the two things that stuck out were the new disks and the new datafiles.

We updated the drivers for the HBA cards (SAN fabric cards) on the two nodes to the very latest versions, as they were way out of support and the errors were leading us down the “it just loses it’s connection” path. Initially it looked as if this was the solution but after a week of service stability we started to see a reoccurrence of the same issue.

As it happened, more by total accident than design,  we had to undertake a very similar exercise for another service, new disks and new data files on those new disks and sure enough we started seeing the exact same bunch (again, 14 in total) of error messages. Too much of a coincidence? I think not.

Now we’ve added tons of disks in the past and I’ve lost count of the number of times I’ve added new data files to filegroups and stuck them on new disks so what was different?

Well, this is where my theory comes in to play.

The disks that were presented to our server were not of the same type as the previous ones, they weren’t even on the same SAN fabric. We’ve recently implemented a brand new SAN along with new disk arrays (from a completely different vendor) and have started rolling this out to all new production systems. Where this new SAN is attached to our SQL Servers, and this type of SAN alone, we are seeing no issues. Where the old SAN is attached to our SQL Servers along with the new SAN we are seeing no issues, it’s only when you have multiple data files in a filegroup and those files have been split over the two different SAN’s that you get the problem.

For details sake, I have listed the (from Disk Management) the details of the LUN’s below -

E: 150gb RDAC Virtual Disk 17421RU-23b6403-Winscape_blk_A Lun 2
F: 30gb RDAC Virtual Disk 17421RU-23b6403-Winscape_blk_A Lun 4
G: 25gb RDAC Virtual Disk 17421RU-23b6403-Winscape_blk_A Lun 3
H: 25gb RDAC Virtual Disk 17421RU-23b6403-Winscape_blk_A Lun 5
J: 150gb IBM 2145 Multi-path Disk Device Port(2,3) Bus 0, Target ID 1, LUN 3
K: 30gb IBM 2145 Multi-path Disk Device Port(2,3) Bus 0, Target ID 1, LUN 2
L: 300gb IBM 2145 Multi-path Disk Device Port(2,3) Bus 0, Target ID 1, LUN 4
M: 25gb IBM 2145 Multi-path Disk Device Port(2,3) Bus 0, Target ID 1, LUN 1

As you can plainly see, we have two types of disk and we have got datafiles (for the PRIMARY filegroup) spread over those disks and different SAN fabric.

So my theory, and the whole reason for this post is this – if you are sending I/O over two different types of SAN then you are likely to encounter these issues. I can’t ignore what I consider to be hard evidence so I’ve managed to convince myself that this is the root cause of the issue. Time will tell.

The Fix

I’ve got my change scheduled for Saturday 4th October, starting at 23:00 so whether or not this fixes the issue or not will only really become apparent after that date and even then I’m going to have to hold my breath for a few days. It’s so intermittent and unpredicatable it’s just not possible to pop open the champagne corks until, well, I don’t exactly know. I’d just better keep the bubbly on ice I guess.

So anyway, the proposed fix, well it’s actually very simple.

I’m going to move all of the data files (for the PRIMARY filegroup) from the RDAC virtual disks to the IBM disks and thus forcing the I/O down one SAN only. It turns out that we have plenty of spare capacity on the IBM disks to cater for all of the data files (and log files and tempdb files), it just means that, me being me, I can’t split them over multiple LUN’s as I normally like to do. Not that splitting them over multiple LUN’s in anyway guarantees that you’ve just put your data files, log files, tempdb files, other filegroup files on separate disks which is very, very nice indeed thankyou very much.

I’ll be posting an update to this, post-change, to let you know how it’s going. Hope you got through it ok, if you have any comments please let me have them, I’d be glad of some feedback.

Follow

Get every new post delivered to your Inbox.