Event Notifications using Powershell, WMI, WQL and SQL Server 2008

This is my first post in a very long time, mainly because I’ve been incredibly busy but also because I keep reading other peoples blogs and I was of the opinion that it had all been said before, usually by someone who knows a shed-load more than I do!

However, I have just been through an experience that I felt I needed to share because I really couldn’t find the definitive answer out there. The question was – how can I capture the SQL:Stmtrecompile event using something other than Profiler Trace.

One solution is by using a combination of Powershell, WMI and WQL.

I could leave it there and let you do your own digging but I just know I’m going to need to do this all over again at some point so even if nobody else reads this, it’s going to be a useful exercise for me.

So. Where to begin?

A client I am doing some work for is seeing a pattern whereby a certain piece of functionality performs differently at certain parts of the day.

“What’s so unusual about that?” I hear you say, after all, isn’t this completely expected.

Well, yes and no. It is expected because as the number of users (and volumes of data) ramp up this is obviously going to have an impact.

But what if the impact is actually a positive one? What if the performance of the component increases when it is under the most pressure?

Early in the morning, when the “system” is quiet, the time measurement is ~80ms for a particular piece of functionality.

As the number of users increases, and the volume of data being shifted around also increases, the time measurement drops to ~60ms.

How could this be? Well, the stored procedure involved uses table variables. Many people, and far better DBA’s than myself, have blogged about the use of Table Variables vs Temporary Tables so I’m not even going to embarras myself with an attempt. I understand the differences and I believe that the combination of the table variable and the amount of data in the transitory tables (to which the table variable is joining) is a contributory factor behind the rise in performance.

Why?

Well (and I have actually witnessed this change in behaviour first hand) the query optimiser makes a decision based upon the statistics (on the transitory tables in the database; there are no stats for table variables of course) and, naturally, they change as the data in the tables change.

I have seen two identical queries behave in different ways because of the data volumes in the tables. One query plan will perform an Index Seek on Table A followed by a Key Lookup on Table B (yes, yes I know all about covering indexes – go tell that to the developers who want to use index hints instead) and on another database (same stored proc) the query plan will perform an Index Seek on Table B followed by a Key Lookup on Table A.

So how can I prove that what is happening is as a direct result of shifting patterns in the statistics histogram? The statistics are set to Auto Update so over the course of the day the picture is going to change. What I want to capture is the moment the picture changes to such an extent that it causes the plan to change.

Well one thing I did think of was to capture the SQL:StmtRecompile event for the given SP to see if I can see the moment when the plan changes.

This may not be the answer to this particular question, time will tell when I finally come to deploy the script, but it did set me off on a technical challenge that I will now write about here.

Capturing SQL Events using Powershell, WMI and WQL.

Firstly, check out the link to the SQL:StmtRecompile Event Class on BOL. You will need to understand what it is that you are capturing – http://technet.microsoft.com/en-us/library/ms179294.aspx

Please note, before this task my Powershell and WQL skills were practically non-existent, so it may not come as any surprise to you if you see some glaring errors in my code. I apologise for these in advance; I had deadlines to meet and I just needed this to work as best I could. I’ll try and make it perfect over time.

My first task was obviously to trawl Google to see if anyone else had done this before and save me time and effort. There are a few examples out there but none that provided me with a complete working solution.

One of my first aknowledgments is to the Shellyourexperience team who set me off on the right foot with this post http://shellyourexperience.wordpress.com/2011/09/04/hey-powershell-eventing-can-you-notify-me-when-a-sp-recompile-happens/

Another nod of aknowledgment goes to Chad Miller whose Write-DataTable and Out-DataTable functions have saved me a huge amount of effort. Thanks Chad.

So whilst I have had to borrow the code of others, it was not simply a case of pasting it all together and pressing F5 (if you happen to be using PowerGUI as I am).

So let’s cut to the chase.

Using PowerGui, I registerd a WMI Event, which is for all intents and purposes a job that listens for a particular event.

Here is the line of code I used (which I have saved as recompile-wrapper.ps1):

$query = “select ApplicationName, ClientProcessID, DatabaseID, DatabaseName, EventSequence, EventSubClass, EventSubClassDesc, GroupID, HostName, IntegerData2, IsSystem, LineNumber, LoginName, NestLevel, NTDomainName, NTUserName, ObjectID, ObjectName, ObjectType, Offset, RequestID, ServerName, SessionLoginName, SPID, StartTime, TextData from SQL_STMTRECOMPILE”

Register-WMIEvent -Namespace ‘root\Microsoft\SqlServer\ServerEvents\TEST01′-Query $query -Action {ipmo -name C:\WMI\sprecompile.psm1 -Verbose;ipmo -name C:\WMI\functions.psm1 -Verbose;$dtr = Get-SPStmtRecompile $event.SourceEventArgs.NewEvent`

    | Out-DataTable ; write-datatable -ServerInstance MyServer\TEST01 -Database SBTest -TableName StmtRecompile -Data $dtr}

So let’s break that down.

$query is a variable that contains my WQL query. I am wanting to retrieve the rows from the SQL_StmtRecompile event that has fired. I have deliberately specified the columns as opposed to selecting everything because I am currently having issues with the Varbinary datatypes and BulkCopy.

The -Namespace is the WMI path to my SQL instance. If you want to explore all of your available Namespace paths and events I highly recommend you download a copy of PowerGUI. By using the WMI Browser in the Admin Console you will be able to see exactly what you are dealing with.

The -Action parameter is followed by a number of options inside the curly brackets. The ipmo command is “Import Module” and I am specifying the fully named path, just so there is no ambiguity. I am also specifying the -verbose option because I want to see the modules being loaded when the event fires.

The two modules I am loading, sprecompile.psm1 and functions.psm1, have been heavily borrowed from the ShellYourExperience team and Chad Miller (I’ll provide links and downloads somewhere in this blog). I did have to create a new function in the sprecompile module and also modify the Write-DataTable function in the functions module to allow the use of triggers with the BulkCopy command. Those modifications in themselves were not easy (well, they are now that I know exactly what it is I need to do) and I wanted to point them out just so you don’t think I’ve robbed someone elses’ code and just put it all together! I’ve literally spent the last two days with my head buried in this. I’ve pulled the legs off it, put it all back together, debugged every single step and I have even woken up in the middle of the night trying to figure out how I’m going to enable the FireTriggers BulkCopy option. I’m completely new to PowerShell so it just wasn’t obvious at the time and I feel all the better for having gone through all of that pain and learning.

Anyway, back to the code.

The line $dtr = Get-SPStmtRecompile $event.SourceEventArgs.NewEvent | Out-DataTable is assigining the output from the Out-DataTable function to a variable called $dtr.

The line write-datatable -ServerInstance GBEDUB-APS08\TEST01 -Database SBTest -TableName StmtRecompile -Data $dtr is calling the Write-DataTable function and passing in the SQL Server Instance Name, the Database Name, the Table Name and the data I want to Bulk Copy in to the SQL Table.

This is where my code differentiates to that from the ShellYourExperience guys: they are piping the output from the Out-DataTable function into the Write-DataTable function. Now I just could not get this to work so I assigned the output to a variable and used the -Data parameter of the Write-DataTable function. This worked a treat.

You will need to create a table (mine is called StmtRecompile in my SBTest database) and if you use the data types from the Event Class you’ll not go far wrong. Well, substitute the Image datatypes for Varbinary(MAX) and that will see you right. I’ll attach my table def somewhere in thist post.

Now, to actually register the WMI Event you need to execute the recompile-wrapper.ps1 PowerShell script, and I have done it from the command line like this:

C:\WMI>powershell -noexit -noninteractive “& “c:\wmi\wrapper.ps1″

It fires up Powershell, executes the command and leaves the job running in the background.

How do I know? Well, from the PS command line, just type get-job and you’ll see the job in the list.

It’ll almost certainly have a status of Not Started and that is absolutely fine. You don’t have to hunt around for clues on how to get it started, it will start itself when the event fires.

So how do you fire the event? Simple, run a query! You’ll need to run a query that will create a new plan and you can force this by creating\rebuilding indexes (causing a schema change).

As soon as you hit the F5 key and your SQL query executes, check out the PS window and you should see the modules being loaded.

Type, get-job and you should either see the status change to Running or Failed.

If it’s Running then all is well – enjoy.

Here’s a screenshot of the job being registered and then fired.

If it’s Failed and/or the modules didn’t load then don’t panic, all is not lost and there are a number of options we will need to explore.

The best way I have found to debug these scripts and functions is to open them in the PowerGUI Script Editor. It’s a really nice interface and gives you great feedback (and clues!) as to what’s occurring (a small nod to Gavin and Stacey there).

First things first.

Open up the recompile-wrapper.ps1 script in the Script Editor and press F5. What happens? Does it register the event.

If it comes up with something like a Namespace error this is usually a red herring, although it is worth confirming that the namespace you are using actually exists (use PowerGUI! :-) ).

Check that the modules you are trying to import (ipmo) exist in the folders you are specifying and that the names of the files are correct. Also check that you haven’t accidentally renamed a function by mashing the keys when trying to save it.

It’s also worth deconstructing the WMi event command line; try removing the Write-DataTable and adding a ;$dtr} instead so that when you type in Receive-Job -ID 1 (1 being your job ID) you will see the output of the Out-DataTable function (what with it being captured in the $dtr variable and all).

Like this:

PS C:\WMI> Register-WMIEvent -Namespace ‘root\Microsoft\SqlServer\ServerEvents\TEST01′-Query $query -Action {ipmo -name C:\WMI\sprecompile.psm1 -Verbose;ipmo -name C:\WMI\functions.psm1 -Verbose;$dtr = Get-SPStmtRecompile $event.SourceEventArgs.NewEvent | Out-DataTable ;$dtr}

PS C:\WMI> Receive-Job -ID 1

(The above Receive-Job command is assuming that the event is registering ok and has been fired – i.e. the job status is either Running or Failed).

If that appears to be working then check the object definition of your target table. The datatypes (all except for image, which should be substituted for varbinary(MAX)) and column names should match.

If that all seems ok then I’m afraid you have hit something I haven’t and will most likely endure a small amount of pain for what will ultimately be a huge gain. By constantly debugging, amending, pulling my hair out over the last two days I have accidentally got into PowerShell, WMI Events and WQL…..and I love it!

If you do actually see some output in the table then well done. But what if you want to include the plan definition?

Well, the only way I could think of this was very crude indeed – I’ve written a trigger on the StmtRecompile table to match the plan of the object and store it into the table. I will look to tighten this part of the process up and will re-post if I can find a better way of achieving this.

I hope you find all\some of the above useful. If you have any questions please feel free to drop me a line and I’ll do my best to answer; I’m far from an expert in PowerShell, WMI or WQL but having been through the pain of getting the above to work I may well know more than I think I do!

Thanks

Paul.

Useful links:

Shell Your Experience http://shellyourexperience.wordpress.com/2011/09/04/hey-powershell-eventing-can-you-notify-me-when-a-sp-recompile-happens/

Write-DataTable http://gallery.technet.microsoft.com/scriptcenter/2fdeaf8d-b164-411c-9483-99413d6053ae

Out-DataTable http://gallery.technet.microsoft.com/scriptcenter/4208a159-a52e-4b99-83d4-8048468d29dd

 System.Data.SqlClient.NameSpace http://msdn.microsoft.com/en-us/library/8t72t3k4(v=VS.90).aspx

SQL BulkCopy Constructor http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.sqlbulkcopy(v=VS.90).aspx

SQL BulkCopyOptions http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopyoptions(v=VS.90).aspx

Scripts

sp_recompile.psm1
functions.psm1
stmtrecompile-wrapper.ps1
stmtRecompile-SQLTable

Disclaimer: I don’t claim to have written the above scripts from scratch, I have merely added\amended functionality to make the whole process work for me.

Managing and analysing multiple trace files

Have you ever set off a profiler trace and then realised that you’ve amassed tons and tons of data and now you need to start searching for that needle in a haystack?

If the answer to this question is yes then you might be interested in a pretty cool SQL function that lets you read the trace files and, even better, import them into a table.

Consider the following SQL statement:

SELECT * INTO [db_name]..[table_name]

FROM ::fn_trace_gettable(‘D:\DBAScripts\output\sql_trace_file.trc’, default);

This will import your SQL trace files into a table, after which you can add indexes to optimise the numerous queries you will no doubt have to run against the data.

Let me just break down the syntax of that statement to try and give you a better understanding of how you can best utilise this feature.

The SELECT * INTO is straight-forward stuff so I won’t bother with that bit, let’s just concentrate on the function syntax.

::fn_trace_gettable is the actual name of the function we are using. The two parameters within the parenthesis basically specify the full path and file name and the number of trace files you want to import. By specifying default you are effectively saying that you want all of the trace files that have been created under that name. Multiple trace files of the same name will be suffixed with an underscore followed by a number, for example, a trace file called sql_trace.trc may have rolled over a number of times to create sql_trace_1.trc, sql_trace_2.trc etc. If you only want to import the first 5 trace files then just specify the number 5 instead of ‘default’. Simple.

So as you can see, you have a pretty handy little function that can import all of that trace data into a SQL table which you can then analyse the data until your heart is content.

 

My Windows “Share” cluster resource keeps failing

Have you ever had that annoying problem where you’ve created a cluster resource of type “share” (typically a shared folder on one of your SAN drives) that just keeps failing? If I had to pick a scenario I would chose log shipping from one cluster (Production) to another (Disaster recovery). So, you have your shared folder that contains your log backups but the resource keeps failing. Even worse, you’ve forgotten to un-check that box that affects the cluster group so that when the share fails it brings your entire group down? Yup, been there, done that. And if it’s a share that SQL needs access to then the chances are that you’ve put this share in your SQL cluster group and you’ve got a yo-yo’ing SQL Server service all because of a failing share.

First off, clear that check-box that reads “Affect the cluster group”. If your cluster share fails it’s really not that big a deal. Certainly not big enough to fail-over your cluster.

Secondly, I have a fix for you. You basically need to add in a key to your registry (ON BOTH NODES!! ) and here it is.

The registry path is HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\lanmanserver\parameters and you basically need to create a key called IRPStackSize and set the value to HEX 12.

That’s it. Simple.

Reboot the node (at a convenient time) and hey presto, no more failing service.

Remember to repeat on both nodes! 

 

Failing cluster service? – CheckQueryProcessorAlive

OK, here’s my first technical post and at the time of writing this, the issue is still not resolved, but I do have a hunch. Probably from spending too much time with a laptop perched on my knees :-) I do have a theory on what the issue might be and that is exactly what I’m going to blog. If you are reading this and I haven’t posted an update and you have first hand experience of this issue, please drop me a line to either confirm or dispute my theory, I will be more than glad to hear from you. What ever happens, I am planning on applying the change on Saturday 4th October (yes I work in a tightly controlled organisation) and I will post my update after that date.

The Backgound

We have a windows cluster in an active/passive configuration containing two resource groups, one for the Windows cluster resources and one for the SQL Server cluster resources. Straight-forward stuff. In the SQL group we have three disks, E:, F: and G: – they are LUN’s (SAN storage) that have been allocated to this cluster. I’m not going to divulge too much detail on the placement of what data file lives where, it’s largely irrelevant, it works and it’ll likely detract from the real issue.

The service (I’m using this term in the business sense) started to outgrow it’s home, rapidly, so extra LUN’s were allocated to the server to try and head-off a potential disaster. The disk letters were assigned, they were added as physical disks in the SQL cluster group and I added them as dependencies of the SQL Server cluster resource. Bingo, we were good to go on the new disks. I then set about adding some new datafiles to the filegroup which were physically placed on the new disks and we were off again, safe in the knowledge that I wasn’t going to get called out that night for a database that had hung because there was nowhere left for it to grow. But then the problems started.

I’d never seen this type of failure before. In very basic terms, the SQL Server service crashes but doesn’t actually fail over to the second (passive) node, it simply dies and restarts on the active node and it all happens in the blink of any eye and without warning.

The Investigation

I trawled through the (Application) Event Viewer logs and sure enough there were a whole bunch of errors, all at the exact moment of failure and all having the value MSSQL$PC001 in the source column.

The Event ID is 17052, the category is (3) and the error text of the earliest error is as follows -

[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

The detail of the failures are exactly the same, there are always 14 errors, always at exactly the same time, all for MSSQL$PC001 and they all start with the error message quoted above.

For completeness, these are the other error messages as they appear in chronological order. I say chronological order but in fact they all have the exact same timestamp -

[sqsrvres] printODBCError: sqlstate = 01000; native error = 2746; message = [Microsoft][ODBC SQL Server Driver][DBNETLIB]ConnectionWrite (send()).
[sqsrvres] printODBCError: sqlstate = 08S01; native error = b; message = [Microsoft][ODBC SQL Server Driver][DBNETLIB]General network error. Check your network documentation.
[sqsrvres] OnlineThread: QP is not online.
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][ODBC SQL Server Driver]Communication link failure

The Action

The very first thing we did was to look at what had changed on this service recently and the two things that stuck out were the new disks and the new datafiles.

We updated the drivers for the HBA cards (SAN fabric cards) on the two nodes to the very latest versions, as they were way out of support and the errors were leading us down the “it just loses it’s connection” path. Initially it looked as if this was the solution but after a week of service stability we started to see a reoccurrence of the same issue.

As it happened, more by total accident than design,  we had to undertake a very similar exercise for another service, new disks and new data files on those new disks and sure enough we started seeing the exact same bunch (again, 14 in total) of error messages. Too much of a coincidence? I think not.

Now we’ve added tons of disks in the past and I’ve lost count of the number of times I’ve added new data files to filegroups and stuck them on new disks so what was different?

Well, this is where my theory comes in to play.

The disks that were presented to our server were not of the same type as the previous ones, they weren’t even on the same SAN fabric. We’ve recently implemented a brand new SAN along with new disk arrays (from a completely different vendor) and have started rolling this out to all new production systems. Where this new SAN is attached to our SQL Servers, and this type of SAN alone, we are seeing no issues. Where the old SAN is attached to our SQL Servers along with the new SAN we are seeing no issues, it’s only when you have multiple data files in a filegroup and those files have been split over the two different SAN’s that you get the problem.

For details sake, I have listed the (from Disk Management) the details of the LUN’s below -

E: 150gb RDAC Virtual Disk 17421RU-23b6403-Winscape_blk_A Lun 2
F: 30gb RDAC Virtual Disk 17421RU-23b6403-Winscape_blk_A Lun 4
G: 25gb RDAC Virtual Disk 17421RU-23b6403-Winscape_blk_A Lun 3
H: 25gb RDAC Virtual Disk 17421RU-23b6403-Winscape_blk_A Lun 5
J: 150gb IBM 2145 Multi-path Disk Device Port(2,3) Bus 0, Target ID 1, LUN 3
K: 30gb IBM 2145 Multi-path Disk Device Port(2,3) Bus 0, Target ID 1, LUN 2
L: 300gb IBM 2145 Multi-path Disk Device Port(2,3) Bus 0, Target ID 1, LUN 4
M: 25gb IBM 2145 Multi-path Disk Device Port(2,3) Bus 0, Target ID 1, LUN 1

As you can plainly see, we have two types of disk and we have got datafiles (for the PRIMARY filegroup) spread over those disks and different SAN fabric.

So my theory, and the whole reason for this post is this – if you are sending I/O over two different types of SAN then you are likely to encounter these issues. I can’t ignore what I consider to be hard evidence so I’ve managed to convince myself that this is the root cause of the issue. Time will tell.

The Fix

I’ve got my change scheduled for Saturday 4th October, starting at 23:00 so whether or not this fixes the issue or not will only really become apparent after that date and even then I’m going to have to hold my breath for a few days. It’s so intermittent and unpredicatable it’s just not possible to pop open the champagne corks until, well, I don’t exactly know. I’d just better keep the bubbly on ice I guess.

So anyway, the proposed fix, well it’s actually very simple.

I’m going to move all of the data files (for the PRIMARY filegroup) from the RDAC virtual disks to the IBM disks and thus forcing the I/O down one SAN only. It turns out that we have plenty of spare capacity on the IBM disks to cater for all of the data files (and log files and tempdb files), it just means that, me being me, I can’t split them over multiple LUN’s as I normally like to do. Not that splitting them over multiple LUN’s in anyway guarantees that you’ve just put your data files, log files, tempdb files, other filegroup files on separate disks which is very, very nice indeed thankyou very much.

I’ll be posting an update to this, post-change, to let you know how it’s going. Hope you got through it ok, if you have any comments please let me have them, I’d be glad of some feedback.

Follow

Get every new post delivered to your Inbox.