I hate that phrase, but in this case, it pretty much sums it up best: when we think about production monitoring, what do we normally think about? Is a server online? Can we connect? Is the SQL Agent available? How’s that drive space looking? What about CPU usage? There’s lots of great tools out there to monitor SQL server and the underlying OS’s that we run it on. Typically, we tend to think of “infrastructure monitoring” as “production monitoring.”1
Honestly? That’s a pretty narrow focus, and a potentially dangerous one at that. As a DBA you might not be completely in tune with what’s going on over in developer land: they may be releasing code that might be not working as they intended. Take a look at the following table as an example:
CREATE TABLE Customer.Usages ( UsageID bigint identity (1,1) not null primary key clustered, BusiessEntityID int not null, InvoiceID int not null, UsageUnitID int not null, AccountID varchar(15) not null, AmountBilled numeric(8,2) not null, UsageCreateDate datetime not null, RemittanceID int not null )
We have a table that is storing some customer usages of… whatever our fictional business uses. There is probably some code that interacts with this table: maybe we do bulk processing or an ETL to load data to this table, or we have people manually interacting with an application that updates these particular rows. Since this table is (probably) integral to billing our customers, it’s a big deal to keep accurate and up to date.
As long as the code the developers are writing isn’t throwing .NET exceptions or SSIS errors or <insert whatever errors you’d see in an application here>, everyone is pretty happy. But how would we identify a problem with the data that goes into this table? Say for instance we have a bug somewhere that isn’t putting the correct account number in the table, how would we know?
Tired of these manager-isms yet? Sooner or later someone is going to find this bug. Developers are going to point fingers at the QA team for not testing enough, QA is going to blame the developers and/or business partners for not telling them about a particular use case, and you can sure you’ll either be called into help fix the problem, and depending on how your systems are used, you might also be looking at some downtime to correct all the bad data and fix the underlying code issue. Ouch.
Which is why when I say “production monitoring” I’m not just talking about hardware and software. We need to treat our data as part of that equation too. One way to do that is to write an audit stored procedure (or two) that looks at critical tables and makes sure “mistakes” aren’t being inserted, and if they are, starting throwing alerts saying so.
Using the example above, let’s say the root cause of the data issue is that our account numbers are not passing validation because of a bad check digit on the end of each one. So, we write a stored procedure called “Monitoring.ValidateUsageAccountID” that looks at all usages in our table and that we can successfully join to the accounts table. For simplicity’s sake, let’s say the stored procedure looks like this:
CREATE PROCEDURE Monitoring.ValidateUsageAccountIDs BEGIN SELECT AccountID FROM Customer.Usages EXCEPT SELECT AccountID FROM Customer.Accounts END
This is overly simplistic but you get the idea: we’re looking for all the AccountIDs from the Usages table that don’t exist in the Accounts table2. In practice, it’s probably more complex. The question is what do we do with it: how do we run it and how do we have it alert us if there is a problem? If only there was some technology that would let us do that…
Of course, I am going to use PowerShell to get this done. The tricky part is: how can we call a stored procedure from PowerShell? It’s not as hard as you think. We’re going to load up the module and run a simple little one-liner. Please note, you’ll need to have imported the sqlps module. Just run Import-Module sqlps.
((Get-ChildItem -Path SQLSERVER:\SQL\LOCALHOST\DEFAULT\Databases | Where-Object {$_.Name –eq "DatabaseName"}).ExecuteWithResults("EXEC dbo.TestProc").Tables.Rows | Measure-Object).Count
Okay, so maybe it’s not a simple little one liner. Let’s break it down:
Let’s say that in the audit above, if ANY rows are returned we need to start sounding alarm bells. You could do a lot of things: you could use built-in PowerShell cmdlets like Send-MailMessage and craft a subject and body to alert a distribution list. Or maybe you could integrate the script with something like SystemCenter and create custom monitor script to leverage an existing rule-based alerting that has been set up for other things in your environment. Since we’re calling this from PowerShell, all we really need is access to the module and we could run and/or schedule these monitoring scripts to run wherever we want!
The possibilities are endless. Just remember, that like any query you put into production, you’ll want to make sure you’re not overdoing it; if your monitors are resource intensive or could be blocked by other processes, you’ll need to really be careful before you jump headfirst into this. You should work with your business partners and developers to find out what needs monitoring. It’s one thing to be a hero in a crisis, but it’s another altogether to prevent one!