Monitoring SQL Server Replication With Extended Events

“Hello there…”

“How do we know replication is actually delivering transactions? How long were things taking to be delivered to the distributor? Was there a lot of latency between the distributor and the subscriber?”  These were the questions our business partners were asking.

It was amazing how much I couldn’t answer. Or more accurately, how much I could quantify. It was time to get personal and just take a peek at what was going on. Traditionally I would just write some queries against the distribution database to see how many transactions were pending but it doesn’t offer much in terms of historical performance. And replication monitor just answers every question with “POTATO.”

So if I wanted monitoring solution, I’d have to make like a dad at a Rush concert and roll my own. It was time to actually start monitoring replication performance, and I was going to use… extended events. Extended events are wonderful diagnostic tools to have in your tool belt for any SQL Server occasion really. If you’ve never used them before and want to learn more about them, I’d highly suggest checking out Erin Stellato’s and Jess Borland’s blogs (or talks) on them. I learned this all from them, and they’re my go-to if I ever have questions.

I like to watch

Extended events are great; they have all the goodness of profiler except you don’t use profiler. Win/win! More to the point, extended events let you quickly and easily view, sort, and aggregate events that occur on your instances. They also have powerful filters (really, a “where” clause) to limit noise. You have way more control over what you monitor, how you store the data, and how you view and use it. This makes them perfect use to track replicated transactions, since we want to measure at both an individual level and the aggregate.

I fired up management studio and went to “New Session” looking for some replication event goodness and I found…

replicationevents

… nothing. I tried looking for events that had even parts of the name replication in it. No such thing, apparently.

After thinking about it though, it makes sense: replication isn’t part of the database engine. Replication works off of agents, which are running at the OS level and just run commands against your databases depending on what type of agents they are. The trick is knowing which events to use to capture the activity you’re interested in.

Monitoring subscription performance

The most common metric you’d probably be interested in is how fast transactions are being applied at your subscriber, or how long is each command taking? To do that, you’ll need to set up an RPC Completed event session.

Why RPC completed? Well, let’s find out by monitoring it. Let’s build a new extended event session on the sqlserver.rpc_completed event. My sample session code is below. To cut down on noise, I am also adding some filters: I am going to just capture events from the subscriber database and I am going to filter requests that come from my distribution agent account. The output will be stored in a file name (which defaults to wherever your SQL Server log files live).

All replication agents use rpc to issue their statements to your server, even if you change the delivery method of transactions from stored procedures to INSERT/UPDATE/DELETE statements. Only filtering on RPC completed leaves you lot of data to sift through, however. For example, let’s run a simple update statement against AdventureWorks2014 (which I am replicating to the subscriber database mentioned above).

We’re just going to update one row on our publisher. To view the replicated transaction, let’s view our extended event session data:

Buried in our event file output (along with a ton of replication barf) is our statement. The actual update that gets delivered to the subscriber is the row right above the highlighted blue row (you might want to click the picture to zoom in a bit, it’s pretty wide).

You might be wondering why it’s calling sp_executesql  and then calling a stored procedure inside of it. That’s because of the default way transactional replication delivers a transaction to a subscriber. Like I mentioned above, when you first set up a publication and add a table article, the default behavior is for replicaiton to automatically create 3 stored procedures for it: one that handles inserts, one that handles updates, and one that handles deletes. While there are other ways that replication delivers transactions to subscribers (and we’ll see some in another blog post that’s coming), this functionality gives us a very convenient way to capture just the actual transactions that come across. We’re going to modify our extended event to look like this:

Since we know the default naming convention of the stored procedures that get created (sp_MSins, sp_MSupd, and sp_MSins), we can use them as a filter! This change will let us filter on the statement being run against the replica database, by distribution agent, to capture all data changes being conducted by replication. Again, I’ll be going into a much more detailed blog post later about how and why transactions get delivered how they do, but for now this should help you see each transaction. If we make sure the event is started and make some more changes to our data, here’s what we’ll see now (clicking the picture will enlarge it for you):

Once we have our events filtered out, we can begin to look at how long each statement is taking. By slightly modifying our chosen columns to include duration, physical_reads, and writes, we can then group by the event name column (right-click the name column, click “group by”), then add aggregations to each of the numeric columns (click each of the new columns, right-click, add aggregation, AVG). We then can get something like this:

And there you have it! How long is each update taking, on average, in milliseconds. Is this useful or representative? Not really, for such a small sample size. However if you have something like this running for a period of time you can filter for date ranges and look for trends. If you see that number cropping up higher, you’re probably heading for some latency problems, or as I like to call it, “scotch o’clock.”

Distribution database performance

Transactions arriving at the subscribers are only part of the whole replication experience, though. Sometimes you can encounter some nasty bottlenecks in the distribution database, too. There’s three events that I like to set up to monitor the distribution database. First, we’ll use a modified version of the same extended event we used above, but we’re going to add two more: one that watches for blocking, and the other that looks at wait stats.

Here they are. We’ll create them and then talk about them.

The first session will find things that are hitting your distribution database. Nothing of a surprise here, it’ll let you see when commands hit your distribution database. The other two though are very useful. One of the biggest killers of distribution is blocking, which can get really really bad on busy distribution systems. The built-in SQL agent job, “Distribution Cleanup” can be notorious for this. The third event session will help us better understand what bottlenecks we’re seeing in distribution via wait statistics.

I should point out though, that in order to make your block process report show anything, you need to modify a server configuration setting: the blocked process threshold. This is number of seconds that has pass for something to show up on the report. What should you set this to? Whatever you think an “acceptable” level of blocking is, or, in this case, how often do you want something to trigger this event. If you find you’re getting too much noise, I’d up the time.

I’m gonna wreck it!

Now let’s wreak some havoc. We’re going to purposely cause some chaos on the distribution server. We’re going to flood the log reader with transactions, and then we’re going to run our distribution cleanup job. Then we’ll check out our events. After I make sure my extended events are running, here’s the code I’ll run on my published database:

Yeah, that’s just as nasty as it looks: we’re going to loop through this series of dynamic SQL statements 500 times, each time changing the values of odd rows and even rows to ‘Yes’ or ‘No’ (note: I added that column to the table myself). Each statement executes twice in a transaction, the commits. Why? It has to do with how replication will log the transactions. I want separate batches, and I want them to come in one right after the other.

This demo has the unfortunate side effect of greatly increasing the size of our distribution database (I filled up my VM’s hard drive while testing this), but it’ll be a good test because it’ll make all our replication processes run long.How do we know it’s working? Simple, you open replication monitor and check out pending transactions.

Good job replication monitor, you tried your best. Classic. It’s not showing us anything because the queries to retrieve those numbers are either really slow or blocked. Why? Here’s the output of our wait stats report:

I am again aggregating by wait type and averaging duration. We can see there’s a lot of long blocking waits in distribution, which could be bad in a lot of ways: either old replicated commands are taking too long to clean up and are blocking other commands, or maybe it’s the other way around. We can also use the blocked process report (not shown here because the output is pretty verbose, but I suggest you play with it regardless) to try and correlate the blocking/blocked process ID’s to see what was happening. PS: I hope you like XML!

We also see some dreaded CXPACKET waits in there, I wonder what that’s about (spoiler alert: it’s replication monitor trying to determine how many commands are left to deliver).

In the real world

So now that we’ve seen some extended event sessions that can help you understand and watch replication do it’s magic. Should you set up these sessions and run them all the time? Probably not; storage considerations of the log files aside, it may not do you a lot of good to collect too much info and have to sift through it all. Rather, I use events like these as sniff tests if I suspect replication is running behind, or if I think my distribution database is misbehaving. I like to start them up during busy periods to check on things, or if I know we’re doing a large data update and we want to measure throughput.

So that’s it! I’ll have some more replication articles soon that will build off of these sessions so we can talk more about the internals of transactional replication. Do you have a favorite series of extended events that you’re using today to monitor your replication performance that you want to share? Hit up the comments let me know.

 

2 thoughts on “Monitoring SQL Server Replication With Extended Events

  1. Pingback: Replication Extended Events – Curated SQL

Comments are closed.