Carrying on my series of posts on SQL Server and Disk IO it’s time to cover the old stalwart that is perfmon (referred to in Windows as Performance Monitor) which I know anyone who has delved into any Windows performance issue will have some familiarity with, so I’m not going to cover what it is or how to use it; one word: Google.
The principal reason for this post is to add my own experiences of using perfmon to track down IO issues.
I’m assuming Windows 7 here, so adapt these instructions according to your flavour of Windows:
Go to Start > Run and type perfmon and hit Enter to launch it. This will start perfmon (Performance Monitor) under the Performance > Monitoring Tools > > Performance Monitor branch, which is exactly where we need to be.
This is where we create the template trace to monitor SQL Server’s IO performance as far as Windows is concerned. The mechanics of creating and managing perfmon templates and their output is well documented all over the wwitblo (world-wide-inter-twitter-blogosphere), so straight to the important counters.
Press Ctrl+Shift+N to bring up the Add Counters dialog. Scroll down to and expand the Physical Disk section. The counters we need are:
- Current Disk Queue Length
- Disk Read Bytes/sec
- Disk Write Bytes/sec
- Disk Bytes/sec
- Disk Reads/sec
- Disk Transfers/sec
- Disk Writes/sec
- Avg. Disk sec/Read
- Avg. Disk sec/Write
- Split IO/Sec
An explanation of what these counters reveal is available if you click on Show Description, but I’m going to add my own 2c, otherwise there ain’t much point in me writing this post.
Current disk queue length
This will tell you how many pending IO requests are queued up against that disk and a sure sign of IO pressure on local/directly attached disks. If your disks are on a SAN or on flash memory, then these numbers cannot be relied on directly, but, if you see high queue length numbers when IO heavy queries/jobs are being run and high numbers being reported by some of the other disk IO related counters then this should pretty much confirm an IO subsystem issue (i.e. don’t be too quick to dismiss this counter). Furthermore, as this shows the actual queue length at that point, and not an averaged out figure, I find it more reliable provided the sample rate is low enough.
Disk Read Bytes/sec & Disk Write Bytes/sec & Disk Bytes/sec
This will show the disk read throughput. In the event of an IO stall, this counter (and usually the corresponding Disk Write Bytes/sec counter) will show a big drop off in read/write throughput. Disk Bytes/sec is merely a sum of Read/Write Bytes a sec, but I think seeing them independently is more useful as it can show up specific disks at e.g. having fantastic read speeds but poor write speeds.
Disk Reads/sec, Disk Writes/sec & Disk Transfers/sec
These show what they say. I’ve seen some documentation on the wwitblo (from a major SQL monitoring software vendor, no less) stating that these counters can be ignored as (and I’m now paraphrasing here) SANs can be quite complicated and do lots of funky things under the covers, so don’t worry your pretty little head about this counter too much.
I’m inclined to disagree.
These figures are damned useful. Firstly, they’ll show you if drives are being over-utilised by reads or writes; if you’ve got a db that’s e.g. part reporting part OLTP and you suspect a particular filegroup is being heavily utilised by some reports, this should show up as heavy read traffic on the partition that hosts the filegroup those reports are hitting, which allows you to look at ways of tweaking performance by e.g. adding files or moving the filegroup to faster disks etc.
But these counters are even more useful than that, as I’ve used them to prove a SAN was no longer up to the job of running the db it was hosting, because the Total for Disk Transfers/sec was regularly hitting the ceiling of the max IOPs the SAN in question could support, and this was manifesting itself in all the classic ways: slow reindexes, slow DBCCs, IO stalls in the SQL error log, slow backups, replication latencies, as well as general db sluggishness as all transactions have to be hardened to the transaction log (don’t forget confirmation from wait stats and DMVs as well).
So, get confirmation of how many IOPs your IO subsystem can support and then use this counter to confirm an IO bottleneck related to the IOP throughput, or use it to proactively monitor your storage subsystem so you get early warning of pending bottleneck. And yes, within a major enterprise there will be multiple SQL Server instances (or other DBMS flavours) plugged into the same SAN, but again, and with more effort, the total IO of these instances can be added up to get a rough idea of how many IOPs the SAN is having to contend with.
Avg. Disk sec/Read & Avg. Disk sec/Write
This is the average amount of time it took to read or write the IOs occurring at that time and this depends greatly on how frequently the data is being sampled and the number of IOs occurring at that time, so a slight increase in the average can mask a significant slow down. Keep an eye on the peaks as well as the average which you want as low as possible. Ideally less than 10ms if your disks are traditional spinning disks.
If your partition alignment is not correct, this will show additional IOs that were required to complete a read or write. This was prevalent on Windows Server 2003 and earlier and can be a significant drag on throughput
With these counters selected we need to select the disks we want to track those counters against under Instances of selected object:
If the database drives are identified by drive letter then that’s simple enough to do, but if mount points are in use then you’ll have to go by the drive number, and you’ll find the drive number (or LUN) via the Disk Administrator tool. Once you’ve identified the LUNs for the database drive(s)/mount point(s) of interest, select them and click OK.
Instead of incorporating the disk counters into an existing template I prefer to use a separate perfmon template just to track IO performance for reasons I’ll explain further down.
DIFF BETWEEN PHYSICAL COUNTERS AND LOGICAL COUNTERS
The difference between Physical Disk and Logical Disk counters is basically that Logical Disk will show the aggregated view of a drive in the case of e.g. drive D: being spread over multiple disks (e.g. 4 (D:), 7 (D:), 9 (D:)) or a drive hosted on a disk split into more than one partition like C:(0) and D: (0). Physical Disk will show the raw underlying drive or mount point, and I always use that, although MS would recommend looking at Logical Drive first and then drilling down.
A separate template to monitor disk counters
The reason I prefer to put the disk counters in their own perfmon template is down to the sample rate. The default perfmon sample rate is 15 seconds and if you stick with that sample rate all the disk counters with the Avg abbreviation will be averaged over that 15 second sample interval.
Why is this a big deal?
Well, it depends (of course!).
Are you trying to track down an IO stall that lasts for 5-6 seconds or are you trying to track down slow IO that lasts for a couple of hours?
If you’re tracking down a 5-6 second IO stall you’re not going to have much luck sampling every 15 seconds as any significant issues lasting 5 or 6 seconds may get almost completely smoothed out when they are averaged over 15 seconds.
For a longer lasting issue you may be able to go for a longer sample interval, but from personal experience I’ve found short sample intervals to be much more revealing, especially when tracking SAN performance as it can be so spikey.
By way of example, a few years back I spent a lot of time tracking down an infrequently occurring IO stall on a SAN for a critical low-latency trading platform.
The IO stalls would typically last @ 5 seconds and sometimes just a couple of seconds and I could never trap them until I got the perfmon sample interval down to 1 second. That’s how low I had to go; and we did find them and prove it was a SAN issue, although we had to enforce the co-operation of the SAN team at the proverbial gunpoint. Good luck with that when the time comes.
The performance impact of this low sample interval was negligible, although the perfmon trace files, inevitably got rather large rather quickly and a close eye had to be kept on them.
Finally, if this is a SAN that is being tracked then the SAN admins are going to have their own ‘monitoring’ which may not show a problem.
From personal experience this will be because the SAN ‘monitoring’ will be at a very different sample rate – usually in the region of minutes, not seconds. Therefore, to accurately pick up IO issues you’ll have to persuade your SAN admins to use the same sample interval perfmon is using otherwise we’re comparing apples to pears; your perfmon date will show huge spikes or dips whilst the SAN ‘tool’ will show a wonderful flat line as it averages everything out over a massive sample interval.
I’ve actually had the head of the SAN storage team argue with me that the 30 minute sample interval they used to monitor their SAN would pick up a 2 second IO stall. I kid you not.
Viewing the data
Some quick notes on reviewing the data gathered in the perfmon file(s). I’m assuming a perfmon file has been loaded into perfmon ready for viewing.
- Right click on the graph area in perfmon, go to Properties, click on the Graph tab and set a sensible upper limit for the Vertical scale. The default is 100, and if you’re looking at e.g Disk Read Bytes/sec you might want to up that into the thousands. Select the vertical grid tick box as well to make it easier to read off the numbers.
- When viewing the data, press Ctrl+H to highlight the selected counter in the display to make that counters data stand out. Makes it much easier when stepping through multiple counters to isolate their data values.
- When looking at a perfmon file which covers a wide time range it can be easier to narrow down that time range by right-clicking the graph area, selecting Properties, clicking on the source Tab and scrolling the Time Range scroll bar. Alternatively, whilst looking at the graph click on the start time you’re interested in and drag your mouse across until the time frame you’re interested in is highlighted, then right-click and select Zoom To.
- Watch out for scaling. If you’re comparing multiple values, select them and then right-click on any of the selected counters and select Scale Selected Counters as it can be very easy to misread a counter if the scaling chosen for it either exaggerates or understates its true value.
Working through an example
Recently I spotted the following slow IO warning on a production server:
2013-11-04 03:07:00.130 spid5s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [XXXXX] in database [YYYYY] (n). The OS file handle is 0x0000000000000D74. The offset of the latest long I/O is: 0x00002449860000
I loaded up a perfmon template from that period:
A 14 ms averages doesn’t look too bad does it? That 10.144 second max can be ignored (fortunately!) as it coincided with scheduled work on the SAN at that time. Zooming into the 03:07 time period in question paints a slightly different picture for the IO numbers:
We can see that in the couple of minutes leading up to the slow IO warning in the SQL Server error log that both reads and write latencies are spiking up.
I’ve left Avg. Disk sec/Read highlighted and we can see the average IO has jumped up from 14ms to 50ms which is well into poor performance territory. I’ve also added a third counter (Disk Transfers /sec) so we can see how much IO was going on (just against that partition) at that time, and for that period there were 3,016 disk transfers per second with a peak read latency of 0.8 seconds and write latency of 1.5 seconds.
Bearing in mind the Avg. Disk sec/Read and Avg. Disk sec/Writes are just averages (and will therefore be averaging out the true latencies of those 3 thousand plus IOs a second). it’s clear the disk subsystem was struggling at that time and, as the error message indicated, at least one of them did not return for at least 15 seconds.
Now, I know that on this system that there just happened to be a reindex running which swamped the IO path at that time, and I know from experience these slow IO warnings coincide with just these sort of disk intensive jobs.
Warnings coinciding with the times these jobs are running are usually the first sign that the infrastructure of the system in question is starting to creak and making plans for pre-emptive action is advisable, or this latency will start creeping into the working day of the system and start affecting users, and guess who they’ll come running to? I’ll give you a clue, it won’t be to whoever is responsible for that disk subsystem.
After many weeks of trying to track down intermittent IO stalls at a financial institution I used to work at the problem turned out to be a networking issue, but not the normal networking kind of issue.
The first clue was that the perfmon data and the SAN performance data (even when the SAN data was gathered at a sensible low interval rate) never matched. This indicated the problem was between the host and the switches/HBA cards, but these checked out OK as well.
This left only one alternative; this financial institution I worked for could afford stretched Windows clusters; e.g. Node 1 would be in the data centre in the basement, and Node 2 was @ 15 miles away at a DR centre. But Windows clusters need shared storage, don’t they? They do indeed, and the way to make Node 1 and Node 2 think they’re looking at the same SAN is to mirror the SAN over underground dark fibre links. Every write is replicated with microsecond latencies.
This added about 3-4ms to every write operation but that was more than acceptable, especially because of the DR protection it gave us (without resorting to database mirroring/log-shipping/SQL Server replication etc etc), but I digress, and, as you’ve probably guessed, this is where we tracked the issue down to; most of the unresolved IO stalls were down to glitches in these links which the networking team were able to confirm when we gave them precise times and dates from the perfmon data.
Although these glitches happen from time to time – usually when workmen dig up roads and punch through the cables – the networking team never thought to mention it outside their usual circle because, up until that point, none of the systems at that institution were that sensitive to
As you might expect, this has now changed, along with their notification procedures.
Incidentally, when the workmen cut through the cable everything automatically switched to a backup circuit within milliseconds but when that bubbles up to the Windows cluster and those lost IOs have to be re-sent, those milliseconds turn into 7-8 seconds.
If you’re looking after a system which has this kind of technology (the institution in question was using EMC GeoSpan clustering) this might be something worth checking out, if you haven’t already done so.
Don’t look at the perfmon numbers in isolation.
If you’re seeing latency spikes, check the number of disk transfers taking place at the same time, in case those IOPs are higher than your disk subsystem can sustain. At the same time, check the wait stats and DMVs as well for corroborating data any issues with those drives.
The Useful Links section below will help with getting this extra information.