This is hardly a new topic and whilst there are a lot of articles out there instructing us on how to stripe tempdb etc I feel there’s a bit of a gap on exactly how to identify a problem with tempdb in the first place and particularly on finding out how much of your overall IO tempdb is responsible for.

I cover that in this post but it is also covered more extensively in my SQL Server and Disk IO series of posts, as the troubleshooting and monitoring principles covered in them are just as valid for tempdb performance monitoring as they are for application databases.

Also, as this is the concluding post in that series I could hardly avoid writing about it!

If an application database is poorly configured in terms of disk layout then generally only that database will be affected by e.g. queries utilising a lot of disk IO performing poorly. If tempdb is poorly configured in terms of disk layout the whole SQL Server instance will suffer performance problems for the simple reason that tempdb is effectively a workhorse database that does a lot of work in the background.

Some of this background work tempdb is obvious, some of it is not so obvious:

  • temporary tables (explicitly created in queries/procs etc)
  • queries that sort (order) data (temporary tables will be created under the covers to sort the data)
  • worktable spools (when a query cannot be processed entirely in memory)
  • Service Broker
  • app databases utilising Snapshot Isolation (to hold versioned data)
  • linked server queries
  • SQL Server internal processes

This is by no means a complete list, and, in all likelihood, as features are added to SQL Server the list will grow rather than shrink.

One way of thinking about tempdb is that it’s like an OS swap file. When the OS runs out of memory the swap file becomes the default dumping ground and performance slows and in the same way a poorly sized and configured swap file can cripple a server a poorly configured tempdb will cripple SQL Server

The most common issue related to tempdb that every DBA will hit sooner or later is space usage when a runaway query maxes out tempdb and it runs out of space. Those kind of issues are pretty easy to solve and well documented all over the web; check out the links at the end of the post.

This post is going to focus more on the performance issues and they can usually be pinned down to the physical database design. If you’ve been reading the previous posts on this SQL Server and Disk IO series you’ll know how critical it is to size a database well, split it up into multiple files and place it on performant disks.

Unsurprisingly, the same principles apply to tempdb. But from my time working at Microsoft it never ceased to amaze me how many clients went to a lot of trouble with the physical design of their application databases but paid no regard to applying the same principles to tempdb, so it would frequently be left on default settings in C:Program Files Microsoft SQL Serveretc etc.

But I don’t know how much IO my tempdb is doing

Tempdb performance and its IO load can be easily checked via the Microsoft SQL Server 2012 Performance Dashboard Reports as well as my posts on monitoring SQL Server IO.

Once the reports are installed, open up the main dashboard report and from there click on the Historical IO link. This takes you to the Historical IO Report, et voila! You can now see exactly how much IO your tempdb does in terms of reads, writes, the time taken to read and write and, crucially, the percentage of all IO that each database on that SQL Server instance is responsible for:
If tempdb is taking a significant percentage of that IO it makes sense to make sure it’s optimised as well as it can be.

For older versions of SQL Server check out this Codeplex project for dashboard reports that run from SQL Server 2005.

Sizes and stripes and trace flags

We know how much IO tempdb is doing and we want to apply the same tuning principles we would apply to any other database, starting with striping it across multiple files. There’s a lot of material out there on why this is ‘a good thing’ so I’m not going to regurgitate why here.

The first thing to check is that tempdb is not on default settings and in the default location (unless that default location has super-low latency and can handle a lot of IOPs).

Out of the box tempdb (up until at least SQL Server 2012) will be sized with one 8 MB data file and one 512 KB log file. If your SQL Server installation consists of one application database containing one table with one row then you’re good to go. The chances are your SQL Server installation will be a little more substantial than that so tempdb needs to be sized up.

In SSMS right-click on tempdb and select Properties and click on Files under the ‘Select a page’ section of the ‘Database Properties – tempdb’ dialog box:

This will list the files, their location and initial sizes of tempdb. If there’s only one data and log file at the default sizes (under Initial Size (MB)) we have work to do. Note: the SSMS GUI will round-up a default log size of 512KB to 1MB. To prove this, run

sp_helpdb tempdb

in a new query window and review the output for the tempdb log file and it should show the size as 512MB.

Adding stripes is simply a question of clicking the Add button here and setting the appropriate size, autogrow, path and filename settings.

You’ll need to provision some fast storage for the additional tempdb stripes first. Ideally, use flash or SSDs or fast local disks. If you’re at the mercy of a SAN administrator, request something akin to RAID 0 as you don’t need to worry about mirroring/parity as tempdb is rebuilt on SQL Server startup. Obviously, if there’s a recurring issue it means there’s a disk problem so that will still need addressing, but from a DBA perspective there’s no recovery options for tempdb other than restarting SQL Server.

To work out how many stripes you need keep any eye on the guidance Microsoft publish as the advice changes over time. Anything I put down now will probably be out of date in a couple of years. I start at one quarter of the number of cores on the system if it has up to 64 cores and (personally) have never found the need to increase it beyond half the cores on systems with more than 32 cores.

On high throughput systems that are very tempdb heavy you may need to up this.

The SQLCAT team (or whatever they are called this month) recommend one stripe per core but these were extreme-scale systems so IMHO not practical for the vast majority of systems out there.

Establishing if you need more cores is not that difficult as this will manifest itself as something called PFS page contention.

Add stripes and enable trace flag 1118 as described in the above article and many others. There’s no downside to it and it’s a de facto step I take on all the SQL Server installations I’m involved with. Check out this Microsoft KB article if you’re still not sure.

The elephant in the room is the size of the stripes, and therefore the size of tempdb. There is no right answer for the size of tempdb. It depends on the load. The load is not necessarily how busy the system is right now, or even how busy it is at peak loads, it’s how big it needs to be when the most tempdb intensive loads are placed on it. E.g. overnight ETL jobs. If these jobs grow your tempdb to 500GB at 3am whereas for the previous 23 hours of every day your tempdb never has more than 50GB of data in it then your tempdb needs to be sized at 500GB in order to avoid autogrows which will cripple the performance of that overnight job as it incrementally autogrows to 500GB the next time it runs.

The same sizing principle applies to the transaction log file size; make it big enough to cope with the demands of the most intensive jobs to avoid autogrows.

There’s nothing to stop you looking at tasks that cause excessive tempdb (see the ‘Useful Links’ section at the end of this post) usage as there may be ways to optimise those tasks and reduce that tempdb usage, but sometimes we just have to accept that some of these procedures or jobs will just be heavy tempdb users so we divide our stripe sizes appropriately so they add up to the size of database that we need, plus a reasonable overhead both within the overall size of tempdb plus free space on the underlying volume the stripes reside on.

This leads inevitably onto the ‘how many volumes?’ question. Split the data files equally across multiple volumes is the simple answer, if it’s traditional spinning disks. If you’re fortunate enough to have flash storage which can handle anywhere from 300 to 500K IOPs one volume should suffice for most installations. If it’s a high throughput, latency sensitive system stripe across multiple flash devices. Besides, if tempdb is placed on a SAN (remember tempdb on local disks for a failover clustered instance of SQL Server is supported from SQL Server 2012) those multiple volumes you have requested might be placed on overlapping spindles anyway, or shared with other heavy usage systems, unless you happen to have a SAN admin who really knows their stuff (I think you’re more likely to stumble across a unicorn) or have the luxury of a dedicated SAN.

For failover cluster instances of SQL Server don’t forget to ensure that all the volumes and paths used for striping tempdb are present on all the nodes in the cluster, otherwise tempdb creation will fail and SQL Server will not come online on that node.

SQL Server will have to be restarted for any changes to the striping to take affect.

If SQL Server fails to start after these changes it’s almost always down to a typo in the path and that usually means having to force SQL Server to start with certain command line startup options as documented in this KB article. The article covers recovering a suspect tempdb but it shows what you need to do to restart SQL Server and bypass the creation of tempdb in order to get you into a position where tempdb paths can be corrected.

An aside: sometimes it really is worth finding out what’s using tempdb

On a system I used to manage we investigated tempdb usage as tempdb was generating a lot of slow IO warnings in the error log and the system was really struggling. The previously mentioned dashboard report revealed that 60% of all IO on the system was being generated by tempdb.
By checking the sys.dm_exec_requests DMV via some home-brewed stored procs (which I might one day blog about as they’ve proven very useful to me) and the use of Adam Machanic’s excellent sp_whoisactive proc we isolated the issue to one stored procedure that was doing some string manipulation and being called about 20 times a second. Not only that but it was explicitly creating and dropping a temp table whenever it was called.
The explicit tempdb creation and deletion was removed first and the string manipulation was eventually placed into a CLR function. Tempdb was then responsible for @ 10% of overall IO of the whole system. Not a difficult change and a big win on overall system performance and throughput.

So to summarise…

  • use the dasboard reports or the blog posts summarised in my SQL Server and IO series of posts to get an idea of how tempdb is performing
  • modify the tempdb configuration, if required, by carefully creating stripes with appropriate sizes and growth settings (don’t forget to test on every node if SQL Server is clustered)
  • check the dashboard reports or DMVs and perfmon stats to get objective metrics on any improvments
  • don’t forget to regularly review these metrics to ensure tempdb is coping

Useful links

Troubleshooting Insufficient Disk Space in tempdb

SQL Server (2005 and 2008) Trace Flag 1118 (-T1118) Usage

Correctly adding data files to tempdb

Finally got some breathing space for my first post of the year!

In order to get a major client through Christmas which is traditionally their busiest period of the year (where load normally goes up five-fold), we embarked on a major platform refresh where both the back end SQL Server DBMS was upgraded from SQL 2008R2 to SQL Server 2012 and the hardware was upgraded to 2 x Dell PowerEdge R910s with 4 x 10 (physical) cores and 512GB of RAM in each server.

In order to take the disk I/O bottleneck out of the equation we decided to take traditional spinning disk technology out of the equation so the SAN was kicked out sob and a shiny pair of 2.4TB Fusion-io Duos were added to the server spec sheet.

This would allow us to use AlwaysOn Availability Groups in SQL Server 2012 which have the major upside of allowing the secondaries to be readable instead of being completely offline, which, of course is why having Fusion-io drives in both severs really comes into its own.

But I digress, this post isn’t about boasting about new kit or Fusion-io’s hardware but rather it’s a warning about choosing the right PCIe slot to place the Fusion-io cards because although the R910s have plenty of PCIe slots they don’t all run at the same speed and our main concerns when adding an older 785 GB ioDrive2 Fusion-io card we were recycling from the old kit was that the slot had enough power and there was room for airflow between the shiny new 2.4TBs that were already installed and these cards we were adding in from the older kit.

When the servers were powered up there was nothing untoward showing up in event logs etc, no bells or sirens and the card was initialised and formatted, and could have been used without issue. It was only when the fio-status command was run as a final check (with the -a option) that any mention was made that all was not as it should be, as the the screenshot below shows:

And there you have it, a discrete little “The bandwidth of the PCIe slot is not optimal for the ioMemory.” warning for the newly inserted ioDrive2 card.

We quickly realised that the warning was due to the fact that the newly inserted card was in an x4 slot and as soon as we stuck it in an x8 slot the warning disappeared and we were good to go. The Fusion-io docs do mention in passing an x8 slot to be fair, but it’s easily missed as more focus is placed on the height of the slots, and if you do have a server with multiple speed slots make sure you pick the right slot as there’s no ‘loud’ indication that you’re not getting the best out of it, and the x4 and x8 slots look quite similar so check your server motherboard docs in advance.

I’m not sure what the performance penalty would have been, but with one of the cards using only half the potential bandwidth I didn’t want to find out.

So, if you’re migrating to meaty new servers be aware of the proliferation of multiple speed PCIe slots and make sure you pick the right speed slot, especially when dealing with multiple Fusion-io cards, and if you’re considering new hardware you might want to add a check to ensure there are enough correct speed slots for your needs.

Everything you need to know about PCI Express

Carrying on my series of posts on SQL Server and Disk IO it’s time to cover the old stalwart that is perfmon (referred to in Windows as Performance Monitor) which I know anyone who has delved into any Windows performance issue will have some familiarity with, so I’m not going to cover what it is or how to use it; one word: Google.

The principal reason for this post is to add my own experiences of using perfmon to track down IO issues.

Starting perfmon

I’m assuming Windows 7 here, so adapt these instructions according to your flavour of Windows:

Go to Start > Run and type perfmon and hit Enter to launch it. This will start perfmon (Performance Monitor) under the Performance > Monitoring Tools > > Performance Monitor branch, which is exactly where we need to be.

This is where we create the template trace to monitor SQL Server’s IO performance as far as Windows is concerned. The mechanics of creating and managing perfmon templates and their output is well documented all over the wwitblo (world-wide-inter-twitter-blogosphere), so straight to the important counters.

Press Ctrl+Shift+N to bring up the Add Counters dialog. Scroll down to and expand the Physical Disk section. The counters we need are:

  • Current Disk Queue Length
  • Disk Read Bytes/sec
  • Disk Write Bytes/sec
  • Disk Bytes/sec
  • Disk Reads/sec
  • Disk Transfers/sec
  • Disk Writes/sec
  • Avg. Disk sec/Read
  • Avg. Disk sec/Write
  • Split IO/Sec

An explanation of what these counters reveal is available if you click on Show Description, but I’m going to add my own 2c, otherwise there ain’t much point in me writing this post.

Current disk queue length

This will tell you how many pending IO requests are queued up against that disk and a sure sign of IO pressure on local/directly attached disks. If your disks are on a SAN or on flash memory, then these numbers cannot be relied on directly, but, if you see high queue length numbers when IO heavy queries/jobs are being run and high numbers being reported by some of the other disk IO related counters then this should pretty much confirm an IO subsystem issue (i.e. don’t be too quick to dismiss this counter). Furthermore, as this shows the actual queue length at that point, and not an averaged out figure, I find it more reliable provided the sample rate is low enough.

Disk Read Bytes/sec & Disk Write Bytes/sec & Disk Bytes/sec

This will show the disk read throughput. In the event of an IO stall, this counter (and usually the corresponding Disk Write Bytes/sec counter) will show a big drop off in read/write throughput. Disk Bytes/sec is merely a sum of Read/Write Bytes a sec, but I think seeing them independently is more useful as it can show up specific disks at e.g. having fantastic read speeds but poor write speeds.

Disk Reads/sec, Disk Writes/sec & Disk Transfers/sec

These show what they say. I’ve seen some documentation on the wwitblo (from a major SQL monitoring software vendor, no less) stating that these counters can be ignored as (and I’m now paraphrasing here) SANs can be quite complicated and do lots of funky things under the covers, so don’t worry your pretty little head about this counter too much.

I’m inclined to disagree.

These figures are damned useful. Firstly, they’ll show you if drives are being over-utilised by reads or writes; if you’ve got a db that’s e.g. part reporting part OLTP and you suspect a particular filegroup is being heavily utilised by some reports, this should show up as heavy read traffic on the partition that hosts the filegroup those reports are hitting, which allows you to look at ways of tweaking performance by e.g. adding files or moving the filegroup to faster disks etc.
But these counters are even more useful than that, as I’ve used them to prove a SAN was no longer up to the job of running the db it was hosting, because the Total for Disk Transfers/sec was regularly hitting the ceiling of the max IOPs the SAN in question could support, and this was manifesting itself in all the classic ways: slow reindexes, slow DBCCs, IO stalls in the SQL error log, slow backups, replication latencies, as well as general db sluggishness as all transactions have to be hardened to the transaction log (don’t forget confirmation from wait stats and DMVs as well).

So, get confirmation of how many IOPs your IO subsystem can support and then use this counter to confirm an IO bottleneck related to the IOP throughput, or use it to proactively monitor your storage subsystem so you get early warning of pending bottleneck. And yes, within a major enterprise there will be multiple SQL Server instances (or other DBMS flavours) plugged into the same SAN, but again, and with more effort, the total IO of these instances can be added up to get a rough idea of how many IOPs the SAN is having to contend with.

Avg. Disk sec/Read & Avg. Disk sec/Write

This is the average amount of time it took to read or write the IOs occurring at that time and this depends greatly on how frequently the data is being sampled and the number of IOs occurring at that time, so a slight increase in the average can mask a significant slow down. Keep an eye on the peaks as well as the average which you want as low as possible. Ideally less than 10ms if your disks are traditional spinning disks.

Split IO/Sec

If your partition alignment is not correct, this will show additional IOs that were required  to complete a read or write. This was prevalent on Windows Server 2003 and earlier and can be a significant drag on throughput

With these counters selected we need to select the disks we want to track those counters against under Instances of selected object:


Figure 1: Physical Disk counters


If the database drives are identified by drive letter then that’s simple enough to do, but if mount points are in use then you’ll have to go by the drive number, and you’ll find the drive number (or LUN) via the Disk Administrator tool. Once you’ve identified the LUNs for the database drive(s)/mount point(s) of interest, select them and click OK.

Instead of incorporating the disk counters into an existing template I prefer to use  a separate perfmon template just to track IO performance for reasons I’ll explain further down.


The difference between Physical Disk and Logical Disk counters is basically that Logical Disk will show the aggregated view of a drive in the case of e.g. drive D: being spread over multiple disks (e.g. 4 (D:), 7 (D:), 9 (D:)) or a drive hosted on a disk split into more than one partition like C:(0) and D: (0). Physical Disk will show the raw underlying drive or mount point, and I always use that, although MS would recommend looking at Logical Drive first and then drilling down.

A separate template to monitor disk counters

The reason I prefer to put the disk counters in their own perfmon template is down to the sample rate. The default perfmon sample rate is 15 seconds and if you stick with that sample rate all the disk counters with the Avg abbreviation will be averaged over that 15 second sample interval.

Why is this a big deal?

Well, it depends (of course!).

Are you trying to track down an IO stall that lasts for 5-6 seconds or are you trying to track down slow IO that lasts for a couple of hours?

If you’re tracking down a 5-6 second IO stall you’re not going to have much luck sampling every 15 seconds as any significant issues lasting 5 or 6 seconds may get almost completely smoothed out when they are averaged over 15 seconds.

For a longer lasting issue you may be able to go for a longer sample interval, but from personal experience I’ve found short sample intervals to be much more revealing, especially when tracking SAN performance as it can be so spikey.

By way of example, a few years back I spent a lot of time tracking down an infrequently occurring IO stall on a SAN for a critical low-latency trading platform.

The IO stalls would typically last @ 5 seconds and sometimes just a couple of seconds and I could never trap them until I got the perfmon sample interval down to 1 second. That’s how low I had to go; and we did find them and prove it was a SAN issue, although we had to enforce the co-operation of the SAN team at the proverbial gunpoint. Good luck with that when the time comes.

The performance impact of this low sample interval was negligible, although the perfmon trace files, inevitably got rather large rather quickly and a close eye had to be kept on them.

Finally, if this is a SAN that is being tracked then the SAN admins are going to have their own ‘monitoring’ which may not show a problem.

From personal experience this will be because the SAN ‘monitoring’ will be at a very different sample rate – usually in the region of minutes, not seconds. Therefore, to accurately pick up IO issues you’ll have to persuade your SAN admins to use the same sample interval perfmon is using otherwise we’re comparing apples to pears; your perfmon date will show huge spikes or dips whilst the SAN ‘tool’ will show a wonderful flat line as it averages everything out over a massive sample interval.

I’ve actually had the head of the SAN storage team argue with me that the 30 minute sample interval they used to monitor their SAN would pick up a 2 second IO stall. I kid you not.

Viewing the data

Some quick notes on reviewing the data gathered in the perfmon file(s). I’m assuming a perfmon file has been loaded into perfmon ready for viewing.

  • Right click on the graph area in perfmon, go to Properties, click on the Graph tab and set a sensible upper limit for the Vertical scale. The default is 100, and if you’re looking at e.g Disk Read Bytes/sec you might want to up that into the thousands. Select the vertical grid tick box as well to make it easier to read off the numbers.
  • When viewing the data, press Ctrl+H to highlight the selected counter in the display to make that counters data stand out. Makes it much easier when stepping through multiple counters to isolate their data values.
  • When looking at a perfmon file which covers a wide time range it can be easier to narrow down that time range by right-clicking the graph area, selecting Properties, clicking on the source Tab and scrolling the Time Range scroll bar. Alternatively, whilst looking at the graph click on the start time you’re interested in and drag your mouse across until the time frame you’re interested in is highlighted, then right-click and select Zoom To.
  • Watch out for scaling. If you’re comparing multiple values, select them and then right-click on any of the selected counters and select Scale Selected Counters as it can be very easy to misread a counter if the scaling chosen for it either exaggerates or understates its true value.

Working through an example

Recently I spotted the following slow IO warning on a production server:

2013-11-04 03:07:00.130 spid5s       SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [XXXXX] in database [YYYYY] (n).  The OS file handle is 0x0000000000000D74.  The offset of the latest long I/O is: 0x00002449860000

I loaded up a perfmon template from that period:


Figure 2: IO latency capture in a perfmon trace

A 14 ms averages doesn’t look too bad does it? That 10.144 second max can be ignored (fortunately!) as it coincided with scheduled work on the SAN at that time. Zooming into the 03:07 time period in question paints a slightly different picture for the IO numbers:

Figure 3: Perfmon IO latencies


We can see that in the couple of minutes leading up to the slow IO warning in the SQL Server error log that both reads and write latencies are spiking up.

I’ve left Avg. Disk sec/Read highlighted and we can see the average IO has jumped up from 14ms to 50ms which is well into poor performance territory. I’ve also added a third counter (Disk Transfers /sec) so we can see how much IO was going on (just against that partition) at that time, and for that period there were 3,016 disk transfers per second with a peak read latency of 0.8 seconds and write latency of 1.5 seconds.

Bearing in mind the Avg. Disk sec/Read and Avg. Disk sec/Writes are just averages (and will therefore be averaging out the true latencies of those 3 thousand plus IOs a second). it’s clear the disk subsystem was struggling at that time and, as the error message indicated, at least one of them did not return for at least 15 seconds.
Now, I know that on this system that there just happened to be a reindex running which swamped the IO path at that time, and I know from experience these slow IO warnings coincide with just these sort of disk intensive jobs.

Warnings coinciding with the times these jobs are running are usually the first sign that the infrastructure of the system in question is starting to creak and making plans for pre-emptive action is advisable, or this latency will start creeping into the working day of the system and start affecting users, and guess who they’ll come running to? I’ll give you a clue, it won’t be to whoever is responsible for that disk subsystem.

Other factors

After many weeks of trying to track down intermittent IO stalls at a financial institution I used to work at the problem turned out to be a networking issue, but not the normal networking kind of issue.

The first clue was that the perfmon data and the SAN performance data (even when the SAN data was gathered at a sensible low interval rate) never matched. This indicated the problem was between the host and the switches/HBA cards, but these checked out OK as well.

This left only one alternative; this financial institution I worked for could afford stretched Windows clusters; e.g. Node 1 would be in the data centre in the basement, and Node 2 was @ 15 miles away at a DR centre. But Windows clusters need shared storage, don’t they? They do indeed, and the way to make Node 1 and Node 2 think they’re looking at the same SAN is to mirror the SAN over underground dark fibre links. Every write is replicated with microsecond latencies.

This added about 3-4ms to every write operation but that was more than acceptable, especially because of the DR protection it gave us (without resorting to database mirroring/log-shipping/SQL Server replication etc etc), but I digress, and, as you’ve probably guessed, this is where we tracked the issue down to; most of the unresolved IO stalls were down to glitches in these links which the networking team were able to confirm when we gave them precise times and dates from the perfmon data.

Although these glitches happen from time to time – usually when workmen dig up roads and punch through the cables – the networking team never thought to mention it outside their usual circle because, up until that point, none of the systems at that institution were that sensitive to

IO latencies.

As you might expect, this has now changed, along with their notification procedures.

Incidentally, when the workmen cut through the cable everything automatically switched to a backup circuit within milliseconds but when that bubbles up to the Windows cluster and those lost IOs have to be re-sent, those milliseconds turn into 7-8 seconds.

If you’re looking after a system which has this kind of technology (the institution in question was using EMC GeoSpan clustering) this might be something worth checking out, if you haven’t already done so.


Don’t look at the perfmon numbers in isolation.

If you’re seeing latency spikes, check the number of disk transfers taking place at the same time, in case those IOPs are higher than your disk subsystem can sustain. At the same time, check the wait stats and DMVs as well for corroborating data any issues with those drives.

The Useful Links section below will help with getting this extra information.

How to examine IO subsystem latencies from within SQL Server

Tracking SQL Server IO Performance

IO Wait Stats

SQL Server and Disk IO

File fragmentation? What file fragmentation?

A few years back a SAN admin told me not to worry about physical file fragmentation because “The SAN takes care of it” and I took them at their word and, probably along with a lot of fellow DBAs who have their back-ends (in a manner of speaking) on  a SAN, allowed myself to wallow for too long in this cosey myth.

It ain’t so

Recently, I’ve found that not all SANs “take care of it” as well as you might think.

In fact, the documentation for this particular SAN (which shall remain nameless in order to protect the guilty) suggested that an OS level defragmentation should be carried out regularly (?!).

I’ve hunted around the world-wide-inter-twitter-blogosphere and there appears to be no clear consensus and even some contradictory posts on the matter.

I imagine this is because some SAN vendors handle fragmentation better than others and that, to an extent, files are by definition fragmented as they are striped across multiple disks in a SAN. Even so, fragmentation within these stripes must affect performance on traditional spinning disks.

I’m just going to recount my recent experiences and let everyone make up their own mind by showing how easy it is to ‘see’ file fragmentation occurring on a Windows Server 2008R2 system running SQL Server 2008R2 (although file fragmentation itself is a universal issue and not really governed by OS or SQL Server version), and what I noticed when I had dealt with the file fragmentation that I uncovered on an established system that had only been running for about a year-and-a-half at the time (but long before it came under my wing, I hasten to add!).

Tool up

First of all, we’ll need a tool which actually shows file fragmentation. If you use Windows Explorer to tell you how fragmented your database files are, expect it to tell you everything’s good, there’s no file fragmentation here, move along now.
That’s because (by default) Windows does not consider any file over 64 MB to be fragmented, and if you’re using the GUI interface accessed via Windows Explorer to check for it you’ll be none the wiser – only running defrag via the command line at the command prompt actually states

“Note: File fragments larger than 64MB are not included in the fragmentation statistics.” as part of the output of a file fragmentation analysis invoked via the /A parameter.

I have to admit, this did fool me for a while as I was investigating one particular database file that was showing significant IO latencies compared to other files in the same database located on the same partition, even after going through all the usual steps of looking at partition alignment, sector sizes, index defragging, high contention tables on that file etc etc.

Resolving file fragmentation

As soon as I uncovered the article outlining the changes to Windows disk defragmentation and a bit more digging around, I went to that mainstay of all Windows troubleshooting sites, SysInternals, and downloaded contig.exe. This great little tool will show file fragmentation of individual files or directories and allow them to be defragmented.

I pointed it at the 180GB problem file and I used the -v parameter to list all the fragments, expecting maybe a couple of dozen fragments. The output can be seen in fig.1 below:

it actually filled the screen and scrolled and scrolled and eventually reported a shocking 4,818 file fragments (the picture has been altered to show the relevant output and obfuscate the name of the file concerned).

The output lists the starting offset and and the number of allocation units it contains (multiply this by the allocation unit size that the partition concerned was formatted with to get the size of that fragment). Some of the fragments were only 64K in size. I’ve no idea how there could be such tiny fragments (even allowing for default autogrow settings), other than the possibility that at some point there was little free space on the partition concerned and the remaining free space was highly fragmented, so space was allocated wherever it could be found until the partition was resized (although this is purely speculation on my part).

Incidentally, contig.exe can be used to analyse free space fragmentation and report your biggest free fragment via the -f parameter.

Not only that, but it can also defragment files, can defragment one file at a time if required, and it can do that all that whilst the database is online, although I would personally never risk trying that on a live production system.

I had to deal with this particular issue in a slightly different way:

  • first, I added additional files to the filegroup
  • then I shrunk the original highly fragmented file which would redistribute the data onto the unfragmented files added to the filegroup and capped the original file to a low maximum size (4GB). I’ll explain why later
  • a reindex was then carried out, as shrinking creates internal index fragmentation

Ideally, I would have just used contig.exe to clear the fragmentation, but there were other factors involved here which forced my hand. Firstly, the file was on a partition that had been created with an incorrect allocation unit size, and this proved a convenient way of both resolving the fragmentation and moving the bulk of the filegroup to a partition created with a 64K allocation unit size, whilst keeping the database online. As the file concerned was the Primary mdf, I could not empty and and drop the file, which is what I would ordinarily have done under these circumstances, and a detach/attach operation to move it to the correctly formatted partition would have involved downtime. Secondly, (and this may affect others as well) contig.exe does not defragment files on mount points and, unfortunately, this database was utilising mount points.

There are ways around this, like using the subst command, but the system in question was clustered as well, so I wasn’t prepared to mess around with it too much.

What’s the big deal?

Backup performance went up 10% immediately after resolving the file fragmentation  and read latency went from 89ms to 13ms, or an improvement of @ 85%. Reindexing performance also improved because certain indexes were not internally fragmenting as frequently as they were before the physical defragmentation.
This kind of of improvement is usually achieved by throwing a lot of money and tin at a problem, but it cost us nothing, so I think it’s a big deal.

See for yourself

It’s not difficult to simulate how a database can become physically fragmented:

  1. Create a small, single data file database with a data file of at least 64MB (on a test system or even your desktop) on a partition that has been in use for a while (i.e. not a pristine empty partition).
  2. Run contig against the mdf file and you’ll see it as one contiguous file (unless the free space on the system concerned has become massively fragmented).
  3. Mimic several autogrows by manually altering the size of the data file and growing it by e.g. 1MB at a time.
  4. Re-run the contig.exe check against that data file. Note the increase in fragments (there probably won’t be a 1:1 ratio as some of the growths will be appended  to existing fragments).
  5. (Optionally) run the Windows fragmentation report which will report no fragmentation as, by default, it’s ignoring files bigger than 64MB.

OK, so this is a somewhat contrived simulation, but this will be what is happening to any databases that have not been appropriately pre-sized and have been autogrowing over time, and after countless autogrows (and probably a few shrinks thrown in for good measure) will be potentially suffering from performance impacting file fragmentation on systems using traditional spinning disks.

On media where the seek time is equal to all parts of the disk, like SSDs, the hit is probably not noticeable, but on spinning media, with enough file fragmentation the impact can be noticed but usually only with hindsight, after the fragmentation has been resolved.

What’s more, and here’s the kicker, if you’re not using third party file fragmentation reporting tools, you don’t know it’s there as the default Windows file fragmentation report will not return stats for files larger than 64MB.

Will it work for me?

The first thing I would do is check out the IO performance of the database(s) in question.

From personal experience, some of the things to look out for include files performing worse than others on the same partition or SAN back-end even though they are not responsible for most of the IO against that database and particular indexes regularly showing high internal fragmentation.

Check the output of contig.exe when aimed at those files. If you see hundreds of file fragments for an individual file on a partition utilising spinning disk technology, it’s definitely worth exploring.
Use the resolution technique most suitable to the SLAs and the situation; if you have the luxury of downtime, the files can probably be defragmented more quickly as there will be less IO contention. If contig.exe cannot be used to defragment the file you can try the approach I was forced to use.

Whatever you do, get a baseline of the latencies before you start so you can compare this to the corresponding figures after the file fragmentation has been removed. Don’t forget, things like reindex duration, DBCC CHECKDB duration and backup durations before and after resolving the file fragmentation can also provide a tangible indication of improvements (or the lack of).

I’ve finally got around to this, the third part of my SQL Server and Disk IO series of posts:

The sys.dm_io_virtual_file_stats DMV and fn_virtualfilestats function are great tools for extracting granular information about SQL Server IO performance, right down to the individual file in each database.

It gives details of the number of physical reads/writes, amount of data read/written and how much time was spent reading/writing that data. Traditionally, this has been important because IO access is the slowest part of fulfilling any database queries as RAM and CPU access takes nanoseconds or microseconds, whereas typically access to a spinning disk takes several milliseconds. In other words, disk access is thousands of times slower than CPU or RAM access, so improving disk access performance often brings about the biggest bang for the buck, in a manner of speaking.

Lately, with the uptake of SSDs increasing this is starting to become less of an issue, but if like me, some of the systems you administer are still using the ancient technology of spinning disks then getting familiar with this view will help you confirm any IO issues. Incidentally, the forerunner of this DMV was fn_virtualfilestats and goes back to (at least) SQL Server 2000 and still works in the latest SQL Server version (SQL Server 2012, as of writing this post) so this post concentrates on the fn_virtualfilestats function.

I won’t bore you with details of how to query the function as it is already well documented, and my own query which utilises it is below:

;WITH sum_table
SELECT [dbid]
, [fileid]
, [numberreads]
, [iostallreadms]
, [numberwrites]
, [iostallwritems]
, [bytesread]
, [byteswritten]
, [iostallms]
, [numberreads] + [numberwrites] AS [total_io] FROM :: fn_virtualfilestats(db_id(), null) vf
inner join sys.database_files df
ON vf.[fileid] = df.[file_id]
WHERE df.[type] <> 1
SELECT db_name([dbid]) AS db
, file_name([fileid]) AS [file]
, CASE [numberreads]
ELSE CAST(ROUND([iostallreadms]/([numberreads] * 1.0), 2) AS NUMERIC(8,2)) END AS [read_latency_ms]
, CASE [numberwrites]
ELSE CAST (ROUND([iostallwritems]/([numberwrites] * 1.0), 2) AS NUMERIC(8,2)) END AS [write_latency_ms]
, [numberreads] AS [#reads]
, [numberwrites] AS [#writes]
, [total_io]
, CAST (ROUND((([bytesread]/1073741824.0)/(SELECT CONVERT(NUMERIC(10,2),DATEDIFF(MI, sqlserver_start_time, GETUTCDATE()) / 60.0) from sys.dm_os_sys_info)),3) AS NUMERIC (8,2)) AS [gb_read_per_hr]
, CAST (ROUND((([byteswritten]/1073741824.0)/(SELECT CONVERT(NUMERIC(10,2),DATEDIFF(MI, sqlserver_start_time, GETUTCDATE()) / 60.0) from sys.dm_os_sys_info)),3) AS NUMERIC (8,2)) AS [gb_written_per_hr]
, [iostallms] AS [#stalled io]
, CAST (ROUND((([numberreads] * 1.0 ) /[total_io]) * 100,3) AS NUMERIC (6,3)) AS [file_read_pct]
, CAST (ROUND((([numberwrites] * 1.0 ) /[total_io]) * 100,3) AS NUMERIC (6,3)) AS [file_write_pct]
, CAST (ROUND((([numberreads] * 1.0) /(SELECT sum(total_io) FROM sum_table)) * 100,3) AS NUMERIC (6,3)) AS [tot_read_pct_ratio]
, CAST (ROUND(((numberwrites * 1.0) /(SELECT sum(total_io) FROM sum_table)) * 100,3) AS NUMERIC (6,3)) AS [tot_write_pct_ratio]
, CAST (ROUND((([total_io] * 1.0) /(SELECT sum(total_io) FROM sum_table)) * 100,3) AS NUMERIC (6,3)) AS [tot_io_pct_ratio]
, CAST (ROUND(([bytesread]/1073741824.0),2) AS NUMERIC (12,2)) AS [tot_gb read]
, CAST (ROUND(([byteswritten]/1073741824.0),2) AS NUMERIC (12,2)) AS [tot_gb_written]
, (SELECT CONVERT(NUMERIC(10,2),DATEDIFF(MI, sqlserver_start_time, GETUTCDATE()) / 1440.0) FROM sys.dm_os_sys_info) AS [uptime_days]
FROM sum_table
GROUP BY [dbid], [fileid], [total_io], [numberreads], [iostallreadms], [numberwrites], [iostallwritems], [bytesread], [byteswritten], [iostallms]
ORDER BY [total_io] DESC

I’ve chosen to use the fn_virtualfilestats function instead of the sys.dm_io_virtual_file_stats DMV as it is present all the way back to SQL Server 2000. [Edit: I’m working on a SQL Server 2000 friendly version of this query at the moment, once I’ve dusted of a copy of SQL Server 2000 and Windows Server 2003 (this is one of the few moments I’ve regretted getting rid of all trace of SQL Server 2000!)]

The main difference between this query and others I’ve found on the blogosphere is that it excludes the log file (because on an OLTP system it will get hammered and generate a lot of IO anyway), and once I’ve excluded the log file being an issue (usually from wait stats and System Monitor (perfmon) data) I’m usually more interested in how busy my data files are, and if there are any hotspots and it does this by aggregating the IO data and returning the percentage of IO that not only each file does, but the actual ratio of IO that each file does relative to the total IO against all data files within that database. Mike Lewis has tweaked the query so it scans all databases (see the first comment on this post below)

Automating the collection of this data provides a great way of baselining performance and spotting trends (hint hint). The uptime_days column shows how many days the server has been running so you can work out the average daily read/write IO. [Edit: as Richard has pointed out in the comments, this is based on a DMV introduced in SQL Server 2008.)

If you want information regarding log file performance, simply comment out the where df.[type] <> 1 line

It’s been tested in battle and I’ve found it invaluable for highlighting high contention filegroups and IO subsystem issues.

A word of caution, before looking at the output and pointing the finger at the disks/SAN it’s always worth digging behind the stats. If high latencies/reads are found, the issue can boil down to database design which can be uncovered by investigating the reason for the high number of reads/writes on the relevant filegroup(s), e.g. a lack of indexing on a highly read table will result in a lot of table scans which can account for a substantial amount of IO on the file/filegroup holding that table and adding a well chosen index or two can eliminate the scans, reducing the IO and (hopefully) the latency. If the IO goes down, but the latency does not, you have more ammunition for looking more closely at the IO subsystem provided that factors like, for example, anti-virus filter drivers etc have been investigated.

Recently, the output highlighted the slow performance of one particular file in an 18 file database which showed latencies of 89ms. After some digging around I found that this particular file was the primary file and (taking an educated guess) probably had default autogrowth settings when it was originally created, and has probably gone through some shrinkfile operations in a previous life, because this one 180GB file was spread over 4,818 physical fragments on the SAN.
Once this was addressed the latency went down to 25ms. Whilst 25ms is not brilliant in itself it’s still a 70% improvement on its previous performance (and I know there’s room for further improvement as the partition it’s on has an sub-optimal allocation unit size, for some strange reason). IMHO physical file fragmentation, even on SANs is a big issue and I’ll be blogging about that particular issue in an upcoming post.

As I was saying, it pays to do a little digging around behind the numbers before jumping to the most obvious conclusion.

I know there are dozens of variations of queries out there that interrogate fn_virtualfilestats or sys.dm_io_virtual_file_stats but the reason I came up with yet another variation was that, for me, the relativity was difficult to ascertain; on a system that’s been running for while the IO numbers can be vast and on a system with a lot of data files it can be difficult to spot the troublesome files/filegroups, hence the addition of percentages and percentage ratios. Another inspiration was the Performance Dashboard Reports (which don’t work on SQL Server 2008 R2 out of the box but I hear that they can be ported back to SQL Server 2008R2 from SQL Server 2012, or you can try this fix) which do offer percentages for IO, but also includes the IO on the log files which can make it difficult to get a clear picture of IO that was exclusively related to data files.

The main reason for this post was to highlight the presence and significance of this DMV and also to point out that it should not be used in isolation. Other methods for gathering performance related data, like perfmon and wait stats data are invaluable and all this information combined, together with corroborating evidence like e.g. slow IO warnings in the SQL Server error log will give a more rounded picture of IO performance.

Do you have tempdb on Fusion-io (or equivalent) technology? Is your reindexing job left on default settings (re-orgs at 5-15% and full rebuilds at @ 30%)? If so, then you might be interested in the findings in this post.

The benefits of flash memory as a replacement for spinning drives in order to boost SQL Server performance is well documented. However, I’m not sure if one particular benefit has been fully exploited. This post covers how we reduced the average duration of our overnight reindex job by 80% with the help of just one Fusion-io card (per node).

A bit of background: in order to give a creaking and very tempdb heavy SQL Server 2008 R2 system a bit of a lift we shifted tempdb onto Fusion-io cards. The system consists of a  2-node cluster running on a SAN. We did indeed see throughput increase, as well as an improvement in concurrency (as we switched on snapshot isolation as well) but some 2 months after the cards were installed another benefit has come to light which has turned perceived wisdom of reorganising versus rebuilding indexes on its head (disclaimer: for our system running on our hardware, at any rate).

I’ve been DBA-ing for so long now that the good old 5%/30% general advice for reorganising as opposed to rebuilding an index as documented in BOL is so deeply ingrained that I’ve hardly given it a second thought and just followed it automatically, only tweaking it when performance analysis showed an issue with these thresholds. For more background info on these thresholds I strongly advise reading Paul Randal’s post on where these guidelines came from.

With tempdb on a much faster IO subsystem than the application database the SORT_IN_TEMPDB option for the ALTER INDEX command suddenly comes into play. After installing the Fusions I tweaked our reindex job to turn the SORT_IN_TEMPDB  option on, made a mental note to follow this up in a few days time, and promptly forgot about it until several weeks later when I wanted to track down indexes that were getting highly fragmented (we use Ola Hallengren’s excellent maintenance script which logs this kind of information (and more) in a table) and noticed some  indexes had huge variations in the time taken to reindex them, with a full rebuild taking a fraction of the time a reorg did, and that’s when the penny dropped.

The Fusion-io boost

I promptly turned off reorgs so now it rebuilds at 10% with every index typically getting fully rebuilt in under 5 minutes (multiple indexes are between 10-14 GB in size). Needless to say, this has had a massive impact on the total reindex time. The reindex job on this system used to take anywhere between 4 to 9 hours which was a real problem as this would often encroach well into the business day.
Since we switched off the reorgs, the reindex has never taken more then 2 hours (and the longest run at 114 minutes included 45 minutes of blocking). The graph below shows the performance improvement very er…graphically:

Purdy, ain’t it? As we can see, February 16th was the first night the reindex job used the rebuild option in place of reorgs, and the drop off in the run time could not be more dramatic. For the two weeks from 1st February – 14th February (inclusive) the average duration was 361 minutes, or 6 hours. After the SORT_IN_TEMPDB tweak on the 16th, the average (from the 16th up until today (March 11th) has been just 78 minutes.

Incidentally, the Fusion-io cards in question are 785GB ioDrive2 Mono MLC PCIe Solid State Storage Cards, and yes, technically MS don’t support this tempdb configuration on a pre SQL 2012 cluster, but lots of people are doing it.

The benefits

For us, modifying the reindex job to drop the reorgs in favour of rebuilds on this particular system was a no-brainer. The reindexing finishing in a quarter of the time taken previously on a consistent basis every night has eliminated the uncertainty and inconvenience caused when the original approach caused the index to collide with subsequent maintenance jobs or encroach into the working day of the system.

Furthermore, the database was in simple recovery so we didn’t have to worry about transaction log backups (although a reorg can have a greater impact on transaction log backups).

This is either a benefit that’s so obvious that no-one has bothered blogging about it, or it’s something that’s gone under the radar thus far (I’m obviously hoping it’s the latter). I’ve seen posts hint at it by mentioning index creation is faster etc, and I’ve even read a MS whitepaper analysing tempdb on Fusion-io on a SQL 2008R2 system (which did a reindexing test with the SORT_IN_TEMPB option on, but left the reindexing thresholds on defaults) but no-one, as far as I can tell, has gone the whole hog with regard to adapting their approach to reindexing and just rebuilding indexes in favour of the reorg option to fully exploit their investment in Fusion-io cards (or they just kept quiet about it, whereas I couldn’t help blabbing about it!); I can think of a number of scenarios and previous systems where the improvement in reindex job performance alone would justify this investment.

Are reorgs redundant?

If you have your tempdb on Fusion-io or similar technology, with the application database(s) on a slower IO subsystem with the reindexing strategy left on original settings its an option well worth exploring.

The things to look out for are the bandwidth from host to SAN as on this system we do briefly get close to maxing it out during the reindex now as it reads in the index pages and writes out the rebuilt index. Personally, I’d rather fully exploit our infrastructure than have it burbling along at the fraction of its capacity, which is what it used to be like when most of the indexes were being reorganised.

Another consideration is the latency and throughput difference between the storage subsystem the application dbs are stored on and the Fusion-io cards; against spinning disk SAN based systems there’s no competition, the Fusion-ios will outperfrom the SAN in every category, but if you have e.g. SSDs already in the SAN the difference may not be that substantial, so the switch may not bring about such a dramatic boost. On this particular system, in our environment, there was a huge disparity between SAN performance and the Fusions which has made the switch so beneficial for us.

If alternative HA strategies like log-shopping have been implemented the impact will have to be measured as the costs may outweigh the benefits (thanks Paul!).

A final thing to bear in mind is that if you are fortunate enough to have some older/smaller Fusions lying around (perhaps you’ve had to swap them out as the application dbs have got bigger) it’s a great way of recycling these cards on other systems to give them a new lease of life.

So are reorgs redundant? If you have access to this kind of technology, it’s easy enough to prove on your system.

In summary

“It depends”, and “your mileage may vary” have become well worn cliches, but they are so true. Every system is different and there are a number of factors to consider, as (I hope) the post outlines, but the potential benefits warrant exploring this, if you have not already done so.

The single biggest clue to the source of a performance problem will be from something referred to as the wait stats.

What are waitstats?

In short wait stats are statistics gathered on what resources SQL Server is waiting on internally whilst executing queries.

Why are they useful?

They can often provide very quick and objective evidence of a performance bottleneck by showing cumulative wait times that SQL Server spent on getting access to the CPU, memory or disk access as well as numerous other internal resources.

This post is only going to concentrate on IO related waits, as it continues on from my SQL Server Disk IO posts.

Gathering this information on at least a daily basis can provide invaluable data on a system’s performance over time and can reveal sudden changes in behaviour caused by e.g. a recent change to an application. Speaking from personal experience, using historically captured wait stats and contrasting them with sudden changes to the topmost wait types is an invaluable way of proving that recent application changes were the trigger for a change in behaviour.

How do I gather these wait stats?

That’s fairly straightforward – it’s a T-SQL statement. First, the SQL Server ‘version independent’ method:


The sample output from my local SQL Server 2005 Express installation is below:

Wait Type                        Requests      Wait Time     Signal Wait Time
-------------------------------- ------------- ------------- ----------------
MISCELLANEOUS                    0             0             0
PAGEIOLATCH_NL                   0             0             0
PAGEIOLATCH_KP                   0             0             0
PAGEIOLATCH_SH                   178           327           0
PAGEIOLATCH_UP                   16            436           0
PAGEIOLATCH_EX                   14            0             0
PAGEIOLATCH_DT                   0             0             0
IO_COMPLETION                    173           1279          0
ASYNC_IO_COMPLETION              1             1294          0
CHKPT                            1             202           0
BACKUPIO                         0             0             0
DISKIO_SUSPEND                   0             0             0
IMPPROV_IOWAIT                   0             0             0
WRITELOG                         29            639           15

From SQL Server 2005, the wait stats data has been split into two separate dynamic management views (DMVs) sys.dm_latch_wait_stats and sys.dm_os_wait_stats. The latch wait stats are counters for internal waits within the SQL Server database engine and the OS wait stats accumulate data on waits for external resources such as CPU, disk and memory.

As we’re concentrating on IO waits, its the sys.dm_os_wait_stats we’re interested in.

The following query will grab the IO wait types output:

select * from sys.dm_os_wait_stats
wait_type like 'PAGEIO%' or
wait_type like '%IO_COMPLETION' or
wait_type like 'DISKIO%' or
wait_type like 'BACKUPIO%'or
wait_type like 'WRITE%'
order by wait_time_ms
wait_type               waiting_tasks_count  wait_time_ms  max_wait_time_ms  signal_wait_time_ms
--------------------- -------------------- ------------- ----------------- --------------------
PAGEIOLATCH_SH                   20892941     110133998          17895              332331
PAGEIOLATCH_EX                     392948       7270457           2395               10353
IO_COMPLETION                      719058       1061552            916                6929
PAGEIOLATCH_UP                       4332         81307           3635                1243
BACKUPIO                             3998         16393            240                  18
ASYNC_IO_COMPLETION                   133         14771            805                   0
PAGEIOLATCH_DT                          0             0              0                   0
PAGEIOLATCH_NL                          0             0              0                   0
PAGEIOLATCH_KP                          0             0              0                   0
DISKIO_SUSPEND                          0   

Incidentally, these queries are only meant to show how to grab the data, if you hunt around on the web you’ll find some elegant queries for processing this data.

This KB from MS discusses these wait types (and more).

SQL Server 2008 R2 has even more granular wait-types to isolate IO related waits, but the above query has been limited to allow it to work from SQL Server 2005.

wait_time_ms is the total accumulated wait time (since that SQL Server instance was last restarted) spent on accessing that resource.

max_wait_time_ms is the peak wait time for one of these requests.

signal_wait_time_ms signal wait time is how long was spent waiting to get access to the runnable queue for that resource, and can indicate CPU pressure if this is high.

If these IO related wait types are high in the list of overall waits (see the Waits and Queues article from Microsoft lower down), then you have an IO issue.

This is either down to the performance of the IO subsytem, or the physical design of the database. The design issues could be due to the file/filegroup layout of the database concerned, but also related to the layout of tempdb (in a previous role it never ceased to amaze me how much thought went into the layout of an application database at various customer sites, whilst tempdb was left e.g. at it’s 8MB starting size with (just one) data file and log file sharing the same drives and/or sharing application database drives).

Take a close look at the particular wait type concerned. If the WRITELOG wait type is dominant, you know it’s the log drive if this is backed up by sys.dm_io_virtual_file_stats. If it’s BACKUPIO, we can infer that it’s the backup drive(s).

A word of caution, don’t talk the numbers at face value.  For example, if the PAGEIOLATCH_SH wait type is high in the list, we know it’s the data drive(s), but that does not necessarily mean there is an issue with the disk subsystem, merely that a lot of IO requests are being issued and this could point to database design issues – if some of the tables had better indexes the query optimizer can come up with a more efficient query plan and retrieve the same data with far fewer IOs. Needless to say (almost), but corroborating evidence from the DMVs or perfmon should always be gathered whenever possible.

These are just a few examples, if you gather this data regularly a clear picture can be built up of what performance issues are affecting your systems, and by picking upon trends you can get advance warning of impending issues.

For further information on wait types and what all the output columns mean, there is an excellent Waits and Queues document on the Microsoft website which goes through them and is highly recommended reading.

Thomas Kejser also has an excellent bottleneck analysis script.

The most telling output can be from identifying IO stalls.

SQL Server 2000 had fn_virtual_filestats and this has now been incorporated into sys.dm_io_virtual_file_stats which is coming up in the next post, Tracking SQL Server IO Performance in this SQL Server and Disk IO series.

Many aspects of a database system’s configuration will affect the performance of queries running on that DBMS. However, there is one single component that has the greatest impact on DBMS performance, and that, of course, is disk (IO) access.

It’s a broad topic and all aspects of it are pretty well covered in various blogs and technical articles, but there’s not much out there that gathers it all together. The intention of this series of posts is to try consolidate some of this information and focus in on some of the functions and procedures that can be used to monitor, troubleshoot and configure SQL Server disk configuration and most, importantly, disk performance. It won’t be exhaustive and cover every conceivable option, but will be based on experience gained whilst working at Sybase and Microsoft and sprinkling of ‘real world’ database administration experience.

I’ve split the topic up into smaller sub-topics in order to make it more manageable:

1. Partition offsets
2. Wait Stats
3. Tracking SQL Server IO performance
4. Physical Fragmentation
5. System Monitor (Perfmon)
6. Tempdb

Kinda ambitious, I know, but I don’t blog often and it gives me something to aim for!

There are plenty of excellent articles which cover IO in some depth, so I’m going to try my best to avoid regurgitation and try keep things DBA-centric.  (I highly recommend reading the links at the end of the relevant posts to anyone wanting a deeper understanding of IO issues, from configuration to monitoring and troubleshooting).

Until solid state disk prices come down and become ubiquitous, disk access is still a painfully slow (relatively speaking) mechanical process. An IO (read or write) request is issued by SQL Server which is passed onto the OS. The OS then passes this onto your disk controller or HBA and eventually this data request translates into a mechanical arm sitting above a spinning disk which will read or write the data that SQL Server requested.

What this means in performance terms is that disk access will be measured in milliseconds.

Access times to the other major components of a DBMS that directly affect performance such as CPU and RAM can be measured in microseconds or nanoseconds.

As you can see, it’s an order of magnitude difference.

This single factor alone highlights why reviewing a SQL Server database installation’s disk configuration can be such a valuable and rewarding exercise.  In fact, I begin many of my SQL Server troubleshooting exercises by eliminating IO performance first.

Prior to SQL Server 2000 SP4 it was difficult to identify IO performance issues without a lot of digging around. The biggest clue SQL Server would give would be the occasional mysterious 17883 (or even 17884) errors, and these were not always guaranteed to be IO related, although IO was usually heavily implicated. SP4 for SQL Server 2000 introduced those ‘slow IO’ messages which most DBAs will be familiar with and was a step in the right direction.

SQL Server 2005 introduced various DMVs which would give much greater insight into IO performance, and SQL Server 2008 built on this with the Performance DataWarehouse. These posts are going to cover methods that can be applied from SQL Server 2000 onwards, so will not delve deeply into features and properties only available in later versions, but focus instead on functions and features available across the last three versions of SQL Server as I know there’s still a lot of SQL Server 2000 installations out there.

I’m starting off this series of posts with a discussion about partition offsets. Diving off at the deep end a bit perhaps, but if your disk setup is not based on firm foundations, you’re setting yourself up for one of the most common and easily-avoidable performance issues right from the off, and a IO performance hit of 20-30%.

The partition offset issue is relevant to any disk partitions created prior to Windows Server 2008. What this means is that whilst your OS may be Windows Server 2008, if the disk you are using was migrated from an older OS the chances are that it may still be exposed to the problem.

Also, certain OEM built machines may be exposed to this problem even on Windows Server 2008 as they often include hidden partitions used for e.g. server recovery date.

Partition offset

Without going into too great a level of detail on how a disk is structured both physically and logically, the partition offset is basically where Windows will start writing its first bytes to disk once the partition concerned is first used.

When a disk partition is created in Windows Server 2003 and before, the first 63 sectors are reserved for the MBR as well as disk vendor proprietary information. That means Windows Server 2003 and before (hereon referred to as just Windows) will write its first bytes of data on the 64th sector.

What’s the big deal?

These 63 sectors are invariably 512 bytes (worth checking with your SAN team/disk vendor however as some SANs may use larger sector size but present the sectors as logical 512 bytes sectors to Windows). Therefore the 64th sector will begin at a 31.5KB sector boundary. What this means is that if this is a SQL Server disk, SQL Server will write its first data page on a 31.5K sector boundary, and every nth page will be written across these boundaries.

The big deal therefore is that each of these nth IO operations will result in an additional IO to read or write the data that was on the page that crossed that boundary (the actual value of n will depend on cluster (sector) size and stripe unit size).

What this means in real (performance) terms is that a properly aligned system will yield a 20-30% performance improvement in IO latency (and therefore query execution time). All this for free, without having to analyse SQL Server configuration, indexing strategy, query plans etc etc.

How do I check my offset?

Checking the offset can be done via a wmi console (wmic) command.

wmic /node: <name of server to check> partition get BlockSize, StartingOffset, Name, Index

The above command will show output similar to the following (if the wmic command is being executed locally, there’s no need for the /node parameter):

C:WINDOWSsystem32>wmic /node:myDbServer partition get BlockSize, StartingOffset, Name, Index

BlockSize Index Name StartingOffset
512 0 Disk #0, Partition #0 32256
512 0 Disk #1, Partition #0 32768
512 0 Disk #2, Partition #0 32768
512 0 Disk #3, Partition #0 32256
512 0 Disk #4, Partition #0 32256
512 0 Disk #5, Partition #0 1048576
512 0 Disk #6, Partition #0 1048576

There will be one line of output for each disk and each disk will be identified by its disk number.

Use the Disk Administrator to identify how these drive letters map to your drive letters.

Surprisingly enough, the StartingOffset column is the key column, and will show the starting offset in bytes.

Disks 1 and 2 have 64 sector offset which lines up with e.g. legacy controllers on local disks.

Drives on default values will show a value of 32256, such as disks 0, 3 and 4 in the output above. Divide that by the 512 byte sectors and you’ll get that dreaded value of 63.

Disks 5 and 6 have a 1MB offset which should line up with most modern disk, SAN cache and cluster size scenarios.

What the above output shows is that, depending on the age of the system, there can be a myriad of offsets used which were correct according to how the sytem was configured at the time. Disks 1 and 2 were altered from the default, so at some point in this system’s history someone looked into the issue of partition misalignment and tweaked it for this drive.

So why are disks 3 and 4 on misaligned defaults? Well, have you ever had disks recut to (usually) increase storage. That usually involves blowing away partitions and recreating them. If partition offsetting was not carried out when the partition was created for whatever reason, this is what you can end up with. The newest disks 5 and 6 were created with the latest information to hand, hence the higher offset, but as this slightly contrived example shows (it’s based on real world systems) it pays to be vigilant and review systems that were previously correctly set up.

How do I fix it?

On paper, the fix is very simple; adjust the partition offset. The diskpart command line utility is used to reset the offset.
Move the offset along from the default misaligned 31.5K offset to an offset that works with your cluster size and stripe unit size; 64K, or a multiple of 64K usually works on most systems, but where a SAN is involved, it gets more complicated as stripe unit sizes will vary from vendor to vendor and rig to rig, so you’ll have to get your SAN engineer/vendor involved.

Windows Server 2008 defaults to 1024K offset.

Ideally, the offset needs to start at a point where the IO requests that SQL Server issues (which will be a multiple of 8K, depending on the IO request type being fulfilled), the cluster size, and any caching involved at disk controller/SAN level all line up, thus avoiding additional IO requests.

The downside, of course, is the partition will have to be recreated, so downtime will be involved on standalone servers that have no HA solution implemented.

Realigning the partition offset involves blowing away the disk partition in question, so if you’re not familiar with diskpart, leave it to your relevant storage/Windows team to reset the partition. If you really want to do this yourself, you’ll find step-by-step instructions in the SQLCAT team’s Disk Partition Alignment Best Practices article linked to at the bottom of this post.

The resulting 20-30% improvement in latency times have been verified by the SQLCAT team who have published them in that paper on partition alignment mentioned above, and is highly recommended reading for anyone with an interest in this subject.


If you only read one further article on partition offsets, make it this one!

Disk Partition Alignment Best Practices for SQL Server

SQL Server 2000 I/O Basics

Disk performance may be slower than expected when you use multiple disks in Windows Server 2003, in Windows XP, and in Windows 2000

SQL Server and CPU core usage

How many CPU cores can SQL Server support should be a simple question, you might think. However, judging by how often this question is asked on the SQL Server forums its clear that there is a lot of confusion about exactly how many CPUs a particular edition of SQL Server can use.

The problem is due mainly to incorrect assumptions about Microsoft’s licencing policy, combined with the prevalence of multi-core CPUs. This has led to some incorrect responses to postings about CPU licencing on the SQL Server forums, so I’ve decided it’s worth blogging about to try and clarify the matter.

This post covers SQL Server 2005 – SQL Server 2008R2 editions with a brief update on the significant licencing policy changes in SQL Server 2012,

Up to and including SQL Server 2008R2, Microsoft’s multicore licencing policy was based on the number of sockets on the motherboard, and not the number of cores, but this has changed in SQL Server 2012 and will fall in line with the rest of the industry (a formula will be applied based on the number of cores).

What this means is that for SQL Server 2008R2 and before, for licencing purposes, a dual-core CPU equates to 1 CPU.  For SQL Server 2012, it equates to 2 CPUs.
A quad-core CPU therefore equates to 1 CPU or 4 depending on the SQL Server version etc. etc.
Therefore, If you’re thinking about upgrading from SQL Server 200n to SQL Server 2012 I recommend reading Glenn Berry’s hardware post which has a discussion on the issues to consider, as it could cost you dear if the system(s) concerned have more than 4 cores.

Incidentally, the free SQL Express edition as of SQL Server 2012 has been capped to a maximum of 4 cores.

It’s well worth keeping track of official limitations here, as it can be a struggle to keep up to all the licencing changes!

[Edit (2014-01-22). The changes to the CPU core licencing saga for SQL Server 2012 continues. Even if you have Enterprise edition, if the licencing model being applied is Server + Client Access License (CAL) based then the server is limited to 20 physical cores per instance. Only Enterprise Edition utilising core-based licencing will be able to use the operating system maximum as stipulated in the Compute Capacity Limits by Edition of SQL Server documentation.

If you’re not sure what licencing model is in effect, the following message at the top of the SQL Server error log will confirm core based licencing:

2014-07-07 09:45:01.130 spid68 Microsoft SQL Server 2012 (SP1) - 11.0.3412.0 (X64)
Mar 2 2014 01:25:09
Copyright (c) Microsoft Corporation
Enterprise Edition: Core-based Licensing (64-bit) on Windows NT 6.3 (Build 9600: )

If the ‘Core-based licensing’ stipulation is absent you’re on CAL licencing. Further down the error log (from a fresh SQL Server restart you will see the following message:

SQL Server detected x sockets with y cores per socket and zz logical processors per socket, zz total logical processors; using 40 logical processors based on SQL Server licensing. This is an informational message; no user action is required.

To change it you will need to get the right key and change the licencing as documented here, or you can just run setup again from the correct licenced CD and use my post on changing the SQL Server edition to make the change.
Note: I had a licencing ‘expert’ from one of the big 2 hardware vendors, who allegedly had a decades experience, try to convince me that I had to reinstall SQL Server to change the licencing model. This is not true, just follow the steps in that changing the SQL Server edition post.

You can verify a particular servers CPU core usage by interrogating the sys.dm_os_sys_info SQL Server dynamic management view (DMV). The following query lists how many CPUs a particular instance of SQL Server can see:

select cpu_count from sys.dm_os_sys_info

(The above query will also confirm only 20 or 40 cores (if hyperthreading is enabled) are actually online.)

The sys.dm_os_sys_info DMV returns a whole host of other useful information that used to be very difficult to obtain without using xp_cmdshell or WMI, so it’s well worth exploring further.

In fact, most of the DMVs have invaluable information if you know what you’re looking for, but that’s a subject area far too large for this post; just peruse Books Online as that gives a complete listing and a good rundown of what each DMV offers.
Just remember that, although SQL Server Express will be able to see multiple CPUs and use them to improve concurrency, SQL Server Express edition is prevented from taking advantage of this to parallelise individual queries, which basically means that an individual query will not be executed across multiple CPUs

How many CPU cores is SQL Server using?

Back to counting CPUs. To list how many CPUs a particular instance is actually using, run the following query which is based on the sys.dm_os_schedulers DMV:

select scheduler_id,cpu_id, status, is_online from sys.dm_os_schedulers where status='VISIBLE ONLINE'

This will return how many CPUs SQL Server is using by listing all the CPU cores it can see. It’s easy enough to modify this query with a count (*) if you just want a count of the CPU cores. Ordinarily, this will always equate to the number of CPUs on the system. However, if CPU affinity is being used to assign specific CPUs to SQL Server (in order to adhere to licencing requirements, for example), this query will show how many CPUs SQL Server is actively using.

Run the query without the where clause to see how many CPUs are offline and not being used (their status will be VISIBLE OFFLINE). The additional rows returned will show up internally used schedulers such as the scheduler retained for the Dedicated Admin Connection, or DAC.

If this query does not list the number of CPUs you are expecting, and you’re definitely not using CPU affinity, check the edition of SQL Server that you are using against the table below.

Note: To confirm that CPU affinity is indeed being used, the cpu_id column will always be less than 255.

CPUs supported by the various SQL Server Editions

A summary of how many CPUs the different editions of SQL Server support is summarised in the table below:

SQL Server version/Edition Express Workgroup Web Standard Enterprise
2005 1 2 n/a 4 Unlimited
2008 1 2 4 4 Unlimited
2012+ 1 Oh dear, you better read this and probably this