I was waiting to implement a change the other day and found the DBCC job was still running. As this was a production system I was not too keen on cancelling the DBCC check. I thought about restarting it after I had completed my change but although the DBCC had been running for nearly an hour I was not sure how long it would take to complete, so it was then that I realised that the DBCC command was one of those commands that populates the percent_complete column in the sys.dm_exec_requests DMV, so I modified my backup/restore progress query to give me an ETA for when the DBCC would complete:

SELECT r.session_id, CONVERT(NVARCHAR(22),DB_NAME(r.database_id)) AS [database], r.command AS [dbcc command], s.last_request_start_time AS [started], DATEADD(mi,r.estimated_completion_time/60000,getdate()) AS [finishing], DATEDIFF(mi, s.last_request_start_time, (dateadd(mi,r.estimated_completion_time/60000,getdate()))) - r.wait_time/60000 AS [mins left], DATEDIFF(mi, s.last_request_start_time, (dateadd(mi,r.estimated_completion_time/60000,getdate()))) AS [total wait mins (est)], CONVERT(VARCHAR(5),CAST((r.percent_complete) AS DECIMAL (4,1))) AS [% complete], GETDATE() AS [current time] FROM sys.dm_exec_requests r INNER JOIN sys.dm_exec_sessions s ON r.[session_id]=s.[session_id] WHERE upper(r.[command]) like 'DBCC %'

A sample ouput from a test database called er…test is shown below:

DBCC completion check output

This is hardly a new topic and whilst there are a lot of articles out there instructing us on how to stripe tempdb etc I feel there’s a bit of a gap on exactly how to identify a problem with tempdb in the first place and particularly on finding out how much of your overall IO tempdb is responsible for.

I cover that in this post but it is also covered more extensively in my SQL Server and Disk IO series of posts, as the troubleshooting and monitoring principles covered in them are just as valid for tempdb performance monitoring as they are for application databases.

Also, as this is the concluding post in that series I could hardly avoid writing about it!

If an application database is poorly configured in terms of disk layout then generally only that database will be affected by e.g. queries utilising a lot of disk IO performing poorly. If tempdb is poorly configured in terms of disk layout the whole SQL Server instance will suffer performance problems for the simple reason that tempdb is effectively a workhorse database that does a lot of work in the background.

Some of this background work tempdb is obvious, some of it is not so obvious:

  • temporary tables (explicitly created in queries/procs etc)
  • queries that sort (order) data (temporary tables will be created under the covers to sort the data)
  • worktable spools (when a query cannot be processed entirely in memory)
  • Service Broker
  • app databases utilising Snapshot Isolation (to hold versioned data)
  • linked server queries
  • SQL Server internal processes

This is by no means a complete list, and, in all likelihood, as features are added to SQL Server the list will grow rather than shrink.

One way of thinking about tempdb is that it’s like an OS swap file. When the OS runs out of memory the swap file becomes the default dumping ground and performance slows and in the same way a poorly sized and configured swap file can cripple a server a poorly configured tempdb will cripple SQL Server

The most common issue related to tempdb that every DBA will hit sooner or later is space usage when a runaway query maxes out tempdb and it runs out of space. Those kind of issues are pretty easy to solve and well documented all over the web; check out the links at the end of the post.

This post is going to focus more on the performance issues and they can usually be pinned down to the physical database design. If you’ve been reading the previous posts on this SQL Server and Disk IO series you’ll know how critical it is to size a database well, split it up into multiple files and place it on performant disks.

Unsurprisingly, the same principles apply to tempdb. But from my time working at Microsoft it never ceased to amaze me how many clients went to a lot of trouble with the physical design of their application databases but paid no regard to applying the same principles to tempdb, so it would frequently be left on default settings in C:Program Files Microsoft SQL Serveretc etc.

But I don’t know how much IO my tempdb is doing

Tempdb performance and its IO load can be easily checked via the Microsoft SQL Server 2012 Performance Dashboard Reports as well as my posts on monitoring SQL Server IO.

Once the reports are installed, open up the main dashboard report and from there click on the Historical IO link. This takes you to the Historical IO Report, et voila! You can now see exactly how much IO your tempdb does in terms of reads, writes, the time taken to read and write and, crucially, the percentage of all IO that each database on that SQL Server instance is responsible for:
If tempdb is taking a significant percentage of that IO it makes sense to make sure it’s optimised as well as it can be.

For older versions of SQL Server check out this Codeplex project for dashboard reports that run from SQL Server 2005.

Sizes and stripes and trace flags

We know how much IO tempdb is doing and we want to apply the same tuning principles we would apply to any other database, starting with striping it across multiple files. There’s a lot of material out there on why this is ‘a good thing’ so I’m not going to regurgitate why here.

The first thing to check is that tempdb is not on default settings and in the default location (unless that default location has super-low latency and can handle a lot of IOPs).

Out of the box tempdb (up until at least SQL Server 2012) will be sized with one 8 MB data file and one 512 KB log file. If your SQL Server installation consists of one application database containing one table with one row then you’re good to go. The chances are your SQL Server installation will be a little more substantial than that so tempdb needs to be sized up.

In SSMS right-click on tempdb and select Properties and click on Files under the ‘Select a page’ section of the ‘Database Properties – tempdb’ dialog box:

This will list the files, their location and initial sizes of tempdb. If there’s only one data and log file at the default sizes (under Initial Size (MB)) we have work to do. Note: the SSMS GUI will round-up a default log size of 512KB to 1MB. To prove this, run

sp_helpdb tempdb

in a new query window and review the output for the tempdb log file and it should show the size as 512MB.

Adding stripes is simply a question of clicking the Add button here and setting the appropriate size, autogrow, path and filename settings.

You’ll need to provision some fast storage for the additional tempdb stripes first. Ideally, use flash or SSDs or fast local disks. If you’re at the mercy of a SAN administrator, request something akin to RAID 0 as you don’t need to worry about mirroring/parity as tempdb is rebuilt on SQL Server startup. Obviously, if there’s a recurring issue it means there’s a disk problem so that will still need addressing, but from a DBA perspective there’s no recovery options for tempdb other than restarting SQL Server.

To work out how many stripes you need keep any eye on the guidance Microsoft publish as the advice changes over time. Anything I put down now will probably be out of date in a couple of years. I start at one quarter of the number of cores on the system if it has up to 64 cores and (personally) have never found the need to increase it beyond half the cores on systems with more than 32 cores.

On high throughput systems that are very tempdb heavy you may need to up this.

The SQLCAT team (or whatever they are called this month) recommend one stripe per core but these were extreme-scale systems so IMHO not practical for the vast majority of systems out there.

Establishing if you need more cores is not that difficult as this will manifest itself as something called PFS page contention.

Add stripes and enable trace flag 1118 as described in the above article and many others. There’s no downside to it and it’s a de facto step I take on all the SQL Server installations I’m involved with. Check out this Microsoft KB article if you’re still not sure.

The elephant in the room is the size of the stripes, and therefore the size of tempdb. There is no right answer for the size of tempdb. It depends on the load. The load is not necessarily how busy the system is right now, or even how busy it is at peak loads, it’s how big it needs to be when the most tempdb intensive loads are placed on it. E.g. overnight ETL jobs. If these jobs grow your tempdb to 500GB at 3am whereas for the previous 23 hours of every day your tempdb never has more than 50GB of data in it then your tempdb needs to be sized at 500GB in order to avoid autogrows which will cripple the performance of that overnight job as it incrementally autogrows to 500GB the next time it runs.

The same sizing principle applies to the transaction log file size; make it big enough to cope with the demands of the most intensive jobs to avoid autogrows.

There’s nothing to stop you looking at tasks that cause excessive tempdb (see the ‘Useful Links’ section at the end of this post) usage as there may be ways to optimise those tasks and reduce that tempdb usage, but sometimes we just have to accept that some of these procedures or jobs will just be heavy tempdb users so we divide our stripe sizes appropriately so they add up to the size of database that we need, plus a reasonable overhead both within the overall size of tempdb plus free space on the underlying volume the stripes reside on.

This leads inevitably onto the ‘how many volumes?’ question. Split the data files equally across multiple volumes is the simple answer, if it’s traditional spinning disks. If you’re fortunate enough to have flash storage which can handle anywhere from 300 to 500K IOPs one volume should suffice for most installations. If it’s a high throughput, latency sensitive system stripe across multiple flash devices. Besides, if tempdb is placed on a SAN (remember tempdb on local disks for a failover clustered instance of SQL Server is supported from SQL Server 2012) those multiple volumes you have requested might be placed on overlapping spindles anyway, or shared with other heavy usage systems, unless you happen to have a SAN admin who really knows their stuff (I think you’re more likely to stumble across a unicorn) or have the luxury of a dedicated SAN.

For failover cluster instances of SQL Server don’t forget to ensure that all the volumes and paths used for striping tempdb are present on all the nodes in the cluster, otherwise tempdb creation will fail and SQL Server will not come online on that node.

SQL Server will have to be restarted for any changes to the striping to take affect.

If SQL Server fails to start after these changes it’s almost always down to a typo in the path and that usually means having to force SQL Server to start with certain command line startup options as documented in this KB article. The article covers recovering a suspect tempdb but it shows what you need to do to restart SQL Server and bypass the creation of tempdb in order to get you into a position where tempdb paths can be corrected.

An aside: sometimes it really is worth finding out what’s using tempdb

On a system I used to manage we investigated tempdb usage as tempdb was generating a lot of slow IO warnings in the error log and the system was really struggling. The previously mentioned dashboard report revealed that 60% of all IO on the system was being generated by tempdb.
By checking the sys.dm_exec_requests DMV via some home-brewed stored procs (which I might one day blog about as they’ve proven very useful to me) and the use of Adam Machanic’s excellent sp_whoisactive proc we isolated the issue to one stored procedure that was doing some string manipulation and being called about 20 times a second. Not only that but it was explicitly creating and dropping a temp table whenever it was called.
The explicit tempdb creation and deletion was removed first and the string manipulation was eventually placed into a CLR function. Tempdb was then responsible for @ 10% of overall IO of the whole system. Not a difficult change and a big win on overall system performance and throughput.

So to summarise…

  • use the dasboard reports or the blog posts summarised in my SQL Server and IO series of posts to get an idea of how tempdb is performing
  • modify the tempdb configuration, if required, by carefully creating stripes with appropriate sizes and growth settings (don’t forget to test on every node if SQL Server is clustered)
  • check the dashboard reports or DMVs and perfmon stats to get objective metrics on any improvments
  • don’t forget to regularly review these metrics to ensure tempdb is coping

Useful links

Troubleshooting Insufficient Disk Space in tempdb

SQL Server (2005 and 2008) Trace Flag 1118 (-T1118) Usage

Correctly adding data files to tempdb

At a client site recently a SQL Server 2012 (Enterprise, SP1 + CU#9) server with 512GB RAM and 80 cores (40 cores assigned to SQL Server) that had been patched and restarted displayed the following symptoms after an index reorg operation was restarted: CPU saturation, with all cores available to SQL Server (but not all cores on the box) almost totally maxed out and the perfmon data showing an average of 96% CPU usage for those CPUs.

The server was all but inaccessible with users reporting trouble connecting for between 5-10 minutes, after which point CPU usage settled back down and normal performance was resumed.

Cancelling the index reorg appeared to have no affect on the duration of the issue.

Shortly after CPU usage returned to normal I found the following messages in the error log:

FlushCache: cleaned up 70316 bufs with 51811 writes in 197020 ms (avoided 17940 new dirty bufs) for db 5:0
average throughput: 2.79 MB/sec, I/O saturation: 481, context switches 1041
last target outstanding: 38800, avgWriteLatency 0
FlushCache: cleaned up 36001 bufs with 26512 writes in 94525 ms (avoided 23060 new dirty bufs) for db 5:0
average throughput: 2.98 MB/sec, I/O saturation: 983, context switches 1107
last target outstanding: 38800, avgWriteLatency 0

These messages are logged (automatically in SQL Server 2012, otherwise via trace flag 2504, see Further Reading link at end of post) when SQL Server is running a checkpoint but the checkpointing process took longer than the configured recovery interval, which by default is one minute.

The first FlushCache message shows 70316 buffer pages: 70316 x 8K = 549.34MB, divide that by the 197020 ms duration reported gives you that 2.79 MB/sec throughput rate.

The server was in an Availability Group so I failed over and tried again; same problem. I failed back and tried a third time and the problem recurred a third time. This (as of the time of writing) is the latest publicly available SQL Server 2012 build and I was getting a bit worried, so I posted the question on the #sqlhelp alias and got the kind of responses I would probably have posted; on the face of it, this indicates an IO bottleneck but I wasn’t buying it as the server has flash storage which the database was located on and had a huge buffer pool (max server memory set to 448 GB) available to it.

I looked at the perfmon data I had from that period and I drew out the ‘Disk Transfers/sec’ counters alongside the ‘% Process Time’ for sqlservr.exe and sure enough it showed an apparent disk stall coinciding with the CPU spike, with the average disk transfers dropping from @3,400 to < 100 during this period. This appeared to ‘prove’ an IO stall. However, on this server we have disks running on SAN and local flash storage running off the PCI bus and all disks showed this apparent stall, and this is what made me a little suspicious.

Sure, with all CPUs maxed throughput everywhere is going to crash, but if this was a genuine IO stall I would expect disks on the specific subsystem this database was using to suffer, not a blanket stall across all disks (and not all available CPUs to be maxed out, for that matter).

As the checkpoint process is flushing dirty pages from memory I homed in on SQL Server’s memory counters, namely ‘Target Server Memory (KB)’ and ‘Total Server Memory (KB)’ and then things started to fall in place.
Shortly after the index reorg operations were triggered, we see a sharp rise in the ‘Total Server Memory (KB)’ counter, so SQL Server appears to be allocating more memory.
However, at the same time CPU usage maxes out as the CPUs get saturated and they stay saturated until they hit ‘Target Server Memory (KB)’ at which point CPU usage descends very sharply back to normal and we see a sharp drop off in ‘Total Server Memory (KB)’ (to a point where it’s significantly lower than when the problem started), which in itself does not make sense as the buffer pool has not reached its max server memory limit but confirms the flushing behaviour.

 

Figure 1 shows the perfmon data which captured two occurrences of the problem. The blue line is the ‘Target Server Memory (KB)’ counter which is at the buffer pool limit of 448 GB, red is the SQL Server ‘% Processor Time’ counter and green is the ‘Target Server Memory (KB)’ counter.

They show the issues which started @ 12:15 and 15:15. The CPU drop off in the middle is from an Availability Group failover/failback during which CU#9 for SQL Server 2012 SP1 was applied as it fixed an issue related to reindexing.

Call me Sherlock, but on the face of it, it looks as if the index reorg triggers a race condition within the db engine where it is both allocating memory into the buffer pool and trying to free memory from the pool and these two operations don’t play nice with each other.
The operation increasing the buffer pool size wins, eventually, after getting to its maximum configured size and maxing out the CPUs on the way, after which point the operation clearing dirty pages completes and dumps out those FlushCache messages which implicate the IO subsystem (falsely IMHO), and we see a corresponding freeing of pages in the perfmon graph and the resumption of normal CPU usage.
An index reorg operation should be something that we can run whenever and should not trigger this behaviour.

Setting max server memory equal to min server memory may be a workaround here as so many people on twitter mentioned, but I don’t want to prolong how long SQL Server takes to initialise so won’t be doing that, although I have bumped up the min server memory value.

If you’ve seen this message in your server maybe this will help you get to the bottom of your issue (assuming you have definitely ruled out IO issues).

Further reading:

Bob Dorr’s explanation about these FlushCache messages.

Finally got some breathing space for my first post of the year!

In order to get a major client through Christmas which is traditionally their busiest period of the year (where load normally goes up five-fold), we embarked on a major platform refresh where both the back end SQL Server DBMS was upgraded from SQL 2008R2 to SQL Server 2012 and the hardware was upgraded to 2 x Dell PowerEdge R910s with 4 x 10 (physical) cores and 512GB of RAM in each server.

In order to take the disk I/O bottleneck out of the equation we decided to take traditional spinning disk technology out of the equation so the SAN was kicked out sob and a shiny pair of 2.4TB Fusion-io Duos were added to the server spec sheet.

This would allow us to use AlwaysOn Availability Groups in SQL Server 2012 which have the major upside of allowing the secondaries to be readable instead of being completely offline, which, of course is why having Fusion-io drives in both severs really comes into its own.

But I digress, this post isn’t about boasting about new kit or Fusion-io’s hardware but rather it’s a warning about choosing the right PCIe slot to place the Fusion-io cards because although the R910s have plenty of PCIe slots they don’t all run at the same speed and our main concerns when adding an older 785 GB ioDrive2 Fusion-io card we were recycling from the old kit was that the slot had enough power and there was room for airflow between the shiny new 2.4TBs that were already installed and these cards we were adding in from the older kit.

When the servers were powered up there was nothing untoward showing up in event logs etc, no bells or sirens and the card was initialised and formatted, and could have been used without issue. It was only when the fio-status command was run as a final check (with the -a option) that any mention was made that all was not as it should be, as the the screenshot below shows:

And there you have it, a discrete little “The bandwidth of the PCIe slot is not optimal for the ioMemory.” warning for the newly inserted ioDrive2 card.

We quickly realised that the warning was due to the fact that the newly inserted card was in an x4 slot and as soon as we stuck it in an x8 slot the warning disappeared and we were good to go. The Fusion-io docs do mention in passing an x8 slot to be fair, but it’s easily missed as more focus is placed on the height of the slots, and if you do have a server with multiple speed slots make sure you pick the right slot as there’s no ‘loud’ indication that you’re not getting the best out of it, and the x4 and x8 slots look quite similar so check your server motherboard docs in advance.

I’m not sure what the performance penalty would have been, but with one of the cards using only half the potential bandwidth I didn’t want to find out.

So, if you’re migrating to meaty new servers be aware of the proliferation of multiple speed PCIe slots and make sure you pick the right speed slot, especially when dealing with multiple Fusion-io cards, and if you’re considering new hardware you might want to add a check to ensure there are enough correct speed slots for your needs.

Everything you need to know about PCI Express

Carrying on my series of posts on SQL Server and Disk IO it’s time to cover the old stalwart that is perfmon (referred to in Windows as Performance Monitor) which I know anyone who has delved into any Windows performance issue will have some familiarity with, so I’m not going to cover what it is or how to use it; one word: Google.

The principal reason for this post is to add my own experiences of using perfmon to track down IO issues.

Starting perfmon

I’m assuming Windows 7 here, so adapt these instructions according to your flavour of Windows:

Go to Start > Run and type perfmon and hit Enter to launch it. This will start perfmon (Performance Monitor) under the Performance > Monitoring Tools > > Performance Monitor branch, which is exactly where we need to be.

This is where we create the template trace to monitor SQL Server’s IO performance as far as Windows is concerned. The mechanics of creating and managing perfmon templates and their output is well documented all over the wwitblo (world-wide-inter-twitter-blogosphere), so straight to the important counters.

Press Ctrl+Shift+N to bring up the Add Counters dialog. Scroll down to and expand the Physical Disk section. The counters we need are:

  • Current Disk Queue Length
  • Disk Read Bytes/sec
  • Disk Write Bytes/sec
  • Disk Bytes/sec
  • Disk Reads/sec
  • Disk Transfers/sec
  • Disk Writes/sec
  • Avg. Disk sec/Read
  • Avg. Disk sec/Write
  • Split IO/Sec

An explanation of what these counters reveal is available if you click on Show Description, but I’m going to add my own 2c, otherwise there ain’t much point in me writing this post.

Current disk queue length

This will tell you how many pending IO requests are queued up against that disk and a sure sign of IO pressure on local/directly attached disks. If your disks are on a SAN or on flash memory, then these numbers cannot be relied on directly, but, if you see high queue length numbers when IO heavy queries/jobs are being run and high numbers being reported by some of the other disk IO related counters then this should pretty much confirm an IO subsystem issue (i.e. don’t be too quick to dismiss this counter). Furthermore, as this shows the actual queue length at that point, and not an averaged out figure, I find it more reliable provided the sample rate is low enough.

Disk Read Bytes/sec & Disk Write Bytes/sec & Disk Bytes/sec

This will show the disk read throughput. In the event of an IO stall, this counter (and usually the corresponding Disk Write Bytes/sec counter) will show a big drop off in read/write throughput. Disk Bytes/sec is merely a sum of Read/Write Bytes a sec, but I think seeing them independently is more useful as it can show up specific disks at e.g. having fantastic read speeds but poor write speeds.

Disk Reads/sec, Disk Writes/sec & Disk Transfers/sec

These show what they say. I’ve seen some documentation on the wwitblo (from a major SQL monitoring software vendor, no less) stating that these counters can be ignored as (and I’m now paraphrasing here) SANs can be quite complicated and do lots of funky things under the covers, so don’t worry your pretty little head about this counter too much.

I’m inclined to disagree.

These figures are damned useful. Firstly, they’ll show you if drives are being over-utilised by reads or writes; if you’ve got a db that’s e.g. part reporting part OLTP and you suspect a particular filegroup is being heavily utilised by some reports, this should show up as heavy read traffic on the partition that hosts the filegroup those reports are hitting, which allows you to look at ways of tweaking performance by e.g. adding files or moving the filegroup to faster disks etc.
But these counters are even more useful than that, as I’ve used them to prove a SAN was no longer up to the job of running the db it was hosting, because the Total for Disk Transfers/sec was regularly hitting the ceiling of the max IOPs the SAN in question could support, and this was manifesting itself in all the classic ways: slow reindexes, slow DBCCs, IO stalls in the SQL error log, slow backups, replication latencies, as well as general db sluggishness as all transactions have to be hardened to the transaction log (don’t forget confirmation from wait stats and DMVs as well).

So, get confirmation of how many IOPs your IO subsystem can support and then use this counter to confirm an IO bottleneck related to the IOP throughput, or use it to proactively monitor your storage subsystem so you get early warning of pending bottleneck. And yes, within a major enterprise there will be multiple SQL Server instances (or other DBMS flavours) plugged into the same SAN, but again, and with more effort, the total IO of these instances can be added up to get a rough idea of how many IOPs the SAN is having to contend with.

Avg. Disk sec/Read & Avg. Disk sec/Write

This is the average amount of time it took to read or write the IOs occurring at that time and this depends greatly on how frequently the data is being sampled and the number of IOs occurring at that time, so a slight increase in the average can mask a significant slow down. Keep an eye on the peaks as well as the average which you want as low as possible. Ideally less than 10ms if your disks are traditional spinning disks.

Split IO/Sec

If your partition alignment is not correct, this will show additional IOs that were required  to complete a read or write. This was prevalent on Windows Server 2003 and earlier and can be a significant drag on throughput

With these counters selected we need to select the disks we want to track those counters against under Instances of selected object:

 

Figure 1: Physical Disk counters

 

If the database drives are identified by drive letter then that’s simple enough to do, but if mount points are in use then you’ll have to go by the drive number, and you’ll find the drive number (or LUN) via the Disk Administrator tool. Once you’ve identified the LUNs for the database drive(s)/mount point(s) of interest, select them and click OK.

Instead of incorporating the disk counters into an existing template I prefer to use  a separate perfmon template just to track IO performance for reasons I’ll explain further down.

DIFF BETWEEN PHYSICAL COUNTERS AND LOGICAL COUNTERS

The difference between Physical Disk and Logical Disk counters is basically that Logical Disk will show the aggregated view of a drive in the case of e.g. drive D: being spread over multiple disks (e.g. 4 (D:), 7 (D:), 9 (D:)) or a drive hosted on a disk split into more than one partition like C:(0) and D: (0). Physical Disk will show the raw underlying drive or mount point, and I always use that, although MS would recommend looking at Logical Drive first and then drilling down.

A separate template to monitor disk counters

The reason I prefer to put the disk counters in their own perfmon template is down to the sample rate. The default perfmon sample rate is 15 seconds and if you stick with that sample rate all the disk counters with the Avg abbreviation will be averaged over that 15 second sample interval.

Why is this a big deal?

Well, it depends (of course!).

Are you trying to track down an IO stall that lasts for 5-6 seconds or are you trying to track down slow IO that lasts for a couple of hours?

If you’re tracking down a 5-6 second IO stall you’re not going to have much luck sampling every 15 seconds as any significant issues lasting 5 or 6 seconds may get almost completely smoothed out when they are averaged over 15 seconds.

For a longer lasting issue you may be able to go for a longer sample interval, but from personal experience I’ve found short sample intervals to be much more revealing, especially when tracking SAN performance as it can be so spikey.

By way of example, a few years back I spent a lot of time tracking down an infrequently occurring IO stall on a SAN for a critical low-latency trading platform.

The IO stalls would typically last @ 5 seconds and sometimes just a couple of seconds and I could never trap them until I got the perfmon sample interval down to 1 second. That’s how low I had to go; and we did find them and prove it was a SAN issue, although we had to enforce the co-operation of the SAN team at the proverbial gunpoint. Good luck with that when the time comes.

The performance impact of this low sample interval was negligible, although the perfmon trace files, inevitably got rather large rather quickly and a close eye had to be kept on them.

Finally, if this is a SAN that is being tracked then the SAN admins are going to have their own ‘monitoring’ which may not show a problem.

From personal experience this will be because the SAN ‘monitoring’ will be at a very different sample rate – usually in the region of minutes, not seconds. Therefore, to accurately pick up IO issues you’ll have to persuade your SAN admins to use the same sample interval perfmon is using otherwise we’re comparing apples to pears; your perfmon date will show huge spikes or dips whilst the SAN ‘tool’ will show a wonderful flat line as it averages everything out over a massive sample interval.

I’ve actually had the head of the SAN storage team argue with me that the 30 minute sample interval they used to monitor their SAN would pick up a 2 second IO stall. I kid you not.

Viewing the data

Some quick notes on reviewing the data gathered in the perfmon file(s). I’m assuming a perfmon file has been loaded into perfmon ready for viewing.

  • Right click on the graph area in perfmon, go to Properties, click on the Graph tab and set a sensible upper limit for the Vertical scale. The default is 100, and if you’re looking at e.g Disk Read Bytes/sec you might want to up that into the thousands. Select the vertical grid tick box as well to make it easier to read off the numbers.
  • When viewing the data, press Ctrl+H to highlight the selected counter in the display to make that counters data stand out. Makes it much easier when stepping through multiple counters to isolate their data values.
  • When looking at a perfmon file which covers a wide time range it can be easier to narrow down that time range by right-clicking the graph area, selecting Properties, clicking on the source Tab and scrolling the Time Range scroll bar. Alternatively, whilst looking at the graph click on the start time you’re interested in and drag your mouse across until the time frame you’re interested in is highlighted, then right-click and select Zoom To.
  • Watch out for scaling. If you’re comparing multiple values, select them and then right-click on any of the selected counters and select Scale Selected Counters as it can be very easy to misread a counter if the scaling chosen for it either exaggerates or understates its true value.

Working through an example

Recently I spotted the following slow IO warning on a production server:

2013-11-04 03:07:00.130 spid5s       SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [XXXXX] in database [YYYYY] (n).  The OS file handle is 0x0000000000000D74.  The offset of the latest long I/O is: 0x00002449860000

I loaded up a perfmon template from that period:

 

Figure 2: IO latency capture in a perfmon trace

A 14 ms averages doesn’t look too bad does it? That 10.144 second max can be ignored (fortunately!) as it coincided with scheduled work on the SAN at that time. Zooming into the 03:07 time period in question paints a slightly different picture for the IO numbers:

Figure 3: Perfmon IO latencies

 

We can see that in the couple of minutes leading up to the slow IO warning in the SQL Server error log that both reads and write latencies are spiking up.

I’ve left Avg. Disk sec/Read highlighted and we can see the average IO has jumped up from 14ms to 50ms which is well into poor performance territory. I’ve also added a third counter (Disk Transfers /sec) so we can see how much IO was going on (just against that partition) at that time, and for that period there were 3,016 disk transfers per second with a peak read latency of 0.8 seconds and write latency of 1.5 seconds.

Bearing in mind the Avg. Disk sec/Read and Avg. Disk sec/Writes are just averages (and will therefore be averaging out the true latencies of those 3 thousand plus IOs a second). it’s clear the disk subsystem was struggling at that time and, as the error message indicated, at least one of them did not return for at least 15 seconds.
Now, I know that on this system that there just happened to be a reindex running which swamped the IO path at that time, and I know from experience these slow IO warnings coincide with just these sort of disk intensive jobs.

Warnings coinciding with the times these jobs are running are usually the first sign that the infrastructure of the system in question is starting to creak and making plans for pre-emptive action is advisable, or this latency will start creeping into the working day of the system and start affecting users, and guess who they’ll come running to? I’ll give you a clue, it won’t be to whoever is responsible for that disk subsystem.

Other factors

After many weeks of trying to track down intermittent IO stalls at a financial institution I used to work at the problem turned out to be a networking issue, but not the normal networking kind of issue.

The first clue was that the perfmon data and the SAN performance data (even when the SAN data was gathered at a sensible low interval rate) never matched. This indicated the problem was between the host and the switches/HBA cards, but these checked out OK as well.

This left only one alternative; this financial institution I worked for could afford stretched Windows clusters; e.g. Node 1 would be in the data centre in the basement, and Node 2 was @ 15 miles away at a DR centre. But Windows clusters need shared storage, don’t they? They do indeed, and the way to make Node 1 and Node 2 think they’re looking at the same SAN is to mirror the SAN over underground dark fibre links. Every write is replicated with microsecond latencies.

This added about 3-4ms to every write operation but that was more than acceptable, especially because of the DR protection it gave us (without resorting to database mirroring/log-shipping/SQL Server replication etc etc), but I digress, and, as you’ve probably guessed, this is where we tracked the issue down to; most of the unresolved IO stalls were down to glitches in these links which the networking team were able to confirm when we gave them precise times and dates from the perfmon data.

Although these glitches happen from time to time – usually when workmen dig up roads and punch through the cables – the networking team never thought to mention it outside their usual circle because, up until that point, none of the systems at that institution were that sensitive to

IO latencies.

As you might expect, this has now changed, along with their notification procedures.

Incidentally, when the workmen cut through the cable everything automatically switched to a backup circuit within milliseconds but when that bubbles up to the Windows cluster and those lost IOs have to be re-sent, those milliseconds turn into 7-8 seconds.

If you’re looking after a system which has this kind of technology (the institution in question was using EMC GeoSpan clustering) this might be something worth checking out, if you haven’t already done so.

Finally…

Don’t look at the perfmon numbers in isolation.

If you’re seeing latency spikes, check the number of disk transfers taking place at the same time, in case those IOPs are higher than your disk subsystem can sustain. At the same time, check the wait stats and DMVs as well for corroborating data any issues with those drives.

The Useful Links section below will help with getting this extra information.

http://blogs.technet.com/b/askcore/archive/2012/03/16/windows-performance-monitor-disk-counters-explained.aspx

How to examine IO subsystem latencies from within SQL Server

Tracking SQL Server IO Performance

IO Wait Stats

SQL Server and Disk IO

File fragmentation? What file fragmentation?

A few years back a SAN admin told me not to worry about physical file fragmentation because “The SAN takes care of it” and I took them at their word and, probably along with a lot of fellow DBAs who have their back-ends (in a manner of speaking) on  a SAN, allowed myself to wallow for too long in this cosey myth.

It ain’t so

Recently, I’ve found that not all SANs “take care of it” as well as you might think.

In fact, the documentation for this particular SAN (which shall remain nameless in order to protect the guilty) suggested that an OS level defragmentation should be carried out regularly (?!).

I’ve hunted around the world-wide-inter-twitter-blogosphere and there appears to be no clear consensus and even some contradictory posts on the matter.

I imagine this is because some SAN vendors handle fragmentation better than others and that, to an extent, files are by definition fragmented as they are striped across multiple disks in a SAN. Even so, fragmentation within these stripes must affect performance on traditional spinning disks.

I’m just going to recount my recent experiences and let everyone make up their own mind by showing how easy it is to ‘see’ file fragmentation occurring on a Windows Server 2008R2 system running SQL Server 2008R2 (although file fragmentation itself is a universal issue and not really governed by OS or SQL Server version), and what I noticed when I had dealt with the file fragmentation that I uncovered on an established system that had only been running for about a year-and-a-half at the time (but long before it came under my wing, I hasten to add!).

Tool up

First of all, we’ll need a tool which actually shows file fragmentation. If you use Windows Explorer to tell you how fragmented your database files are, expect it to tell you everything’s good, there’s no file fragmentation here, move along now.
That’s because (by default) Windows does not consider any file over 64 MB to be fragmented, and if you’re using the GUI interface accessed via Windows Explorer to check for it you’ll be none the wiser – only running defrag via the command line at the command prompt actually states

“Note: File fragments larger than 64MB are not included in the fragmentation statistics.” as part of the output of a file fragmentation analysis invoked via the /A parameter.

I have to admit, this did fool me for a while as I was investigating one particular database file that was showing significant IO latencies compared to other files in the same database located on the same partition, even after going through all the usual steps of looking at partition alignment, sector sizes, index defragging, high contention tables on that file etc etc.

Resolving file fragmentation

As soon as I uncovered the article outlining the changes to Windows disk defragmentation and a bit more digging around, I went to that mainstay of all Windows troubleshooting sites, SysInternals, and downloaded contig.exe. This great little tool will show file fragmentation of individual files or directories and allow them to be defragmented.

I pointed it at the 180GB problem file and I used the -v parameter to list all the fragments, expecting maybe a couple of dozen fragments. The output can be seen in fig.1 below:

it actually filled the screen and scrolled and scrolled and eventually reported a shocking 4,818 file fragments (the picture has been altered to show the relevant output and obfuscate the name of the file concerned).

The output lists the starting offset and and the number of allocation units it contains (multiply this by the allocation unit size that the partition concerned was formatted with to get the size of that fragment). Some of the fragments were only 64K in size. I’ve no idea how there could be such tiny fragments (even allowing for default autogrow settings), other than the possibility that at some point there was little free space on the partition concerned and the remaining free space was highly fragmented, so space was allocated wherever it could be found until the partition was resized (although this is purely speculation on my part).

Incidentally, contig.exe can be used to analyse free space fragmentation and report your biggest free fragment via the -f parameter.

Not only that, but it can also defragment files, can defragment one file at a time if required, and it can do that all that whilst the database is online, although I would personally never risk trying that on a live production system.

I had to deal with this particular issue in a slightly different way:

  • first, I added additional files to the filegroup
  • then I shrunk the original highly fragmented file which would redistribute the data onto the unfragmented files added to the filegroup and capped the original file to a low maximum size (4GB). I’ll explain why later
  • a reindex was then carried out, as shrinking creates internal index fragmentation

Ideally, I would have just used contig.exe to clear the fragmentation, but there were other factors involved here which forced my hand. Firstly, the file was on a partition that had been created with an incorrect allocation unit size, and this proved a convenient way of both resolving the fragmentation and moving the bulk of the filegroup to a partition created with a 64K allocation unit size, whilst keeping the database online. As the file concerned was the Primary mdf, I could not empty and and drop the file, which is what I would ordinarily have done under these circumstances, and a detach/attach operation to move it to the correctly formatted partition would have involved downtime. Secondly, (and this may affect others as well) contig.exe does not defragment files on mount points and, unfortunately, this database was utilising mount points.

There are ways around this, like using the subst command, but the system in question was clustered as well, so I wasn’t prepared to mess around with it too much.

What’s the big deal?

Backup performance went up 10% immediately after resolving the file fragmentation  and read latency went from 89ms to 13ms, or an improvement of @ 85%. Reindexing performance also improved because certain indexes were not internally fragmenting as frequently as they were before the physical defragmentation.
This kind of of improvement is usually achieved by throwing a lot of money and tin at a problem, but it cost us nothing, so I think it’s a big deal.

See for yourself

It’s not difficult to simulate how a database can become physically fragmented:

  1. Create a small, single data file database with a data file of at least 64MB (on a test system or even your desktop) on a partition that has been in use for a while (i.e. not a pristine empty partition).
  2. Run contig against the mdf file and you’ll see it as one contiguous file (unless the free space on the system concerned has become massively fragmented).
  3. Mimic several autogrows by manually altering the size of the data file and growing it by e.g. 1MB at a time.
  4. Re-run the contig.exe check against that data file. Note the increase in fragments (there probably won’t be a 1:1 ratio as some of the growths will be appended  to existing fragments).
  5. (Optionally) run the Windows fragmentation report which will report no fragmentation as, by default, it’s ignoring files bigger than 64MB.

OK, so this is a somewhat contrived simulation, but this will be what is happening to any databases that have not been appropriately pre-sized and have been autogrowing over time, and after countless autogrows (and probably a few shrinks thrown in for good measure) will be potentially suffering from performance impacting file fragmentation on systems using traditional spinning disks.

On media where the seek time is equal to all parts of the disk, like SSDs, the hit is probably not noticeable, but on spinning media, with enough file fragmentation the impact can be noticed but usually only with hindsight, after the fragmentation has been resolved.

What’s more, and here’s the kicker, if you’re not using third party file fragmentation reporting tools, you don’t know it’s there as the default Windows file fragmentation report will not return stats for files larger than 64MB.

Will it work for me?

The first thing I would do is check out the IO performance of the database(s) in question.

From personal experience, some of the things to look out for include files performing worse than others on the same partition or SAN back-end even though they are not responsible for most of the IO against that database and particular indexes regularly showing high internal fragmentation.

Check the output of contig.exe when aimed at those files. If you see hundreds of file fragments for an individual file on a partition utilising spinning disk technology, it’s definitely worth exploring.
Use the resolution technique most suitable to the SLAs and the situation; if you have the luxury of downtime, the files can probably be defragmented more quickly as there will be less IO contention. If contig.exe cannot be used to defragment the file you can try the approach I was forced to use.

Whatever you do, get a baseline of the latencies before you start so you can compare this to the corresponding figures after the file fragmentation has been removed. Don’t forget, things like reindex duration, DBCC CHECKDB duration and backup durations before and after resolving the file fragmentation can also provide a tangible indication of improvements (or the lack of).

I’ve finally got around to this, the third part of my SQL Server and Disk IO series of posts:

The sys.dm_io_virtual_file_stats DMV and fn_virtualfilestats function are great tools for extracting granular information about SQL Server IO performance, right down to the individual file in each database.

It gives details of the number of physical reads/writes, amount of data read/written and how much time was spent reading/writing that data. Traditionally, this has been important because IO access is the slowest part of fulfilling any database queries as RAM and CPU access takes nanoseconds or microseconds, whereas typically access to a spinning disk takes several milliseconds. In other words, disk access is thousands of times slower than CPU or RAM access, so improving disk access performance often brings about the biggest bang for the buck, in a manner of speaking.

Lately, with the uptake of SSDs increasing this is starting to become less of an issue, but if like me, some of the systems you administer are still using the ancient technology of spinning disks then getting familiar with this view will help you confirm any IO issues. Incidentally, the forerunner of this DMV was fn_virtualfilestats and goes back to (at least) SQL Server 2000 and still works in the latest SQL Server version (SQL Server 2012, as of writing this post) so this post concentrates on the fn_virtualfilestats function.

I won’t bore you with details of how to query the function as it is already well documented, and my own query which utilises it is below:

;WITH sum_table
AS
(
SELECT [dbid]
, [fileid]
, [numberreads]
, [iostallreadms]
, [numberwrites]
, [iostallwritems]
, [bytesread]
, [byteswritten]
, [iostallms]
, [numberreads] + [numberwrites] AS [total_io] FROM :: fn_virtualfilestats(db_id(), null) vf
inner join sys.database_files df
ON vf.[fileid] = df.[file_id]
WHERE df.[type] <> 1
)
SELECT db_name([dbid]) AS db
, file_name([fileid]) AS [file]
, CASE [numberreads]
WHEN 0 THEN 0
ELSE CAST(ROUND([iostallreadms]/([numberreads] * 1.0), 2) AS NUMERIC(8,2)) END AS [read_latency_ms]
, CASE [numberwrites]
WHEN 0 THEN 0
ELSE CAST (ROUND([iostallwritems]/([numberwrites] * 1.0), 2) AS NUMERIC(8,2)) END AS [write_latency_ms]
, [numberreads] AS [#reads]
, [numberwrites] AS [#writes]
, [total_io]
, CAST (ROUND((([bytesread]/1073741824.0)/(SELECT CONVERT(NUMERIC(10,2),DATEDIFF(MI, sqlserver_start_time, GETUTCDATE()) / 60.0) from sys.dm_os_sys_info)),3) AS NUMERIC (8,2)) AS [gb_read_per_hr]
, CAST (ROUND((([byteswritten]/1073741824.0)/(SELECT CONVERT(NUMERIC(10,2),DATEDIFF(MI, sqlserver_start_time, GETUTCDATE()) / 60.0) from sys.dm_os_sys_info)),3) AS NUMERIC (8,2)) AS [gb_written_per_hr]
, [iostallms] AS [#stalled io]
, CAST (ROUND((([numberreads] * 1.0 ) /[total_io]) * 100,3) AS NUMERIC (6,3)) AS [file_read_pct]
, CAST (ROUND((([numberwrites] * 1.0 ) /[total_io]) * 100,3) AS NUMERIC (6,3)) AS [file_write_pct]
, CAST (ROUND((([numberreads] * 1.0) /(SELECT sum(total_io) FROM sum_table)) * 100,3) AS NUMERIC (6,3)) AS [tot_read_pct_ratio]
, CAST (ROUND(((numberwrites * 1.0) /(SELECT sum(total_io) FROM sum_table)) * 100,3) AS NUMERIC (6,3)) AS [tot_write_pct_ratio]
, CAST (ROUND((([total_io] * 1.0) /(SELECT sum(total_io) FROM sum_table)) * 100,3) AS NUMERIC (6,3)) AS [tot_io_pct_ratio]
, CAST (ROUND(([bytesread]/1073741824.0),2) AS NUMERIC (12,2)) AS [tot_gb read]
, CAST (ROUND(([byteswritten]/1073741824.0),2) AS NUMERIC (12,2)) AS [tot_gb_written]
, (SELECT CONVERT(NUMERIC(10,2),DATEDIFF(MI, sqlserver_start_time, GETUTCDATE()) / 1440.0) FROM sys.dm_os_sys_info) AS [uptime_days]
FROM sum_table
GROUP BY [dbid], [fileid], [total_io], [numberreads], [iostallreadms], [numberwrites], [iostallwritems], [bytesread], [byteswritten], [iostallms]
ORDER BY [total_io] DESC
GO

I’ve chosen to use the fn_virtualfilestats function instead of the sys.dm_io_virtual_file_stats DMV as it is present all the way back to SQL Server 2000. [Edit: I’m working on a SQL Server 2000 friendly version of this query at the moment, once I’ve dusted of a copy of SQL Server 2000 and Windows Server 2003 (this is one of the few moments I’ve regretted getting rid of all trace of SQL Server 2000!)]

The main difference between this query and others I’ve found on the blogosphere is that it excludes the log file (because on an OLTP system it will get hammered and generate a lot of IO anyway), and once I’ve excluded the log file being an issue (usually from wait stats and System Monitor (perfmon) data) I’m usually more interested in how busy my data files are, and if there are any hotspots and it does this by aggregating the IO data and returning the percentage of IO that not only each file does, but the actual ratio of IO that each file does relative to the total IO against all data files within that database. Mike Lewis has tweaked the query so it scans all databases (see the first comment on this post below)

Automating the collection of this data provides a great way of baselining performance and spotting trends (hint hint). The uptime_days column shows how many days the server has been running so you can work out the average daily read/write IO. [Edit: as Richard has pointed out in the comments, this is based on a DMV introduced in SQL Server 2008.)

If you want information regarding log file performance, simply comment out the where df.[type] <> 1 line

It’s been tested in battle and I’ve found it invaluable for highlighting high contention filegroups and IO subsystem issues.

A word of caution, before looking at the output and pointing the finger at the disks/SAN it’s always worth digging behind the stats. If high latencies/reads are found, the issue can boil down to database design which can be uncovered by investigating the reason for the high number of reads/writes on the relevant filegroup(s), e.g. a lack of indexing on a highly read table will result in a lot of table scans which can account for a substantial amount of IO on the file/filegroup holding that table and adding a well chosen index or two can eliminate the scans, reducing the IO and (hopefully) the latency. If the IO goes down, but the latency does not, you have more ammunition for looking more closely at the IO subsystem provided that factors like, for example, anti-virus filter drivers etc have been investigated.

Recently, the output highlighted the slow performance of one particular file in an 18 file database which showed latencies of 89ms. After some digging around I found that this particular file was the primary file and (taking an educated guess) probably had default autogrowth settings when it was originally created, and has probably gone through some shrinkfile operations in a previous life, because this one 180GB file was spread over 4,818 physical fragments on the SAN.
Once this was addressed the latency went down to 25ms. Whilst 25ms is not brilliant in itself it’s still a 70% improvement on its previous performance (and I know there’s room for further improvement as the partition it’s on has an sub-optimal allocation unit size, for some strange reason). IMHO physical file fragmentation, even on SANs is a big issue and I’ll be blogging about that particular issue in an upcoming post.

As I was saying, it pays to do a little digging around behind the numbers before jumping to the most obvious conclusion.

I know there are dozens of variations of queries out there that interrogate fn_virtualfilestats or sys.dm_io_virtual_file_stats but the reason I came up with yet another variation was that, for me, the relativity was difficult to ascertain; on a system that’s been running for while the IO numbers can be vast and on a system with a lot of data files it can be difficult to spot the troublesome files/filegroups, hence the addition of percentages and percentage ratios. Another inspiration was the Performance Dashboard Reports (which don’t work on SQL Server 2008 R2 out of the box but I hear that they can be ported back to SQL Server 2008R2 from SQL Server 2012, or you can try this fix) which do offer percentages for IO, but also includes the IO on the log files which can make it difficult to get a clear picture of IO that was exclusively related to data files.

The main reason for this post was to highlight the presence and significance of this DMV and also to point out that it should not be used in isolation. Other methods for gathering performance related data, like perfmon and wait stats data are invaluable and all this information combined, together with corroborating evidence like e.g. slow IO warnings in the SQL Server error log will give a more rounded picture of IO performance.

Do you have tempdb on Fusion-io (or equivalent) technology? Is your reindexing job left on default settings (re-orgs at 5-15% and full rebuilds at @ 30%)? If so, then you might be interested in the findings in this post.

The benefits of flash memory as a replacement for spinning drives in order to boost SQL Server performance is well documented. However, I’m not sure if one particular benefit has been fully exploited. This post covers how we reduced the average duration of our overnight reindex job by 80% with the help of just one Fusion-io card (per node).

A bit of background: in order to give a creaking and very tempdb heavy SQL Server 2008 R2 system a bit of a lift we shifted tempdb onto Fusion-io cards. The system consists of a  2-node cluster running on a SAN. We did indeed see throughput increase, as well as an improvement in concurrency (as we switched on snapshot isolation as well) but some 2 months after the cards were installed another benefit has come to light which has turned perceived wisdom of reorganising versus rebuilding indexes on its head (disclaimer: for our system running on our hardware, at any rate).

I’ve been DBA-ing for so long now that the good old 5%/30% general advice for reorganising as opposed to rebuilding an index as documented in BOL is so deeply ingrained that I’ve hardly given it a second thought and just followed it automatically, only tweaking it when performance analysis showed an issue with these thresholds. For more background info on these thresholds I strongly advise reading Paul Randal’s post on where these guidelines came from.

With tempdb on a much faster IO subsystem than the application database the SORT_IN_TEMPDB option for the ALTER INDEX command suddenly comes into play. After installing the Fusions I tweaked our reindex job to turn the SORT_IN_TEMPDB  option on, made a mental note to follow this up in a few days time, and promptly forgot about it until several weeks later when I wanted to track down indexes that were getting highly fragmented (we use Ola Hallengren’s excellent maintenance script which logs this kind of information (and more) in a table) and noticed some  indexes had huge variations in the time taken to reindex them, with a full rebuild taking a fraction of the time a reorg did, and that’s when the penny dropped.

The Fusion-io boost

I promptly turned off reorgs so now it rebuilds at 10% with every index typically getting fully rebuilt in under 5 minutes (multiple indexes are between 10-14 GB in size). Needless to say, this has had a massive impact on the total reindex time. The reindex job on this system used to take anywhere between 4 to 9 hours which was a real problem as this would often encroach well into the business day.
Since we switched off the reorgs, the reindex has never taken more then 2 hours (and the longest run at 114 minutes included 45 minutes of blocking). The graph below shows the performance improvement very er…graphically:

Purdy, ain’t it? As we can see, February 16th was the first night the reindex job used the rebuild option in place of reorgs, and the drop off in the run time could not be more dramatic. For the two weeks from 1st February – 14th February (inclusive) the average duration was 361 minutes, or 6 hours. After the SORT_IN_TEMPDB tweak on the 16th, the average (from the 16th up until today (March 11th) has been just 78 minutes.

Incidentally, the Fusion-io cards in question are 785GB ioDrive2 Mono MLC PCIe Solid State Storage Cards, and yes, technically MS don’t support this tempdb configuration on a pre SQL 2012 cluster, but lots of people are doing it.

The benefits

For us, modifying the reindex job to drop the reorgs in favour of rebuilds on this particular system was a no-brainer. The reindexing finishing in a quarter of the time taken previously on a consistent basis every night has eliminated the uncertainty and inconvenience caused when the original approach caused the index to collide with subsequent maintenance jobs or encroach into the working day of the system.

Furthermore, the database was in simple recovery so we didn’t have to worry about transaction log backups (although a reorg can have a greater impact on transaction log backups).

This is either a benefit that’s so obvious that no-one has bothered blogging about it, or it’s something that’s gone under the radar thus far (I’m obviously hoping it’s the latter). I’ve seen posts hint at it by mentioning index creation is faster etc, and I’ve even read a MS whitepaper analysing tempdb on Fusion-io on a SQL 2008R2 system (which did a reindexing test with the SORT_IN_TEMPB option on, but left the reindexing thresholds on defaults) but no-one, as far as I can tell, has gone the whole hog with regard to adapting their approach to reindexing and just rebuilding indexes in favour of the reorg option to fully exploit their investment in Fusion-io cards (or they just kept quiet about it, whereas I couldn’t help blabbing about it!); I can think of a number of scenarios and previous systems where the improvement in reindex job performance alone would justify this investment.

Are reorgs redundant?

If you have your tempdb on Fusion-io or similar technology, with the application database(s) on a slower IO subsystem with the reindexing strategy left on original settings its an option well worth exploring.

The things to look out for are the bandwidth from host to SAN as on this system we do briefly get close to maxing it out during the reindex now as it reads in the index pages and writes out the rebuilt index. Personally, I’d rather fully exploit our infrastructure than have it burbling along at the fraction of its capacity, which is what it used to be like when most of the indexes were being reorganised.

Another consideration is the latency and throughput difference between the storage subsystem the application dbs are stored on and the Fusion-io cards; against spinning disk SAN based systems there’s no competition, the Fusion-ios will outperfrom the SAN in every category, but if you have e.g. SSDs already in the SAN the difference may not be that substantial, so the switch may not bring about such a dramatic boost. On this particular system, in our environment, there was a huge disparity between SAN performance and the Fusions which has made the switch so beneficial for us.

If alternative HA strategies like log-shopping have been implemented the impact will have to be measured as the costs may outweigh the benefits (thanks Paul!).

A final thing to bear in mind is that if you are fortunate enough to have some older/smaller Fusions lying around (perhaps you’ve had to swap them out as the application dbs have got bigger) it’s a great way of recycling these cards on other systems to give them a new lease of life.

So are reorgs redundant? If you have access to this kind of technology, it’s easy enough to prove on your system.

In summary

“It depends”, and “your mileage may vary” have become well worn cliches, but they are so true. Every system is different and there are a number of factors to consider, as (I hope) the post outlines, but the potential benefits warrant exploring this, if you have not already done so.

For things like automated SQL Server build scripts (and probably many other reasons) finding the root data and log file paths is essential.

Prior to SQL Server 2012 there were all sorts of options ranging from e.g. running sp_helpfile against the master database (which is not foolproof) or constructing a string to establish the right registry key to search (which can be a pain for named instance) and then running an undocumented extended stored procedure called xp_instance_regread to get this information.

Fortunately, this has all now changed. Gone are the calls to undocumented extended procs or other home-brewed solutions because now (for those of us working on SQL Server 2012 onward), we can just run a straight t-sql function:

SELECT
    SERVERPROPERTY('INSTANCEDEFAULTDATAPATH') as [Default_data_path],
    SERVERPROPERTY('INSTANCEDEFAULTLOGPATH') as [Default_log_path];

I found this entirely by chance whilst tracing SSMS in order to update a server build script.

The SERVERPROPERTY function has been around for a while and provides really useful information about that SQL Server instance, but for some reason the SQL Server 2012 documentation has not been updated to reflect these two new additions to this function, which is a bit of a shame so I’ve decided to plug it in this post, as well as submitting a connect bug on this, as I’m hoping this is just an oversight and it’s not been deliberately left as an undocumented command.

Incidentally, if you’re wondering how to change the default data path or the default log path, connect to the server in SSMS, right click on the server in Object Explorer and select Properties. On the Server Properties dialog box look for Database Settings in the object list and make the necessary  modifications under Database default locations.

(EDIT: I did get a response the same day I logged the connect bug and this post, but it got short shrift and closed as a “Won’t fix” which I cannot understand, so if more people vote on it maybe they’ll change their mind…)