I was waiting to implement a change the other day and found the DBCC job was still running. As this was a production system I was not too keen on cancelling the DBCC check. I thought about restarting it after I had completed my change but although the DBCC had been running for nearly an hour I was not sure how long it would take to complete, so it was then that I realised that the DBCC command was one of those commands that populates the percent_complete column in the sys.dm_exec_requests DMV, so I modified my backup/restore progress query to give me an ETA for when the DBCC would complete:

SELECT r.session_id, CONVERT(NVARCHAR(22),DB_NAME(r.database_id)) AS [database], r.command AS [dbcc command], s.last_request_start_time AS [started], DATEADD(mi,r.estimated_completion_time/60000,getdate()) AS [finishing], DATEDIFF(mi, s.last_request_start_time, (dateadd(mi,r.estimated_completion_time/60000,getdate()))) - r.wait_time/60000 AS [mins left], DATEDIFF(mi, s.last_request_start_time, (dateadd(mi,r.estimated_completion_time/60000,getdate()))) AS [total wait mins (est)], CONVERT(VARCHAR(5),CAST((r.percent_complete) AS DECIMAL (4,1))) AS [% complete], GETDATE() AS [current time] FROM sys.dm_exec_requests r INNER JOIN sys.dm_exec_sessions s ON r.[session_id]=s.[session_id] WHERE upper(r.[command]) like 'DBCC %'

A sample ouput from a test database called er…test is shown below:

DBCC completion check output

This is hardly a new topic and whilst there are a lot of articles out there instructing us on how to stripe tempdb etc I feel there’s a bit of a gap on exactly how to identify a problem with tempdb in the first place and particularly on finding out how much of your overall IO tempdb is responsible for.

I cover that in this post but it is also covered more extensively in my SQL Server and Disk IO series of posts, as the troubleshooting and monitoring principles covered in them are just as valid for tempdb performance monitoring as they are for application databases.

Also, as this is the concluding post in that series I could hardly avoid writing about it!

If an application database is poorly configured in terms of disk layout then generally only that database will be affected by e.g. queries utilising a lot of disk IO performing poorly. If tempdb is poorly configured in terms of disk layout the whole SQL Server instance will suffer performance problems for the simple reason that tempdb is effectively a workhorse database that does a lot of work in the background.

Some of this background work tempdb is obvious, some of it is not so obvious:

  • temporary tables (explicitly created in queries/procs etc)
  • queries that sort (order) data (temporary tables will be created under the covers to sort the data)
  • worktable spools (when a query cannot be processed entirely in memory)
  • Service Broker
  • app databases utilising Snapshot Isolation (to hold versioned data)
  • linked server queries
  • SQL Server internal processes

This is by no means a complete list, and, in all likelihood, as features are added to SQL Server the list will grow rather than shrink.

One way of thinking about tempdb is that it’s like an OS swap file. When the OS runs out of memory the swap file becomes the default dumping ground and performance slows and in the same way a poorly sized and configured swap file can cripple a server a poorly configured tempdb will cripple SQL Server

The most common issue related to tempdb that every DBA will hit sooner or later is space usage when a runaway query maxes out tempdb and it runs out of space. Those kind of issues are pretty easy to solve and well documented all over the web; check out the links at the end of the post.

This post is going to focus more on the performance issues and they can usually be pinned down to the physical database design. If you’ve been reading the previous posts on this SQL Server and Disk IO series you’ll know how critical it is to size a database well, split it up into multiple files and place it on performant disks.

Unsurprisingly, the same principles apply to tempdb. But from my time working at Microsoft it never ceased to amaze me how many clients went to a lot of trouble with the physical design of their application databases but paid no regard to applying the same principles to tempdb, so it would frequently be left on default settings in C:Program Files Microsoft SQL Serveretc etc.

But I don’t know how much IO my tempdb is doing

Tempdb performance and its IO load can be easily checked via the Microsoft SQL Server 2012 Performance Dashboard Reports as well as my posts on monitoring SQL Server IO.

Once the reports are installed, open up the main dashboard report and from there click on the Historical IO link. This takes you to the Historical IO Report, et voila! You can now see exactly how much IO your tempdb does in terms of reads, writes, the time taken to read and write and, crucially, the percentage of all IO that each database on that SQL Server instance is responsible for:
If tempdb is taking a significant percentage of that IO it makes sense to make sure it’s optimised as well as it can be.

For older versions of SQL Server check out this Codeplex project for dashboard reports that run from SQL Server 2005.

Sizes and stripes and trace flags

We know how much IO tempdb is doing and we want to apply the same tuning principles we would apply to any other database, starting with striping it across multiple files. There’s a lot of material out there on why this is ‘a good thing’ so I’m not going to regurgitate why here.

The first thing to check is that tempdb is not on default settings and in the default location (unless that default location has super-low latency and can handle a lot of IOPs).

Out of the box tempdb (up until at least SQL Server 2012) will be sized with one 8 MB data file and one 512 KB log file. If your SQL Server installation consists of one application database containing one table with one row then you’re good to go. The chances are your SQL Server installation will be a little more substantial than that so tempdb needs to be sized up.

In SSMS right-click on tempdb and select Properties and click on Files under the ‘Select a page’ section of the ‘Database Properties – tempdb’ dialog box:

This will list the files, their location and initial sizes of tempdb. If there’s only one data and log file at the default sizes (under Initial Size (MB)) we have work to do. Note: the SSMS GUI will round-up a default log size of 512KB to 1MB. To prove this, run

sp_helpdb tempdb

in a new query window and review the output for the tempdb log file and it should show the size as 512MB.

Adding stripes is simply a question of clicking the Add button here and setting the appropriate size, autogrow, path and filename settings.

You’ll need to provision some fast storage for the additional tempdb stripes first. Ideally, use flash or SSDs or fast local disks. If you’re at the mercy of a SAN administrator, request something akin to RAID 0 as you don’t need to worry about mirroring/parity as tempdb is rebuilt on SQL Server startup. Obviously, if there’s a recurring issue it means there’s a disk problem so that will still need addressing, but from a DBA perspective there’s no recovery options for tempdb other than restarting SQL Server.

To work out how many stripes you need keep any eye on the guidance Microsoft publish as the advice changes over time. Anything I put down now will probably be out of date in a couple of years. I start at one quarter of the number of cores on the system if it has up to 64 cores and (personally) have never found the need to increase it beyond half the cores on systems with more than 32 cores.

On high throughput systems that are very tempdb heavy you may need to up this.

The SQLCAT team (or whatever they are called this month) recommend one stripe per core but these were extreme-scale systems so IMHO not practical for the vast majority of systems out there.

Establishing if you need more cores is not that difficult as this will manifest itself as something called PFS page contention.

Add stripes and enable trace flag 1118 as described in the above article and many others. There’s no downside to it and it’s a de facto step I take on all the SQL Server installations I’m involved with. Check out this Microsoft KB article if you’re still not sure.

The elephant in the room is the size of the stripes, and therefore the size of tempdb. There is no right answer for the size of tempdb. It depends on the load. The load is not necessarily how busy the system is right now, or even how busy it is at peak loads, it’s how big it needs to be when the most tempdb intensive loads are placed on it. E.g. overnight ETL jobs. If these jobs grow your tempdb to 500GB at 3am whereas for the previous 23 hours of every day your tempdb never has more than 50GB of data in it then your tempdb needs to be sized at 500GB in order to avoid autogrows which will cripple the performance of that overnight job as it incrementally autogrows to 500GB the next time it runs.

The same sizing principle applies to the transaction log file size; make it big enough to cope with the demands of the most intensive jobs to avoid autogrows.

There’s nothing to stop you looking at tasks that cause excessive tempdb (see the ‘Useful Links’ section at the end of this post) usage as there may be ways to optimise those tasks and reduce that tempdb usage, but sometimes we just have to accept that some of these procedures or jobs will just be heavy tempdb users so we divide our stripe sizes appropriately so they add up to the size of database that we need, plus a reasonable overhead both within the overall size of tempdb plus free space on the underlying volume the stripes reside on.

This leads inevitably onto the ‘how many volumes?’ question. Split the data files equally across multiple volumes is the simple answer, if it’s traditional spinning disks. If you’re fortunate enough to have flash storage which can handle anywhere from 300 to 500K IOPs one volume should suffice for most installations. If it’s a high throughput, latency sensitive system stripe across multiple flash devices. Besides, if tempdb is placed on a SAN (remember tempdb on local disks for a failover clustered instance of SQL Server is supported from SQL Server 2012) those multiple volumes you have requested might be placed on overlapping spindles anyway, or shared with other heavy usage systems, unless you happen to have a SAN admin who really knows their stuff (I think you’re more likely to stumble across a unicorn) or have the luxury of a dedicated SAN.

For failover cluster instances of SQL Server don’t forget to ensure that all the volumes and paths used for striping tempdb are present on all the nodes in the cluster, otherwise tempdb creation will fail and SQL Server will not come online on that node.

SQL Server will have to be restarted for any changes to the striping to take affect.

If SQL Server fails to start after these changes it’s almost always down to a typo in the path and that usually means having to force SQL Server to start with certain command line startup options as documented in this KB article. The article covers recovering a suspect tempdb but it shows what you need to do to restart SQL Server and bypass the creation of tempdb in order to get you into a position where tempdb paths can be corrected.

An aside: sometimes it really is worth finding out what’s using tempdb

On a system I used to manage we investigated tempdb usage as tempdb was generating a lot of slow IO warnings in the error log and the system was really struggling. The previously mentioned dashboard report revealed that 60% of all IO on the system was being generated by tempdb.
By checking the sys.dm_exec_requests DMV via some home-brewed stored procs (which I might one day blog about as they’ve proven very useful to me) and the use of Adam Machanic’s excellent sp_whoisactive proc we isolated the issue to one stored procedure that was doing some string manipulation and being called about 20 times a second. Not only that but it was explicitly creating and dropping a temp table whenever it was called.
The explicit tempdb creation and deletion was removed first and the string manipulation was eventually placed into a CLR function. Tempdb was then responsible for @ 10% of overall IO of the whole system. Not a difficult change and a big win on overall system performance and throughput.

So to summarise…

  • use the dasboard reports or the blog posts summarised in my SQL Server and IO series of posts to get an idea of how tempdb is performing
  • modify the tempdb configuration, if required, by carefully creating stripes with appropriate sizes and growth settings (don’t forget to test on every node if SQL Server is clustered)
  • check the dashboard reports or DMVs and perfmon stats to get objective metrics on any improvments
  • don’t forget to regularly review these metrics to ensure tempdb is coping

Useful links

Troubleshooting Insufficient Disk Space in tempdb

SQL Server (2005 and 2008) Trace Flag 1118 (-T1118) Usage

Correctly adding data files to tempdb

At a client site recently a SQL Server 2012 (Enterprise, SP1 + CU#9) server with 512GB RAM and 80 cores (40 cores assigned to SQL Server) that had been patched and restarted displayed the following symptoms after an index reorg operation was restarted: CPU saturation, with all cores available to SQL Server (but not all cores on the box) almost totally maxed out and the perfmon data showing an average of 96% CPU usage for those CPUs.

The server was all but inaccessible with users reporting trouble connecting for between 5-10 minutes, after which point CPU usage settled back down and normal performance was resumed.

Cancelling the index reorg appeared to have no affect on the duration of the issue.

Shortly after CPU usage returned to normal I found the following messages in the error log:

FlushCache: cleaned up 70316 bufs with 51811 writes in 197020 ms (avoided 17940 new dirty bufs) for db 5:0
average throughput: 2.79 MB/sec, I/O saturation: 481, context switches 1041
last target outstanding: 38800, avgWriteLatency 0
FlushCache: cleaned up 36001 bufs with 26512 writes in 94525 ms (avoided 23060 new dirty bufs) for db 5:0
average throughput: 2.98 MB/sec, I/O saturation: 983, context switches 1107
last target outstanding: 38800, avgWriteLatency 0

These messages are logged (automatically in SQL Server 2012, otherwise via trace flag 2504, see Further Reading link at end of post) when SQL Server is running a checkpoint but the checkpointing process took longer than the configured recovery interval, which by default is one minute.

The first FlushCache message shows 70316 buffer pages: 70316 x 8K = 549.34MB, divide that by the 197020 ms duration reported gives you that 2.79 MB/sec throughput rate.

The server was in an Availability Group so I failed over and tried again; same problem. I failed back and tried a third time and the problem recurred a third time. This (as of the time of writing) is the latest publicly available SQL Server 2012 build and I was getting a bit worried, so I posted the question on the #sqlhelp alias and got the kind of responses I would probably have posted; on the face of it, this indicates an IO bottleneck but I wasn’t buying it as the server has flash storage which the database was located on and had a huge buffer pool (max server memory set to 448 GB) available to it.

I looked at the perfmon data I had from that period and I drew out the ‘Disk Transfers/sec’ counters alongside the ‘% Process Time’ for sqlservr.exe and sure enough it showed an apparent disk stall coinciding with the CPU spike, with the average disk transfers dropping from @3,400 to < 100 during this period. This appeared to ‘prove’ an IO stall. However, on this server we have disks running on SAN and local flash storage running off the PCI bus and all disks showed this apparent stall, and this is what made me a little suspicious.

Sure, with all CPUs maxed throughput everywhere is going to crash, but if this was a genuine IO stall I would expect disks on the specific subsystem this database was using to suffer, not a blanket stall across all disks (and not all available CPUs to be maxed out, for that matter).

As the checkpoint process is flushing dirty pages from memory I homed in on SQL Server’s memory counters, namely ‘Target Server Memory (KB)’ and ‘Total Server Memory (KB)’ and then things started to fall in place.
Shortly after the index reorg operations were triggered, we see a sharp rise in the ‘Total Server Memory (KB)’ counter, so SQL Server appears to be allocating more memory.
However, at the same time CPU usage maxes out as the CPUs get saturated and they stay saturated until they hit ‘Target Server Memory (KB)’ at which point CPU usage descends very sharply back to normal and we see a sharp drop off in ‘Total Server Memory (KB)’ (to a point where it’s significantly lower than when the problem started), which in itself does not make sense as the buffer pool has not reached its max server memory limit but confirms the flushing behaviour.

 

Figure 1 shows the perfmon data which captured two occurrences of the problem. The blue line is the ‘Target Server Memory (KB)’ counter which is at the buffer pool limit of 448 GB, red is the SQL Server ‘% Processor Time’ counter and green is the ‘Target Server Memory (KB)’ counter.

They show the issues which started @ 12:15 and 15:15. The CPU drop off in the middle is from an Availability Group failover/failback during which CU#9 for SQL Server 2012 SP1 was applied as it fixed an issue related to reindexing.

Call me Sherlock, but on the face of it, it looks as if the index reorg triggers a race condition within the db engine where it is both allocating memory into the buffer pool and trying to free memory from the pool and these two operations don’t play nice with each other.
The operation increasing the buffer pool size wins, eventually, after getting to its maximum configured size and maxing out the CPUs on the way, after which point the operation clearing dirty pages completes and dumps out those FlushCache messages which implicate the IO subsystem (falsely IMHO), and we see a corresponding freeing of pages in the perfmon graph and the resumption of normal CPU usage.
An index reorg operation should be something that we can run whenever and should not trigger this behaviour.

Setting max server memory equal to min server memory may be a workaround here as so many people on twitter mentioned, but I don’t want to prolong how long SQL Server takes to initialise so won’t be doing that, although I have bumped up the min server memory value.

If you’ve seen this message in your server maybe this will help you get to the bottom of your issue (assuming you have definitely ruled out IO issues).

Further reading:

Bob Dorr’s explanation about these FlushCache messages.

Finally got some breathing space for my first post of the year!

In order to get a major client through Christmas which is traditionally their busiest period of the year (where load normally goes up five-fold), we embarked on a major platform refresh where both the back end SQL Server DBMS was upgraded from SQL 2008R2 to SQL Server 2012 and the hardware was upgraded to 2 x Dell PowerEdge R910s with 4 x 10 (physical) cores and 512GB of RAM in each server.

In order to take the disk I/O bottleneck out of the equation we decided to take traditional spinning disk technology out of the equation so the SAN was kicked out sob and a shiny pair of 2.4TB Fusion-io Duos were added to the server spec sheet.

This would allow us to use AlwaysOn Availability Groups in SQL Server 2012 which have the major upside of allowing the secondaries to be readable instead of being completely offline, which, of course is why having Fusion-io drives in both severs really comes into its own.

But I digress, this post isn’t about boasting about new kit or Fusion-io’s hardware but rather it’s a warning about choosing the right PCIe slot to place the Fusion-io cards because although the R910s have plenty of PCIe slots they don’t all run at the same speed and our main concerns when adding an older 785 GB ioDrive2 Fusion-io card we were recycling from the old kit was that the slot had enough power and there was room for airflow between the shiny new 2.4TBs that were already installed and these cards we were adding in from the older kit.

When the servers were powered up there was nothing untoward showing up in event logs etc, no bells or sirens and the card was initialised and formatted, and could have been used without issue. It was only when the fio-status command was run as a final check (with the -a option) that any mention was made that all was not as it should be, as the the screenshot below shows:

And there you have it, a discrete little “The bandwidth of the PCIe slot is not optimal for the ioMemory.” warning for the newly inserted ioDrive2 card.

We quickly realised that the warning was due to the fact that the newly inserted card was in an x4 slot and as soon as we stuck it in an x8 slot the warning disappeared and we were good to go. The Fusion-io docs do mention in passing an x8 slot to be fair, but it’s easily missed as more focus is placed on the height of the slots, and if you do have a server with multiple speed slots make sure you pick the right slot as there’s no ‘loud’ indication that you’re not getting the best out of it, and the x4 and x8 slots look quite similar so check your server motherboard docs in advance.

I’m not sure what the performance penalty would have been, but with one of the cards using only half the potential bandwidth I didn’t want to find out.

So, if you’re migrating to meaty new servers be aware of the proliferation of multiple speed PCIe slots and make sure you pick the right speed slot, especially when dealing with multiple Fusion-io cards, and if you’re considering new hardware you might want to add a check to ensure there are enough correct speed slots for your needs.

Everything you need to know about PCI Express

Carrying on my series of posts on SQL Server and Disk IO it’s time to cover the old stalwart that is perfmon (referred to in Windows as Performance Monitor) which I know anyone who has delved into any Windows performance issue will have some familiarity with, so I’m not going to cover what it is or how to use it; one word: Google.

The principal reason for this post is to add my own experiences of using perfmon to track down IO issues.

Starting perfmon

I’m assuming Windows 7 here, so adapt these instructions according to your flavour of Windows:

Go to Start > Run and type perfmon and hit Enter to launch it. This will start perfmon (Performance Monitor) under the Performance > Monitoring Tools > > Performance Monitor branch, which is exactly where we need to be.

This is where we create the template trace to monitor SQL Server’s IO performance as far as Windows is concerned. The mechanics of creating and managing perfmon templates and their output is well documented all over the wwitblo (world-wide-inter-twitter-blogosphere), so straight to the important counters.

Press Ctrl+Shift+N to bring up the Add Counters dialog. Scroll down to and expand the Physical Disk section. The counters we need are:

  • Current Disk Queue Length
  • Disk Read Bytes/sec
  • Disk Write Bytes/sec
  • Disk Bytes/sec
  • Disk Reads/sec
  • Disk Transfers/sec
  • Disk Writes/sec
  • Avg. Disk sec/Read
  • Avg. Disk sec/Write
  • Split IO/Sec

An explanation of what these counters reveal is available if you click on Show Description, but I’m going to add my own 2c, otherwise there ain’t much point in me writing this post.

Current disk queue length

This will tell you how many pending IO requests are queued up against that disk and a sure sign of IO pressure on local/directly attached disks. If your disks are on a SAN or on flash memory, then these numbers cannot be relied on directly, but, if you see high queue length numbers when IO heavy queries/jobs are being run and high numbers being reported by some of the other disk IO related counters then this should pretty much confirm an IO subsystem issue (i.e. don’t be too quick to dismiss this counter). Furthermore, as this shows the actual queue length at that point, and not an averaged out figure, I find it more reliable provided the sample rate is low enough.

Disk Read Bytes/sec & Disk Write Bytes/sec & Disk Bytes/sec

This will show the disk read throughput. In the event of an IO stall, this counter (and usually the corresponding Disk Write Bytes/sec counter) will show a big drop off in read/write throughput. Disk Bytes/sec is merely a sum of Read/Write Bytes a sec, but I think seeing them independently is more useful as it can show up specific disks at e.g. having fantastic read speeds but poor write speeds.

Disk Reads/sec, Disk Writes/sec & Disk Transfers/sec

These show what they say. I’ve seen some documentation on the wwitblo (from a major SQL monitoring software vendor, no less) stating that these counters can be ignored as (and I’m now paraphrasing here) SANs can be quite complicated and do lots of funky things under the covers, so don’t worry your pretty little head about this counter too much.

I’m inclined to disagree.

These figures are damned useful. Firstly, they’ll show you if drives are being over-utilised by reads or writes; if you’ve got a db that’s e.g. part reporting part OLTP and you suspect a particular filegroup is being heavily utilised by some reports, this should show up as heavy read traffic on the partition that hosts the filegroup those reports are hitting, which allows you to look at ways of tweaking performance by e.g. adding files or moving the filegroup to faster disks etc.
But these counters are even more useful than that, as I’ve used them to prove a SAN was no longer up to the job of running the db it was hosting, because the Total for Disk Transfers/sec was regularly hitting the ceiling of the max IOPs the SAN in question could support, and this was manifesting itself in all the classic ways: slow reindexes, slow DBCCs, IO stalls in the SQL error log, slow backups, replication latencies, as well as general db sluggishness as all transactions have to be hardened to the transaction log (don’t forget confirmation from wait stats and DMVs as well).

So, get confirmation of how many IOPs your IO subsystem can support and then use this counter to confirm an IO bottleneck related to the IOP throughput, or use it to proactively monitor your storage subsystem so you get early warning of pending bottleneck. And yes, within a major enterprise there will be multiple SQL Server instances (or other DBMS flavours) plugged into the same SAN, but again, and with more effort, the total IO of these instances can be added up to get a rough idea of how many IOPs the SAN is having to contend with.

Avg. Disk sec/Read & Avg. Disk sec/Write

This is the average amount of time it took to read or write the IOs occurring at that time and this depends greatly on how frequently the data is being sampled and the number of IOs occurring at that time, so a slight increase in the average can mask a significant slow down. Keep an eye on the peaks as well as the average which you want as low as possible. Ideally less than 10ms if your disks are traditional spinning disks.

Split IO/Sec

If your partition alignment is not correct, this will show additional IOs that were required  to complete a read or write. This was prevalent on Windows Server 2003 and earlier and can be a significant drag on throughput

With these counters selected we need to select the disks we want to track those counters against under Instances of selected object:

 

Figure 1: Physical Disk counters

 

If the database drives are identified by drive letter then that’s simple enough to do, but if mount points are in use then you’ll have to go by the drive number, and you’ll find the drive number (or LUN) via the Disk Administrator tool. Once you’ve identified the LUNs for the database drive(s)/mount point(s) of interest, select them and click OK.

Instead of incorporating the disk counters into an existing template I prefer to use  a separate perfmon template just to track IO performance for reasons I’ll explain further down.

DIFF BETWEEN PHYSICAL COUNTERS AND LOGICAL COUNTERS

The difference between Physical Disk and Logical Disk counters is basically that Logical Disk will show the aggregated view of a drive in the case of e.g. drive D: being spread over multiple disks (e.g. 4 (D:), 7 (D:), 9 (D:)) or a drive hosted on a disk split into more than one partition like C:(0) and D: (0). Physical Disk will show the raw underlying drive or mount point, and I always use that, although MS would recommend looking at Logical Drive first and then drilling down.

A separate template to monitor disk counters

The reason I prefer to put the disk counters in their own perfmon template is down to the sample rate. The default perfmon sample rate is 15 seconds and if you stick with that sample rate all the disk counters with the Avg abbreviation will be averaged over that 15 second sample interval.

Why is this a big deal?

Well, it depends (of course!).

Are you trying to track down an IO stall that lasts for 5-6 seconds or are you trying to track down slow IO that lasts for a couple of hours?

If you’re tracking down a 5-6 second IO stall you’re not going to have much luck sampling every 15 seconds as any significant issues lasting 5 or 6 seconds may get almost completely smoothed out when they are averaged over 15 seconds.

For a longer lasting issue you may be able to go for a longer sample interval, but from personal experience I’ve found short sample intervals to be much more revealing, especially when tracking SAN performance as it can be so spikey.

By way of example, a few years back I spent a lot of time tracking down an infrequently occurring IO stall on a SAN for a critical low-latency trading platform.

The IO stalls would typically last @ 5 seconds and sometimes just a couple of seconds and I could never trap them until I got the perfmon sample interval down to 1 second. That’s how low I had to go; and we did find them and prove it was a SAN issue, although we had to enforce the co-operation of the SAN team at the proverbial gunpoint. Good luck with that when the time comes.

The performance impact of this low sample interval was negligible, although the perfmon trace files, inevitably got rather large rather quickly and a close eye had to be kept on them.

Finally, if this is a SAN that is being tracked then the SAN admins are going to have their own ‘monitoring’ which may not show a problem.

From personal experience this will be because the SAN ‘monitoring’ will be at a very different sample rate – usually in the region of minutes, not seconds. Therefore, to accurately pick up IO issues you’ll have to persuade your SAN admins to use the same sample interval perfmon is using otherwise we’re comparing apples to pears; your perfmon date will show huge spikes or dips whilst the SAN ‘tool’ will show a wonderful flat line as it averages everything out over a massive sample interval.

I’ve actually had the head of the SAN storage team argue with me that the 30 minute sample interval they used to monitor their SAN would pick up a 2 second IO stall. I kid you not.

Viewing the data

Some quick notes on reviewing the data gathered in the perfmon file(s). I’m assuming a perfmon file has been loaded into perfmon ready for viewing.

  • Right click on the graph area in perfmon, go to Properties, click on the Graph tab and set a sensible upper limit for the Vertical scale. The default is 100, and if you’re looking at e.g Disk Read Bytes/sec you might want to up that into the thousands. Select the vertical grid tick box as well to make it easier to read off the numbers.
  • When viewing the data, press Ctrl+H to highlight the selected counter in the display to make that counters data stand out. Makes it much easier when stepping through multiple counters to isolate their data values.
  • When looking at a perfmon file which covers a wide time range it can be easier to narrow down that time range by right-clicking the graph area, selecting Properties, clicking on the source Tab and scrolling the Time Range scroll bar. Alternatively, whilst looking at the graph click on the start time you’re interested in and drag your mouse across until the time frame you’re interested in is highlighted, then right-click and select Zoom To.
  • Watch out for scaling. If you’re comparing multiple values, select them and then right-click on any of the selected counters and select Scale Selected Counters as it can be very easy to misread a counter if the scaling chosen for it either exaggerates or understates its true value.

Working through an example

Recently I spotted the following slow IO warning on a production server:

2013-11-04 03:07:00.130 spid5s       SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [XXXXX] in database [YYYYY] (n).  The OS file handle is 0x0000000000000D74.  The offset of the latest long I/O is: 0x00002449860000

I loaded up a perfmon template from that period:

 

Figure 2: IO latency capture in a perfmon trace

A 14 ms averages doesn’t look too bad does it? That 10.144 second max can be ignored (fortunately!) as it coincided with scheduled work on the SAN at that time. Zooming into the 03:07 time period in question paints a slightly different picture for the IO numbers:

Figure 3: Perfmon IO latencies

 

We can see that in the couple of minutes leading up to the slow IO warning in the SQL Server error log that both reads and write latencies are spiking up.

I’ve left Avg. Disk sec/Read highlighted and we can see the average IO has jumped up from 14ms to 50ms which is well into poor performance territory. I’ve also added a third counter (Disk Transfers /sec) so we can see how much IO was going on (just against that partition) at that time, and for that period there were 3,016 disk transfers per second with a peak read latency of 0.8 seconds and write latency of 1.5 seconds.

Bearing in mind the Avg. Disk sec/Read and Avg. Disk sec/Writes are just averages (and will therefore be averaging out the true latencies of those 3 thousand plus IOs a second). it’s clear the disk subsystem was struggling at that time and, as the error message indicated, at least one of them did not return for at least 15 seconds.
Now, I know that on this system that there just happened to be a reindex running which swamped the IO path at that time, and I know from experience these slow IO warnings coincide with just these sort of disk intensive jobs.

Warnings coinciding with the times these jobs are running are usually the first sign that the infrastructure of the system in question is starting to creak and making plans for pre-emptive action is advisable, or this latency will start creeping into the working day of the system and start affecting users, and guess who they’ll come running to? I’ll give you a clue, it won’t be to whoever is responsible for that disk subsystem.

Other factors

After many weeks of trying to track down intermittent IO stalls at a financial institution I used to work at the problem turned out to be a networking issue, but not the normal networking kind of issue.

The first clue was that the perfmon data and the SAN performance data (even when the SAN data was gathered at a sensible low interval rate) never matched. This indicated the problem was between the host and the switches/HBA cards, but these checked out OK as well.

This left only one alternative; this financial institution I worked for could afford stretched Windows clusters; e.g. Node 1 would be in the data centre in the basement, and Node 2 was @ 15 miles away at a DR centre. But Windows clusters need shared storage, don’t they? They do indeed, and the way to make Node 1 and Node 2 think they’re looking at the same SAN is to mirror the SAN over underground dark fibre links. Every write is replicated with microsecond latencies.

This added about 3-4ms to every write operation but that was more than acceptable, especially because of the DR protection it gave us (without resorting to database mirroring/log-shipping/SQL Server replication etc etc), but I digress, and, as you’ve probably guessed, this is where we tracked the issue down to; most of the unresolved IO stalls were down to glitches in these links which the networking team were able to confirm when we gave them precise times and dates from the perfmon data.

Although these glitches happen from time to time – usually when workmen dig up roads and punch through the cables – the networking team never thought to mention it outside their usual circle because, up until that point, none of the systems at that institution were that sensitive to

IO latencies.

As you might expect, this has now changed, along with their notification procedures.

Incidentally, when the workmen cut through the cable everything automatically switched to a backup circuit within milliseconds but when that bubbles up to the Windows cluster and those lost IOs have to be re-sent, those milliseconds turn into 7-8 seconds.

If you’re looking after a system which has this kind of technology (the institution in question was using EMC GeoSpan clustering) this might be something worth checking out, if you haven’t already done so.

Finally…

Don’t look at the perfmon numbers in isolation.

If you’re seeing latency spikes, check the number of disk transfers taking place at the same time, in case those IOPs are higher than your disk subsystem can sustain. At the same time, check the wait stats and DMVs as well for corroborating data any issues with those drives.

The Useful Links section below will help with getting this extra information.

http://blogs.technet.com/b/askcore/archive/2012/03/16/windows-performance-monitor-disk-counters-explained.aspx

How to examine IO subsystem latencies from within SQL Server

Tracking SQL Server IO Performance

IO Wait Stats

SQL Server and Disk IO

Many aspects of a database system’s configuration will affect the performance of queries running on that DBMS. However, there is one single component that has the greatest impact on DBMS performance, and that, of course, is disk (IO) access.

It’s a broad topic and all aspects of it are pretty well covered in various blogs and technical articles, but there’s not much out there that gathers it all together. The intention of this series of posts is to try consolidate some of this information and focus in on some of the functions and procedures that can be used to monitor, troubleshoot and configure SQL Server disk configuration and most, importantly, disk performance. It won’t be exhaustive and cover every conceivable option, but will be based on experience gained whilst working at Sybase and Microsoft and sprinkling of ‘real world’ database administration experience.

I’ve split the topic up into smaller sub-topics in order to make it more manageable:

1. Partition offsets
2. Wait Stats
3. Tracking SQL Server IO performance
4. Physical Fragmentation
5. System Monitor (Perfmon)
6. Tempdb

Kinda ambitious, I know, but I don’t blog often and it gives me something to aim for!

There are plenty of excellent articles which cover IO in some depth, so I’m going to try my best to avoid regurgitation and try keep things DBA-centric.  (I highly recommend reading the links at the end of the relevant posts to anyone wanting a deeper understanding of IO issues, from configuration to monitoring and troubleshooting).

Until solid state disk prices come down and become ubiquitous, disk access is still a painfully slow (relatively speaking) mechanical process. An IO (read or write) request is issued by SQL Server which is passed onto the OS. The OS then passes this onto your disk controller or HBA and eventually this data request translates into a mechanical arm sitting above a spinning disk which will read or write the data that SQL Server requested.

What this means in performance terms is that disk access will be measured in milliseconds.

Access times to the other major components of a DBMS that directly affect performance such as CPU and RAM can be measured in microseconds or nanoseconds.

As you can see, it’s an order of magnitude difference.

This single factor alone highlights why reviewing a SQL Server database installation’s disk configuration can be such a valuable and rewarding exercise.  In fact, I begin many of my SQL Server troubleshooting exercises by eliminating IO performance first.

Prior to SQL Server 2000 SP4 it was difficult to identify IO performance issues without a lot of digging around. The biggest clue SQL Server would give would be the occasional mysterious 17883 (or even 17884) errors, and these were not always guaranteed to be IO related, although IO was usually heavily implicated. SP4 for SQL Server 2000 introduced those ‘slow IO’ messages which most DBAs will be familiar with and was a step in the right direction.

SQL Server 2005 introduced various DMVs which would give much greater insight into IO performance, and SQL Server 2008 built on this with the Performance DataWarehouse. These posts are going to cover methods that can be applied from SQL Server 2000 onwards, so will not delve deeply into features and properties only available in later versions, but focus instead on functions and features available across the last three versions of SQL Server as I know there’s still a lot of SQL Server 2000 installations out there.

I’m starting off this series of posts with a discussion about partition offsets. Diving off at the deep end a bit perhaps, but if your disk setup is not based on firm foundations, you’re setting yourself up for one of the most common and easily-avoidable performance issues right from the off, and a IO performance hit of 20-30%.

The partition offset issue is relevant to any disk partitions created prior to Windows Server 2008. What this means is that whilst your OS may be Windows Server 2008, if the disk you are using was migrated from an older OS the chances are that it may still be exposed to the problem.

Also, certain OEM built machines may be exposed to this problem even on Windows Server 2008 as they often include hidden partitions used for e.g. server recovery date.

Partition offset

Without going into too great a level of detail on how a disk is structured both physically and logically, the partition offset is basically where Windows will start writing its first bytes to disk once the partition concerned is first used.

When a disk partition is created in Windows Server 2003 and before, the first 63 sectors are reserved for the MBR as well as disk vendor proprietary information. That means Windows Server 2003 and before (hereon referred to as just Windows) will write its first bytes of data on the 64th sector.

What’s the big deal?

These 63 sectors are invariably 512 bytes (worth checking with your SAN team/disk vendor however as some SANs may use larger sector size but present the sectors as logical 512 bytes sectors to Windows). Therefore the 64th sector will begin at a 31.5KB sector boundary. What this means is that if this is a SQL Server disk, SQL Server will write its first data page on a 31.5K sector boundary, and every nth page will be written across these boundaries.

The big deal therefore is that each of these nth IO operations will result in an additional IO to read or write the data that was on the page that crossed that boundary (the actual value of n will depend on cluster (sector) size and stripe unit size).

What this means in real (performance) terms is that a properly aligned system will yield a 20-30% performance improvement in IO latency (and therefore query execution time). All this for free, without having to analyse SQL Server configuration, indexing strategy, query plans etc etc.

How do I check my offset?

Checking the offset can be done via a wmi console (wmic) command.

wmic /node: <name of server to check> partition get BlockSize, StartingOffset, Name, Index

The above command will show output similar to the following (if the wmic command is being executed locally, there’s no need for the /node parameter):

C:WINDOWSsystem32>wmic /node:myDbServer partition get BlockSize, StartingOffset, Name, Index

BlockSize Index Name StartingOffset
512 0 Disk #0, Partition #0 32256
512 0 Disk #1, Partition #0 32768
512 0 Disk #2, Partition #0 32768
512 0 Disk #3, Partition #0 32256
512 0 Disk #4, Partition #0 32256
512 0 Disk #5, Partition #0 1048576
512 0 Disk #6, Partition #0 1048576

There will be one line of output for each disk and each disk will be identified by its disk number.

Use the Disk Administrator to identify how these drive letters map to your drive letters.

Surprisingly enough, the StartingOffset column is the key column, and will show the starting offset in bytes.

Disks 1 and 2 have 64 sector offset which lines up with e.g. legacy controllers on local disks.

Drives on default values will show a value of 32256, such as disks 0, 3 and 4 in the output above. Divide that by the 512 byte sectors and you’ll get that dreaded value of 63.

Disks 5 and 6 have a 1MB offset which should line up with most modern disk, SAN cache and cluster size scenarios.

What the above output shows is that, depending on the age of the system, there can be a myriad of offsets used which were correct according to how the sytem was configured at the time. Disks 1 and 2 were altered from the default, so at some point in this system’s history someone looked into the issue of partition misalignment and tweaked it for this drive.

So why are disks 3 and 4 on misaligned defaults? Well, have you ever had disks recut to (usually) increase storage. That usually involves blowing away partitions and recreating them. If partition offsetting was not carried out when the partition was created for whatever reason, this is what you can end up with. The newest disks 5 and 6 were created with the latest information to hand, hence the higher offset, but as this slightly contrived example shows (it’s based on real world systems) it pays to be vigilant and review systems that were previously correctly set up.

How do I fix it?

On paper, the fix is very simple; adjust the partition offset. The diskpart command line utility is used to reset the offset.
Move the offset along from the default misaligned 31.5K offset to an offset that works with your cluster size and stripe unit size; 64K, or a multiple of 64K usually works on most systems, but where a SAN is involved, it gets more complicated as stripe unit sizes will vary from vendor to vendor and rig to rig, so you’ll have to get your SAN engineer/vendor involved.

Windows Server 2008 defaults to 1024K offset.

Ideally, the offset needs to start at a point where the IO requests that SQL Server issues (which will be a multiple of 8K, depending on the IO request type being fulfilled), the cluster size, and any caching involved at disk controller/SAN level all line up, thus avoiding additional IO requests.

The downside, of course, is the partition will have to be recreated, so downtime will be involved on standalone servers that have no HA solution implemented.

Realigning the partition offset involves blowing away the disk partition in question, so if you’re not familiar with diskpart, leave it to your relevant storage/Windows team to reset the partition. If you really want to do this yourself, you’ll find step-by-step instructions in the SQLCAT team’s Disk Partition Alignment Best Practices article linked to at the bottom of this post.

The resulting 20-30% improvement in latency times have been verified by the SQLCAT team who have published them in that paper on partition alignment mentioned above, and is highly recommended reading for anyone with an interest in this subject.

Links

If you only read one further article on partition offsets, make it this one!

Disk Partition Alignment Best Practices for SQL Server

SQL Server 2000 I/O Basics

Disk performance may be slower than expected when you use multiple disks in Windows Server 2003, in Windows XP, and in Windows 2000

SQL Server listeners

Verifying and troubleshooting SQL Server listener connectivity is a key component in a DBAs armoury but often overlooked until now (hopefully!).

SQL Server can listen for connection requests via multiple listeners, such as TCP/IP, shared memory or named pipes.

Sometimes it’s necessary to check connectivity by a specific listener.

The first place to check is the error log to confirm that the relevant listener has started up; the error log is the SQL Server version independent way of confirming listener startup instead of relying on e.g. the sys.endpoints catalog view available from SQL Server 2005 onward (if you prefer to used the DMV, select name, protocol_desc,state_desc from sys.endpoints will list all the listeners and their status).

Use the SQL Server Configuration Manager (SQL Server Network Configuration Section) to view and enable or disable the various listeners, although a SQL Server restart will be involved in starting/stopping them.

The output below is from a SQL Server 2005 server with TCP/IP, shared memory, named pipes and the dedicated administrator connection listeners enabled.

2009-02-24 08:15:40.050 Server Server local connection provider is ready to accept connection on [ .pipeSQLLocalSQL2K5DEV ].
2009-02-24 08:15:40.050 Server Server named pipe provider is ready to accept connection on [ .pipeMSSQL$SQL2K5DEVsqlquery ].
2009-02-24 08:15:40.050 Server Server is listening on [ 'any' 49839].
2009-02-24 08:15:40.050 Server Dedicated admin connection support was established for listening remotely on port 49839

So, the relevant listener is in the list and, as far as SQL Server is concerned, is up and running.The next step is to check connectivity. The SQL Server clients are best for this, as this will take the problem client application out of the equation (presumably you’re reading this blog because of issues connecting to SQL Server via a specific listener).

Either SQL Server Management Studio (SSMS), or the lightweight console mode SQLCMD, which is my preferred tool will do here.

Testing the listener

To test each listener you just need to prefix the listener name abbreviation before the server name. Using SQLCMD as an example and a local instance by the name of .SQL2K5DEV:

Listener Prefix Example Server Name entry in SSMS
Named Pipes np np:.\pipeMSSQL$SQL2K5DEVsqlquery
Shared Memory lpc lpc:.\SQL2K5DEV
TCP/IP tcp tcp:.\SQL2K5DEV,50334
Note the use of the TCP/IP port number
VIA Protocol via via:.\SQL2K5DEV,1433,0
Note the use of port number and the network card number (0 if it’s the only card)

In case you didn’t know, the ‘.’ where the host name should be prior to the instance name is a shortcut for the local machine name.

If you prefer to use SQLCMD, just put the listener prefix in front of the server name as listed in the prefix column after the -S (server name) parameter.

The SQL Browser service

The SQL Browser service’s principal responsibility is to direct incoming connections to the relevant named instance.

If you’re having issues connecting to a named instance, and this service is enabled and running, make sure your firewall has an exception of UDP port 1434 which is used by this service for this connection handling.

If, for whatever reason, this cannot be done or SQL Browser is not running and you have named instances running on that host, remote clients can still connect by specifying the port number after the hostname as outlined in the previous section e.g. specify myhost.domain.com,5000 in the Server name text box of a connection via SSMS to connect to a named instance listening on port 5000 on server myhost.domain.com (note the lack of any reference to the actual instance name as we only need to provide the host name and port number).

Verifying what listener a connection is using

To verify what protocol a connected client is using, use the following query from SQL Server 2005 onwards:

select net_transport, endpoint_id, connect_time,client_net_address from sys.dm_exec_connections where session_id = @@spid;

Replace @@spid with the spid (or session_id) of the connection you’re interested in.

This will output the listener the relevant connection is using as well as information like the endpoint id (which can be checked against sys.endpoints), when the connection came in and its address, as shown below (I dispensed with the column to make the output easier to read):

net_transport endpoint_id connect_time
------------- ----------- ----------------------
Named pipe              3 2009-03-03 08:59:07.31

If you can confirm the listener is up and you can connect to it (at least locally) you know it’s not a SQL Server configuration issue and is more likely to be a firewall/network problem, so make sure the relevant ports you’re listening on are open, e.g. named pipes required port 445 to be open.

Just to prove it, go to the client machine(s) concerned and verify if the Microsoft port querying tool can connect to that server. Fortunately, there is a graphical interface for this tool. This is a very useful tool for identifying, or at least narrowing down, connectivity problems (make sure you select SQL Service in the ‘Query predefined service’ dropdown).

Happy troubleshooting!