Tracking SQL Server IO performance

Page content

I’ve finally got around to this, the third part of my SQL Server and Disk IO series of posts:

The sys.dm_io_virtual_file_stats DMV and fn_virtualfilestats function are great tools for extracting granular information about SQL Server IO performance, right down to the individual file in each database.

It gives details of the number of physical reads/writes, amount of data read/written and how much time was spent reading/writing that data. Traditionally, this has been important because IO access is the slowest part of fulfilling any database queries as RAM and CPU access takes nanoseconds or microseconds, whereas typically access to a spinning disk takes several milliseconds. In other words, disk access is thousands of times slower than CPU or RAM access, so improving disk access performance often brings about the biggest bang for the buck, in a manner of speaking.

Lately, with the uptake of SSDs increasing this is starting to become less of an issue, but if like me, some of the systems you administer are still using the ancient technology of spinning disks then getting familiar with this view will help you confirm any IO issues. Incidentally, the forerunner of this DMV was fn_virtualfilestats and goes back to (at least) SQL Server 2000 and still works in the latest SQL Server version (SQL Server 2012, as of writing this post) so this post concentrates on the fn_virtualfilestats function.

I won’t bore you with details of how to query the function as it is already well documented, and my own query which utilises it is below:

;WITH sum\_table
SELECT \[dbid\]
, \[fileid\]
, \[numberreads\]
, \[iostallreadms\]
, \[numberwrites\]
, \[iostallwritems\]
, \[bytesread\]
, \[byteswritten\]
, \[iostallms\]
, \[numberreads\] + \[numberwrites\] AS \[total\_io\] FROM :: fn\_virtualfilestats(db\_id(), null) vf
inner join sys.database\_files df
ON vf.\[fileid\] = df.\[file\_id\]
WHERE df.\[type\] <> 1
SELECT db\_name(\[dbid\]) AS db
, file\_name(\[fileid\]) AS \[file\]
, CASE \[numberreads\]
ELSE CAST(ROUND(\[iostallreadms\]/(\[numberreads\] \* 1.0), 2) AS NUMERIC(8,2)) END AS \[read\_latency\_ms\]
, CASE \[numberwrites\]
ELSE CAST (ROUND(\[iostallwritems\]/(\[numberwrites\] \* 1.0), 2) AS NUMERIC(8,2)) END AS \[write\_latency\_ms\]
, \[numberreads\] AS \[#reads\]
, \[numberwrites\] AS \[#writes\]
, \[total\_io\]
, CAST (ROUND(((\[bytesread\]/1073741824.0)/(SELECT CONVERT(NUMERIC(10,2),DATEDIFF(MI, sqlserver\_start\_time, GETUTCDATE()) / 60.0) from\_os\_sys\_info)),3) AS NUMERIC (8,2)) AS \[gb\_read\_per\_hr\]
, CAST (ROUND(((\[byteswritten\]/1073741824.0)/(SELECT CONVERT(NUMERIC(10,2),DATEDIFF(MI, sqlserver\_start\_time, GETUTCDATE()) / 60.0) from\_os\_sys\_info)),3) AS NUMERIC (8,2)) AS \[gb\_written\_per\_hr\]
, \[iostallms\] AS \[#stalled io\]
, CAST (ROUND(((\[numberreads\] \* 1.0 ) /\[total\_io\]) \* 100,3) AS NUMERIC (6,3)) AS \[file\_read\_pct\]
, CAST (ROUND(((\[numberwrites\] \* 1.0 ) /\[total\_io\]) \* 100,3) AS NUMERIC (6,3)) AS \[file\_write\_pct\]
, CAST (ROUND(((\[numberreads\] \* 1.0) /(SELECT sum(total\_io) FROM sum\_table)) \* 100,3) AS NUMERIC (6,3)) AS \[tot\_read\_pct\_ratio\]
, CAST (ROUND(((numberwrites \* 1.0) /(SELECT sum(total\_io) FROM sum\_table)) \* 100,3) AS NUMERIC (6,3)) AS \[tot\_write\_pct\_ratio\]
, CAST (ROUND(((\[total\_io\] \* 1.0) /(SELECT sum(total\_io) FROM sum\_table)) \* 100,3) AS NUMERIC (6,3)) AS \[tot\_io\_pct\_ratio\]
, CAST (ROUND((\[bytesread\]/1073741824.0),2) AS NUMERIC (12,2)) AS \[tot\_gb read\]
, CAST (ROUND((\[byteswritten\]/1073741824.0),2) AS NUMERIC (12,2)) AS \[tot\_gb\_written\]
, (SELECT CONVERT(NUMERIC(10,2),DATEDIFF(MI, sqlserver\_start\_time, GETUTCDATE()) / 1440.0) FROM\_os\_sys\_info) AS \[uptime\_days\]
FROM sum\_table
GROUP BY \[dbid\], \[fileid\], \[total\_io\], \[numberreads\], \[iostallreadms\], \[numberwrites\], \[iostallwritems\], \[bytesread\], \[byteswritten\], \[iostallms\]
ORDER BY \[total\_io\] DESC

I’ve chosen to use the fn_virtualfilestats function instead of the sys.dm_io_virtual_file_stats DMV as it is present all the way back to SQL Server 2000. [Edit: I’m working on a SQL Server 2000 friendly version of this query at the moment, once I’ve dusted of a copy of SQL Server 2000 and Windows Server 2003 (this is one of the few moments I’ve regretted getting rid of all trace of SQL Server 2000!)]

The main difference between this query and others I’ve found on the blogosphere is that it excludes the log file (because on an OLTP system it will get hammered and generate a lot of IO anyway), and once I’ve excluded the log file being an issue (usually from wait stats and System Monitor (perfmon) data) I’m usually more interested in how busy my data files are, and if there are any hotspots and it does this by aggregating the IO data and returning the percentage of IO that not only each file does, but the actual ratio of IO that each file does relative to the total IO against all data files within that database. Mike Lewis has tweaked the query so it scans all databases (see the first comment on this post below)

Automating the collection of this data provides a great way of baselining performance and spotting trends (hint hint). The uptime_days column shows how many days the server has been running so you can work out the average daily read/write IO. [Edit: as Richard has pointed out in the comments, this is based on a DMV introduced in SQL Server 2008.)

If you want information regarding log file performance, simply comment out the where df.[type] <> 1 line

It’s been tested in battle and I’ve found it invaluable for highlighting high contention filegroups and IO subsystem issues.

A word of caution, before looking at the output and pointing the finger at the disks/SAN it’s always worth digging behind the stats. If high latencies/reads are found, the issue can boil down to database design which can be uncovered by investigating the reason for the high number of reads/writes on the relevant filegroup(s), e.g. a lack of indexing on a highly read table will result in a lot of table scans which can account for a substantial amount of IO on the file/filegroup holding that table and adding a well chosen index or two can eliminate the scans, reducing the IO and (hopefully) the latency. If the IO goes down, but the latency does not, you have more ammunition for looking more closely at the IO subsystem provided that factors like, for example, anti-virus filter drivers etc have been investigated.

Recently, the output highlighted the slow performance of one particular file in an 18 file database which showed latencies of 89ms. After some digging around I found that this particular file was the primary file and (taking an educated guess) probably had default autogrowth settings when it was originally created, and has probably gone through some shrinkfile operations in a previous life, because this one 180GB file was spread over 4,818 physical fragments on the SAN.
Once this was addressed the latency went down to 25ms. Whilst 25ms is not brilliant in itself it’s still a 70% improvement on its previous performance (and I know there’s room for further improvement as the partition it’s on has an sub-optimal allocation unit size, for some strange reason). IMHO physical file fragmentation, even on SANs is a big issue and I’ll be blogging about that particular issue in an upcoming post.

As I was saying, it pays to do a little digging around behind the numbers before jumping to the most obvious conclusion.

I know there are dozens of variations of queries out there that interrogate fn_virtualfilestats or sys.dm_io_virtual_file_stats but the reason I came up with yet another variation was that, for me, the relativity was difficult to ascertain; on a system that’s been running for while the IO numbers can be vast and on a system with a lot of data files it can be difficult to spot the troublesome files/filegroups, hence the addition of percentages and percentage ratios. Another inspiration was the Performance Dashboard Reports (which don’t work on SQL Server 2008 R2 out of the box but I hear that they can be ported back to SQL Server 2008R2 from SQL Server 2012, or you can try this fix) which do offer percentages for IO, but also includes the IO on the log files which can make it difficult to get a clear picture of IO that was exclusively related to data files.

The main reason for this post was to highlight the presence and significance of this DMV and also to point out that it should not be used in isolation. Other methods for gathering performance related data, like perfmon and wait stats data are invaluable and all this information combined, together with corroborating evidence like e.g. slow IO warnings in the SQL Server error log will give a more rounded picture of IO performance.

SQL Server and Disk IO
Tracking SQL Server IO performance with Performance Monitor (Perfmon)
Misaligned disk partition offsets and SQL Server Performance
SQL Server wait stats
Impact of Fusion-io based tempdb on reindex duration
FlushCache messages might not be an actual IO stall
Tempdb configuration and performance