Tuesday, July 14, 2009

The Value Of Statistics

This isn’t strictly about .NET, but hey, it’s my blog so here it is anyway. Besides, how many .NET developers use Sql Server ? Quite a few I’d guess, therefore I say it’s relevant.

So a few weeks ago my boss and another colleague of mine were working on some SQL. They had a query that returned the data they wanted but it was running too slowly (about 2 minutes a time). They had been studying it for half an hour with no success when my boss came to see me…

“I don’t expect you can help, I just want to spread the misery I guess.”, he said. He’s a funny guy. So I followed him to the desk where they were working and he explained the query to me.

The query had about 6 or 7 table joins and possibly a sub-select or two, but was otherwise a fairly typical select statement. My boss ran the query returning just count(*) as well as the actual data he wanted, to prove it wasn’t the quantity of data he was getting back , or any expressions in his select clause that was taking the time.

Next he showed me that if he removed a join to one table, the query ran in about 2 seconds. I was about to ask (the obvious question) if the fields he was joining on were indexed when he said; “And that table only has two records in it, it will never have any more than 5. I can’t see how adding an index will help”.

“You want a statistic”, I said with more confidence than I actually felt. I was guessing, albeit it an educated guess.

“How do I do that ? ", he asked

So I walked him through creating a statistic on the column he was joining to in the table with only two rows. You can do this with T-Sql or the Sql Management Studio application which is how we did it… simply navigate to and expand the table in object explorer, then right click on statistics and choose “New Statistics”. From there it should be pretty easy to figure out what to do, it’s similar to creating an index.

So, what happened after we applied the statistic ? Well, my boss sat stunned as his query ran in 2 seconds again but with the offending table join included.

How did the statistic help ? Well the root of the problem was the Sql Server Optimiser (that’s Optimizer if you’re an American) had selected an inefficient method for joining the two tables. Why did it do this ? Because it had no way of knowing how many rows were in the little table, so it guessed, and it guessed badly. Adding the statistic to the table allowed the optimiser to know how many rows existed, and therefore it chose a better solution… this is much better than manually adjusting the join type or adding index hints, as the statistic still allows the optimiser to choose a better plan later if the numbers of rows increases dramatically.

At a Tech Ed session I went to a few years ago they said that if you view a query plan and hover over the joins you’ll see two figures of interest; Estimated Rows and Actual Rows. If these values aren’t within spitting distance of each other (that’s assuming you can’t spit very far) then you’ve either got a problem, or could have a problem... and a statistic of some sort will quite likely help.

Another way to determine if you need statistics is to analyse the query with the Database Tuning Advisor, if you have an edition of Sql Server that includes this. Often when I use this tool it will recommend one or more indexes and several statistics… I always apply the statistics first without the index and retry my queries to see if the performance increase is sufficient before adding indexes.

Why ? Because adding statistics seems (I have only my own anecdotal evidence to stand on here) to add much less overhead to the database than an index. A statistic is basically a count of a  number of records, or records within certain groups of records (i.e. a statistic on an orders table might include counts for rows in each month for which there are orders). This is usually very small data… even over a million records in a table you might add only a few bytes to the size of your database, unlike an index which could add gigabytes if it’s a bad one. Also, statistics don’t seem to add all that much overhead to inserts and updates, and you can choose to have them updated asynchronously or manually (presumably at a scheduled interval) if they do cause you a problem. I’ve never had to do this though, and I’m not sure it’s a good idea unless you really know what you’re doing.

I should note here that we had a similar problem a while back when joining data from an xml block (using xquery inside Sql Server) to a table… the optimizer had no way of knowing the number of rows in the Xml and so it chose bad join plans. Unfortunately in that case we couldn’t create statistics on the xml so instead we had to alter the t-sql to explicitly state the join type (merge, loop etc. not left/right/inner/outer). Once we selected the correct join type for the quantity of data we usually process the statement ran in seconds rather than minutes again.

So, if you’re having a problem with performance joining tables (especially ones of disparate size), or joining some non-table data (like xml) to a table then consider adding statistics or forcing the join type in the sql statement.


No comments:

Post a comment