Wednesday, 11 April 2012

A story of a failed exam.

The Microsoft's SQL Server 2012 beta exam 71-464 has failed. Yes, the exam, or rather Prometric, not me.

Yesterday morning I went to an Authorized Prometric Testing Centre ready to sit the beta exam. It was all going well until the actual exam. The time has come, I clicked the 'start' button and an error message popped up.

Tuesday, 10 April 2012

T-SQL Tuesday #029 - Analytic Functions in SQL Server 2012

I took the last few days to review the new features of SQL Server 2012 in preparation for the beta exams  I will sit very soon. About many of them I read before but some still came as a nice surprise (better now than on the exam I suppose).

If I am to choose one useful feature of SQL Server 2012 to write about today it must be the Analytic Functions in T-SQL and improved OVER clasue. The LAG, the LEAD and the OVER which I would call a sapper tools for dealing with time bomb coding techniques, especially the triangular joins. Let me present

Wednesday, 21 March 2012

24 Hours of PASS 2012

Last 24 hours where the 24 hours of PASS tightly packed with very interesting sessions about many aspects of SQL Server. A lot of information, a lot of stories and unfortunately some technical problems too. Still a time well spent. 

My favourite presentations of the day in the order they happened were

Wednesday, 14 March 2012

Database table naming conventions

Opinions are like... belly buttons and so are naming conventions, right? Some people ignore them and some are ready to go to some lengths in proving superiority of theirs. And the truth is it doesn't really matter, unless your convention happens to be something as silly as Hungarian notation, for example.

When it comes to naming database tables there are two schools of thought: you should either use singular or plural nouns. Some people argue that singular looks better in projections and selections because you tend to think about an individual item there. Some people, and I am among them, think that the noun should be plural as it is semantically closer to what it is: a set.

But I am easy, I am not going to try to convince anybody that a table called sales will perform better then one called sale. It is just that sales fits better in the way I think about databases and in turn the way I think about them have on a few occasions proven to result in better performance ;)

Wednesday, 7 March 2012

How not to move locations

It will be a bit of storytelling this time. A story about trust and guts and things you cannot control.

Not so long ago a company that hosted their own servers in-house decided to move offices. All was planned in advance. The servers were to stay in the old location longer than everybody else until a decent internet connection will be installed in the new server room. The ISP was informed, a new line was ordered with specific instructions that it is not (what it was) an office move but that a completely new line is needed in the new location. A small trick to make sure everything goes right.

The plan was, that on Friday the new line will be installed and then on Saturday night the servers will be moved. The time has come and... nothing has happened.

Obviously ISP's support very kindly apologised from a very safe distance and said they are very busy and it will be done as soon as possible. The whole move was postponed until the following weekend.

Wednesday, 29 February 2012

ALTER COLLATION - deceptive simplicity

A database collation is something developers hardly ever think about. After all, most of the time, it is hidden deep below the surface and concerns itself with such trifling matters as character data types encoding and sorting.

As a developer you hardly ever have to concern yourself with it but it can occasionally bite with a 'Cannot resolve the collation conflict between X and Y in the equal to operation.' error. It may happen in comparison, look-ups or joins. That is because when two columns have different collations they cannot be compared directly and there is no way to do any sort of implicit conversion. Such problems can be easily solved by forcing the collation with COLLATE.

More surprising, perhaps, are problems with temporary objects as they use collation of the tempdb and not the database you are using. Solution is the same but here usually more cumbersome. For that reason administrators' rule of thumb is to keep all databases with the same collation as tempdb as long as it is possible, just to keep collations in their oblivion.

Wednesday, 22 February 2012

sub-tree performance - summary

Note to self in regards to sub-tree performance in SQL Server.

When working with hierarchical structures in SQL Server never use XML data. It is usually tricky to implement and performs poorly. CTE recursive queries perform reasonably well and are very elegant in implementation and as such are suitable for wide range of applications but when the tree is big, wide and deep and performance is critical HierarchyId should be used. On smaller data sets it not only cumbersome but also slower than other solutions and the cross-over point seems to be around 1500 nodes.

Old loop solution performs consistently regardless of data size but somehow lacks elegance.

Detailed reasons for the above can be found in the following posts:

Part 1 - the brief introduction to the problem
Part 2 - test environment setup with all four methods explained
Part 3 - testing, results and the conclusion

Wednesday, 15 February 2012

sub-tree performance - part 3

This is the third part of the series about performance of tree operations in SQL Server in which I show how the tests were performed and more interestingly what were the results.
Part 1 - the brief introduction to the problem
Part 2 - test environment setup with all four methods explained

The complete code is also available as a github gist here.

We have now the test table and four test user defined function, each implemented using different technique, namely: an old fashioned loop, CTE, Xml and HierarchyId. First let's have a look at the execution plans. Three functions (Loop, Xml and HierarchyId) have very similar and simple plan. Here is how it looks like for the loop function:

But the plan for function using CTE is completely different and looks like that:

For a batch of all four functions in question relative cost is as follow:
  • 1% GetTreeWithLoop()
  • 97% GetTreeWithCte()
  • 1% GetTreeWithXml()
  • 1% GetTreeWithHierarchyId()
But these are only estimated execution plans, and yet to be "put to the test". To compare real performance, we need a more quantitative approach. So, I selected 6 test nodes:
  • node 2 with sub-tree of 1 and max depth of 0
  • node 5 with sub-tree of 16 and max depth of 3
  • node 8 with sub-tree of 176 and max depht of 6
  • node 11 with sub-tree of 1792 and max depth of 9
  • node 14 with sub-tree of 8192 and max depth of 12
  • node 1 with sub-tree of 69632 and max depth of 14
and computed sub-tree count for each one of them 15 times with each of 4 functions. For each node and function I recorded the average server's CPU time needed to process the request.

Here are the results:

As you may see there wasn't much difference until around 1500 nodes. Beyond that point XML version's time goes through the roof and for that reason I excluded it from further computation of relative values.

From this graph you may clearly see that the old school Loop solution can still compete with the two more modern approaches as its relative performance to the others is stable regardless of the sub-tree size. More so, it seems to be exactly in the middle between the two. You may also see that CTE based solution seems to be fastest for small trees and HierarchyId for large ones with transition point around 1500 nodes and 8 levels. But you must not forget that in this test scenario depth grows with size as well, and in business scenarios we typically have only few levels.

So let's see how the performance compares for sub-trees of different sizes but equal depths.

It is worth noticing all the solutions have almost linear relation between size and time. Again XML method despite looking all right with small sub-trees goes through the roof at the other and of the graph. And again for that reason I excluded it from further computation of relative values.

Here you may clearly see that time-wise, out of the three CTE approach is the worst yet we are talking about 0.2-0.3 second difference to the best option at any size so it still may be preferable solution due to it's clean and compact syntax.

And how about Loop vs. HierarchyId? On free level sub-trees their performance is very close. with Loop having small advantage for sub-trees up to around 2500 nodes and HierarchyId on bigger ones. So is the overhead worth implementing HierarchyId? On such shallow trees probably not, but when you look at previous graphs it looks like the new type has a lot to offer when it comes to much deeper structures.

2005 FOR XML queries on MSDN
HierarchyId on MSDN
More about HierarchyId on Technet
Excelent article about comparing performance of queries on SQL Server

Wednesday, 8 February 2012

sub-tree performance - part 2

This is the second part of the series about performance of tree operations in SQL Server in which I show how the test case was constructed and how four test methods were implemented.
Part 1 - the brief introduction to the problem
Part 3 - testing, results and the conclusion

The complete code is also available as a github gist here.

To test performance of hierarchical queries we need a simple self-referenced table populated with test data as well as test queries implemented using different methods. For flexibility the test queries will be implemented as User Defined Functions meeting the following requirement
  • it will take two parameters
  • primary key of the root node from which to start the search
  • maximum distance from the root node (when -1 then search without depth limit)
  • it will return a table containing full sub-tree including the specified root node up to the specified depth limit
  • output will include node primary key, primary key of parent node and distance
First we need to create test data structures and populate it with some test data:
The above code will create a simple Nodes table with
  • nodeId - primary key for the table
  • nodeParent - self referencing to nodeId
  • nodeValue - in real life scenario it could be a foreign key to data table or instead of that many columns could be used storing record information.

Then the table is populated with test data. The CreateAbcTree stored procedure which does the job is using CTE and for that reason it will run only on SQL 2005 or newer. The created data is structured like this:

In total there are 69632 nodes with one root and 14 nodes at first level. Sub-trees staring from those 14 node have different depths and sizes and on each level there is different number of nodes.

Finally an index is created on the nodeParent column to improve performance. The test environment is ready.

Now is time to present possible solutions.

1. Old school solution with a loop - possible since SQL 2000
How does it work?
Line 8: A temporary table is created which can store node key and it's relative distance from sub-tree root. The node in question is inserted into the temporary table as a start point
Line 16: In a loop for most recently added records to the temporary table search for their 1 level descendants and add them to the temporary table.
Line 27: Use temporary table to return relevant nodes from table

Recursive query using Common Table Expression - novelty in SQL Server 2005
How does it work?
Line 5: A CTE is defined
Line 7: Root level is defined
Line 11: Recursive level is defined
Line 13: and joined with root level
Line 17: resulting table is returned

Already at this point one advantage of CTE over Loop is visible - the clarity of code. NOTE: because such CTE is treated as single query I was able use an inline table-valued function to make the code even clearer.

3. XML Data Type - Available since 2005 version of MS SQL Server
How does it work?
This time we need to have two functions. One that will be called recursively and will return XML. And than another one that will use the XML data and return table with results.

Line 3: WITH in this line sets function option and has nothing to do with WITH defining CTE from previous snippet
Line 9: The GetXmlTree() function is called recursively
Line 26: Sub-tree is computed by GetXmlTree() and stored in a variable
Line 30: cross apply is used to effectively join Nodes table with XML data stored in @xml

4. Using HierarchyId Data Type - only SQL Server 2008
It is not that obvious how a HierarchyId column works but explaining it's mechanics is beyond the scope of this article. If you have never used it before, don't worry. For this Article, we just want to see its practical benefits in action. However if you are interested please check references at the bottom of this article where you will find links to sites explaining it.

Firstly we will have to add a new column to the Nodes table and compute it's values for all existing rows. The initial query is time consuming - took my poor laptop over 2 hours - but it is only due to the fact that the table was not designed to use HierarchyId from the very beginning and now the existing structure has to be translated into a new construct. That translation is the first script. Once the table has been converted write operations on it will have only a very little, negligible overhead as a trade-off for performance at read time.

How does it work?
Line 3: new column is added named nodeHid
Line 5: root node is assigned root HierarchyId value
Line 10: temporary table is populated with records for which HierarchyId may be computed (at each run only one child per nod may be added)
Line 18: while there is something left
Line 20: compute new HierarchyIds
Line 25: remove records already processed from temporary table
Line 27: find more records that need and can be processed
Line 37: add index on nodeHid to improve performance

Now we can create a function that will operate on the newly created HierarchyId column How does it work?
Line 6: HierarchyId of the sub-tree root is found
Line 11: output is created
Line 15: condition to see if the node is descendant of a potential parent

So we now have our test environment and four different implementation of  queries retrieving sub-trees ready for testing. In the next post I will use those to test the relative performance.

2005 FOR XML queries on MSDN
HierarchyId on MSDN
More about HierarchyId on Technet
Excelent article about comparing performance of queries on SQL Server

Wednesday, 1 February 2012

sub-tree performance - part 1

Graph theory is omnipresent in computing and graph structures are very often used to model and solve problems. But despite being so theoretically common, graphs very often prove to be problematic. Not because they are inherently difficult, but seems so, because they are neglected and avoided whenever possible.

It is true that for many years it was hard to implement common tree functions in relational database systems but in 2005 MS SQL Server implemented Common Table Expressions (defined by SQL:1999 standard) which made truly recursive queries possible. In that same version XML data type has been added and later in version 2008 HierarchyId data type emerged, designed especially to make operations on trees easier.

Nowadays Transact-SQL gives us plenty of tools to deal with data structured as trees: sub-queries, recursive stored procedures, cursors, XML data type, temporary tables, CTE, HierarchyId data type; thus the only problem remaining is to pick the right one.

This is the first post out of a series of three based on an article I wrote on Experts-Exchange over two years ago. The time has gone by but the data and more importantly the findings are still relevant to SQL 2008 R2 as they were then for SQL 2008 and for that reason I decided to include it in my new blog.

In the next two posts I will setup a test environment and then attempt to demonstrate and compare four different approaches to get descendants of a node from a tree implemented as a self-referencing table in an SQL Server.

Part 2 - test environment setup with all four methods explained
Part 3 - testing, results and the conclusion

The complete code is also available as a github gist here.

2005 FOR XML queries on MSDN
HierarchyId on MSDN
More about HierarchyId on Technet
Excelent article about comparing performance of queries on SQL Server