Wednesday, 8 February 2012

sub-tree performance - part 2

This is the second part of the series about performance of tree operations in SQL Server in which I show how the test case was constructed and how four test methods were implemented.
Part 1 - the brief introduction to the problem
Part 3 - testing, results and the conclusion

The complete code is also available as a github gist here.

To test performance of hierarchical queries we need a simple self-referenced table populated with test data as well as test queries implemented using different methods. For flexibility the test queries will be implemented as User Defined Functions meeting the following requirement
  • it will take two parameters
  • primary key of the root node from which to start the search
  • maximum distance from the root node (when -1 then search without depth limit)
  • it will return a table containing full sub-tree including the specified root node up to the specified depth limit
  • output will include node primary key, primary key of parent node and distance
First we need to create test data structures and populate it with some test data:
The above code will create a simple Nodes table with
  • nodeId - primary key for the table
  • nodeParent - self referencing to nodeId
  • nodeValue - in real life scenario it could be a foreign key to data table or instead of that many columns could be used storing record information.

Then the table is populated with test data. The CreateAbcTree stored procedure which does the job is using CTE and for that reason it will run only on SQL 2005 or newer. The created data is structured like this:

In total there are 69632 nodes with one root and 14 nodes at first level. Sub-trees staring from those 14 node have different depths and sizes and on each level there is different number of nodes.

Finally an index is created on the nodeParent column to improve performance. The test environment is ready.

Now is time to present possible solutions.

1. Old school solution with a loop - possible since SQL 2000
How does it work?
Line 8: A temporary table is created which can store node key and it's relative distance from sub-tree root. The node in question is inserted into the temporary table as a start point
Line 16: In a loop for most recently added records to the temporary table search for their 1 level descendants and add them to the temporary table.
Line 27: Use temporary table to return relevant nodes from table

Recursive query using Common Table Expression - novelty in SQL Server 2005
How does it work?
Line 5: A CTE is defined
Line 7: Root level is defined
Line 11: Recursive level is defined
Line 13: and joined with root level
Line 17: resulting table is returned

Already at this point one advantage of CTE over Loop is visible - the clarity of code. NOTE: because such CTE is treated as single query I was able use an inline table-valued function to make the code even clearer.

3. XML Data Type - Available since 2005 version of MS SQL Server
How does it work?
This time we need to have two functions. One that will be called recursively and will return XML. And than another one that will use the XML data and return table with results.

Line 3: WITH in this line sets function option and has nothing to do with WITH defining CTE from previous snippet
Line 9: The GetXmlTree() function is called recursively
Line 26: Sub-tree is computed by GetXmlTree() and stored in a variable
Line 30: cross apply is used to effectively join Nodes table with XML data stored in @xml

4. Using HierarchyId Data Type - only SQL Server 2008
It is not that obvious how a HierarchyId column works but explaining it's mechanics is beyond the scope of this article. If you have never used it before, don't worry. For this Article, we just want to see its practical benefits in action. However if you are interested please check references at the bottom of this article where you will find links to sites explaining it.

Firstly we will have to add a new column to the Nodes table and compute it's values for all existing rows. The initial query is time consuming - took my poor laptop over 2 hours - but it is only due to the fact that the table was not designed to use HierarchyId from the very beginning and now the existing structure has to be translated into a new construct. That translation is the first script. Once the table has been converted write operations on it will have only a very little, negligible overhead as a trade-off for performance at read time.

How does it work?
Line 3: new column is added named nodeHid
Line 5: root node is assigned root HierarchyId value
Line 10: temporary table is populated with records for which HierarchyId may be computed (at each run only one child per nod may be added)
Line 18: while there is something left
Line 20: compute new HierarchyIds
Line 25: remove records already processed from temporary table
Line 27: find more records that need and can be processed
Line 37: add index on nodeHid to improve performance

Now we can create a function that will operate on the newly created HierarchyId column How does it work?
Line 6: HierarchyId of the sub-tree root is found
Line 11: output is created
Line 15: condition to see if the node is descendant of a potential parent

So we now have our test environment and four different implementation of  queries retrieving sub-trees ready for testing. In the next post I will use those to test the relative performance.

2005 FOR XML queries on MSDN
HierarchyId on MSDN
More about HierarchyId on Technet
Excelent article about comparing performance of queries on SQL Server

No comments:

Post a Comment