Row Goals Gone Rogue

This post discusses “row goals“, but with a twist. The point is to illustrate how row goals can cause unnecessarily slow queries. First, run this script:

    USE tempdb
    GO
    IF OBJECT_ID ('even') IS NOT NULL DROP TABLE even;
    IF OBJECT_ID ('odd') IS NOT NULL DROP TABLE odd;
    GO
    CREATE TABLE even (c1 int, c2 CHAR(30));
    CREATE TABLE odd (c1 int, c2 CHAR(30));
    GO
    SET NOCOUNT ON;
    DECLARE @x int;
    SET @x = 1;
    BEGIN TRAN;
    WHILE (@x <= 10000)
    BEGIN
        INSERT INTO even (c1, c2) VALUES (@x * 2, '');
        INSERT INTO odd (c1, c2) VALUES (@x * 2 - 1, '');
        IF @x % 1000 = 0
        BEGIN
            RAISERROR ('Inserted %d rows...', 0, 1, @x) WITH NOWAIT;
            COMMIT TRAN;
            BEGIN TRAN;
        END;
        SET @x = @x + 1;
    END;
    WHILE @@TRANCOUNT > 0 COMMIT TRAN;
    GO

This will insert 10,000 even integers into a table named [even], and the same number of odd values into a table named [odd]. Then consider this simple query:

    -- QUERY #1
    SELECT *
    FROM even t1
    INNER JOIN even t2 ON t1.c1 = t2.c1

This joins the [even] table to itself, and will return all 10,000 rows from the table. If you run with “SET STATISTICS TIME ON; SET STATISTICS PROFILE ON;” to see the query plan that SQL selects for this query, you should find that it chooses a hash match:

StmtText
------------------------------------------------------------------------------------------
  |--Hash Match(Inner Join, HASH:([t1].[c1])=([t2].[c1]), RESIDUAL:([t2].[c1]=[t1].[c1]))
      |--Table Scan(OBJECT:([tempdb].[dbo].[even] AS [t1]))
      |--Table Scan(OBJECT:([tempdb].[dbo].[even] AS [t2]))

While a hash join is often the fastest join option, a hash has the highest initial build cost of any join strategy. Before the hash join can evaluate even a single row for the join, it must first hash all of the rows in one of the join inputs and store these in a hash table (“build cost” refers to this up-front work). Only then can it begin to join the rows in the other input with the rows in the under-the-covers hash table. In contrast to a hash join, a nested loop join strategy has an extremely low build cost. There are no expensive up-front delays before a nested join can produce its first row, but by default SQL optimizes with the goal of minimizing total query execution time. In this case, the nature of a hash join allows it to complete the query in a small fraction of the time it would take for a loop join (on my machine the hash join plan runs in about 100ms), despite the high initial setup cost. So the optimizer is doing the right thing by selecting a hash join for this query. However, watch what happens if you add a “TOP 1” to the query:

    -- QUERY #2
    SELECT TOP 1 *
    FROM even t1
    INNER JOIN even t2 ON t1.c1 = t2.c1

The plan changes to use a loop join, and the query completes in a few milliseconds:

   Rows  EstRows  Executes  EstExecs StmtText
   ----- -------- --------- --------- -------------------------------------------------------------
   1     1        1         1         |--Top(TOP EXPRESSION:((1)))
   1     1        1         1             |--Nested Loops(Inner Join, WHERE:([t2].[c1]=[t1].[c1]))
   1     1        1         1                 |--Table Scan(OBJECT:([tempdb].[dbo].[even] AS [t1]))
   1     10000    1         1                 |--Table Scan(OBJECT:([tempdb].[dbo].[even] AS [t2]))

The addition of a TOP 1 tells the optimizer that it doesn’t have to execute the whole query; it only has to produce enough rows out of each child operator in the query to get a single row out the top. The optimizer takes this into account and correctly estimates that for this join it won’t have to scan many rows at all before finding a match and taking an early exit. As a result, it decides to use a nested loop join, because a loop join should be able to produce that single row and exit long before the hash join could build its hash table. This is also the right thing to do. If you want to prove it to yourself, run query #2 with SET STATISTICS TIME ON, then again with an OPTION(HASH JOIN) hint to force a hash join despite the higher cost. You should find that the hash join is faster when returning all rows, but the loop join is faster when only one row must be returned. This query optimizer intelligence is possible because of the “row goal” feature.

The TOP operator isn’t the only thing that can cause the optimizer to set a row goal. An OPTION (FAST N) query hint sets a row goal of N — the optimizer will cost plans in about the same way that it would with a TOP N, but without artificially restricting the number of rows the query will return. The chosen plan should return the first N rows as quickly as possible, but the end-to-end query execution time might be much slower than you’d get without the hint. Also, semi-joins or anti-semi-joins, like you’d get with an EXISTS or NOT EXISTS clause, can also implicitly set a row goal of 1.

If row goals are a new concept for you, I recommend reading through Paul White’s Row Goals In Depth article before continuing.

Now, let’s run the same TOP 1 query, but instead of joining [even] to itself, we’ll join [even] to the [odd] table:

     -- QUERY #3
     SELECT TOP 1 *
     FROM even t1
     INNER JOIN odd t2 ON t1.c1 = t2.c1

You’ll note that this query is extremely slow relative to the others. On my machine it takes about 9 seconds, which is several thousand times slower than the similar query that does a self-join on the [even] table. Now you and I know that the [even] table contains even integers, and the [odd] table contains odd integers. Based on that understanding, we can guess that this query should return no rows. But the query optimizer has no concept of “odd” and “even”. Remember that statistics only provide a high-level summary of the data in a column. In this case, the statistics on the [even] and [odd] tables seem to indicate that the [c1] columns in these two tables have similar data distributions, so from the query optimizer’s perspective this is the same problem it was faced with in query #2. It guesses that it will very quickly find a matching row, so it chooses a loop join-based plan to minimize operator startup costs. If you look at the plan, you can see that it selects the same plan that was used for query #2. Unfortunately, in this case the results are disastrous. The query plan slogged through 100 million rows (10,000 * 10,000) looking for a match for the join, and it never found one. The Top operator in this plan reflects the row goal of 1.

   Rows      EstRows  Executes  EstExecs StmtText
   --------- -------- --------- --------- -------------------------------------------------------------
   0         1        1         1         |--Top(TOP EXPRESSION:((1)))
   0         1        1         1             |--Nested Loops(Inner Join, WHERE:([t2].[c1]=[t1].[c1]))
   10000     1        1         1                 |--Table Scan(OBJECT:([tempdb].[dbo].[even] AS [t1]))
   100000000 10000    1         2                 |--Table Scan(OBJECT:([tempdb].[dbo].[even] AS [t2]))

What happened here? This mess is caused by a combination of things:

  • The “TOP 1” told the query optimizer that it should choose a plan that is likely to return a single row as fast as possible, rather than optimizing for return of all rows that could satisfy the join.
  • The data in the [even] and [odd] tables exhibit negative correlation. That is, the values in [even] are never going to match a value in [odd], even though the data distribution is similar. (If you can make a prediction about the data in one column based on the data in another column, you can say that those columns are correlated.)
  • There are a few exceptions, but in general you can say that the query optimizer is ignorant of data correlation. In this case, the implication is that the QO can’t know at compile time that it will never find a value in [even] that matches a value in [odd].
  • Where the data distributions described by SQL’s statistics indicate overlapping data ranges, the optimizer is generally optimistic, and assumes that it will find a match if it joins those two sets.

The last item is key. It means that there is a fundamental assumption of positive correlation built into join cardinality estimation. Usually when you join two sets this is the right assumption. For example, consider:

    SELECT * FROM Employee
    INNER JOIN Manager ON Employee.ManagerId = Manager.Id

Every employee’s [ManagerId] field will generally refer to a row that exists in the [Manager] table. But occasionally you join two sets together expecting to get no or few matches even though the high-level data distribution looks similar. The example in this post is artificial, but there are plenty of real world examples. Consider a data cleansing routine that is looking for illegal values in some data imported from an untrusted source.

    SELECT * FROM ImportedData
    WHERE RegionCode IN (SELECT RegionCode FROM DisallowedRegions)

That query by itself won’t hit this problem because there’s no row goal. But combine it with an EXISTS so that the QO sets an implied row goal of 1, and you may find yourself waiting for a while:

    IF EXISTS (
        SELECT * FROM ImportedData
        WHERE RegionCode IN (SELECT RegionCode FROM DisallowedRegions))
    BEGIN
        RAISERROR ('Found some bad data', 16, 1)
    END;

This is by-design behavior. While the optimizer is pretty smart, the hypothetical world it models to come up with cost estimates doesn’t always capture all of the nuances of real world data. Usually these simplifying assumptions are good, because they allow SQL to produce good plans without taking too long to compile. But sometimes the simplified model leads to suboptimal plans. It pays to know the optimizer’s limits, so that you can quickly recognize these cases during query tuning.

 

(cross posted to MSDN blog)

 

Don’t depend on expression short circuiting in T-SQL (not even with CASE)

There are a fair number of blog posts and forum discussions regarding expression short circuiting in T-SQL. Some of the most authoritative posts, like this one, come to the following conclusions: (a) You cannot depend on expression evaluation order for things like “WHERE <expr1> OR <expr2>“, since the optimizer might choose a plan that evaluates the second predicate before the first one. But, (b) order of evaluation of the expressions in a CASE statement is fixed, so you can depend on deterministic short circuit evaluation of a CASE statement. For example, this wouldn’t protect you from a divide-by-zero error:

    WHERE (@value = 0) OR ((col1 / @value) = 2)

But the idea is that this variation is functionally-equivalent, and should protect you from the error:

    WHERE 
        CASE 
            WHEN (@value = 0) THEN 2 
            ELSE (col1 / @value)
        END = 2

Before now that’s the advice I would have offered, too. But I just ran into a situation where a CASE statement does not provide predictable short circuiting. Here’s a simplified repro:

    ALTER FUNCTION dbo.test_case_short_circuit (@input INT)
    RETURNS TABLE 
    AS 
    RETURN (
        SELECT calculated_value = 
            CASE 
                WHEN @input <= 0 THEN 0
                ELSE LOG10 (@input)
            END
    );
    GO

    SELECT * FROM dbo.test_case_short_circuit (0);
    GO

This fails with this error:


Msg 3623, Level 16, State 1, Line 2
An invalid floating point operation occurred.

The LOG10 function raises this error when its input is 0 or a negative value. In some cases it appears that the plan may still evaluate the expression in the second CASE branch even when it won’t be using the value. This is a case where CASE doesn’t provide deterministic short circuiting.

I want to make sure no one takes away the conclusion that SQL Server doesn’t support expression short circuiting. It definitely does. It’s just that you don’t have explicit control over the order of expression evaluation — even with CASE, apparently. And if you’re going to depend on short circuiting, you need a deterministic order of expression evaluation.

What can you do about it? One option would be to always scrub things so that an error isn’t possible even when the CASE branch’s output won’t be used. For example, using “ELSE LOG10 (CASE WHEN @input <= 0 THEN 1 ELSE @input END)” in the repro script doesn’t change the behavior of the function, but avoids the error. Unfortunately, that’s not so pretty.

UPDATE (4 Mar 2011): To be clear, I’ve used CASE before for its short-circuiting properties, and I don’t intend to go back and revisit all of that old code in light of this example. This seems like an edge case to me. But it’s worth being aware that such edge cases exist if you’re thinking about relying on CASE for short circuiting. The most defensive programming approach would be to write the expression in such a way that it doesn’t require particular short circuiting behavior.

(Cross-posted to my MSDN blog.)

UPDATE (21 Mar 2011): The QO team investigated this and decided that it’s a bug. The problem should be fixed in a future SQL release, but it’s still a small risk on existing releases so keep an eye out for it if you’re using CASE for short circuiting on those releases along with a function like LOG.

UPDATE (10 Jun 2011): The owners of this code have marked this bug as fixed. From their comments, it sounds like you are supposed to be able to rely on deterministic order of expression evaluation for CASE statements. But any SQL release in your hands right now will still be vulnerable to this problem — keep an eye out for the issue as you use CASE for short circuiting.

Living with SQL’s 900 Byte Index Key Length Limit



(Cross-posted to my MSDN blog here.)

We recently had a situation where we needed to interface with an external non-relational data source that happened to use really long strings to identify entity instances. These identity strings had theoretically unbounded length, but in practice they were never more than 1000-2000 characters. The database layer needed to accept an identifier of this sort and use it to retrieve some information from the relational store that was related to that object. I’ll call the entity type “widget”, and the long string identifier from the external system the “external key”.

We could just store the external key in the database table and use it to locate the correct widget, but we couldn’t afford to scan the widget table for every lookup. The number of rows in the widget table was expected to be large, so we needed the lookup to be indexed. Unfortunately, SQL has a limit of 900 bytes for the keys in each index, and the external key values can be longer than this. If you create an index on combination of columns that could contain more than 900 bytes, you get this warning:

Warning! The maximum key length is 900 bytes. The index 'xyz' has maximum length of 4000 bytes. For some combination of large values, the insert/update operation will fail.

Then if you try to insert a row into the table that has more than 900 bytes in the indexed columns, it will fail with this error:

Even if this restriction within SQL didn’t exist, I wouldn’t ever think of creating a table with a >1KB primary key. A surrogate primary key like an IDENTITY column or a GUID/UNIQUEIDENTIFIER column should be used, instead. You’d use the surrogate as your primary key, and all lookups and foreign keys within the data tier should use the surrogate. But the external system doesn’t know anything about the surrogate keys used in the data tier; the only identifier that the external system can pass to the database is that >900 byte external key value. So while a surrogate for primary key is a good idea here, it doesn’t solve the immediate problem – we still need to be able to do an efficient (indexed) lookup of an external key value in the database, even though a column that could store the external key values would be too large to index.

Msg 1946, Level 16, State 3, Line 11
Operation failed. The index entry of length 1501 bytes for the index 'xyz' exceeds the maximum length of 900 bytes.

 

If you find yourself in a similar situation, here are some alternatives to consider:

A. Change the external system to use an identifier that is always less than 900 bytes long.

Examine the entity’s attributes: is there some combination of attributes that together make up a candidate key (in other words, is there some combination of other attributes that is guaranteed to be unique for each entity instance)? Even if the external system doesn’t internally use those columns as an identifier, it could still pass them to the data tier when it wanted to look up a widget. In the database, you’d create a UNIQUE index on this column or combination of columns.

B. Require non-key columns as additional criteria for each lookup.

For example, suppose a widget has a “Name” attribute that is always short enough to be indexed. Name is not guaranteed to be globally unique, so neither the external system nor the relational model can use this as a key. But if the column is fairly selective it may be sufficient to give you lookups that are efficient enough that it’s almost like doing an index seek on the external key. In this example you would index the [name] column in the database, and use a routine for lookups like this.

    CREATE PROC usp_get_widet @external_key VARCHAR(MAX), @name NVARCHAR(60) AS
        SELECT *
        FROM widgets
        WHERE external_key = @external_key AND name = @name;

SQL will use a seek on the non-clustered [name] index to narrow the results down to a tiny percentage of the table (hopefully no more than a few rows), then it will do a residual filter to get rid of all but the single row that matches the external key value. Note that you can add more than one supplemental search column if necessary to increase the selectivity of the non-key index seek. Also be sure to note that you still need to do the filter on the [external_key] column. That predicate won’t be satisfied by an index seek, but it’s still necessary for correctness. Consider making the full external key INCLUDEd in the nonclustered index, to minimize the number of bookmark lookups the engine needs to do to find the single matching row.

One obvious downside to this approach is that the additional required search criteria complicate the interface between the database and the external component.

C. Truncate the external key to 900 bytes and index the truncated portion of the key.

For example, your widget table might look like this:

    CREATE TABLE widgets (
        surrogate_key INT PRIMARY KEY IDENTITY,
        external_key VARCHAR(MAX),
        external_key_fragment AS LEFT (external_key, 900),
        ...

You’d create a (nonunique) index on the [external_key_fragment] column, and your lookup procedure might look something like this:

    CREATE PROC usp_get_widet @external_key VARCHAR(MAX) AS
    BEGIN
        SELECT *
        FROM widgets
        WHERE external_key = @external_key
        AND external_key_fragment = LEFT (@external_key, 900);
    END;

Of course, this assumes that some predictable 900-byte portion of the external key is reasonably selective. If it’s possible for a large portion of the rows to have the same initial 900 bytes (assuming the fragment is generated in the way shown above), then this won’t help any – the optimizer will correctly estimate that a scan would be more efficient than a seek, and you’ll still end up scanning the entire table for each lookup.

D. Break up or parse the external key into its component parts, and index a selective subset of the parts.

This assumes that the long external key is really a serialized version of a composite key. It must have some underlying structure, and it must be practical to break the key into its child parts within the data tier. It also assumes that some portion of this structured key is guaranteed to be unique enough to support an efficient index seek that only returned a few rows.

For example, suppose that the external key was string that actually encoded a hierarchy like “RootIdentifierX\ChildIdentifierY\GrandChildIdentifierZ\…” If you’re willing to push knowledge of the structure of this serialized key down into the data tier, the database could parse this apart and use the identifiers for the first few levels of the hierarchy to do a secondary lookup that could be satisfied via an index seek. You’d need to store those parts of the external key in separate columns (so they could be index) in addition to the full external key. Once you’ve indexed the portions that you can, consider adding the remainder as an INCLUDEd column in the index; that would allow the index seek to locate the single matching row without resorting to a bookmark lookup. (It would also make it less likely that the optimizer would choose to scan the table because it guessed that a seek on the component parts wouldn’t be selective enough.)

E. Hash the external key and index the hash.

This is essentially building a hash table and persisting it in the database. With this approach, your widgets table might look something like this:

    CREATE TABLE widgets (
        surrogate_key INT PRIMARY KEY IDENTITY,
        external_key VARCHAR(MAX),
        -- A computed column that hashes the external key (MD5 returns a
        -- 128-bit hash).
        external_key_hash AS CAST (HASHBYTES('MD5', external_key) AS VARBINARY(16)),
        ...

The [external_key_hash] column would be indexed, and the lookup proc would do this:

    CREATE PROC usp_get_widet @external_key VARCHAR(MAX) AS
        SELECT *
        FROM widgets
        WHERE external_key = @external_key
            AND external_key_hash = CAST (HASHBYTES('MD5', @external_key) AS VARBINARY(16));

It could be argued that the CAST(HASHBYTES(…)) stuff should be hidden within a UDF (ditto for alternatives (C) and (D)):

    CREATE FUNCTION dbo.fn_hash_external_key (@urn nvarchar(max))
    RETURNS VARBINARY(16)
    WITH SCHEMABINDING
    AS
    BEGIN
        RETURN HASHBYTES ('MD5', @urn);
    END;

The computed column and the lookup proc would both reference this function so that the details of the hash function lived in a single place. That would indeed be cleaner, but before taking that route be sure you’re aware of the potential problems with scalar UDF performance that are described at http://www.developerfusion.com/community/blog-entry/8389063/tsql-scalar-functions-are-evil/ and http://sqlblog.com/blogs/adam_machanic/archive/2006/08/04/scalar-functions-inlining-and-performance-an-entertaining-title-for-a-boring-post.aspx.

Of all of these, option (A) is the only one that I find very aesthetically pleasing. The others could work in a pinch, though. Please drop me a comment if I’m overlooking some other alternative here.

Follow

Get every new post delivered to your Inbox.