I am tuning a query which is slow, I have narrowed the root of the problem to be at the very beginning of the execution plan, where SQL Server makes a bad estimate on a WHERE IS NULL filter that supports a left anti-join - SQL Server estimates 1 row and favours some index scans through a nested loop, thinking it will only execute them once, when in fact it happens several thousand times:
I've managed to create an MCVE to replicate the problem.
Set up the test environment
/* INSERT 35000 dinstinct random numbers into a table */ CREATE TABLE #TableA ( ID BIGINT NULL ) INSERT INTO #TableA SELECT DISTINCT TOP 35000 a.Random FROM ( SELECT TOP 50000 ABS(CHECKSUM(NewId())) % 20000000 AS Random FROM sys.messages ) a GO /* add a further 15000 that already exist in the table. Use a loop to increase the possibility of duplicates */ INSERT INTO #TableA SELECT TOP 1000 ID FROM #TableA a ORDER BY NEWID() GO 15 /* Insert 10000 numbers into another table, that are in the first table */ CREATE TABLE #TableB ( ID BIGINT NOT NULL ) INSERT INTO #TableB SELECT TOP 10000 * FROM #TableA /* insert 80000 distinct random numbers that are not in the first table */ INSERT INTO #TableB SELECT DISTINCT TOP 80000 a.Random FROM ( SELECT TOP 100000 ABS(CHECKSUM(NewId())) % 2000000 AS Random FROM sys.messages ) a LEFT JOIN #TableA b ON a.Random = b.ID WHERE b.ID IS NULL
Then, the query which suffers the problem is
SELECT a.ID FROM #TableA a LEFT JOIN #TableB b ON a.ID = b.ID WHERE b.ID IS NULL
Which is a fairly simple "show me all the IDs in TableA that are not in TableB"
The execution plan from my test environment is here
We can see a very similar thing happening to as we see in above plan, in terms of the filter operator - SQL Server joins the two tables together and then filters down to those records that are in the left table but not the right table and it massively underestimates the number of rows that match that predicate
If I force legacy estimation, I get a much better estimate on the operator
I believe one of the key differences between the old estimator and new estimators is how they differ on their assumption of the correlation between two predicates - the old one assumes there is little correlation between two predicates whereas the new estimator is more optimistic and assumes a higher correlation?
My questions are
- What causes this underestimation on the newer cardinality estimator?
- Is there a way to fix it other than forcing the older compatibility model?