govind: Hint

Wednesday, July 23, 2008

Anti Join algorithm

This article was written in oracle9i release 9.2.0.6.0. The below test result may varry from Oracle one release to another release.

An “anti-join” between two tables returns rows from the first table where no matches are found in the second table. An anti-join is essentially the opposite of a semi-join: While a semi-join returns one copy of each row in the first table for which at least one match is found, an anti-join returns one copy of each row in the first table for which no match is found. Anti-joins are written using the NOT EXISTS or NOT IN constructs. These two constructs differ in how they handle nulls.

Suppose you have the DEPT and EMP tables in the SCOTT schema.

SQL> select count(*) from emp;

COUNT(*)
----------
458752
SQL> select count(*) from dept;

COUNT(*)
----------
5
SQL> EXECUTE DBMS_STATS.GATHER_TABLE_STATS(OWNNAME => 'SCOTT',TABNAME => 'EMP
',ESTIMATE_PERCENT => 10, METHOD_OPT => 'FOR ALL COLUMNS SIZE 1', CASCADE => TRUE);

PL/SQL procedure successfully completed.

SQL> EXECUTE DBMS_STATS.GATHER_TABLE_STATS(OWNNAME => 'SCOTT',TABNAME => 'DEP
T',ESTIMATE_PERCENT => 10, METHOD_OPT => 'FOR ALL COLUMNS SIZE 1', CASCADE => TRUE);

PL/SQL procedure successfully completed.

Let us test the Anti Join

We want to list all the departments in department table where the department does not have any employees in employee table.

Here is one way to write the query.
SQL> select count(*) from(
2 SELECT D1.deptno
3 FROM dept D1
4 MINUS
5 SELECT D2.deptno
6 FROM dept D2, emp E2
7 WHERE D2.deptno = E2.deptno)
8 /

COUNT(*)
----------
2
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=2162 Card=1)
1 0 SORT (AGGREGATE)
2 1 VIEW (Cost=2162 Card=5)
3 2 MINUS
4 3 SORT (UNIQUE NOSORT) (Cost=7 Card=5 Bytes=15)
5 4 INDEX (FULL SCAN) OF 'SYS_C0010206' (UNIQUE)
6 3 SORT (UNIQUE) (Cost=2155 Card=461910 Bytes=2771460)
7 6 NESTED LOOPS (Cost=161 Card=461910 Bytes=2771460)
8 7 INDEX (FAST FULL SCAN) OF 'IDX_EMP' (NON-UNIQUE)
(Cost=161 Card=461910 Bytes=1385730)
9 7 INDEX (UNIQUE SCAN) OF 'SYS_C0010206' (UNIQUE)

The above query will give the desired results, but it might be clearer to write the query using an anti-join:

SQL> SELECT COUNT(D.deptno)
2 FROM dept D
3 WHERE D.deptno NOT IN
4 (
5 SELECT E.deptno
6 FROM emp E
7 WHERE E.deptno IS NOT NULL
8 ) ;

COUNT(D.DEPTNO)
---------------
2
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=806 Card=1 Bytes=6)
1 0 SORT (AGGREGATE)
2 1 NESTED LOOPS (ANTI) (Cost=806 Card=2 Bytes=12)
3 2 INDEX (FULL SCAN) OF 'SYS_C0010206' (UNIQUE)
4 2 INDEX (FAST FULL SCAN) OF 'IDX_EMP' (NON-UNIQUE) (Cost=161 Card=277146 Bytes=831438)

We can write the above query by using NOT EXISTS clause. The below query will also produce the same result as above.

SELECT d.deptno, d.dname
FROM dept D
WHERE NOT EXISTS (SELECT 1
FROM emp E WHERE E.deptno = D.deptno)
/

Here is another good candidate for an ANTI-JOIN access path

Suppose you want a list of customers who have not placed an order within the last ten days. You might start with a query that looks like:
SELECT C.short_name, C.customer_id
FROM customers C
WHERE NOT EXISTS(
SELECT 1
FROM orders O
WHERE O.customer_id = C.customer_id
AND O.order_date > SYSDATE - 10)
ORDER BY C.short_name;
/

SELECT C.short_name, C.customer_id
FROM customers C
WHERE C.customer_id NOT IN(
SELECT O.customer_id
FROM orders O
WHERE O.order_date > SYSDATE - 10
AND O.customer_id IS NOT NULL)
ORDER BY C.short_name
/

In Oracle 9i, an anti-join can be performed using the nested loops, merge, or hash join algorithms. As with a conventional join, Oracle will choose the join algorithm with the lowest cost. Oracle provides the NL_AJ, MERGE_AJ, and HASH_AJ hints in order for you to manipulate the anti-join process if you need to. As with the anti-join hints, the hint is applied to the subquery and not the main body of the query itself.

Semi Join algorithm

This article was written in oracle9i release 9.2.0.6.0. The below test result may varry from Oracle one release to another release.

A “semi-join” between two tables returns rows from the first table where one or more matches are found in the second table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most once. Even if the second table contains two matches for a row in the first table, only one copy of the row will be returned.

Suppose you have the DEPT and EMP tables in the SCOTT schema.

SQL> select count(*) from emp;

COUNT(*)
----------
458752
SQL> select count(*) from dept;

COUNT(*)
----------
4
SQL> EXECUTE DBMS_STATS.GATHER_TABLE_STATS(OWNNAME => 'SCOTT',TABNAME => 'EMP
',ESTIMATE_PERCENT => 10, METHOD_OPT => 'FOR ALL COLUMNS SIZE 1', CASCADE => TRUE);

PL/SQL procedure successfully completed.

SQL> EXECUTE DBMS_STATS.GATHER_TABLE_STATS(OWNNAME => 'SCOTT',TABNAME => 'DEP
T',ESTIMATE_PERCENT => 10, METHOD_OPT => 'FOR ALL COLUMNS SIZE 1', CASCADE => TRUE);

PL/SQL procedure successfully completed.

Let us test the semi join.

We want a list all departments in dept table if department has atleast one employee in emp table.

Here is conventional way to write the query.

SELECT D.deptno, D.dname FROM dept D, emp E WHERE E.deptno = D.deptno ORDER BY D.deptno;

The problem here is, if the department has 100 employees, then department will appear 100 times. We can eliminate the duplicate records by using the DISTINCT clause. But again, oracle would do more work to get the output.

Here is total time it took to complete the query

SQL> declare
2 l_start NUMBER DEFAULT DBMS_UTILITY.GET_TIME;
3 v_cnt number;
4 begin
5 SELECT count(distinct D.deptno) into v_cnt
6 FROM dept D, emp E
7 WHERE E.deptno = D.deptno
8 ORDER BY D.deptno;
9 dbms_output.put_line(to_char(ROUND(ROUND((DBMS_UTILITY.GET_TIME - l_start)/100, 2)/60,3)));
10 end;
11 /
.009
Here is the execution plan for the query

SQL> SELECT count(distinct D.deptno)
2 FROM dept D, emp E
3 WHERE E.deptno = D.deptno
4 ORDER BY D.deptno;

COUNT(DISTINCTD.DEPTNO)
-----------------------
3
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=160 Card=1 Bytes
=6)

1 0 SORT (GROUP BY)
2 1 NESTED LOOPS (Cost=160 Card=455570 Bytes=2733420)
3 2 INDEX (FAST FULL SCAN) OF 'IDX_EMP' (NON-UNIQUE) (Cost
=160 Card=455570 Bytes=1366710)

4 2 INDEX (UNIQUE SCAN) OF 'SYS_C0010206' (UNIQUE)

We can use semi join between the DEPT and EMP tables instead of a conventional join:

The below query will list the departments that have at least one employee. Whether a department has one employee or 100, the department will appear only once in the query output. Moreover, Oracle will move on to the next department as soon as it finds the first employee in a department, instead of finding all of the employees in each department.

Here is total time it took to complete the query

SQL> declare
2 l_start NUMBER DEFAULT DBMS_UTILITY.GET_TIME;
3 v_cnt number;
4 begin
5 SELECT count(D.deptno) into v_cnt
6 FROM dept D
7 WHERE EXISTS
8 (
9 SELECT 1
10 FROM emp E
11 WHERE E.deptno = D.deptno
12 )
13 ORDER BY D.deptno;
14 dbms_output.put_line(to_char(ROUND(ROUND((DBMS_UTILITY.GET_TIME - l_start)/100, 2)/60,3)));
15 end;
16 /
.001

Here is the execution plan

SQL> SELECT count(D.deptno)
2 FROM dept D
3 WHERE EXISTS
4 (
5 SELECT 1
6 FROM emp E
7 WHERE E.deptno = D.deptno
8 )
9 ORDER BY D.deptno;

COUNT(D.DEPTNO)
---------------
3

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=641 Card=1 Bytes =6)

1 0 SORT (AGGREGATE)
2 1 NESTED LOOPS (SEMI) (Cost=641 Card=3 Bytes=18)
3 2 INDEX (FULL SCAN) OF 'SYS_C0010206' (UNIQUE)
4 2 INDEX (FAST FULL SCAN) OF 'IDX_EMP' (NON-UNIQUE) (Cost
=160 Card=341678 Bytes=1025034)

Here is the another good candidate for SEMI-JOIN algorithim.

Display list of gold-status customers who have placed an order within the last three days. You might start with a query that looks like:

SELECT C.short_name, C.customer_id
FROM customers C
WHERE C.customer_type = 'Gold'
AND EXISTS(
SELECT 1
FROM orders O
WHERE O.customer_id = C.customer_id
AND O.order_date > SYSDATE - 3)
ORDER BY C.short_name
/

Note : If the query contains a DISTINCT/UNION clause, then Oracle cannot transform EXISTS or IN clauses into something that could use a semi-join access path. Semi join can be used in Nested loop, Hash Join & Sort Merge. If semi join method is not possible, then optimizer use the conventional join with lowest cost. Oracle provides hints(NL_SJ, HASH_SJ, and MERGE_SJ) to enforce semi join algorithm. The hint is applied to the subquery of the EXISTS or IN clause, not the main body of the query itself.