govind: Hash Join

Tuesday, September 23, 2008

Optimizer Mode - FIRST_ROWS Vs ALL_ROWS

What circumstances we use ALL_ROWS and what circumstances we use FIRST_ROWS optimizer mode? This article is written in oracle9i.

First_rows attempts to optimize the query to get the very first row back to the client as fast as possible. This is good for an interactive client server environment where the client runs a query and shows the user the first 10 rows or so and waits for them to page down to get more.

All_rows attempts to optimize the query to get the very last row as fast as possible. This makes sense in a stored procedure for example where the client does not regain control until the stored procedure completes. You don't care if you have to wait to get the first row if the last row gets back to you twice as fast. In a client server/interactive application you may well care about that.

In TOAD or SQL Navigator, When we select the data, it display immediately. But it does not mean that, it is faster. If we scroll down, it might be fetching the data in the background mode. First_rows is best place for OLTP environment. Also in some reporting environment, if user wants to see initial data first and later see the rest of the data, then first_rows is good option. When we run the query in the stored procedure, first_rows would not be a good choice, all_rows is good option here, because, there is no use to fetch the first few records immediatley inside the stored procedure.

Let us demonstrate the FIRST_ROWS/ALL_ROWS optimizer hint.

Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.8.0 - Production

SQL> create table testtable as select * from user_objects;

Table created.

SQL> create index idx on testtable(object_type);

Index created.

SQL> EXECUTE DBMS_STATS.GATHER_TABLE_STATS(OWNNAME => 'SCREPORT',TABNAME => 'TESTTABLE',ESTIMATE_PER
CENT => 10, METHOD_OPT => 'FOR ALL COLUMNS SIZE 1', CASCADE => TRUE);

PL/SQL procedure successfully completed.

SQL> select count(*) from testtable;

COUNT(*)
----------
5619712

SQL> select count(*) from testtable where object_type='TABLE';

COUNT(*)
----------
2392064

SQL> set autotrace traceonly exp;

SQL> select /*+ all_rows */ * from testtable where object_type='TABLE';

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=HINT: ALL_ROWS (Cost=4316 Card=62
3914 Bytes=53032690)

1 0 TABLE ACCESS (FULL) OF 'TESTTABLE' (Cost=4316 Card=623914
Bytes=53032690)

In TESTTABLE table, we have around 5 million records, the above query returns half of the records. Optimizer use full table scan when we use all_rows hint. Because, it needs to read all the rows before it display the data in the screen. The cost for the all_rows is 4316.

SQL> select /*+ first_rows */ * FROM TESTTABLE where object_type='TABLE';

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=HINT: FIRST_ROWS (Cost=51502 Card
=623914 Bytes=53032690)

1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TESTTABLE' (Cost=51502 C
ard=623914 Bytes=53032690)

2 1 INDEX (RANGE SCAN) OF 'IDX' (NON-UNIQUE) (Cost=1604 Card
=623914)

Optimizer use the index scan when we use the first_rows hint. Because, it use the index scan and reads first few rows to display in the screen, then it reads the rest of the data. The cost is 51502.
so first_rows hint looks faster, but it is really not. In this example, the cost is 12 times more in index scan(first_rows) when we compared to All_rows(full table scan).

When do we use FIRST_ROWS?

To answer this question, we can use first_rows when user want to see the first few rows immediately. It is mostly used in OLTP, some reporting environment.

When do we use ALL_ROWS?

To answer this question, we can use all_rows when user want to process all the rows before we see the output.. Mostly used in OLAP. All_rows use less resource when compared to first_rows.

Important factor in FIRST_ROWS

1. It prefer to use the index scan
2. It prefer to use nested loop join over hash joins. Because, nested loop joins data as selected. but hash join hashes the data in hash table which takes time.
3. Good for OLTP

Important factor in ALL_ROWS

1. It use both index scan & full table scan depends on how many blocks optimizer is reading in the table.
2. Good for OLAP
3. It most likly to use hash join, again depends upon other factors.

Saturday, September 13, 2008

Nested Loop, Hash Join, Sort Merge Join, Cartesian join difference

This article is written in oracle 9.2.0.8.

Nested loop Joins The optimizer uses nested loop joins when joining small number of rows, with a good driving condition between the two tables. It drives from the outer loop to the inner loop. The inner loop is iterated for every row returned from the outer loop, ideally by an index scan. It is inefficient when join returns large number of rows (typically, more than 10,000 rows is considered large), and the optimizer might choose not to use it.

The cost is calculated as below.
cost = access cost of A + (access cost of B * no_of_rows from A)

A nested loop join involves the following steps:

1. The optimizer determines the driving table and designates it as the outer table.
2. The other table is designated as the inner table.
3. For every row in the outer table, Oracle accesses all the rows in the inner table.

For instance,
scott@DB1.US.ORACLE.COM> SELECT emp.empno, dept.dname
2 FROM EMP , DEPT
3 WHERE dept.deptno = 10
4 AND emp.deptno = dept.deptno
5 /

EMPNO DNAME
---------- --------------
7782 ACCOUNTING
7839 ACCOUNTING
7934 ACCOUNTING

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=3 Card=5 Bytes=85)
1 0 NESTED LOOPS (Cost=3 Card=5 Bytes=85)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'DEPT' (Cost=1 Card=1 Bytes=10)
3 2 INDEX (UNIQUE SCAN) OF 'PK_DEPT' (UNIQUE)
4 1 TABLE ACCESS (FULL) OF 'EMP' (Cost=2 Card=5 Bytes=35)

We can also force the Nested loop join hint as below.

SELECT /*+ USE_NL(emp dept) */ emp.empno, dept.dname
FROM EMP , DEPT WHERE dept.deptno = 10
AND emp.deptno = dept.deptno

Hash Join Hash joins are used for joining large data sets. The optimizer uses the smaller of two tables or data sources to build a hash table on the join key in memory. It then scans the larger table, probing the hash table to find the joined rows. This method is best used when the smaller table fits in available memory. The optimizer uses a hash join to join two tables if they are joined using an equijoin(joins with equals predicates) and large amount of data needs to be joined.

The cost of a Hash loop join is calculated by the following formula:
cost=(access cost of A*no_of_hash partitions of B) + access cost of B

For instance,
scott@DB1.US.ORACLE.COM> ;
1 SELECT emp.empno, dept.dname
2 FROM EMP , DEPT
3* WHERE emp.deptno = dept.deptno
scott@DB1.US.ORACLE.COM>
/
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=5 Card=14 Bytes=238)
1 0 HASH JOIN (Cost=5 Card=14 Bytes=238)
2 1 TABLE ACCESS (FULL) OF 'DEPT' (Cost=2 Card=7 Bytes=70)
3 1 TABLE ACCESS (FULL) OF 'EMP' (Cost=2 Card=14 Bytes=98)

We can also force the Hash join hint as below.
SELECT /*+USE_HASH(emp dept) */ emp.empno, dept.dname
FROM EMP , DEPT
WHERE emp.deptno = dept.deptno

Sort-Merge Join Sort merge joins can be used to join rows from two independent sources. Hash joins generally perform better than sort merge joins. On the other hand, sort merge joins can perform better than hash joins if both of the following conditions exist:

1. The row sources are sorted already.
2. A sort operation does not have to be done.

Sort merge joins are almost exclusively used for non-equi joins (>, <, BETWEEN). Sort merge joins perform better than nested loop joins for large data sets. (You cannot use hash joins unless there is an equality condition). In a merge join, there is no concept of a driving table.
The join consists of two steps:
Sort join operation: Both the inputs are sorted on the join key.
Merge join operation: The sorted lists are merged together.

The optimizer can choose a sort merge join over a hash join for joining large amounts of data if any of the following conditions are true:

1. The join condition between two tables is not an equi-join.
2. OPTIMIZER_MODE is set to RULE.
3. HASH_JOIN_ENABLED is false.
4. Because of sorts already required by other operations, the optimizer finds it is cheaper to use a sort merge than a hash join.
5. The optimizer thinks that the cost of a hash join is higher, based on the settings of HASH_AREA_SIZE and SORT_AREA_SIZE.

The cost of a sort merge join is calculated by the following formula:
cost = access cost of A + access cost of B + (sort cost of A + sort cost of B)

For instance, scott@DB1.US.ORACLE.COM> SELECT emp.empno, dept.dname
2 FROM EMP , DEPT
3 WHERE emp.deptno < dept.deptno

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=6 Card=5 Bytes=85)
1 0 MERGE JOIN (Cost=6 Card=5 Bytes=85)
2 1 SORT (JOIN) (Cost=2 Card=7 Bytes=70)
3 2 TABLE ACCESS (BY INDEX ROWID) OF 'DEPT' (Cost=2 Card=7 Bytes=70)
4 3 INDEX (FULL SCAN) OF 'PK_DEPT' (UNIQUE) (Cost=1 Card =7)
5 1 SORT (JOIN) (Cost=4 Card=14 Bytes=98)
6 5 TABLE ACCESS (FULL) OF 'EMP' (Cost=2 Card=14 Bytes=98)

We can also force the Hash join hint as below.

scott@DB1.US.ORACLE.COM> SELECT /*+USE_MERGE(emp dept) */
emp.empno, dept.dname
2 FROM EMP , DEPT
3 WHERE emp.deptno < dept.deptno

Cartesian join A Cartesian join is used when one or more of the tables does not have any join conditions to any other tables in the statement. The optimizer joins every row from one data source with all the row from the other data source, creating the Cartesian product of the two sets.

For instance,
scott@DB1.US.ORACLE.COM> select * from emp,dept;
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=16 Card=98 Bytes=5194)
1 0 MERGE JOIN (CARTESIAN) (Cost=16 Card=98 Bytes=5194)
2 1 TABLE ACCESS (FULL) OF 'DEPT' (Cost=2 Card=7 Bytes=112)
3 1 BUFFER (SORT) (Cost=14 Card=14 Bytes=518)
4 3 TABLE ACCESS (FULL) OF 'EMP' (Cost=2 Card=14 Bytes=518)