Sunday, September 16, 2007

Query to determine the clustering factor before you create the index

I found the following query useful to determine the clustering factor for the indexes that has not been created yet.

select count(1) clustering_factor
from
(
select dbms_rowid.rowid_block_number(rowid) block_no,
list of the indexed columns,
count(1) num_rows,
LAG(dbms_rowid.rowid_block_number(rowid))
over (order by list of the indexed columns) prev_block
from table_name
group by dbms_rowid.rowid_block_number(rowid),
list of the indexed columns
order by list of the indexed columns
)
where block_no<>prev_block or prev_block is null


Let's take a look at the following example:

SQL> create table test
2 tablespace example
3 as select
4 mod(rownum,8) col1
5 lpad('x',1000) col2
6 from all_objects
7 where rownum<=1000;


Using the following query, we can see that there are about 7 rows per block (block size 8192) that makes the records with same COL1 to be scattered across different blocks:

SQL> select num_rows_in_block, count(1) blocks
2 from
3 (
4 select block_no, count(1) num_rows_in_block
5 from
6 (
7 select dbms_rowid.rowid_block_number(rowid) block_no
8 from test
9 )
10 group by block_no
11 )
12* group by num_rows_in_block
SQL> /

NUM_ROWS_IN_BLOCK BLOCKS
----------------- ----------
6 1
7 142

Let see how big would be the clustering factor of an index built on TEST(COL1).

SQL> select count(1) clustering_factor
2 from
3 (
4 select dbms_rowid.rowid_block_number(rowid) block_no,
5 col1,
6 count(1) num_rows,
7 LAG(dbms_rowid.rowid_block_number(rowid))
8 over (order by col1) prev_block
9 from test
10 group by dbms_rowid.rowid_block_number(rowid), col1
11 order by col1
12 )
13* where block_no<>prev_block or prev_block is null
SQL> /

CLUSTERING_FACTOR
-----------------
1000

The result was pretty obvious.
Now let's add another column COL3 in the table TEST and set its value to mod(col1, 4).
This will make the blocks to contain at most two records with the same value of COL3.

SQL> alter table test
2 add col3 number;

Table altered.

SQL> update test set col3=mod(col1,4);

1000 rows updated.

SQL> commit;

Commit complete.

SQL>

Now let see what would be the value of the clustering factor of the index created on COL3 column:

SQL> select count(1) clustering_factor
2 from
3 (
4 select dbms_rowid.rowid_block_number(rowid) block_no,
5 col3,
6 count(1) num_rows,
7 LAG(dbms_rowid.rowid_block_number(rowid))
8 over (order by col3) prev_block
9 from test
10 group by dbms_rowid.rowid_block_number(rowid), col3
11 order by col3
12 )
13* where block_no<>prev_block or prev_block is null
SQL> /

CLUSTERING_FACTOR
-----------------
572

Now let's create the actual indexes and see what their clustering factor would be:

SQL> create index test_col1_idx on test(col1);

Index created.

SQL> create index test_col3_idx on test(col3);

Index created.

SQL> select index_name, clustering_factor
2 from user_indexes
3* where table_name='TEST'
SQL> /

INDEX_NAME CLUSTERING_FACTOR
------------------------------ -----------------
TEST_COL1_IDX 1000
TEST_COL3_IDX 572

Perfect insights about the importance of the clustering factor for calculating the cost I found in Jonathan Lewis's book Cost-Based Oracle Fundamentals.

Let's go back to the very first query in this post. Why I think this query is useful?
Because sometime, even though I think some columns are perfect combination to create index on, the index might not be used by the optimizer because the clustering factor is big.
It can give you an initial clue how effective some index could be.