histogram是oracle为cbo提供更精确的成本估计而设计的一种直方图数据。histogram能提供列的数据分布,每次分析表后列的分布信息将会被保存在统计表里面,分析时默认的histogram size是75,意
思就是采用75个buckets来表示数据分布。
histogram分为2种类型,基于高度的histogram和基于值的histogram
基于高度的histogram
当histogram buckets的数量少于列的distinct value时,oracle会采用基于高度的直方图反映数据分布,每个bucket容纳相同数量的值。
基本格式如下所示
SQL> column column_name format a20;
SQL> column ENDPOINT_ACTUAL_VALUE format a20;
SQL> SELECT COLUMN_NAME,ENDPOINT_NUMBER, ENDPOINT_VALUE,ENDPOINT_ACTUAL_VALUE
2 FROM DBA_HISTOGRAMS
3 WHERE TABLE_NAME ='TEST' AND COLUMN_NAME='OBJECT_ID'
4 ORDER BY ENDPOINT_NUMBER;
COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALU
-------------------- --------------- -------------- --------------------
OBJECT_ID 29 1
OBJECT_ID 44 2
OBJECT_ID 59 3
OBJECT_ID 74 4
OBJECT_ID 75 76
这里'OBJECT_ID'列有从1到76不同的76个值,bucket数量为75个,所以采取了基于高度的直方图。
可以看到1的值占据了1-29号bucket,2的值占据了30-44号bucket,3的值占据了45-59号bucket,4的值占据了60-74号bucket,值5-76占据了75号bucket。所以值1占的比例最大。
基于值的histogram
当histogram buckets>=列的distinct values时,那么Oracle会使用基于值的histogram,每个值将会占据一个bucket,来看一下
SQL> column column_name format a20;
SQL> column ENDPOINT_ACTUAL_VALUE format a20;
SQL> SELECT COLUMN_NAME,ENDPOINT_NUMBER, ENDPOINT_VALUE,ENDPOINT_ACTUAL_VALUE
2 FROM DBA_HISTOGRAMS
3 WHERE TABLE_NAME ='TEST' AND COLUMN_NAME='OBJECT_ID'
4 ORDER BY ENDPOINT_NUMBER;
COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALU
-------------------- --------------- -------------- --------------------
OBJECT_ID 19928 1
OBJECT_ID 29927 2
OBJECT_ID 39926 3
OBJECT_ID 49838 4
OBJECT_ID 49839 5
OBJECT_ID 49840 6
OBJECT_ID 49841 7
OBJECT_ID 49842 8
OBJECT_ID 49843 9
OBJECT_ID 49844 10
OBJECT_ID 49845 11
OBJECT_ID 49846 12
OBJECT_ID 49847 13
OBJECT_ID 49848 14
OBJECT_ID 49849 15
OBJECT_ID 49850 16
OBJECT_ID 49851 17
OBJECT_ID 49852 18
OBJECT_ID 49853 19
OBJECT_ID 49854 20
OBJECT_ID 49855 21
OBJECT_ID 49856 22
OBJECT_ID 49857 23
OBJECT_ID 49858 24
OBJECT_ID 49859 25
OBJECT_ID 49860 26
OBJECT_ID 49861 27
OBJECT_ID 49862 28
OBJECT_ID 49863 29
OBJECT_ID 49864 30
OBJECT_ID 49865 31
OBJECT_ID 49866 32
OBJECT_ID 49867 33
OBJECT_ID 49868 34
OBJECT_ID 49869 35
OBJECT_ID 49870 36
OBJECT_ID 49871 37
OBJECT_ID 49872 38
OBJECT_ID 49873 39
OBJECT_ID 49874 40
OBJECT_ID 49875 41
OBJECT_ID 49876 42
OBJECT_ID 49877 43
OBJECT_ID 49878 44
OBJECT_ID 49879 45
OBJECT_ID 49880 46
OBJECT_ID 49881 47
OBJECT_ID 49882 48
OBJECT_ID 49883 49
OBJECT_ID 49884 50
OBJECT_ID 49885 51
OBJECT_ID 49886 52
OBJECT_ID 49887 53
OBJECT_ID 49888 54
OBJECT_ID 49889 55
OBJECT_ID 49890 56
OBJECT_ID 49891 57
OBJECT_ID 49892 58
OBJECT_ID 49893 59
OBJECT_ID 49894 60
OBJECT_ID 49895 61
OBJECT_ID 49896 62
OBJECT_ID 49897 63
OBJECT_ID 49898 64
OBJECT_ID 49899 65
OBJECT_ID 49900 66
OBJECT_ID 49901 67
OBJECT_ID 49902 68
OBJECT_ID 49903 69
OBJECT_ID 49904 70
OBJECT_ID 49905 71
OBJECT_ID 49906 72
OBJECT_ID 49907 73
OBJECT_ID 49908 74
OBJECT_ID 49909 75
很明显可以看出与基于高度的histogram的区别,基于值的histogram为每一个值提供了一个bucket,从上面看到值1有19928行,值2有9999行,值3有9999行,值4有9912行,值5-75都分别只有1行,同样
我们可以看出值1占的比例最大。
直方图的数据我们也看到了,但cbo怎么利用它来计算成本呢,做个10053看一下
alter session set events'10053 trace name context forever,level 1';
select object_name from test where object_id=1;
alter session set events'10053 trace name context off';
提取trace文件中的一段
***************************************
BASE STATISTICAL INFORMATION
***********************
Table Stats::
Table: TEST Alias: TEST
#Rows: 49909 #Blks: 707 AvgRowLen: 95.00
Index Stats::
Index: IND_TEST_OBJECT_ID Col#: 4
LVLS: 1 #LB: 179 #DK: 5 LB/K: 35.00 DB/K: 175.00 CLUF: 879.00
***************************************
SINGLE TABLE ACCESS PATH
Column (#4): OBJECT_ID(NUMBER)
AvgLen: 2.00 NDV: 75 Nulls: 0 Density: 1.0018e-05 Min: 1 Max: 75
Histogram: Freq #Bkts: 75 UncompBkts: 49909 EndPtVals: 75
Table: TEST Alias: TEST
Card: Original: 49909 Rounded: 19928 Computed: 19928.00 Non Adjusted: 19928.00
Access Path: TableScan
Cost: 158.56 Resp: 158.56 Degree: 0
Cost_io: 156.00 Cost_cpu: 18011198
Resp_io: 156.00 Resp_cpu: 18011198
Access Path: index (AllEqRange)
Index: IND_TEST_OBJECT_ID
resc_io: 423.00 resc_cpu: 11183699
ix_sel: 0.39929 ix_sel_with_filters: 0.39929
Cost: 424.59 Resp: 424.59 Degree: 1
Best:: AccessPath: TableScan
Cost: 158.56 Degree: 1 Resp: 158.56 Card: 19928.00 Bytes: 0
上面的selectivity就是通过histogram来计算的,如果不存在histogram的话那么selectivity=(1/number of distinct values),这里的selectivity=(值为1的行数/总行数)=(19928/49909)=0.39929
再根据索引扫描成本计算公式
io_cost=resc_io= blevel+FF*leaf_blocks+FF*clustering_factor
=1+0.39929*179+0.39929*879
=1+71.47291+350.97591
=423
所以index访问成本大于table scan成本,最终cbo选择的路线为采用tablescan.
如果选择object_id=75呢?
alter session set events'10053 trace name context forever,level 1';
select object_name from test where object_id=75;
alter session set events'10053 trace name context off';
***************************************
BASE STATISTICAL INFORMATION
***********************
Table Stats::
Table: TEST Alias: TEST
#Rows: 49909 #Blks: 707 AvgRowLen: 95.00
Index Stats::
Index: IND_TEST_OBJECT_ID Col#: 4
LVLS: 1 #LB: 179 #DK: 75 LB/K: 2.00 DB/K: 11.00 CLUF: 879.00
***************************************
SINGLE TABLE ACCESS PATH
Column (#4): OBJECT_ID(NUMBER)
AvgLen: 2.00 NDV: 75 Nulls: 0 Density: 1.0018e-05 Min: 1 Max: 75
Histogram: Freq #Bkts: 75 UncompBkts: 49909 EndPtVals: 75
Table: TEST Alias: TEST
Card: Original: 49909 Rounded: 1 Computed: 0.50 Non Adjusted: 0.50
Access Path: TableScan
Cost: 158.56 Resp: 158.56 Degree: 0
Cost_io: 156.00 Cost_cpu: 18011198
Resp_io: 156.00 Resp_cpu: 18011198
Access Path: index (AllEqRange)
Index: IND_TEST_OBJECT_ID
resc_io: 2.00 resc_cpu: 15503
ix_sel: 1.0018e-05 ix_sel_with_filters: 1.0018e-05
Cost: 2.00 Resp: 2.00 Degree: 1
Best:: AccessPath: IndexRange Index: IND_TEST_OBJECT_ID
Cost: 2.00 Degree: 1 Resp: 2.00 Card: 0.50 Bytes: 0
对于non-popular value(数量比较少的值,比如在基于高度histogram里跨度不超过一个bucket或在基于值的histogram里行数比较少的值),oracle采取了不再去计算selectivity,而是采用density来
做为
做为selectivity,从HIST_HEAD$里可以看到density,oracle采用density function(密度函数)来估计列的密度。
SQL> select OBJ#,COL#,BUCKET_CNT,ROW_CNT,SAMPLE_SIZE,MINIMUM,MAXIMUM,DISTCNT,DENSITY from sys.HIST_HEAD$ where obj#=51933 AND COL#=4;
OBJ# COL# BUCKET_CNT ROW_CNT SAMPLE_SIZE MINIMUM MAXIMUM DISTCNT DENSITY
---------- ---------- ---------- ---------- ----------- ---------- ---------- ---------- ----------
51933 4 49909 75 49909 1 75 75 .000010018
io_cost=resc_io= blevel+FF*leaf_blocks+FF*clustering_factor
=1+1.0018e-05*179+1.0018e-05*879
=1+0.001793222+0.008805822
=2(因为最少会读2个块)
所以最后cbo选择了index
dbms_stats.gather_table_stats(ownname => 'OWNER',tabname => 'TABLE_NAME',method_opt=>'for all indexed columns size auto',cascade => true);