SQL QUERY之不的不说的predicate-kinfinger-ChinaUnix博客

kinfingerasage.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

kinfinger

博客访问： 1810755
博文数量： 335
博客积分： 4690
博客等级：上校
技术积分： 4341
用户组：普通用户
注册时间： 2010-05-08 21:38

个人简介

无聊之人--除了技术，还是技术，你懂得

文章分类

全部博文（335）

ORACEL（4）
REDIS（1）
REDIS（0）
LINUX/UNIX（4）
PHP（1）
COGNOS（1）
COBOL（4）
CICS（3）
EXCEL（2）
MYSQL（7）
DB2（72）
TWS（0）
SA（0）
mainframe（16）
web（10）

javascript（1）
APUE（43）
REXX（7）
work（13）
life（13）
python（95）
c/c++（26）
asage（7）
未分配的博文（6）

文章存档

2016年（29）

2015年（18）

2014年（7）

2013年（86）

2012年（90）

2011年（105）

我的朋友

相关博文

SQL QUERY之不的不说的predicate

分类： DB2/Informix

2012-09-12 11:16:25

本文主要从predicate的概念，分类，以及predicate度访问路径的影响来进行介绍，主要的目的是通过介绍让你对你的sql有一个大体的优化方向，顺便介绍了一下DB2的 runstat utility的部分语法。需要注意的是catalog中的统计信息表对访问路径的影响是很大的，虽然我们不明白具体的机制，但我们可以给DB2优化器提供足够的多信息，让优化器进行选择，达到最优的访问路径

本文涉及的内容主要涉及三张catalog表

SYSCOLUMNS(cardinality）

SYSCOLDIST(frequency）

SYSCOLDIST_HIST(histgram)

Predicate

Concept: A predicate specifies a condition that is true, false, or unknown about a given row

or group.

对于一个给定的行或是行组，predicate可以用来规定条件的真假或是未知，

它主要是用来描述数据的属性，它主要出现在WHERE,HAVING,ON clause

Predicate的一般形式是：

COLUMN_NAME OPERATOR CONDITION_VALUE

Predicate的结果只能是真，假，或是未知，这样对于一个特定的行进行匹配的时候，就可以确定该行是否满足该predicate，也就是说predicate是用来限定（qualified）行的

Predicate依据operator进行分类，主要包含：

Subquery predicate ： any predicate that includes another select statement

For example :

C1 in (select c10 from t_c where xxxx )

Equal predicate：any predicate that is not a subquery predicate and has an equal operator and no not operator,also included are predicates of the form

C1 is null

C is not distinct from

Range predicate：not a subquery predicate and contains one of the following operaotrs:

Between and

IN predicate ： a predicate of the form :

Column in (list of values)

按照predicate 的结果对where条件的影响有可以分为：

Boolean term predicate:

A Boolean term predicate(BTP) is a simple or compound predicate that, when it is evaluated false for a particular row, makes the entire where clause false for that particular row.

所谓BTP就是一个简单或是符合predicate，针对某一行，当该predicate为假时，整个where条件在那个特定的行业为假，该行也就不满足条件（not-qualified）

Non_boolean term predicate

下面举例说明：

假设P1,P2,P3都是simple predicate

SELECT * FROM T1 WHERE P1 AND (P2 OR P3)

针对上面的SQL ：

P1 simplge BTP

P2 SIMPLE NON-BTP

P3 SIMPLE NON-BTP

P2 OR P3 COMPUND BTP

P1 AND (P2 OR P3) COMPUND BTP

针对predicate，还需要解释一点就是having中的predicate是不用来选择access path的

Not predicate: any predicate that is a not subquery predicate and contains a not operator.

一个sql中的predicate的处理可以包含两个stage，因此predicate分为两种

Stage1 predicate --sargable(search argument-able )可用来搜索的参数吧，自己理解

Stage 2 predicate ----residual (剩余的)

Predicate 是按一下顺序进行求值处理的：stage

1 首先应用可索引predicate，

2 stage 1 predicate中没有选作matching predicate的predicate，这样的predicate引用了index column，这个过程也被成为index screening。

3 剩余其它的stage 1 predicate

After data page access,stage 1 prediate are applied to the data

4 stage 2 predicate are applied on the returned data rows

在每一阶段，predicate是按如下的数据类型进行处理的：

1 equal predicate ( incluce in predicate that contain only one item and between predicates that contain the save value twice

2 range predicate and predicate of the forms : column is not null,多说一句，range predicate可以终止索引predicate的匹配（前面的博文已涉及）

3 all other predicate type

下面我们涉及predicate的重点内容：predicate的filter Factors(FF)是如何影响访问路径，只有明白它，你才能写出更加高效的SQL QUERY.

所谓filter Factors,是一个介于0到1之间的值，它表明了predicate为真时，表中的行所占的比例，这些行被这些predicate所限定。DB2使用FF来估计predicate所限定的行的数目，通常对于一个simple predicate，FF是下面三个变量的函数：（参考predicate的构成）

Constant value in the predicate

The operator in the predicte

Statistic on the column in the predicate

上述三个变量，前两个是你可以控制的，第三个变量的统计信息是保存在catalog中的，而catalog中的统计信息你可以通过运行带有update 的ruanstat utility 来进行实时更新。下面给出一些简单的predicate的FF的default value

Predicate FF

Col = constant 1/25

Col <> constant 1 – (1/25)

Col IS NULL 1/25

Col IS NOT DISTINCT FROM 1/25

Col IS DISTINCT FROM 1 – (1/25)

Col IN (constant list) (number of constants)/25

Col Op constant 1/3

Col LIKE constant 1/10

其实说白了，FF的值就是某一值出现在表中行的一个统计特性，即出现的概率值,默认某一个值出现的概率是服从均匀分布(uniform distribution).对于其他比较复杂的predicate，可以使用插值公式来进行计算（interpolation formulation）。

当你收集足够多的统计信息时，DB2优化器进行访问路径选择时，估计的也就越准确，从而可以选出最优的访问路径。

通常DB2收集两种统计信息

Frequency ： the percentage of rows in the table that that contains a value for a column of set of columns

Cardinality

The number of distinct value in a set of columns

这两个统计信息侧重点不同，frequency描述某一个值在表中所占的百分比，而cardinality描述了针对一列或是多列所包含的distinct值的多少，二者都能影响access path的选择。

其中frequency的统计信息放在SYSIBM.SYSCOLDIST 表中，需要关注的列：

COLVALUE FREQUENCYF

该值可以通过runstat utility来收集，在这里顺带解释一下runstat的语法，

索引包含的列数可以能是一列或是多列，因此针对列统计信息的收集的语法包含两种

COLUMN

ALL

Specifies that statistics are to be gathered on all columns in the table.

COLUMN_NAME

Specifies the columns on which statistics are to be gathered.

COLUMN-GUOUP SPEC

Indicates that the specified set of columns are to be treated as a group. This

option enables RUNSTATS to collect a cardinality value on the specified

column group. 针对该column group收集distinct value

FREQVAL

Indicates, when specified with the COLGROUP option, that frequency statistics

are also to be gathered for the specified group of columns. (COLGROUP

indicates that cardinality statistics are to be gathered.) One group of statistics is

gathered for each column.

该参数同colgroup 一同使用，用来收集特定colgroup 的统计信息，其中colgroup表明了cardinality信息会被收集），colgroup中的每一列的统计信息都会被收集。

COUNT integer

Indicates the number of frequently occurring values to be collected from

the specified column group. For example, COUNT 20 means that DB2 collects 20 frequently occurring values from the column group. You must

specify a value for integer; no default value is assumed.

Be careful when specifying a high value for COUNT. Specifying a value of

1000 or more can increase the prepare time for some SQL statements.

MOST

Indicates that the utility is to collect the most frequently occurring values

for the specified set of columns when COLGROUP is specified.

BOTH

Indicates that the utility is to collect the most and the least frequently

occurring values for the specified set of columns when COLGROUP is

specified. If COUNT is n, the utility collects the n least frequently

occurring values and the n most frequently occurring values.

LEAST

Indicates that the utility is to collect the least frequently occurring values

for the specified set of columns when COLGROUP is specified.

总而言之，freval选项在启用colgroup后，用来收集colgroup中某个值出现次数前几(COUNTER)最多(MOST)或是最少(least)，或者都有的值(both include most and least)值的统计信息的统计信息，同时还会收集colgroup每一列的cardinality的统计信息

DB2在选用统计信息的时候，如果针对某一列或是colgroup，你收集的前十个值，那么剩下的值会按均匀分布的概率进行估算。

通常针对col frequency统计信息的收集都是针对single col或是single multi-column ，采集前十或是某些特殊的统计信息，如

Example frequency statistics for values on the YRS_OF_EXPERIENCE column in

an EMPLOYEE table

VALUE FREQUENCY

2 10%

25 15%

26 15%

27 7%

12 0.02%

13 0.01%

40 0.0001%

41 0.00001%

上述的统计信息收集了出现最多前五个，最少前五个的统计信息，但是面对如下查询

Equality predicate with unmatched value:

SELECT EMPID FROM EMPLOYEE T

WHERE T.YRS_OF_EXPERIENCE = 6;

v Range predicate:

SELECT T.EMPID FROM EMPLOYEE T

WHERE T.YRS_OF_EXPERIENCE BETWEEN 5 AND 10;

v Non-local predicate:

SELECT T1.EMPID FROM EMPLOYEE T1, OPENJOBS T2

WHERE T1.SPECIALTY = T2.AREA AND T1.YRS_OF_EXPERIENCE > T2.YRS_OF_EXPERIENCE;

Frequency对FF的估算并不能起作用，如果没有相关的统计信息，DB2只能使用interpolation来计算，为了解决这种问题，你还可以使用runstat utility来收集直方图统计信息以帮助解决FF的计算。在直方图统计信息中，DB2将某一个值的区间进行等分，区间的数目取决于你histogram option 中的NUMQUANTILES，每一个区间都有一个quantilenno，DB2针对每一个区间收集统计信息，下表是numquantiles=7 时，上一个表的直方图统计信息