在MySQL里执行计划对子查询no unnest subquery的有:SUBQUERY、DEPENDENT SUBQUERY、UNCACHEABLE SUBQUERY。
文档上说DEPENDENT SUBQUERY和UNCACHEABLE SUBQUERY不同,DEPENDENT SUBQUERY对于外部的每个不同的关联key执行一次子查询,UNCACHEABLE SUBQUERY意思是没有缓存,也就是外部的每行执行一次子查询。
文档10.8.2 EXPLAIN Output Format:
DEPENDENT SUBQUERY evaluation differs from UNCACHEABLE SUBQUERY evaluation.
For DEPENDENT SUBQUERY, the subquery is re-evaluated only once for each set of different values of the variables from its outer context.
For UNCACHEABLE SUBQUERY, the subquery is re-evaluated for each row of the outer context.
通过测试,发现和文档有点出入,MySQL貌似对SUBQUERY有缓存优化,DEPENDENT SUBQUERY实际没有看到缓存优化(有优化,但是效果不明显)。
SUBQUERY和DEPENDENT SUBQUERY实际上是没有unnest的子查询的算法,子查询没有unnest,必须是外层结果来驱动子查询,所以,外层结果的行数直接影响子查询执行次数,而子查询执行次数,直接影响整个没有unnest的子查询效率。
一般数据库subquery cache做法(以ORACLE为例):
一般情况下,大多数据库对未unnest的子查询做了缓存优化,这种缓存优化怎么做呢?因为外层与子查询是通过关联key比较的,那么做cache,也就是需要一块内存区域保存外层的key与子查询比较的结果(key-value结构),一般用hash table保存,每一个key保存一行结果即可,像ORACLE
可以保存1024行左右。减少的子查询次数,和外层表的key存储顺序有关,相同值存储越连续,减少的子查询执行次数越多,反而按key存储越离散,则子查询执行次数越多。
在ORACLE里外层驱动子查询不能unnest的有一般子查询(比如子查询有OR)、update关联子查询、标量子查询等,特点是执行计划被驱动部分谓词有系统绑定变量,比如access("C"=:B1),这个:B1是ORACLE CBO自己定义的。
这样subquery cache优化,对于外部行关联值基数(distinct值)特别小的,可以极大减少子查询执行次数,从而提高执行效率,这就是为什么有时候在ORACLE里标量子查询比JOIN效率更高的原因。
在MySQL里执行计划有SUBQUERY和ORACLE里的unnest子查询类似,有缓存,DEPENDENT SUBQUERY貌似没有看到缓存(但是看到改写有cache).
MariaDB有subquery cache:
下面分别研究下MySQL里的subquery cache以及和Oracle的对比。
1)MySQL 执行计划select_type:SUBQUERY,也即不能unnest的子查询materilization有subquery cache
select_type:SUBQUERY对应的子查询使用materialization物化,物化过程只执行一次。
我们希望的是真正需要执行子查询才真正做子查询物化,
比如外部有条件
返回0行,则子查询不需要做物化(
这个后面会说,过滤条件写的位置在子查询前还是后有影响),
需要子查询结果时做materilization,整个过程只做一次,
在内存里创建临时表并按照关联key剔重,并创建索引,然后外部表的行按照子查询关联条件驱动materialize表多次,
这种子查询是有subquery cache优化的,具体子查询被驱动多少次,要看外部表传入的关联key相同值是否连续(这个和ORACLE的FILTER子查询类似)。
materialization只执行一次,会按照关联key剔重然后建立索引,对于没有索引的,剔重后结果集少的,很适用,临时表创建好后,外部结果驱动临时表多次。
按照subquery cache重复的可能减少子查询执行(要看外部行key存储连续程度)。
2)DEPENDENT SUBQUERY就是普通的in to exists方式,外部表每一行会驱动子查询一次,文档上说有缓存,测试没有看到(但是看到改写有cache),实际
树形计划显示不能减少loops次数,效果不是很明显。
3)对于标量子查询,MySQL因为不能走select_type:SUBQUERY,只能走DEPENDENT SUBQUERY,没有cache优化,所以要慎用
另外还有个注意点:
MySQL里subquery unnest走SUBQUERY、DEPENDENT SUBQUERY、UNCACHEABLE SUBQUERY的,如果主表还有过滤条件,
过滤条件写在子查询前还是后,子查询驱动顺序不一样:
1)如果条件不能走索引:这个是规则式的,在ORACLE基于CBO的优化器下没有这种区别
from a where a.cond and subquery 是先执行a.cond过滤,然后每一行驱动执行subquery,较好。
from a where subquery and a.cond 这个是先a的每行驱动执行subquery,如果子查询返回true,则判断a.cond,false则不执行a.cond,一般效率差。
2)如果条件能走索引:MySQL不管condition位置,先执行condition过滤,然后执行子查询,也就是上面2个写法没有区别。
3) 使用subquery hints的,in/exists不能互相转换为相同执行计划,subquery hints.
对应的exists就是intoexists策略,是DEPENDENT SUBQUERY
in的可以指定intoexists或materialization策略。
如果不用hints,MySQL 8对于in和exists可以等价转换。
一般主流数据库都会做subquery cache优化,比如Oracle,这种优化有时候还是很有用的,比如外部表驱动子查询的关联key基数很小,这时候实际执行效率也是很高的,如果没有subquery cache优化,外部每一行驱动一次子查询,必然效率低很多。
MySQL建表语句:
-
create table subq_t1(a int,b int);
-
-
set @@cte_max_recursion_depth = 100000;
-
insert into subq_t1(a,b)
-
WITH recursive t(a,b) AS (
-
SELECT 1 a,1 b
-
from dual
-
UNION ALL
-
SELECT 1,b+1
-
FROM t
-
WHERE t.b<100000
-
)
-
select * from t;
-
-
insert into subq_t1(a,b)
-
WITH recursive t(a,b) AS (
-
SELECT 2 a,1 b
-
from dual
-
UNION ALL
-
SELECT 2,b+1
-
FROM t
-
WHERE t.b<100000
-
)
-
select * from t;
-
-
insert into subq_t1(a,b)
-
WITH recursive t(a,b) AS (
-
SELECT 3 a,1 b
-
from dual
-
UNION ALL
-
SELECT 3,b+1
-
FROM t
-
WHERE t.b<100000
-
)
-
select * from t;
-
-
create table subq_t2(a int,b int);
-
insert into subq_t2(a,b)
-
WITH recursive t(a,b) AS (
-
SELECT 1 a,1 b
-
from dual
-
UNION ALL
-
SELECT a+1,b+1
-
FROM t
-
WHERE t.b<100000
-
)
-
select * from t;
-
-
create index idx_subq_t2 on subq_t2(a);
subq_t1表的a只有3个值:
-
select count(*),count(distinct a) from subq_t1;
-
+----------+-------------------+
-
| count(*) | count(distinct a) |
-
+----------+-------------------+
-
| 300000 | 3 |
-
+----------+-------------------+
subq_t2的a都是不重复的:
-
select count(*),count(distinct a) from subq_t2;
-
+----------+-------------------+
-
| count(*) | count(distinct a) |
-
+----------+-------------------+
-
| 100000 | 100000 |
-
+----------+-------------------+
###标量子查询貌似只能走DEPENDENT SUBQUERY,没有cache
本文对于MySQL子查询研究的都是不能unnest的,具体通过hints控制,也就是select_type:SUBQUERY、DEPENDENT SUBQUERY、UNCACHEABLE SUBQUERY.
能够unnest subquery的,可以走pullout,firstmatch,loosescan,materialization,dupsweedout方式,这个实际是转为JOIN方式,按照JOIN方式走,MySQL对于join走
nested loops的没有cache优化。
1.MySQL子查询计划是SUBQUERY (materialization)的,有subquery cache
exists+subquery(@subq materialization)也走不了materialization,还是和intoexists一样,走DEPENDENT SUBQUERY,只不过改写没有cache.
当然,如果exists不写hints,可能转为in+materialization,也可以有缓存。
SQL如下:
-
explain select/*+subquery(@subq materialization)*/ *
-
from subq_t1 t1
-
where exists
-
(select/*+qb_name(subq)*/ 1 from subq_t2 t2 where t1.a=t2.a);
执行计划如下
-
+----+--------------------+-------+------------+------+---------------+-------------+---------+----------------+--------+----------+-------------+
-
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
-
+----+--------------------+-------+------------+------+---------------+-------------+---------+----------------+--------+----------+-------------+
-
| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 299886 | 100.00 | Using where |
-
| 2 | DEPENDENT SUBQUERY | t2 | NULL | ref | idx_subq_t2 | idx_subq_t2 | 5 | employees.t1.a | 1 | 100.00 | Using index |
-
+----+--------------------+-------+------------+------+---------------+-------------+---------+----------------+--------+----------+-------------+
-
2 rows in set, 2 warnings (0.00 sec)
和预期的走SUBQUERY不一样,走的是DEPENDENT SUBQUERY。
查看改写的,虽然是intoexists,但是改写没有,用subquery(@subq intoexists)+in子查询方式虽然执行计划显示一样,但是改写有
-
show warnings\G
-
*************************** 1. row ***************************
-
Level: Note
-
Code: 1276
-
Message: Field or reference 'employees.t1.a' of SELECT #2 was resolved in SELECT #1
-
*************************** 2. row ***************************
-
Level: Note
-
Code: 1003
-
Message: /* select#1 */ select /*+ SUBQUERY(@`subq` MATERIALIZATION) */ `employees`.`t1`.`a` AS `a`,`employees`.`t1`.`b` AS `b` from `employees`.`subq_t1` `t1` where exists(/* select#2 */ select /*+ QB_NAME(`subq`) */ 1 from `employees`.`subq_t2` `t2` where (`employees`.`t1`.`a` = `employees`.`t2`.`a`))
-
2 rows in set (0.00 sec)
从执行计划看,DEPENDENT SUBQUERY 走intoexists策略,外部表subq_t1有30w行,但是a字段只有3个不同值,子查询被驱动30w次,没有减少子查询执行次数。
-
explain analyze
-
select/*+subquery(@subq materialization)*/ *
-
from subq_t1 t1
-
where exists
-
(select/*+qb_name(subq)*/ 1 from subq_t2 t2 where t1.a=t2.a)\G
-
-
*************************** 1. row ***************************
-
EXPLAIN: -> Filter: exists(select #2) (cost=30156.85 rows=299886) (actual time=0.055..1248.470 rows=300000 loops=1)
-
-> Table scan on t1 (cost=30156.85 rows=299886) (actual time=0.031..260.777 rows=300000 loops=1)
-
-> Select #2 (subquery in condition; dependent)
-
-> Limit: 1 row(s) (cost=0.35 rows=1) (actual time=0.002..0.002 rows=1 loops=300000)
-
-> Covering index lookup on t2 using idx_subq_t2 (a=t1.a) (cost=0.35 rows=1) (actual time=0.002..0.002 rows=1 loops=300000)
-
-
1 row in set, 1 warning (1.31 sec)
改为in+materialization:
-
explain
-
select/*+subquery(@subq materialization)*/ *
-
from subq_t1 t1
-
where t1.a in
-
(select/*+qb_name(subq)*/ t2.a from subq_t2 t2);
现在执行计划符合预期,走SUBQUERY,也就是materialization,可以通过树形计划看到:
-
+----+-------------+-------+------------+-------+---------------+-------------+---------+------+--------+----------+-------------+
-
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
-
+----+-------------+-------+------------+-------+---------------+-------------+---------+------+--------+----------+-------------+
-
| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 299886 | 100.00 | Using where |
-
| 2 | SUBQUERY | t2 | NULL | index | idx_subq_t2 | idx_subq_t2 | 5 | NULL | 100113 | 100.00 | Using index |
-
+----+-------------+-------+------------+-------+---------------+-------------+---------+------+--------+----------+-------------+
-
2 rows in set, 1 warning (0.01 sec)
可以看扩展信息,子查询使用materialization。
-
show warnings\G
-
*************************** 1. row ***************************
-
Level: Note
-
Code: 1003
-
Message: /* select#1 */ select /*+ SUBQUERY(@`subq` MATERIALIZATION) */ `employees`.`t1`.`a` AS `a`,`employees`.`t1`.`b` AS `b` from `employees`.`subq_t1` `t1` where <in_optimizer>(`employees`.`t1`.`a`,`employees`.`t1`.`a` in ( <materialize> (/* select#2 */ select /*+ QB_NAME(`subq`) */ `employees`.`t2`.`a` from `employees`.`subq_t2` `t2` where true ), <primary_index_lookup>(`employees`.`t1`.`a` in <temporary table> on <auto_distinct_key> where ((`employees`.`t1`.`a` = `<materialized_subquery>`.`a`)))))
-
1 row in set (0.00 sec)
检验子查询执行次数:可以看到主表subq_t1返回30w行,如果没有缓存,物化表应该被驱动30w次,但是现在是loops=4,说明有缓存,subq_t1的a字段只有3个不同值,现在物化表被驱动4次,
比预想的多1次,这还是在subq_t1的a字段是按顺序存储的情况下,如果离散存储,则子查询驱动次数更多。
-
explain analyze
-
select/*+subquery(@subq materialization)*/ *
-
from subq_t1 t1
-
where t1.a in
-
(select/*+qb_name(subq)*/ t2.a from subq_t2 t2)\G
materialization有subquery cache,子查询执行次数少,外部行30w,子查询物化表只执行4次查询,
执行时间从intoexists的1248ms到335ms,效率提升明显。
-
*************************** 1. row ***************************
-
EXPLAIN: -> Filter: <in_optimizer>(t1.a,t1.a in (select #2)) (cost=30156.85 rows=299886) (actual time=124.672..335.684 rows=300000 loops=1)
-
-> Table scan on t1 (cost=30156.85 rows=299886) (actual time=0.026..175.501 rows=300000 loops=1)
-
-> Select #2 (subquery in condition; run only once)
-
-> Filter: ((t1.a = `<materialized_subquery>`.a)) (cost=20078.95..20078.95 rows=1) (actual time=31.166..31.166 rows=1 loops=4)
-
-> Limit: 1 row(s) (cost=20078.85..20078.85 rows=1) (actual time=31.164..31.164 rows=1 loops=4)
-
-> Index lookup on <materialized_subquery> using <auto_distinct_key> (a=t1.a) (actual time=31.163..31.163 rows=1 loops=4)
-
-> Materialize with deduplication (cost=20078.85..20078.85 rows=100113) (actual time=124.626..124.626 rows=100000 loops=1)
-
-> Covering index scan on t2 using idx_subq_t2 (cost=10067.55 rows=100113) (actual time=0.007..64.168 rows=100000 loops=1)
-
-
1 row in set (0.36 sec)
总结:MySQL对不能unnest的子查询,只有SUBQUERY方式(走materilization)才有subquery cache优化,exists写hints不能转为SUBQUERY方式,不写可能能转。
2.MySQL的DEPENDENT SUBQUERY,如果是intoexists对外部表字段cache,但是不能减少子查询执行次数,效果不是太明显
改为in+intoexists看看:
-
explain
-
select/*+subquery(@subq intoexists)*/ *
-
from subq_t1 t1
-
where t1.a in
-
(select/*+qb_name(subq)*/ t2.a from subq_t2 t2);
-
-
+----+--------------------+-------+------------+----------------+---------------+-------------+---------+------+--------+----------+-------------+
-
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
-
+----+--------------------+-------+------------+----------------+---------------+-------------+---------+------+--------+----------+-------------+
-
| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 299886 | 100.00 | Using where |
-
| 2 | DEPENDENT SUBQUERY | t2 | NULL | index_subquery | idx_subq_t2 | idx_subq_t2 | 5 | func | 1 | 100.00 | Using index |
-
+----+--------------------+-------+------------+----------------+---------------+-------------+---------+------+--------+----------+-------------+
-
2 rows in set, 1 warning (0.00 sec)
看起来in+intoexists也做了优化,改写里有((`employees`.`t1`.`a`) ,实际上有多大效果,看explain analyze:
-
show warnings\G
-
*************************** 1. row ***************************
-
Level: Note
-
Code: 1003
-
Message: /* select#1 */ select /*+ SUBQUERY(@`subq` INTOEXISTS) */ `employees`.`t1`.`a` AS `a`,`employees`.`t1`.`b` AS `b` from `employees`.`subq_t1` `t1` where <in_optimizer>(`employees`.`t1`.`a`,<exists>(<index_lookup>(<cache>(`employees`.`t1`.`a`) in subq_t2 on idx_subq_t2)))
-
1 row in set (0.00 sec)
查看树形计划:走索引有a=(t1.a),但是索引还是loops=300000,没有减少执行次数:
-
explain analyze
-
select/*+subquery(@subq intoexists)*/ *
-
from subq_t1 t1
-
where t1.a in
-
(select/*+qb_name(subq)*/ t2.a from subq_t2 t2)\G
-
*************************** 1. row ***************************
-
EXPLAIN: -> Filter: <in_optimizer>(t1.a,<exists>(select #2)) (cost=30156.85 rows=299886) (actual time=0.131..939.998 rows=300000 loops=1)
-
-> Table scan on t1 (cost=30156.85 rows=299886) (actual time=0.047..241.175 rows=300000 loops=1)
-
-> Select #2 (subquery in condition; dependent)
-
-> Limit: 1 row(s) (actual time=0.002..0.002 rows=1 loops=300000)
-
-> Covering index lookup on t2 using idx_subq_t2 (a=<cache>(t1.a)) (actual time=0.002..0.002 rows=1 loops=300000)
-
1 row in set (0.99 sec)
对于DEPENDENT SUBQUERY方式,树形计划里显示a=(t1.a),也就是在索引查找时做了cache优化,而不是减少索引扫描的次数。
索引查询平均每次0.002ms.原来exists方式没有,in+intoexists的字段cache,优化效果不是很明显,只在外部表subq_t1的全表扫描上降了30%时间,
而不是子查询表的执行次数减少。
-
explain analyze
-
select/*+subquery(@subq materialization)*/ *
-
from subq_t1 t1
-
where exists
-
(select/*+qb_name(subq)*/ 1 from subq_t2 t2 where t1.a=t2.a)\G
-
-
*************************** 1. row ***************************
-
EXPLAIN: -> Filter: exists(select #2) (cost=30156.85 rows=299886) (actual time=0.070..1230.870 rows=300000 loops=1)
-
-> Table scan on t1 (cost=30156.85 rows=299886) (actual time=0.029..256.100 rows=300000 loops=1)
-
-> Select #2 (subquery in condition; dependent)
-
-> Limit: 1 row(s) (cost=0.35 rows=1) (actual time=0.002..0.002 rows=1 loops=300000)
-
-> Covering index lookup on t2 using idx_subq_t2 (a=t1.a) (cost=0.35 rows=1) (actual time=0.002..0.002 rows=1 loops=300000)
-
-
1 row in set, 1 warning (1.29 sec)
总结:
DEPENDENT SUBQUERY实际上是不能unnest subquery的保底执行方法,经常效率比较烂,子查询能够unnest较好,有5种策略可以使用,能充分使用join/semi join/anti join算法提高效率。
不能unnest的只有materialization和intoexists两种策略,完全是外表驱动子查询走FILTER,只有materialization才有缓存减少子查询执行次数的优化。
下一篇:
MySQL Subquery Cache vs Oracle Subquery Cache子查询缓存_MySQL部分_PART2
阅读(407) | 评论(0) | 转发(0) |