MySQL Subquery Cache vs Oracle Subquery Cache子查询缓存_MySQL部分

博客访问： 1464344
博文数量： 180
博客积分： 0
博客等级：民兵
技术积分： 3926
用户组：普通用户
注册时间： 2011-03-30 13:00

个人简介

About me:Oracle ACE pro,optimistic,passionate and harmonious. Focus on ORACLE,MySQL and other database programming,peformance tuning,db design, j2ee,Linux/AIX,Architecture tech,etc

文章分类

全部博文（180）

CBO（57）
Tuning&Performan（35）
Concepts&Archite（4）
TroubleShooting（5）
New Feature（7）
SQL（52）
PL/SQL（8）
Miscellaneous（3）
MySQL（9）
未分配的博文（0）

文章存档

2025年（8）

2024年（27）

2023年（28）

2022年（43）

2020年（62）

2014年（3）

2013年（9）

我的朋友

相关博文

MySQL Subquery Cache vs Oracle Subquery Cache子查询缓存_MySQL部分_PART1

分类： Mysql/postgreSQL

2024-05-23 15:49:04

在MySQL里执行计划对子查询no unnest subquery的有：SUBQUERY、DEPENDENT SUBQUERY、UNCACHEABLE SUBQUERY。
文档上说DEPENDENT SUBQUERY和UNCACHEABLE SUBQUERY不同，DEPENDENT SUBQUERY对于外部的每个不同的关联key执行一次子查询，UNCACHEABLE SUBQUERY意思是没有缓存，也就是外部的每行执行一次子查询。

文档10.8.2 EXPLAIN Output Format：
DEPENDENT SUBQUERY evaluation differs from UNCACHEABLE SUBQUERY evaluation.
For DEPENDENT SUBQUERY, the subquery is re-evaluated only once for each set of different values of the variables from its outer context.
For UNCACHEABLE SUBQUERY, the subquery is re-evaluated for each row of the outer context.

通过测试，发现和文档有点出入，MySQL貌似对SUBQUERY有缓存优化，DEPENDENT SUBQUERY实际没有看到缓存优化（有优化，但是效果不明显）。

SUBQUERY和DEPENDENT SUBQUERY实际上是没有unnest的子查询的算法，子查询没有unnest,必须是外层结果来驱动子查询，所以，外层结果的行数直接影响子查询执行次数,而子查询执行次数，直接影响整个没有unnest的子查询效率。

一般数据库subquery cache做法（以ORACLE为例）：
一般情况下，大多数据库对未unnest的子查询做了缓存优化，这种缓存优化怎么做呢？因为外层与子查询是通过关联key比较的，那么做cache,也就是需要一块内存区域保存外层的key与子查询比较的结果（key-value结构），一般用hash table保存，每一个key保存一行结果即可，像ORACLE
可以保存1024行左右。减少的子查询次数，和外层表的key存储顺序有关，相同值存储越连续，减少的子查询执行次数越多，反而按key存储越离散，则子查询执行次数越多。

在ORACLE里外层驱动子查询不能unnest的有一般子查询（比如子查询有OR）、update关联子查询、标量子查询等，特点是执行计划被驱动部分谓词有系统绑定变量，比如access("C"=:B1)，这个:B1是ORACLE CBO自己定义的。

这样subquery cache优化,对于外部行关联值基数（distinct值）特别小的，可以极大减少子查询执行次数，从而提高执行效率，这就是为什么有时候在ORACLE里标量子查询比JOIN效率更高的原因。

在MySQL里执行计划有SUBQUERY和ORACLE里的unnest子查询类似，有缓存，DEPENDENT SUBQUERY貌似没有看到缓存(但是看到改写有cache).
MariaDB有subquery cache:

下面分别研究下MySQL里的subquery cache以及和Oracle的对比。

1）MySQL 执行计划select_type:SUBQUERY,也即不能unnest的子查询materilization有subquery cache

select_type:SUBQUERY对应的子查询使用materialization物化，物化过程只执行一次。
我们希望的是真正需要执行子查询才真正做子查询物化，比如外部有条件
返回0行，则子查询不需要做物化（这个后面会说，过滤条件写的位置在子查询前还是后有影响），需要子查询结果时做materilization,整个过程只做一次，
在内存里创建临时表并按照关联key剔重，并创建索引，然后外部表的行按照子查询关联条件驱动materialize表多次，

这种子查询是有subquery cache优化的，具体子查询被驱动多少次，要看外部表传入的关联key相同值是否连续（这个和ORACLE的FILTER子查询类似）。

materialization只执行一次，会按照关联key剔重然后建立索引，对于没有索引的，剔重后结果集少的，很适用，临时表创建好后，外部结果驱动临时表多次。
按照subquery cache重复的可能减少子查询执行（要看外部行key存储连续程度）。

2）DEPENDENT SUBQUERY就是普通的in to exists方式，外部表每一行会驱动子查询一次，文档上说有缓存，测试没有看到（但是看到改写有cache),实际
树形计划显示不能减少loops次数，效果不是很明显。

3）对于标量子查询，MySQL因为不能走select_type:SUBQUERY,只能走DEPENDENT SUBQUERY，没有cache优化，所以要慎用

另外还有个注意点：
MySQL里subquery unnest走SUBQUERY、DEPENDENT SUBQUERY、UNCACHEABLE SUBQUERY的，如果主表还有过滤条件，
过滤条件写在子查询前还是后，子查询驱动顺序不一样：

1）如果条件不能走索引：这个是规则式的，在ORACLE基于CBO的优化器下没有这种区别
from a where a.cond and subquery 是先执行a.cond过滤，然后每一行驱动执行subquery,较好。
from a where subquery and a.cond 这个是先a的每行驱动执行subquery,如果子查询返回true,则判断a.cond,false则不执行a.cond,一般效率差。

2）如果条件能走索引：MySQL不管condition位置，先执行condition过滤，然后执行子查询，也就是上面2个写法没有区别。

3) 使用subquery hints的，in/exists不能互相转换为相同执行计划，subquery hints.
对应的exists就是intoexists策略，是DEPENDENT SUBQUERY
in的可以指定intoexists或materialization策略。
如果不用hints,MySQL 8对于in和exists可以等价转换。

一般主流数据库都会做subquery cache优化，比如Oracle，这种优化有时候还是很有用的，比如外部表驱动子查询的关联key基数很小，这时候实际执行效率也是很高的，如果没有subquery cache优化，外部每一行驱动一次子查询，必然效率低很多。

MySQL建表语句：

点击(此处)折叠或打开

create table subq_t1(a int,b int);
set @@cte_max_recursion_depth = 100000;
insert into subq_t1(a,b)
WITH recursive t(a,b) AS (
SELECT 1 a,1 b
from dual
UNION ALL
SELECT 1,b+1
FROM t
WHERE t.b<100000
)
select * from t;
insert into subq_t1(a,b)
WITH recursive t(a,b) AS (
SELECT 2 a,1 b
from dual
UNION ALL
SELECT 2,b+1
FROM t
WHERE t.b<100000
)
select * from t;
insert into subq_t1(a,b)
WITH recursive t(a,b) AS (
SELECT 3 a,1 b
from dual
UNION ALL
SELECT 3,b+1
FROM t
WHERE t.b<100000
)
select * from t;
create table subq_t2(a int,b int);
insert into subq_t2(a,b)
WITH recursive t(a,b) AS (
SELECT 1 a,1 b
from dual
UNION ALL
SELECT a+1,b+1
FROM t
WHERE t.b<100000
)
select * from t;
create index idx_subq_t2 on subq_t2(a);

subq_t1表的a只有3个值：

点击(此处)折叠或打开

select count(*),count(distinct a) from subq_t1;
+----------+-------------------+
| count(*) | count(distinct a) |
+----------+-------------------+
| 300000 | 3 |
+----------+-------------------+

subq_t2的a都是不重复的：

点击(此处)折叠或打开

select count(*),count(distinct a) from subq_t2;
+----------+-------------------+
| count(*) | count(distinct a) |
+----------+-------------------+
| 100000 | 100000 |
+----------+-------------------+

###标量子查询貌似只能走DEPENDENT SUBQUERY,没有cache

本文对于MySQL子查询研究的都是不能unnest的，具体通过hints控制，也就是select_type:SUBQUERY、DEPENDENT SUBQUERY、UNCACHEABLE SUBQUERY.

能够unnest subquery的，可以走pullout,firstmatch,loosescan,materialization,dupsweedout方式，这个实际是转为JOIN方式，按照JOIN方式走，MySQL对于join走
nested loops的没有cache优化。

1.MySQL子查询计划是SUBQUERY （materialization)的，有subquery cache

exists+subquery(@subq materialization)也走不了materialization，还是和intoexists一样，走DEPENDENT SUBQUERY，只不过改写没有cache.
当然，如果exists不写hints,可能转为in+materialization,也可以有缓存。

SQL如下：

点击(此处)折叠或打开

explain select/*+subquery(@subq materialization)*/ *
from subq_t1 t1
where exists
(select/*+qb_name(subq)*/ 1 from subq_t2 t2 where t1.a=t2.a);

执行计划如下

点击(此处)折叠或打开

+----+--------------------+-------+------------+------+---------------+-------------+---------+----------------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+------------+------+---------------+-------------+---------+----------------+--------+----------+-------------+
| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 299886 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | t2 | NULL | ref | idx_subq_t2 | idx_subq_t2 | 5 | employees.t1.a | 1 | 100.00 | Using index |
+----+--------------------+-------+------------+------+---------------+-------------+---------+----------------+--------+----------+-------------+
2 rows in set, 2 warnings (0.00 sec)

和预期的走SUBQUERY不一样，走的是DEPENDENT SUBQUERY。

查看改写的，虽然是intoexists,但是改写没有,用subquery(@subq intoexists)+in子查询方式虽然执行计划显示一样，但是改写有

点击(此处)折叠或打开

show warnings\G
*************************** 1. row ***************************
Level: Note
Code: 1276
Message: Field or reference 'employees.t1.a' of SELECT #2 was resolved in SELECT #1
*************************** 2. row ***************************
Level: Note
Code: 1003
Message: /* select#1 */ select /*+ SUBQUERY(@`subq` MATERIALIZATION) */ `employees`.`t1`.`a` AS `a`,`employees`.`t1`.`b` AS `b` from `employees`.`subq_t1` `t1` where exists(/* select#2 */ select /*+ QB_NAME(`subq`) */ 1 from `employees`.`subq_t2` `t2` where (`employees`.`t1`.`a` = `employees`.`t2`.`a`))
2 rows in set (0.00 sec)

从执行计划看，DEPENDENT SUBQUERY 走intoexists策略，外部表subq_t1有30w行，但是a字段只有3个不同值，子查询被驱动30w次,没有减少子查询执行次数。

点击(此处)折叠或打开

explain analyze
select/*+subquery(@subq materialization)*/ *
from subq_t1 t1
where exists
(select/*+qb_name(subq)*/ 1 from subq_t2 t2 where t1.a=t2.a)\G
*************************** 1. row ***************************
EXPLAIN: -> Filter: exists(select #2) (cost=30156.85 rows=299886) (actual time=0.055..1248.470 rows=300000 loops=1)
-> Table scan on t1 (cost=30156.85 rows=299886) (actual time=0.031..260.777 rows=300000 loops=1)
-> Select #2 (subquery in condition; dependent)
-> Limit: 1 row(s) (cost=0.35 rows=1) (actual time=0.002..0.002 rows=1 loops=300000)
-> Covering index lookup on t2 using idx_subq_t2 (a=t1.a) (cost=0.35 rows=1) (actual time=0.002..0.002 rows=1 loops=300000)
1 row in set, 1 warning (1.31 sec)

改为in+materialization：

点击(此处)折叠或打开

explain
select/*+subquery(@subq materialization)*/ *
from subq_t1 t1
where t1.a in
(select/*+qb_name(subq)*/ t2.a from subq_t2 t2);

现在执行计划符合预期，走SUBQUERY，也就是materialization，可以通过树形计划看到：

点击(此处)折叠或打开

+----+-------------+-------+------------+-------+---------------+-------------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+-------------+---------+------+--------+----------+-------------+
| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 299886 | 100.00 | Using where |
| 2 | SUBQUERY | t2 | NULL | index | idx_subq_t2 | idx_subq_t2 | 5 | NULL | 100113 | 100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+-------------+---------+------+--------+----------+-------------+
2 rows in set, 1 warning (0.01 sec)

可以看扩展信息，子查询使用materialization。

点击(此处)折叠或打开

show warnings\G
*************************** 1. row ***************************
Level: Note
Code: 1003
Message: /* select#1 */ select /*+ SUBQUERY(@`subq` MATERIALIZATION) */ `employees`.`t1`.`a` AS `a`,`employees`.`t1`.`b` AS `b` from `employees`.`subq_t1` `t1` where <in_optimizer>(`employees`.`t1`.`a`,`employees`.`t1`.`a` in ( <materialize> (/* select#2 */ select /*+ QB_NAME(`subq`) */ `employees`.`t2`.`a` from `employees`.`subq_t2` `t2` where true ), <primary_index_lookup>(`employees`.`t1`.`a` in <temporary table> on <auto_distinct_key> where ((`employees`.`t1`.`a` = `<materialized_subquery>`.`a`)))))
1 row in set (0.00 sec)

检验子查询执行次数：可以看到主表subq_t1返回30w行，如果没有缓存，物化表应该被驱动30w次，但是现在是loops=4，说明有缓存,subq_t1的a字段只有3个不同值，现在物化表被驱动4次，
比预想的多1次，这还是在subq_t1的a字段是按顺序存储的情况下，如果离散存储，则子查询驱动次数更多。

点击(此处)折叠或打开

explain analyze
select/*+subquery(@subq materialization)*/ *
from subq_t1 t1
where t1.a in
(select/*+qb_name(subq)*/ t2.a from subq_t2 t2)\G

materialization有subquery cache,子查询执行次数少，外部行30w,子查询物化表只执行4次查询，
执行时间从intoexists的1248ms到335ms，效率提升明显。

点击(此处)折叠或打开

*************************** 1. row ***************************
EXPLAIN: -> Filter: <in_optimizer>(t1.a,t1.a in (select #2)) (cost=30156.85 rows=299886) (actual time=124.672..335.684 rows=300000 loops=1)
-> Table scan on t1 (cost=30156.85 rows=299886) (actual time=0.026..175.501 rows=300000 loops=1)
-> Select #2 (subquery in condition; run only once)
-> Filter: ((t1.a = `<materialized_subquery>`.a)) (cost=20078.95..20078.95 rows=1) (actual time=31.166..31.166 rows=1 loops=4)
-> Limit: 1 row(s) (cost=20078.85..20078.85 rows=1) (actual time=31.164..31.164 rows=1 loops=4)
-> Index lookup on <materialized_subquery> using <auto_distinct_key> (a=t1.a) (actual time=31.163..31.163 rows=1 loops=4)
-> Materialize with deduplication (cost=20078.85..20078.85 rows=100113) (actual time=124.626..124.626 rows=100000 loops=1)
-> Covering index scan on t2 using idx_subq_t2 (cost=10067.55 rows=100113) (actual time=0.007..64.168 rows=100000 loops=1)
1 row in set (0.36 sec)

总结：MySQL对不能unnest的子查询，只有SUBQUERY方式（走materilization)才有subquery cache优化，exists写hints不能转为SUBQUERY方式，不写可能能转。

2.MySQL的DEPENDENT SUBQUERY，如果是intoexists对外部表字段cache,但是不能减少子查询执行次数，效果不是太明显

改为in+intoexists看看：

点击(此处)折叠或打开

explain
select/*+subquery(@subq intoexists)*/ *
from subq_t1 t1
where t1.a in
(select/*+qb_name(subq)*/ t2.a from subq_t2 t2);
+----+--------------------+-------+------------+----------------+---------------+-------------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+------------+----------------+---------------+-------------+---------+------+--------+----------+-------------+
| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 299886 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | t2 | NULL | index_subquery | idx_subq_t2 | idx_subq_t2 | 5 | func | 1 | 100.00 | Using index |
+----+--------------------+-------+------------+----------------+---------------+-------------+---------+------+--------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)

看起来in+intoexists也做了优化，改写里有((`employees`.`t1`.`a`) ，实际上有多大效果，看explain analyze:

点击(此处)折叠或打开

show warnings\G
*************************** 1. row ***************************
Level: Note
Code: 1003
Message: /* select#1 */ select /*+ SUBQUERY(@`subq` INTOEXISTS) */ `employees`.`t1`.`a` AS `a`,`employees`.`t1`.`b` AS `b` from `employees`.`subq_t1` `t1` where <in_optimizer>(`employees`.`t1`.`a`,<exists>(<index_lookup>(<cache>(`employees`.`t1`.`a`) in subq_t2 on idx_subq_t2)))
1 row in set (0.00 sec)

查看树形计划：走索引有a=(t1.a)，但是索引还是loops=300000,没有减少执行次数：

点击(此处)折叠或打开

explain analyze
select/*+subquery(@subq intoexists)*/ *
from subq_t1 t1
where t1.a in
(select/*+qb_name(subq)*/ t2.a from subq_t2 t2)\G
*************************** 1. row ***************************
EXPLAIN: -> Filter: <in_optimizer>(t1.a,<exists>(select #2)) (cost=30156.85 rows=299886) (actual time=0.131..939.998 rows=300000 loops=1)
-> Table scan on t1 (cost=30156.85 rows=299886) (actual time=0.047..241.175 rows=300000 loops=1)
-> Select #2 (subquery in condition; dependent)
-> Limit: 1 row(s) (actual time=0.002..0.002 rows=1 loops=300000)
-> Covering index lookup on t2 using idx_subq_t2 (a=<cache>(t1.a)) (actual time=0.002..0.002 rows=1 loops=300000)
1 row in set (0.99 sec)

对于DEPENDENT SUBQUERY方式，树形计划里显示a=(t1.a)，也就是在索引查找时做了cache优化，而不是减少索引扫描的次数。
索引查询平均每次0.002ms.原来exists方式没有,in+intoexists的字段cache,优化效果不是很明显，只在外部表subq_t1的全表扫描上降了30%时间，
而不是子查询表的执行次数减少。

点击(此处)折叠或打开

explain analyze
select/*+subquery(@subq materialization)*/ *
from subq_t1 t1
where exists
(select/*+qb_name(subq)*/ 1 from subq_t2 t2 where t1.a=t2.a)\G
*************************** 1. row ***************************
EXPLAIN: -> Filter: exists(select #2) (cost=30156.85 rows=299886) (actual time=0.070..1230.870 rows=300000 loops=1)
-> Table scan on t1 (cost=30156.85 rows=299886) (actual time=0.029..256.100 rows=300000 loops=1)
-> Select #2 (subquery in condition; dependent)
-> Limit: 1 row(s) (cost=0.35 rows=1) (actual time=0.002..0.002 rows=1 loops=300000)
-> Covering index lookup on t2 using idx_subq_t2 (a=t1.a) (cost=0.35 rows=1) (actual time=0.002..0.002 rows=1 loops=300000)
1 row in set, 1 warning (1.29 sec)

总结:
DEPENDENT SUBQUERY实际上是不能unnest subquery的保底执行方法，经常效率比较烂，子查询能够unnest较好,有5种策略可以使用，能充分使用join/semi join/anti join算法提高效率。

不能unnest的只有materialization和intoexists两种策略,完全是外表驱动子查询走FILTER，只有materialization才有缓存减少子查询执行次数的优化。

下一篇：MySQL Subquery Cache vs Oracle Subquery Cache子查询缓存_MySQL部分_PART2

阅读(767) | 评论(0) | 转发(0) |

上一篇：ORACLE递归with实现group by快速查找min/max值

下一篇：MySQL Subquery Cache vs Oracle Subquery Cache子查询缓存_MySQL部分_PART2

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6