CASE 1. 不要把SQL当成一般高级语言使用
不要把SQL当成一般高级语言使用
应该理解SQL是面向集合的语言,和一般面向过程的语言是不同的
通常情形下,面向集合的SQL(连接,聚集,过滤)要比面向过程的(for loop, 迭代,递归,游标)SQL要高效
下面的例子,计算(1 + 2 + 3… + 1000) * (1 + 2 + 3… + 1000),后者要比前者快得多
-- calculate (1 + 2 + 3... + 1000) * (1 + 2 + 3... + 1000)
-- 1 use for .. loop
-- EXECUTE TIME: 1202ms
do $$
declare
x bigint;
y bigint;
sql text;
sum_x bigint := 0;
sum_y bigint := 0;
begin
for x in 1..1000 loop
for y in 1..1000 loop
sum_x := sum_x + x;
sum_y := sum_y + y;
end loop;
end loop;
sum_x = sum_x * sum_y;
raise info 'sum = [%]' , sum_x;
end$$;
-- 2 use generate_series cross join
-- EXECUTE TIME: 212ms
do $$
declare
sum bigint := 0;
begin
select sum(x) * sum(y) into sum
from generate_series(1, 1000) x, generate_series(1, 1000) y;
raise info 'sum = [%]' , sum;
end$$;
CASE 2. 用exists做存在性检查
用exists做存在性检查,而不要用count = 0的方式
exists在检测到满足条件的记录后便终止检查,count要对全表进行扫描
如下面的例子, exists版本要比count版本高效10倍以上
-- test if existance
-- use exists rather than count = 0
create table test_existance(id int);
insert into test_existance select id from generate_series(1, 1000000) id;
-- EXECUTE TIME: 142ms
select count(1) = 0 from test_existance where id = 100;
-- EXECUTE TIME: 10ms
select exists (select 1 from test_existance where id = 100);
CASE 3. 注意CTE与Sub-Query的不同
看一个例子,
-- CTE cannot use "push down" rule
create table cte_vs_subquery (id int, kind text);
insert into cte_vs_subquery select id, (id / 100000)::text from generate_series(1, 10000000) id
-- EXECUTE TIME: 3325ms
-- kind in filter was not pushed down
with cte as ( select kind, count(*) cnt from cte_vs_subquery group by kind )
select kind, cnt from cte
where kind in ( '1', '2' , '3', '4') order by cnt;
-- EXECUTE TIME: 2228ms
-- kind in filter was pushed down
select kind, cnt from ( select kind, count(*) cnt from cte_vs_subquery group by kind) t
where kind in ( '1', '2' , '3', '4') order by cnt;
它们的执行计划是不同的
QUERY PLAN
"Sort (cost=125892.60..125892.61 rows=4 width=40)"
" Sort Key: cte.cnt"
" CTE cte"
" -> HashAggregate (cost=125885.56..125887.56 rows=200 width=32)"
" -> Seq Scan on cte_vs_subquery (cost=0.00..98673.04 rows=5442504 width=32)"
" -> CTE Scan on cte (cost=0.00..5.00 rows=4 width=40)"
" Filter: (kind = ANY ('{1,2,3,4}'::text[]))"
QUERY PLAN
"Sort (cost=126429.93..126429.94 rows=4 width=40)"
" Sort Key: (count(*))"
" -> HashAggregate (cost=126429.81..126429.85 rows=4 width=32)"
" -> Seq Scan on cte_vs_subquery (cost=0.00..125885.56 rows=108850 width=32)"
" Filter: (kind = ANY ('{1,2,3,4}'::text[]))"
实现同一个逻辑的两个SQL,它们的不同之处在哪里呢?我们看到,使用CTE的版本,对kind的属性的过滤是在CTE执行完成之后才执行的;
而使用SuqQuery的版本,对kind的属性的过滤是在扫描表阶段。明显,后者会更高效,因为在算聚集前已经剔除了很多不需要的记录,避免的多余的聚集计算。
CTE是不能被RBO优化的
CASE 4. 避免大表的Nested Loop
避免大表的Nested Loop, 优先选择基于Hash的算法
前提是我们要知道,哪些情形不能被实现为Hash,回想一下Hash Table是如何构建的
-- prefer hash join first, kown when join predicate can be hashable
--
create table f_tbl (id int, kind_ids text);
create table d_tbl (kind_id text, kind text);
insert into f_tbl
select id, array_to_string(array[id/10000, id/10000 + 1, id/10000 + 2], '|')
from generate_series(1, 1000000) id;
insert into d_tbl
select i kind_id , 'kind_'||(i + 10000)::text kind
from generate_series(1, 102) i;
-- DO NOT use cte, CASE 3
-- EXECUTE TIME: 3690ms
select id, kind from f_tbl
join d_tbl on kind_id = any (string_to_array(kind_ids, '|'))
where kind in ('kind_10001', 'kind_10002', 'kind_10102');
-- EXECUTE TIME: 2036ms
select id, kind from
(select id, unnest(string_to_array(kind_ids, '|')) kind_id from f_tbl ) f
join d_tbl d on f.kind_id = d.kind_id
where kind in ('kind_10001', 'kind_10002', 'kind_10102');
对比一下两者的执行计划
QUERY PLAN
"Nested Loop (cost=0.00..249792.57 rows=422614 width=36)"
" Join Filter: (d_tbl.kind_id = ANY (string_to_array(f_tbl.kind_ids, '|'::text)))"
" -> Seq Scan on f_tbl (cost=0.00..12055.38 rows=664938 width=36)"
" -> Materialize (cost=0.00..21.89 rows=13 width=64)"
" -> Seq Scan on d_tbl (cost=0.00..21.82 rows=13 width=64)"
" Filter: (kind = ANY ('{kind_10001,kind_10002,kind_10102}'::text[]))"
QUERY PLAN
"Hash Join (cost=21.99..1302057.09 rows=4322097 width=36)"
" Hash Cond: ((unnest(string_to_array(f_tbl.kind_ids, '|'::text))) = d.kind_id)"
" -> Seq Scan on f_tbl (cost=0.00..344524.38 rows=66493800 width=36)"
" -> Hash (cost=21.82..21.82 rows=13 width=64)"
" -> Seq Scan on d_tbl d (cost=0.00..21.82 rows=13 width=64)"
" Filter: (kind = ANY ('{kind_10001,kind_10002,kind_10102}'::text[]))"
对比上面的例子,前者Join Predicate是在数组中查找一个数组项是不是存在,这个条件没办法使用Hash Join(当然也不能使用Sort Merge Join), 所以只能走Nested Loop Join了;
再看后者,先把string拆解成多项,一行生成多行,再与相应的维表Join,这时Join条件变成了等值Join,这个是可以走Hash Join的条件。
CASE 5. 用映射表代替Case When(很长/多次执行)
用映射表代替很长或是被多次执行的Case When语句,case when会顺序的执行每一条when,如果这个列表很多或是被多次执行,创建一个临时表作为映射表后执行join可能会更好。
实际上,做一个映射表不仅仅是出于效率考虑,可读性和易维护性也更好。
-- create test table and insert data
create table tbl(id1 text);
insert into tbl values
('1'), ('2'), ('3'), ('4'), ('5'), ('6'), ('7'), ('8'),
('9'), ('10'), ('11'), ('12'), ('13'), ('14'), ('15'), ('16');
with recursive rct(id1) as
(select id1 from tbl union all select id1 from rct )
insert into tbl
select * from rct limit 10000000;
-- use case when list
-- EXECUTE TIME: 344918ms
create table tmp1 as
select case id1 when '1' then 'a'
when '2' then 'b' when '3' then 'c' when '4' then 'd' when '5' then 'e'
when '6' then 'f' when '7' then 'g' when '8' then 'h' when '9' then 'i'
when '10' then 'j' when '11' then 'k' when '12' then 'l' when '13' then 'm'
when '14' then 'n' when '15' then 'o' when '16' then 'p'
else '' end as id2 from tbl;
-- create a value map 11ms
create table id_map(id1 text, id2 text);
insert into id_map values
('1', 'a'), ('2', 'b'), ('3', 'c'), ('4', 'd'), ('5', 'e'), ('6', 'f'), ('7', 'g'), ('8', 'h'),
('9', 'i'), ('10', 'j'), ('11', 'k'), ('12', 'l'), ('13', 'm'), ('14', 'n'), ('15', 'o'), ('16', 'p');
-- EXECUTE TIME: 26193ms
create table tmp2 as
select id2 from tbl natural join id_map;
CASE 6. 注意索引属性上的表达式/类型转换
看下面的例子,前一个把索引列由Int转为Float,执行过滤,后者除了没有类型转换外与前者一致
-- CASE 6
-- case cannot use index
create table tbl_with_index (col1 int, col2 int);
insert into tbl_with_index select generate_series(1, 100000), generate_series(1, 100000);
create index on tbl_with_index(col1);
select * from tbl_with_index where col1::float > 10.0;
select * from tbl_with_index where col1 > 10;
看看它们的执行计划
QUERY PLAN
"Seq Scan on tbl_with_index (cost=0.00..1943.00 rows=33333 width=8)"
" Filter: ((col1)::double precision > 10::double precision)"
QUERY PLAN
"Bitmap Heap Scan on tbl_with_index (cost=626.62..1486.29 rows=33333 width=8)"
" Recheck Cond: (col1 > 10)"
" -> Bitmap Index Scan on tbl_with_index_col1_idx (cost=0.00..618.29 rows=33333 width=0)"
" Index Cond: (col1 > 10)"
在这里没有写出两者执行时间,仅是想说明什么这种情形下是不能走索引的,实际的情形,对于这个Case走索引可能比不走更慢
我们经常对geometry类型加索引,在使用时,经常量化长度/面积(米/平方米),会转换成geography类型,这时候需要特别注意
比如,如果t1,t2的geometry列上有空间索引,假设对两个表执行join时,join predicate为ST_Distance(geometry::geography, geometry::geography) < 100时
这时是无法走索引的,如果两个表规模很大,那将是个灾难。
CASE 7. 用truncate代替delete清除表数据
用truncate代替delete清除表数据,truncate要比delete更快
-- delete VS truncate;
-- truncate is better
create table delete_vs_trucate (col1 int, col2 int);
insert into delete_vs_trucate select generate_series(1, 100000), generate_series(1, 100000);
-- EXECUTE TIME: 191ms
delete from delete_vs_trucate;
-- EXECUTE TIME: 11ms
truncate delete_vs_trucate;