Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1172356
  • 博文数量: 166
  • 博客积分: 0
  • 博客等级: 民兵
  • 技术积分: 3765
  • 用 户 组: 普通用户
  • 注册时间: 2011-03-30 13:00
个人简介

About me:Oracle ACE pro,optimistic,passionate and harmonious. Focus on ORACLE,MySQL and other database programming,peformance tuning,db design, j2ee,Linux/AIX,Architecture tech,etc

文章分类

全部博文(166)

文章存档

2024年(21)

2023年(28)

2022年(43)

2020年(62)

2014年(3)

2013年(9)

分类: Oracle

2024-05-06 15:29:59

  本文主要讲解如何从一个表里快速随机获取前N行的方法,用较少IO,利用索引的快速扫描,快速从表里获取随机的前N行数据。 
 
建表语句如下:

点击(此处)折叠或打开

  1. 建表语句如下:
  2.  drop table t;
  3.  create table t(id number,name varchar2(1000),ext varchar2(1000));
  4.  insert into t
  5.  select level,rpad('t',500,'x')||level,rpad('t',500,'x')||level
  6.  from dual
  7.  connect by level<=300000;
  8.  commit;
  9.  create index idx_t on t(id);
  10.  
  11.  exec dbms_stats.gather_table_stats(ownname=>user,tabname=>'t',no_invalidate=>false);
  12. t表总共有30w行:
     select count(*) from t;


      COUNT(*)
    ----------
        300000
         

1.原始方法,全表扫描+排序,随机取10条记录

全表扫描,因为是随机的,所以要读完所有30w行再随机排序取10行,效率非常低
SQL如下:

点击(此处)折叠或打开

  1. select * from (
  2. select * from t order by dbms_random.value
  3. ) where rownum <= 10;
执行计划如下:

点击(此处)折叠或打开

  1. Plan hash value: 3299198703

  2. ------------------------------------------------------------------------------------------------------------------------------
  3. | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-Mem |
  4. ------------------------------------------------------------------------------------------------------------------------------
  5. | 0 | SELECT STATEMENT | | 1 | | 10 |00:00:00.81 | 43171 | 22014 | | | |
  6. |* 1 | COUNT STOPKEY | | 1 | | 10 |00:00:00.81 | 43171 | 22014 | | | |
  7. | 2 | VIEW | | 1 | 300K| 10 |00:00:00.81 | 43171 | 22014 | | | |
  8. |* 3 | SORT ORDER BY STOPKEY| | 1 | 300K| 10 |00:00:00.81 | 43171 | 22014 | 106K| 106K|96256 (0)|
  9. | 4 | TABLE ACCESS FULL | T | 1 | 300K| 300K|00:00:00.18 | 43171 | 22014 | | | |
  10. ------------------------------------------------------------------------------------------------------------------------------

  11. Predicate Information (identified by operation id):
  12. ---------------------------------------------------

  13.    1 - filter(ROWNUM<=10)
  14.    3 - filter(ROWNUM<=10)
通过执行计划可以看到ID=4,读了表T的全部数据300K行,然后排序取10行。耗时0.81s,逻辑读43171,从执行计划看,主要耗时在排序上,
因为是对30w行随机排序,所以效率很低,后续优化要考虑避免扫描全量数据以及避免全量数据的随机排序。

2.改进方法,利用覆盖索引提高效率

因为id有索引,索引比表小,可以采用覆盖索引,先从索引里随机获取前10行,然后再到表里查。

可以从下面看到IDX_T索引只有6M,比表345M少很多:

点击(此处)折叠或打开

  1. OWNER SEGMENT_NAME SEGMENT_TYPE SIZE_MB INIEXT MAXEXT
  2. ---------- ------------------------------ ------------------ ---------- ---------- ----------
  3. DINGJUN123 T TABLE 345 65536 1048576
  4. DINGJUN123 IDX_T INDEX 6 65536 1048576
为了能够走覆盖索引,将id改为not null:

点击(此处)折叠或打开

  1. alter table t modify id not null;
SQL如下:

点击(此处)折叠或打开

  1. select * from t where id in (
  2. select id
  3. from(
  4. select id from t order by dbms_random.value
  5. ) where rownum <= 10
  6. );
执行计划如下:

点击(此处)折叠或打开

  1. Plan hash value: 753993260

  2. ---------------------------------------------------------------------------------------------------------------------------------------
  3. | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-Mem |
  4. ---------------------------------------------------------------------------------------------------------------------------------------
  5. | 0 | SELECT STATEMENT | | 1 | | 10 |00:00:00.41 | 699 | 5 | | | |
  6. | 1 | NESTED LOOPS | | 1 | | 10 |00:00:00.41 | 699 | 5 | | | |
  7. | 2 | NESTED LOOPS | | 1 | 10 | 10 |00:00:00.41 | 689 | 0 | | | |
  8. | 3 | VIEW | VW_NSO_1 | 1 | 10 | 10 |00:00:00.41 | 676 | 0 | | | |
  9. | 4 | HASH UNIQUE | | 1 | 10 | 10 |00:00:00.41 | 676 | 0 | 1397K| 1397K| 1259K (0)|
  10. |* 5 | COUNT STOPKEY | | 1 | | 10 |00:00:00.41 | 676 | 0 | | | |
  11. | 6 | VIEW | | 1 | 300K| 10 |00:00:00.41 | 676 | 0 | | | |
  12. |* 7 | SORT ORDER BY STOPKEY | | 1 | 300K| 10 |00:00:00.41 | 676 | 0 | 2048 | 2048 | 2048 (0)|
  13. | 8 | INDEX FAST FULL SCAN | IDX_T | 1 | 300K| 300K|00:00:00.03 | 676 | 0 | | | |
  14. |* 9 | INDEX RANGE SCAN | IDX_T | 10 | 1 | 10 |00:00:00.01 | 13 | 0 | | | |
  15. | 10 | TABLE ACCESS BY INDEX ROWID| T | 10 | 1 | 10 |00:00:00.01 | 10 | 5 | | | |
  16. ---------------------------------------------------------------------------------------------------------------------------------------

  17. Predicate Information (identified by operation id):
  18. ---------------------------------------------------

  19.    5 - filter(ROWNUM<=10)
  20.    7 - filter(ROWNUM<=10)
  21.    9 - access("ID"="ID")
通过覆盖索引,0.41s,逻辑读从4.3w降低到676,主要耗时还是在随机排序上(占95%以上)。

3.使用索引min/max快速扫描构造随机数提高效率

优化主要集中在下面两点:
1)怎么避免扫描全量数据
2)怎么避免对全量数据随机排序
避免扫描全量数据,必须构造一个ID值,从此ID处开始扫描,然后用ID排序,则可以消除以上两点。方法如下:

索引的min,max查询效率很高(ORACLE不能写在同一个SELECT里否则不能用快速扫描),只需要扫描左右侧索引NO.1叶子节点,可以实现快速扫描,
然后利用floor(tmp.min_id + (tmp.max_id - tmp.min_id + 1) * dbms_random.value) 构造随机id,然后与原表关联,利用索引列id排序,
因为是索引列,可以利用索引消除排序,然后取100条,为了防止断号(因为取10条的话,这个id是连续的,也不够连续,如果id本身不是连续的,
也可能取不到10条),先取100条,然后从100条里随机取10条,比全扫描效率高很多

SQL如下:

点击(此处)折叠或打开

  1. select *
  2.   from t
  3.  where id in
  4.        (select id
  5.           from (select id
  6.                   from (select id
  7.                           from (select id
  8.                                   from t
  9.                                  where t.id >=
  10.                                        (select floor(tmp.min_id +
  11.                                                      (tmp.max_id - tmp.min_id + 1) *
  12.                                                      dbms_random.value) id
  13.                                           from (select (select max(id) from t) max_id,
  14.                                                        (select min(id) from t) min_id
  15.                                                   from dual) tmp)
  16.                                  order by t.id)
  17.                          where rownum <= 100)
  18.                  order by dbms_random.value)
  19.          where rownum <= 10);
执行计划如下:

点击(此处)折叠或打开

  1. Plan hash value: 72274151

  2. -----------------------------------------------------------------------------------------------------------------------------------------
  3. | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
  4. -----------------------------------------------------------------------------------------------------------------------------------------
  5. | 0 | SELECT STATEMENT | | 1 | | 10 |00:00:00.01 | 30 | | | |
  6. | 1 | NESTED LOOPS | | 1 | | 10 |00:00:00.01 | 30 | | | |
  7. | 2 | NESTED LOOPS | | 1 | 10 | 10 |00:00:00.01 | 20 | | | |
  8. | 3 | VIEW | VW_NSO_1 | 1 | 10 | 10 |00:00:00.01 | 7 | | | |
  9. | 4 | HASH UNIQUE | | 1 | 10 | 10 |00:00:00.01 | 7 | 1397K| 1397K| 847K (0)|
  10. |* 5 | COUNT STOPKEY | | 1 | | 10 |00:00:00.01 | 7 | | | |
  11. | 6 | VIEW | | 1 | 100 | 10 |00:00:00.01 | 7 | | | |
  12. |* 7 | SORT ORDER BY STOPKEY | | 1 | 100 | 10 |00:00:00.01 | 7 | 2048 | 2048 | 2048 (0)|
  13. | 8 | VIEW | | 1 | 100 | 100 |00:00:00.01 | 7 | | | |
  14. |* 9 | COUNT STOPKEY | | 1 | | 100 |00:00:00.01 | 7 | | | |
  15. | 10 | VIEW | | 1 | 100 | 100 |00:00:00.01 | 7 | | | |
  16. |* 11 | INDEX RANGE SCAN | IDX_T | 1 | 100 | 100 |00:00:00.01 | 7 | | | |
  17. | 12 | FAST DUAL | | 1 | 1 | 1 |00:00:00.01 | 0 | | | |
  18. | 13 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  19. | 14 | INDEX FULL SCAN (MIN/MAX) | IDX_T | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  20. | 15 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  21. | 16 | INDEX FULL SCAN (MIN/MAX)| IDX_T | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  22. |* 17 | INDEX RANGE SCAN | IDX_T | 10 | 1 | 10 |00:00:00.01 | 13 | | | |
  23. | 18 | TABLE ACCESS BY INDEX ROWID | T | 10 | 1 | 10 |00:00:00.01 | 10 | | | |
  24. -----------------------------------------------------------------------------------------------------------------------------------------

  25. Predicate Information (identified by operation id):
  26. ---------------------------------------------------

  27.    5 - filter(ROWNUM<=10)
  28.    7 - filter(ROWNUM<=10)
  29.    9 - filter(ROWNUM<=100)
  30.   11 - access("T"."ID">=)
  31.   17 - access("ID"="ID")
从执行计划看,改进的SQL逻辑读只需要30,执行时间0.01s。

4)使用索引min/max+递归with构造前N个随机数
和3类似,只不过这里使用递归with,先构造一个随机id,然后递归,取100条,然后随机排序后取10条。
如下所示:
SQL如下:

点击(此处)折叠或打开

  1. with tmp(rn,id) as
  2. (
  3. select 1 rn,floor(tmp.min_id +(tmp.max_id - tmp.min_id + 1) * dbms_random.value) id
  4.         from (select (select max(id) from t) max_id,
  5.                       (select min(id) from t) min_id
  6.                      from dual) tmp
  7. union all
  8. select rn+1,id
  9. from tmp
  10. where rn <100
  11. )
  12. select * from t where id in (
  13. select id from (
  14. select id+rn as id from tmp
  15. order by dbms_random.value
  16. ) where rownum <= 10
  17. );
执行计划如下:

点击(此处)折叠或打开

  1. Plan hash value: 2701499939

  2. --------------------------------------------------------------------------------------------------------------------------------------------------
  3. | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
  4. --------------------------------------------------------------------------------------------------------------------------------------------------
  5. | 0 | SELECT STATEMENT | | 1 | | 10 |00:00:00.01 | 22 | | | |
  6. | 1 | NESTED LOOPS | | 1 | | 10 |00:00:00.01 | 22 | | | |
  7. | 2 | NESTED LOOPS | | 1 | 1 | 10 |00:00:00.01 | 13 | | | |
  8. | 3 | VIEW | VW_NSO_1 | 1 | 2 | 10 |00:00:00.01 | 4 | | | |
  9. | 4 | HASH UNIQUE | | 1 | 1 | 10 |00:00:00.01 | 4 | 1397K| 1397K| 1092K (0)|
  10. |* 5 | COUNT STOPKEY | | 1 | | 10 |00:00:00.01 | 4 | | | |
  11. | 6 | VIEW | | 1 | 2 | 10 |00:00:00.01 | 4 | | | |
  12. |* 7 | SORT ORDER BY STOPKEY | | 1 | 2 | 10 |00:00:00.01 | 4 | 2048 | 2048 | 2048 (0)|
  13. | 8 | VIEW | | 1 | 2 | 100 |00:00:00.01 | 4 | | | |
  14. | 9 | UNION ALL (RECURSIVE WITH) BREADTH FIRST| | 1 | | 100 |00:00:00.01 | 4 | | | |
  15. | 10 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  16. | 11 | INDEX FULL SCAN (MIN/MAX) | IDX_T | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  17. | 12 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  18. | 13 | INDEX FULL SCAN (MIN/MAX) | IDX_T | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  19. | 14 | FAST DUAL | | 1 | 1 | 1 |00:00:00.01 | 0 | | | |
  20. | 15 | RECURSIVE WITH PUMP | | 100 | | 99 |00:00:00.01 | 0 | | | |
  21. |* 16 | INDEX RANGE SCAN | IDX_T | 10 | 1 | 10 |00:00:00.01 | 9 | | | |
  22. | 17 | TABLE ACCESS BY INDEX ROWID | T | 10 | 1 | 10 |00:00:00.01 | 9 | | | |
  23. --------------------------------------------------------------------------------------------------------------------------------------------------

  24. Predicate Information (identified by operation id):
  25. ---------------------------------------------------

  26.    5 - filter(ROWNUM<=10)
  27.    7 - filter(ROWNUM<=10)
  28.   16 - access("ID"="ID")
逻辑读22,执行时间0.01s。

总结:
从一个表里随机取10条记录的关键在于:
1)避免扫描全部数据
2)避免对全部数据随机排序
所以关键在于构造一个随机ID,然后用这个ID为起始点去表里找,可以走索引查找,然后可以随机
取100条(为了避免ID不连续的断号问题),减少扫描和排序量,充分利用索引的高效扫描
提高效率。
构造随机ID,利用扫描索引取min,max可以快速扫描,然后利用min_id为起点+随机数,
这个随机数用(tmp.max_id - tmp.min_id + 1) * dbms_random.value
构造,然后用floor取整即可。




阅读(5913) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~