分类:
2008-10-16 19:10:15
在报表数据库的后台alert文件中发现了这个错误,简单记录一下问题的诊断和解决过程。数据库版本9204 for Solaris sparc64。 错误信息如下:
Errors in file /u1/oracle/admin/repdb01/bdump/repdb01_j000_9888.trc: ORA-12012: error on auto execute of job 394 ORA-04030: out of process memory when trying to allocate 1032704 bytes (hash- join subh,kllcqas:kllsltba) ORA-06512: at "JSGOV_OLD.INSERT_HOS_INFO", line 9 ORA-06512: at "JSGOV_OLD.P_GEN_STAT", line 11 ORA-06512: at line 1 |
这个JOB是昨天才添加到数据库中的,而运行这个JOB的用户是从其他数据库迁移到当前数据库中的。
产生问题的情况有很多种,有可能是本地配置和远端配置的区别造成的;也可能是由于源数据库是9201,而当前数据库是9204,版本的差异造成了执行计划改变;还有可能是迁移过程中出现了错误从而引起了问题。
从错误本身观察,是由于无法为HASH JOIN分配1M的内存所导致的。观察数据库的PGA内存设置:
SQL> SHOW PARAMETER PGA NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ pga_aggregate_target big integer 838860800 |
对于一个报表系统来说,这个设置确实小了一点,但是考虑这个数据库在处理很多比当前数据量大得多的情况都未出现这个问题,基本上可以确定不是系统参数设置造成的。
如果排除第一种情况,那么无论是迁移出现了问题,还是版本差异的问题,最大的可能性都是执行计划发生了变化,那么现在就需要找到出现问题的SQL语句,检查执行计划。
根据错误信息给出的过程名称和位置提示,可以轻易的找到出现问题的SQL语句,不过由于SQL语句太长,而且和问题的关系并不太大,这里将SQL语句省略,只列出这个SQL语句对应的执行计划:
SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY); PLAN_TABLE_OUTPUT ----------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | ------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 495 | 16455 | | 1 | NESTED LOOPS | | 1 | 495 | 16455 | | 2 | HASH JOIN | | 1 | 457 | 16454 | | 3 | HASH JOIN | | 1 | 195 | 16429 | | 4 | MERGE JOIN CARTESIAN | | 1 | 130 | 10952 | | 5 | VIEW | | 1 | 65 | 5476 | | 6 | SORT GROUP BY | | 1 | 50 | 5476 | | 7 | VIEW | | 1 | 50 | 5476 | | 8 | SORT GROUP BY | | 1 | 77 | 5476 | | 9 | TABLE ACCESS FULL | ORD_HIT_COMM | 1 | 77 | 5473 | | 10 | BUFFER SORT | | 1 | 65 | 10952 | | 11 | VIEW | | 1 | 65 | 5476 | | 12 | SORT GROUP BY | | 1 | 50 | 5476 | | 13 | VIEW | | 1 | 50 | 5476 | | 14 | SORT GROUP BY | | 1 | 77 | 5476 | | 15 | TABLE ACCESS FULL | ORD_HIT_COMM | 1 | 77 | 5473 | | 16 | VIEW | | 1 | 65 | 5476 | | 17 | SORT GROUP BY | | 1 | 52 | 5476 | | 18 | TABLE ACCESS FULL | ORD_HIT_COMM | 1 | 52 | 5473 | | 19 | VIEW | | 1 | 262 | 24 | | 20 | SORT GROUP BY | | 1 | 344 | 24 | | 21 | NESTED LOOPS | | 1 | 344 | 21 | | 22 | INLIST ITERATOR | | | | | | 23 | TABLE ACCESS BY INDEX ROWID| SW_PLAT_CAT_ORG | 18 | 5148 | 3 | | 24 | INDEX RANGE SCAN | IDX_SW_PLAT_CAT_ORG_ENABLE | 7 | | 2 | | 25 | INLIST ITERATOR | | | | | | 26 | TABLE ACCESS BY INDEX ROWID| SW_PLAT_CAT_BUYER | 1 | 58 | 1 | | 27 | INDEX UNIQUE SCAN | PK_SW_PLAT_CAT_BUYER | 1 | | | | 28 | INLIST ITERATOR | | | | | | 29 | TABLE ACCESS BY INDEX ROWID | PLT_PLAT | 1 | 38 | 1 | | 30 | INDEX UNIQUE SCAN | PK_PLT_PLAT | 1 | | | ------------------------------------------------------------------------------------------------- Note: cpu costing is off, PLAN_TABLE' is old version 38 rows selected. |
虽然SQL本身写的有缺点,但是绝对不应该产生这种包含笛卡儿积的执行计划,检查SQL并没有发现缺少关联条件的情况,即问题和SQL本身并不大,虽然SQL有很多可以优化的地方,但是这并不是产生笛卡儿积的关键因素。
观察执行计划本身,除了笛卡儿积之外,另人比较疑惑的一点就是返回记录数,认为全表扫描ORD_HIT_COMM仅仅返回一条记录,这显然是有问题的:
SQL> SELECT COUNT(*) FROM ORD_HIT_COMM; COUNT(*) ---------- 1688957 |
从这一点上判断,可以很容易的断定是统计信息出现了问题,检查ORD_HIT_COMM的统计信息:
SQL> SELECT TABLE_NAME, NUM_ROWS FROM USER_TABLES 2 WHERE TABLE_NAME = 'ORD_HIT_COMM'; TABLE_NAME NUM_ROWS ------------------------------ ---------- ORD_HIT_COMM 1681983 |
本以为不存在统计信息,或者得到一个很小的值,没想到统计信息基本上是准确的,那么是哪里出现的问题呢。
观察SQL语句,发现对ORD_HIT_COMM表唯一的限制条件是ENABLE_FLAG=1,而这个限制条件其实对过滤数据来说没有多大的意义,不过检查执行计划:
SQL> EXPLAIN PLAN FOR SELECT * FROM ORD_HIT_COMM 2 WHERE ENABLE_FLAG = 1; Explained. SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY); PLAN_TABLE_OUTPUT ----------------------------------------------------------------------------------- ---------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | ---------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 733 | 5473 | | 1 | TABLE ACCESS FULL | ORD_HIT_COMM | 1 | 733 | 5473 | ---------------------------------------------------------------------- Note: cpu costing is off, PLAN_TABLE' is old version 9 rows selected. 看来问题多半出现在ENABLE_FLAG列的统计信息上。 SQL> SELECT COLUMN_NAME, NUM_DISTINCT, NUM_NULLS, DENSITY, NUM_BUCKETS 2 FROM USER_TAB_COLUMNS 3 WHERE TABLE_NAME = 'ORD_HIT_COMM' 4 AND COLUMN_NAME = 'ENABLE_FLAG'; COLUMN_NAME NUM_DISTINCT NUM_NULLS DENSITY NUM_BUCKETS ------------------------------ ------------ ---------- ---------- ----------- ENABLE_FLAG 2 0 2.9971E-07 1 |
在9i的环境,根据DENSITY来确定返回的记录数,因此得到1条记录的结果是很正常的:
SQL> SELECT 1681983 * 2.9971E-07 FROM DUAL; 1681983*2.9971E-07 ------------------ .504107125 |
显然这时需要删除错误的统计信息,并重新收集统计信息:
SQL> EXEC DBMS_STATS.DELETE_TABLE_STATS(USER, 'ORD_HIT_COMM') PL/SQL procedure successfully completed. SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'ORD_HIT_COMM') PL/SQL procedure successfully completed. |
检查统计信息中的DENSITY值:
SQL> SELECT COLUMN_NAME, NUM_DISTINCT, NUM_NULLS, DENSITY, NUM_BUCKETS 2 FROM USER_TAB_COLUMNS 3 WHERE TABLE_NAME = 'ORD_HIT_COMM' 4 AND COLUMN_NAME = 'ENABLE_FLAG'; COLUMN_NAME NUM_DISTINCT NUM_NULLS DENSITY NUM_BUCKETS ------------------------------ ------------ ---------- ---------- ----------- ENABLE_FLAG 2 0 .5 1 |
下面检查访问ORD_HIT_COMM的执行计划,检查优化器认为的返回记录数:
SQL> EXPLAIN PLAN FOR 2 SELECT * FROM ORD_HIT_COMM WHERE ENABLE_FLAG = 1; Explained. SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY); PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------- ---------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | ---------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 844K| 591M| 10902 | | 1 | TABLE ACCESS FULL | ORD_HIT_COMM | 844K| 591M| 10902 | ---------------------------------------------------------------------- Note: cpu costing is off, PLAN_TABLE' is old version 9 rows selected. |
现在统计信息已经恢复正常,检查一下出现问题的SQL语句执行计划是否正常:
SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- | Id | Operation |Name |Rows| Bytes |TempSpc| Cost | ---------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 5| 2475 | | 51159 | | 1 | HASH JOIN | | 5| 2475 | | 51159 | | 2 | HASH JOIN | | 7| 3010 | | 32489 | | 3 | HASH JOIN | | 9| 3285 | | 13819 | | 4 | MERGE JOIN CARTESIAN | | 13| 3900 | | 26 | | 5 | VIEW | | 1| 262 | | 24 | | 6 | SORT GROUP BY | | 1| 344 | | 24 | | 7 | NESTED LOOPS | | 1| 344 | | 21 | | 8 | INLIST ITERATOR | | | | | | | 9 | TABLE ACCESS BY INDEX ROWID|SW_PLAT_CAT_ORG | 18| 5148 | | 3 | | 10 | INDEX RANGE SCAN |IDX_SW_PLAT_CAT_ORG_ENABLE| 7| | | 2 | | 11 | INLIST ITERATOR | | | | | | | 12 | TABLE ACCESS BY INDEX ROWID|SW_PLAT_CAT_BUYER | 1| 58 | | 1 | | 13 | INDEX UNIQUE SCAN |PK_SW_PLAT_CAT_BUYER | 1| | | | | 14 | BUFFER SORT | | 13| 494 | | 26 | | 15 | INLIST ITERATOR | | | | | | | 16 | TABLE ACCESS BY INDEX ROWID |PLT_PLAT | 13| 494 | | 2 | | 17 | INDEX RANGE SCAN |PK_PLT_PLAT | 13| | | 1 | | 18 | VIEW | |4027| 255K| | 13792 | | 19 | SORT GROUP BY | |4027| 204K| | 13792 | | 20 | TABLE ACCESS FULL |ORD_HIT_COMM |645K| 32M| | 10902 | | 21 | VIEW | |4027| 255K| | 18669 | | 22 | SORT GROUP BY | |4027| 196K| | 18669 | | 23 | VIEW | |645K| 30M| | 18669 | | 24 | SORT GROUP BY | |645K| 47M| 114M| 18669 | | 25 | TABLE ACCESS FULL |ORD_HIT_COMM |645K| 47M| | 10902 | | 26 | VIEW | |4027| 255K| | 18669 | | 27 | SORT GROUP BY | |4027| 196K| | 18669 | | 28 | VIEW | |645K| 30M| | 18669 | | 29 | SORT GROUP BY | |645K| 47M| 114M| 18669 | | 30 | TABLE ACCESS FULL |ORD_HIT_COMM |645K| 47M| | 10902 | ---------------------------------------------------------------------------------------------------- Note: cpu costing is off, PLAN_TABLE' is old version 38 rows selected. |
这个执行计划至少已经不那么离谱了,下面只需要重新运行JOB就可以了