常见latch的争用分析以及解决办法

chenyu

一、buffer cache 中的CBC latch（cache buffer chains）,这表名在buffer中查找数据块时发生了latch争用，则有两种可能，一种是系统的latch确实不够用，二是大量进程集中访问某些数据块(热点快)导致latch争用，现实中可能大多是这种情况；怎么判断到底是那种情况导致latch争用呢？我们可以查询v$latch_children视图，如果发现latch平均分布在各个子latch中说明latch不够可以增加cpu（增加CPU可以增加latch）或修改隐藏参数,如果分布不均匀说明出现热点块，解决热点块可以通过下面的两个方法：1、对这个热点快对象进行分区；2、增加热点快对象的pctfree，把块分散开；为什么分区和增加pctfree可以解决热点块的问题呢？原因如下：
对表分区:假如一个表中有city字段，某个块中有北京、天津、青岛三个city的数据，这时有三个session分别访问这个块中三个城市的数据行，哪么这个块就会被三个session访问，如果对这张表city列进行范围分区，哪么北京、天津、青岛的数据肯定分布在不同的分区（不同的数据块）中，这时三个session就会访问不同的数据块了，从而解决热点块的问题；
增加pctfree:假如原pctfree由10%增加到20%,哪么数据块中用来存放数据的空间就减少10%，原来存放在这10%中的数据就会放到其他的数据块中，这样假如原来的session访问这10%的数据此时就会访问其他的数据块，从而解决热点块的问题；
解决热点快的流程
1、查找v$system_event查看系统等待事件，这里假设发现cbc latch争用严重
2、查找v$latch_children找到那些子latch争用比较厉害
select addr,name,gets,spin_gets,sleeps,hash from v$latch where name = 'cache buffers chains' order by sleeps
3、查找热点块属于那个对象
select t1.object_name,t2.file#,t2.DBAblk
from dba_objects t1,(select file#,dbablk,obj,addr,hladdr from x$bh where hladdr = '？') t2 where t1.object_id = t2.obj;  //？就是上一步查到的addr
4、定位导致热点快的SQL语句
select sql_text,executions from v$sqlarea where sql_text like '%？%';  //？就是上一步查询到的object_name
我们根据SQL语句的执行次数就应该能够判断出是那些SQL语句导致的热点快
模拟发生热点块时的检查步骤：
SQL> select rowid,ename from emp;
ROWID             ENAME
------------------ ----------
AAACYVAAEAAAAA6AAA SMITH
AAACYVAAEAAAAA6AAB ALLEN
AAACYVAAEAAAAA6AAC WARD
AAACYVAAEAAAAA6AAD JONES
AAACYVAAEAAAAA6AAE MARTIN
AAACYVAAEAAAAA6AAF BLAKE
AAACYVAAEAAAAA6AAG CLARK
AAACYVAAEAAAAA6AAH SCOTT
AAACYVAAEAAAAA6AAI KING //查询这行数据
AAACYVAAEAAAAA6AAJ TURNER
AAACYVAAEAAAAA6AAK ADAMS
AAACYVAAEAAAAA6AAL JAMES
AAACYVAAEAAAAA6AAM FORD
AAACYVAAEAAAAA6AAN MILLER
14 rows selected.
--find.sql
declare
  name varchar2(100);
begin
  loop
select ename into name from emp where rowid = 'AAACYVAAEAAAAA6AAI';
  end loop;
end;
/
exit;
--run.sql
sqlplus -s scott/tiger@PROD @find.sql &
sqlplus -s scott/tiger@PROD @find.sql &
sqlplus -s scott/tiger@PROD @find.sql
chmod +x run.sql
select event,TOTAL_WAITS,TIME_WAITED from v$system_event where wait_class!='Idle' order by TIME_WAITED;
EVENT                                                          TOTAL_WAITS TIME_WAITED
---------------------------------------------------------------- ----------- -----------
latch: cache buffers chains                                              99       164
运行一段时间后
select event,TOTAL_WAITS,TIME_WAITED from v$system_event where wait_class!='Idle' order by TIME_WAITED;
EVENT                                                          TOTAL_WAITS TIME_WAITED
---------------------------------------------------------------- ----------- -----------
latch: cache buffers chains                                           477       835 //latch争用比较严重
--查看子latch，得到地址
select addr,latch#,child#,name,hash,gets,misses,sleeps,spin_gets from v$latch_children where name = 'cache buffers chains' order by sleeps;
ADDR       LATCH#    CHILD# NAME                               HASH    GETS    MISSES    SLEEPS  SPIN_GETS
-------- ---------- ---------- ------------------------------ ---------- ---------- ---------- ---------- ----------
31D476B0       122       870 cache buffers chains          3563305585 6230630    63862       908    63095
--查看热点快属于那个对象
select t1.object_name from dba_objects t1,x$bh t2 where t1.object_id = t2.obj and t2.hladdr='31D476B0';
OBJECT_NAME
-----------------------
I_OBJ1
C_OBJ#_INTCOL#
I_H_OBJ#_COL#
SMON_SCN_TO_TIME
WRI$_ADV_TASKS_IDX_01
I_WRI$_OPTSTAT_H_ST
I_WRI$_OPTSTAT_H_ST
EMP
BIGTAB
BIGTAB
BIGTAB
BIGTAB
WRH$_LATCH
WRH$_WAITSTAT
WRH$_TABLESPACE_STAT
从上面的结果发现emp和bigtab两张表比较可疑，假如现在我们不知道是那张表发生热点快
--查询导致热点快的SQL
select sql_text,executions from v$sqlarea where upper(sql_text) like '%EMP%' or upper(sql_text) like '%BIGTAB%' order by executions;
SQL_TEXT                                                    EXECUTIONS
------------------------------------------------------------ ----------
declare name varchar2(100); begin loop    select ename          0
into name from emp where rowid = 'AAACYVAAEAAAAA6AAI'; end
loop; end;
select sql_text,executions from v$sqlarea where upper(sql_te       2
xt) like '%EMP%' or upper(sql_text) like '%BIGTAB%' order by
executions
select value$ from props$ where name='DEFAULT_TEMP_TABLESPAC       3
E'
select sql_id,TEMPSEG_SIZE,MAX_MEM_USED,WORK_AREA_SIZE from          3
v$sql_workarea_active
select dbms_metadata.get_ddl('TABLESPACE','TEMP') from dual          3
select sql_id,MAX_TEMPSEG_SIZE,LAST_TEMPSEG_SIZE from v$sql_       6
workarea order by MAX_TEMPSEG_SIZE,LAST_TEMPSEG_SIZE
select obj#, dataobj#, part#, hiboundlen, hiboundval, ts#, f       184
ile#, block#, pctfree$, pctused$, initrans, maxtrans, flags,
analyzetime, samplesize, rowcnt, blkcnt, empcnt, avgspc, ch
ncnt, avgrln, length(bhiboundval), bhiboundval from tabpart$
where bo# = :1 order by part#
select t.ts#,t.file#,t.block#,nvl(t.bobj#,0),nvl(t.tab#,0),t    16165
.intcols,nvl(t.clucols,0),t.audit$,t.flags,t.pctfree$,t.pctu
sed$,t.initrans,t.maxtrans,t.rowcnt,t.blkcnt,t.empcnt,t.avgs
pc,t.chncnt,t.avgrln,t.analyzetime,t.samplesize,t.cols,t.pro
perty,nvl(t.degree,1),nvl(t.instances,1),t.avgspc_flb,t.flbc
nt,t.kernelcols,nvl(t.trigflag, 0),nvl(t.spare1,0),nvl(t.spa
re2,0),t.spare4,t.spare6,ts.cachedblk,ts.cachehit,ts.logical
read from tab$ t, tab_stats$ ts where t.obj#= :1 and t.obj#
= ts.obj# (+)
SELECT ENAME FROM EMP WHERE ROWID = 'AAACYVAAEAAAAA6AAI'    13374107
至此发现是由于SELECT ENAME FROM EMP WHERE ROWID = 'AAACYVAAEAAAAA6AAI'这个SQL语句导致的CBC latch的争用，下一步就可以按照上面的步骤进行处理了；
二、shared pool latch: 从shared pool中找空闲空间时需要获取,排他latch，如果这个latch比较高，大部分原因是因为SQL没共享,这时一般情况下hard parse也相应比较高，解决方法是查询v$sql或v$sqlarea，查询是那些SQL。
模拟shared pool latch的测试
SQL> select count(*) from bigtab;
  COUNT(*)
----------
1000000
SQL> desc bigtab;
Name                                     Null? Type
----------------------------------------- -------- ----------------------------
ID                                              NUMBER
NAME                                              VARCHAR2(32)
--查看开始时的shared pool情况
select event,TOTAL_WAITS,TIME_WAITED from v$system_event where wait_class!='Idle' order by TIME_WAITED;
EVENT                                                          TOTAL_WAITS TIME_WAITED
---------------------------------------------------------------- ----------- -----------
latch: shared pool                                                    318       161
--运行程序
declare
  name varchar2(50);
  sqlstr varchar2(100);
begin
  for i in 1..1000000000 loop
sqlstr := 'select name from bigtab where id = ' || i;
execute immediate sqlstr into name; //未绑定变量,模拟硬解析
  end loop;
end;
/
--再次查看shared pool情况，发现有争用但是不明显
select event,TOTAL_WAITS,TIME_WAITED from v$system_event where wait_class!='Idle' order by TIME_WAITED;
EVENT                                                          TOTAL_WAITS TIME_WAITED
---------------------------------------------------------------- ----------- -----------
latch: shared pool                                                    324       162
--当前数据库设置,sga自动管理，可能是由于sga比较大，SQL有比较简单所以shared pool的争用不明显
SQL> show parameter shared_pool_size
NAME                               TYPE       VALUE
------------------------------------ ----------- ------------------------------
shared_pool_size                   big integer 0
SQL> show parameter sga_target
NAME                               TYPE       VALUE
------------------------------------ ----------- ------------------------------
sga_target                         big integer 300M
--换一种分析思路，如果shared pool争用比较厉害，那么一般是由于没有使用绑定变量导致大量硬解析造成的，这时我们查询系统硬解析的情况
SQL> select name,value from v$sysstat where name like '%parse%hard%';
NAME                               VALUE
------------------------------ ----------
parse count (hard)                142897
SQL> /
NAME                               VALUE
------------------------------ ----------
parse count (hard)                143182 //比较明显
SQL> /
NAME                               VALUE
------------------------------ ----------
parse count (hard)                143673 //比较明显
--查询是由于那些SQL语句导致的硬解析，从结果中发现有大量的select name from bigtab where id = ？的语句没有使用绑定变量，至此问题已经找到；
select sql_text,executions from v$sqlarea order by sql_text,executions desc;
SQL_TEXT                                                    EXECUTIONS
------------------------------------------------------------ ----------
......
select name from bigtab where id = 93510                            1
select name from bigtab where id = 93511                            1
select name from bigtab where id = 93512                            1
select name from bigtab where id = 93513                            1
select name from bigtab where id = 93514                            1
select name from bigtab where id = 93515                            1
.......

三、library cache latch:产生共享SQL和找共享SQL时需要获取,排他latch;
如果shared pool latch不高但是library cache latch很高时表示有很多相同的SQL要执行(软解析)，需要到shared_pool的链中查找，这时查询v$sqlarea视图看那个SQL的executions比较大；
模拟library cache latch测试
--把session_cached_cursors设置为0，确保发生软解析
SQL> alter system set session_cached_cursors=0 scope=spfile;
System altered.
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
Total System Global Area  314572800 bytes
Fixed Size                1219184 bytes
Variable Size             88081808 bytes
Database Buffers       218103808 bytes
Redo Buffers             7168000 bytes
Database mounted.
Database opened.
SQL> show parameter session_cached_cursors;
NAME                               TYPE       VALUE
------------------------------------ ----------- ------------------------------
session_cached_cursors             integer    0
--表进行建立索引
SQL> select index_name from user_indexes where table_name = 'BIGTAB';
INDEX_NAME
------------------------------
IDX_BIGTAB_ID
--find.sql
declare
  str varchar2(50);
  n integer;
begin
  for i in 1..10 loop
n := floor(dbms_random.value*1000000);
select name into str from bigtab where id = n;
  end loop;
end;
/
exit;
--run.sql
sqlplus -s scott/tiger@prod @find.sql &
sqlplus -s scott/tiger@prod @find.sql &
sqlplus -s scott/tiger@prod @find.sql
chmod +x run.sql
--开始执行程序之前的library cache情况
SQL> select event,TOTAL_WAITS,TIME_WAITED from v$system_event where event like '%library cache%';
EVENT                                                          TOTAL_WAITS TIME_WAITED
---------------------------------------------------------------- ----------- -----------
library cache load lock                                                 2          1
--执行程序
./run.sql
--一段时间后再次查询library cache争用情况
SQL> select event,TOTAL_WAITS,TIME_WAITED from v$system_event where event like '%library cache%';
EVENT                                                          TOTAL_WAITS TIME_WAITED
---------------------------------------------------------------- ----------- -----------
latch: library cache                                                    1          3
library cache pin                                                       3       11
library cache load lock                                                 2          1
/
EVENT                                                          TOTAL_WAITS TIME_WAITED
---------------------------------------------------------------- ----------- -----------
latch: library cache                                                    23       52  //争用出现
latch: library cache pin                                                 5       12
library cache pin                                                       3       11
library cache load lock                                                 2          1
--查询子latch的地址和详细的争用情况
SQL> select addr,latch#,child#,name,hash,gets,misses,sleeps,spin_gets from v$latch_children where name = 'library cache' order by sleeps;
ADDR       LATCH#    CHILD# NAME                               HASH    GETS    MISSES    SLEEPS  SPIN_GETS
-------- ---------- ---------- ------------------------------ ---------- ---------- ---------- ---------- ----------
313C3B3C       214       1 library cache                3055961779    24242       18       0       18
313C3A74       214       3 library cache                3055961779    41359       31       2       29
313C3AD8       214       2 library cache                3055961779 2707431    15096       79    15020 //争用显著
--查询执行次数多的SQL语句
select * from (select sql_text,executions from v$sqlarea order by executions desc) where rownum <= 10;
SQL_TEXT                                                    EXECUTIONS
------------------------------------------------------------ ----------
SELECT NAME FROM BIGTAB WHERE ID = :B1                         2457736 //发现问题，实有与这个SQL引起的library cache latch争用
select /*+ rule */ bucket_cnt, row_cnt, cache_cnt, null_cnt,    1960
timestamp#, sample_size, minimum, maximum, distcnt, lowval,
hival, density, col#, spare1, spare2, avgcln from hist_head
$ where obj#=:1 and intcol#=:2
select /*+ rule */ bucket, endpoint, col#, epvalue from hist       744
grm$ where obj#=:1 and intcol#=:2 and row#=:3 order by bucke
t
select timestamp from fixed_obj$ where obj# = :1                   597
select type#,blocks,extents,minexts,maxexts,extsize,extpct,u       555
ser#,iniexts,NVL(lists,65535),NVL(groups,65535),cachehint,hw
mincr, NVL(spare1,0),NVL(scanhint,0) from seg$ where ts#=:1
and file#=:2 and block#=:3
select intcol#,nvl(pos#,0),col#,nvl(spare1,0) from ccol$ whe       492
re con#=:1
select o.owner#,o.name,o.namespace,o.remoteowner,o.linkname,       324
o.subname,o.dataobj#,o.flags from obj$ o where o.obj#=:1
select obj#,type#,ctime,mtime,stime,status,dataobj#,flags,oi       320
d$, spare1, spare2 from obj$ where owner#=:1 and name=:2 and
namespace=:3 and remoteowner is null and linkname is null a
nd subname is null
select pos#,intcol#,col#,spare1,bo#,spare2 from icol$ where       243
obj#=:1
select count(*) from sys.job$ where (next_date > sysdate) an       204
d (next_date < (sysdate+5/86400))

P4 · 发表于 2010-11-12 15:05:16

不错，顶一个

P6 · 发表于 2010-11-12 17:42:17

上面是对三种latch的解决方案。
这三种latch是最常见的latch。
希望大家认真的阅读和理解，当然前提是大家对latch的工作机制以及两种pool的内存结构有比较清楚的理解。

P5 · 发表于 2014-5-7 08:28:28

不错，顶一个学习一下

图文精华

常见latch的争用分析以及解决办法