kevin.zhang 发表于 2012-5-1 00:42:49

【案例6】Trouble Shooting: CRS-4534: Cannot communicate with Event Manager

本帖最后由 kevin.zhang 于 2012-5-1 00:55 编辑

来自:
http://dbakevin.blogspot.com/2012/04/trouble-shooting-crs-4534-cannot.html
转载请注明。

Here is a three nodes Rac cluster.
All DBs are up and running fine on all the nodes.
oracle $ ps -ef|grep smon|wc -l
43
oracle $ ps -ef|grep tns|wc -l
50
oracle $ ps -ef|grep d.bin
root      8177   10 Apr28 ?      00:18:29 /stage/grid/11.2.0/grid/bin/orarootagent.bin
oracle    8550   1 22 Apr28 ?      08:52:59 /stage/grid/11.2.0/grid/bin/oraagent.bin
root   26480   10 Apr28 ?      00:03:26 /stage/grid/11.2.0/grid/bin/ohasd.bin reboot
oracle   29210   10 Apr28 ?      00:00:00 /stage/grid/11.2.0/grid/bin/mdnsd.bin
oracle   29232   10 Apr28 ?      00:01:09 /stage/grid/11.2.0/grid/bin/gpnpd.bin
oracle   29257   10 Apr28 ?      00:05:32 /stage/grid/11.2.0/grid/bin/gipcd.bin
root   29307   10 Apr28 ?      00:00:13 /stage/grid/11.2.0/grid/bin/cssdmonitor
root   29334   10 Apr28 ?      00:00:10 /stage/grid/11.2.0/grid/bin/cssdagent
oracle   29351   18 Apr28 ?      03:25:51 /stage/grid/11.2.0/grid/bin/ocssd.bin
root   29540   10 Apr28 ?      00:20:57 /stage/grid/11.2.0/grid/bin/orarootagent.bin
oracle   29556   10 Apr28 ?      00:00:18 /stage/grid/11.2.0/grid/bin/diskmon.bin -d -f
root   30160   10 Apr28 ?      00:00:33 /stage/grid/11.2.0/grid/bin/octssd.bin reboot
oracle   30350   10 Apr28 ?      00:00:04 /stage/grid/11.2.0/grid/bin/evmd.bin
root   31947   10 Apr28 ?      00:03:50 /stage/grid/11.2.0/grid/bin/crsd.bin reboot

From above output red part we aslo can see evmd.bin process is up.
But when use crsctl command, it shows that Event Manger not work on node 2 and node 3.
oracle $ crsctl check cluster -all
**************************************************************
cihcissdb758:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
cihcissdb759:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
**************************************************************
cihcissdb760:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
**************************************************************

Checked cluster alert.log, it also indicate EVMD started normal:
2012-04-28 18:29:48.154
CRS-1401:EVMD started on node cihcissdb759.

Below is from evmd.log:
2012-04-28 18:29:17.518: [ EVMD] EVMD Starting
2012-04-28 18:29:17.518: [ EVMD] Initializing OCR
2012-04-28 18:29:17.518: [ EVMD] EVMD running in cluster mode
2012-04-28 18:29:18.524: [ EVMD] Waiting for OCR. Seconds elapsed: 0
2012-04-28 18:29:48.136: [ EVMD] Get OCR context succeeded
2012-04-28 18:29:48.153: [ EVMD] Initializing Diagnostics Settings
2012-04-28 18:29:48.224: [ EVMD] Authorization database built successfully.
2012-04-28 18:29:49.105: [ EVMAPP] EVMD Started
2012-04-28 18:29:49.123: [ EVMD] Authorization database built successfully.

All seems fine. Then why "CRS-4534: Cannot communicate with Event Manager"?

Let's check the current status for evmd:
oracle $ crsctl stat res -t -init
-------------------------------------------------------
NAME             TARGET          STATE       SERVER   
-------------------------------------------------------
ora.evmd      1ONLINE   INTERMEDIATE cihcissdb760

The current status INTERMEDIATE means the evmd did starting but not totally completed.

Below recorded in oracle_agent.logfile:
2012-04-28 18:45:50.936: [    AGFW] {0:0:2} ora.evmd 1 1 state changed from: UNKNOWN to: OFFLINE
.................
2012-04-28 18:46:23.441: [    AGFW] {0:0:2} ora.evmd 1 1 state changed from: UNKNOWN to: STARTING
.................
2012-04-28 20:41:27.768: [    AGFW] {0:0:2} ora.evmd 1 1 state changed from: STARTING to: PARTIAL
.................

Wecan see the evmd was starting in progress, status changed from UNKNOWN --> OFFLINE --> STARTING --> PARTIAL.
And next the status should change from PARTIAL to ONLINE.
But from the logfile, we don't see the final step, it never happen.

It failed at the final step due to below error message:
2012-04-28 20:43:27.787: {0:0:2} clsn_agent::abort {
2012-04-28 20:43:27.787: {0:0:2} abort {
2012-04-28 20:43:27.787: {0:0:2} abort command: check
2012-04-28 20:43:27.787: {0:0:2} tryActionLock {
2012-04-28 20:44:27.816: {0:0:2} did not get lock
2012-04-28 20:44:27.816: {0:0:2} tryActionLock }
2012-04-28 20:44:27.816: {0:0:2} clsn_agent::abort: Exception LockAcquireTimeoutException
2012-04-28 20:44:27.816: {0:0:2} clsn_agent::abort, agent exiting }

This is very similar to bug 11807012: oraagent.bin Exits After Check Timed Out

But this bug should have been fixed in 11.2.0.2.3,and our server already patched 11gR2 Patch Set Update 4.

It looks like that patch set update don't real fully solved that bug.


Raised SR and will check with Oracle Support further.

页: [1]
查看完整版本: 【案例6】Trouble Shooting: CRS-4534: Cannot communicate with Event Manager