【案例6】Trouble Shooting: CRS-4534: Cannot communicate with Event Manager

kevin.zhang

本帖最后由 kevin.zhang 于 2012-5-1 00:55 编辑

来自：
http://dbakevin.blogspot.com/2012/04/trouble-shooting-crs-4534-cannot.html
转载请注明。

Here is a three nodes Rac cluster.
All DBs are up and running fine on all the nodes.
oracle $ ps -ef|grep smon|wc -l
43
oracle $ ps -ef|grep tns|wc -l
50
oracle $ ps -ef|grep d.bin
root    8177    1  0 Apr28 ?       00:18:29 /stage/grid/11.2.0/grid/bin/orarootagent.bin
oracle 8550    1 22 Apr28 ?       08:52:59 /stage/grid/11.2.0/grid/bin/oraagent.bin
root    26480    1  0 Apr28 ?       00:03:26 /stage/grid/11.2.0/grid/bin/ohasd.bin reboot
oracle 29210    1  0 Apr28 ?       00:00:00 /stage/grid/11.2.0/grid/bin/mdnsd.bin
oracle 29232    1  0 Apr28 ?       00:01:09 /stage/grid/11.2.0/grid/bin/gpnpd.bin
oracle 29257    1  0 Apr28 ?       00:05:32 /stage/grid/11.2.0/grid/bin/gipcd.bin
root    29307    1  0 Apr28 ?       00:00:13 /stage/grid/11.2.0/grid/bin/cssdmonitor
root    29334    1  0 Apr28 ?       00:00:10 /stage/grid/11.2.0/grid/bin/cssdagent
oracle 29351    1  8 Apr28 ?       03:25:51 /stage/grid/11.2.0/grid/bin/ocssd.bin
root    29540    1  0 Apr28 ?       00:20:57 /stage/grid/11.2.0/grid/bin/orarootagent.bin
oracle 29556    1  0 Apr28 ?       00:00:18 /stage/grid/11.2.0/grid/bin/diskmon.bin -d -f
root    30160    1  0 Apr28 ?       00:00:33 /stage/grid/11.2.0/grid/bin/octssd.bin reboot
oracle 30350    1  0 Apr28 ?       00:00:04 /stage/grid/11.2.0/grid/bin/evmd.bin
root    31947    1  0 Apr28 ?       00:03:50 /stage/grid/11.2.0/grid/bin/crsd.bin reboot

From above output red part we aslo can see evmd.bin process is up.
But when use crsctl command, it shows that Event Manger not work on node 2 and node 3.
oracle $ crsctl check cluster -all
**************************************************************
cihcissdb758:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
cihcissdb759:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
**************************************************************
cihcissdb760:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
**************************************************************

Checked cluster alert.log, it also indicate EVMD started normal:
2012-04-28 18:29:48.154
[evmd(30579)]CRS-1401:EVMD started on node cihcissdb759.

Below is from evmd.log:
2012-04-28 18:29:17.518: [ EVMD][1928129552] EVMD Starting
2012-04-28 18:29:17.518: [ EVMD][1928129552] Initializing OCR
2012-04-28 18:29:17.518: [ EVMD][1928129552] EVMD running in cluster mode
2012-04-28 18:29:18.524: [ EVMD][1928129552] Waiting for OCR. Seconds elapsed: 0
2012-04-28 18:29:48.136: [ EVMD][1928129552] Get OCR context succeeded
2012-04-28 18:29:48.153: [ EVMD][1928129552] Initializing Diagnostics Settings
2012-04-28 18:29:48.224: [ EVMD][1928129552] Authorization database built successfully.
2012-04-28 18:29:49.105: [ EVMAPP][1928129552] EVMD Started
2012-04-28 18:29:49.123: [ EVMD][1928129552] Authorization database built successfully.

All seems fine. Then why "CRS-4534: Cannot communicate with Event Manager"?

Let's check the current status for evmd:
oracle $ crsctl stat res -t -init
-------------------------------------------------------
NAME          TARGET       STATE    SERVER
-------------------------------------------------------
ora.evmd    1  ONLINE INTERMEDIATE cihcissdb760

The current status INTERMEDIATE means the evmd did starting but not totally completed.

Below recorded in oracle_agent.logfile:
2012-04-28 18:45:50.936: [ AGFW][1113385280] {0:0:2} ora.evmd 1 1 state changed from: UNKNOWN to: OFFLINE
.................
2012-04-28 18:46:23.441: [ AGFW][1114609984] {0:0:2} ora.evmd 1 1 state changed from: UNKNOWN to: STARTING
.................
2012-04-28 20:41:27.768: [ AGFW][1116698944] {0:0:2} ora.evmd 1 1 state changed from: STARTING to: PARTIAL
.................

We  can see the evmd was starting in progress, status changed from UNKNOWN --> OFFLINE --> STARTING --> PARTIAL.
And next the status should change from PARTIAL to ONLINE.
But from the logfile, we don't see the final step, it never happen.

It failed at the final step due to below error message:
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort {
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] abort {
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] abort command: check
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] tryActionLock {
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] did not get lock
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] tryActionLock }
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort: Exception LockAcquireTimeoutException
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort, agent exiting }

This is very similar to bug 11807012: oraagent.bin Exits After Check Timed Out

But this bug should have been fixed in 11.2.0.2.3,  and our server already patched 11gR2 Patch Set Update 4.

It looks like that patch set update don't real fully solved that bug.

Raised SR and will check with Oracle Support further.

图文精华

【案例6】Trouble Shooting: CRS-4534: Cannot communicate with Event Manager