本帖最后由 kevin.zhang 于 2012-5-1 00:55 编辑
来自:
http://dbakevin.blogspot.com/2012/04/trouble-shooting-crs-4534-cannot.html
转载请注明。
Here is a three nodes Rac cluster.
All DBs are up and running fine on all the nodes.
oracle $ ps -ef|grep smon|wc -l
43
oracle $ ps -ef|grep tns|wc -l
50
oracle $ ps -ef|grep d.bin
root 8177 1 0 Apr28 ? 00:18:29 /stage/grid/11.2.0/grid/bin/orarootagent.bin
oracle 8550 1 22 Apr28 ? 08:52:59 /stage/grid/11.2.0/grid/bin/oraagent.bin
root 26480 1 0 Apr28 ? 00:03:26 /stage/grid/11.2.0/grid/bin/ohasd.bin reboot
oracle 29210 1 0 Apr28 ? 00:00:00 /stage/grid/11.2.0/grid/bin/mdnsd.bin
oracle 29232 1 0 Apr28 ? 00:01:09 /stage/grid/11.2.0/grid/bin/gpnpd.bin
oracle 29257 1 0 Apr28 ? 00:05:32 /stage/grid/11.2.0/grid/bin/gipcd.bin
root 29307 1 0 Apr28 ? 00:00:13 /stage/grid/11.2.0/grid/bin/cssdmonitor
root 29334 1 0 Apr28 ? 00:00:10 /stage/grid/11.2.0/grid/bin/cssdagent
oracle 29351 1 8 Apr28 ? 03:25:51 /stage/grid/11.2.0/grid/bin/ocssd.bin
root 29540 1 0 Apr28 ? 00:20:57 /stage/grid/11.2.0/grid/bin/orarootagent.bin
oracle 29556 1 0 Apr28 ? 00:00:18 /stage/grid/11.2.0/grid/bin/diskmon.bin -d -f
root 30160 1 0 Apr28 ? 00:00:33 /stage/grid/11.2.0/grid/bin/octssd.bin reboot
oracle 30350 1 0 Apr28 ? 00:00:04 /stage/grid/11.2.0/grid/bin/evmd.bin
root 31947 1 0 Apr28 ? 00:03:50 /stage/grid/11.2.0/grid/bin/crsd.bin reboot
From above output red part we aslo can see evmd.bin process is up.
But when use crsctl command, it shows that Event Manger not work on node 2 and node 3.
oracle $ crsctl check cluster -all
**************************************************************
cihcissdb758:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
cihcissdb759:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
**************************************************************
cihcissdb760:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
**************************************************************
Checked cluster alert.log, it also indicate EVMD started normal:
2012-04-28 18:29:48.154
[evmd(30579)]CRS-1401:EVMD started on node cihcissdb759.
Below is from evmd.log:
2012-04-28 18:29:17.518: [ EVMD][1928129552] EVMD Starting
2012-04-28 18:29:17.518: [ EVMD][1928129552] Initializing OCR
2012-04-28 18:29:17.518: [ EVMD][1928129552] EVMD running in cluster mode
2012-04-28 18:29:18.524: [ EVMD][1928129552] Waiting for OCR. Seconds elapsed: 0
2012-04-28 18:29:48.136: [ EVMD][1928129552] Get OCR context succeeded
2012-04-28 18:29:48.153: [ EVMD][1928129552] Initializing Diagnostics Settings
2012-04-28 18:29:48.224: [ EVMD][1928129552] Authorization database built successfully.
2012-04-28 18:29:49.105: [ EVMAPP][1928129552] EVMD Started
2012-04-28 18:29:49.123: [ EVMD][1928129552] Authorization database built successfully.
All seems fine. Then why "CRS-4534: Cannot communicate with Event Manager"?
Let's check the current status for evmd:
oracle $ crsctl stat res -t -init
-------------------------------------------------------
NAME TARGET STATE SERVER
-------------------------------------------------------
ora.evmd 1 ONLINE INTERMEDIATE cihcissdb760
The current status INTERMEDIATE means the evmd did starting but not totally completed.
Below recorded in oracle_agent.logfile:
2012-04-28 18:45:50.936: [ AGFW][1113385280] {0:0:2} ora.evmd 1 1 state changed from: UNKNOWN to: OFFLINE
.................
2012-04-28 18:46:23.441: [ AGFW][1114609984] {0:0:2} ora.evmd 1 1 state changed from: UNKNOWN to: STARTING
.................
2012-04-28 20:41:27.768: [ AGFW][1116698944] {0:0:2} ora.evmd 1 1 state changed from: STARTING to: PARTIAL
.................
We can see the evmd was starting in progress, status changed from UNKNOWN --> OFFLINE --> STARTING --> PARTIAL.
And next the status should change from PARTIAL to ONLINE.
But from the logfile, we don't see the final step, it never happen.
It failed at the final step due to below error message:
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort {
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] abort {
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] abort command: check
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] tryActionLock {
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] did not get lock
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] tryActionLock }
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort: Exception LockAcquireTimeoutException
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort, agent exiting }
This is very similar to bug 11807012: oraagent.bin Exits After Check Timed Out
But this bug should have been fixed in 11.2.0.2.3, and our server already patched 11gR2 Patch Set Update 4.
It looks like that patch set update don't real fully solved that bug.
Raised SR and will check with Oracle Support further.
|
|