Friday, October 21, 2011

Using Bonded Network Device Can Cause OCFS2 to Detect Network Outage

Visit the Below Website to access unlimited exam questions for all IT vendors and Get Oracle Certifications for FREE
http://www.free-online-exams.com

Problem:    Using Bonded Network Device Can Cause OCFS2 to Detect Network Outage
During Network interface Card failure test, after removing nic from 1 node after 1 or 2 minutes second node restart automatically

Symptoms:
10.2.0.4 With ASM / CRS / Database / ocfs2 / AMSLIB


Here is the full test carried out and the problem faced. 

1. We have 2 NIC Card. One is onboard and another is AD-ON network Card. 
2. Bonding has been done on this 2 card for public ip. 
3. If i remove cable from the AD-ON card everything works fine. CRS understand the change and switches to other network card. 
4. Next stage is to remove the cable from the ON Board network card. If i do so RAC understand that cable has been removed and start communicating using other network card. But after some time say 2 to 3 minutes it gives error voting disk hangs and it panic and restart both the server. 

Heartbit cable is on the Addon Card. 


Log files:
ocssd.log (s1gz0ss016)
================
[ CSSD]CLSS-3000: reconfiguration successful, incarnation 12 with 2 nodes

[ CSSD]CLSS-3001: local node number 2, master node number 1

[ CSSD]2010-01-25 15:23:39.930 [1287670080] >TRACE: clssgmReconfigThread: completed for reconfig(12), with status(1)
[ CSSD]2010-01-25 15:23:40.015 [1245710656] >TRACE: clssgmCommonAddMember: clsomon joined (2/0x1000000/#CSS_CLSSOMON)
[ CSSD]2010-01-25 15:37:19.190 [1245710656] >TRACE: clscsendx: (0x2aaaac089bb0) Connection not active

[ CSSD]2010-01-25 15:37:19.190 [1245710656] >TRACE: clssgmSendClient: Send failed rc 6, con (0x2aaaac089bb0), client (0x2aaaac089eb0), proc ((nil))
[ CSSD]2010-01-25 15:43:52.310 >USER: Copyright 2010, Oracle version 10.2.0.4.0
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=s1gz0ss016DBG_CSSD))
[ CSSD]2010-01-25 15:43:52.310 >USER: CSS daemon log for node s1gz0ss016, number 2, in cluster crs
……………………
[ CSSD]2010-01-25 15:46:38.827 [1250961728] >WARNING: clssnmPollingThread: node s1gz0ss015 (1) at 50 6.875823e-316artbeat fatal, eviction in 29.760 seconds
…………………..
[ CSSD]2010-01-25 15:47:07.885 [1250961728] >WARNING: clssnmPollingThread: node s1gz0ss015 (1) at 90 6.875823e-316artbeat fatal, eviction in 0.710 seconds
[ CSSD]2010-01-25 15:47:08.597 [1250961728] >TRACE: clssnmPollingThread: Eviction started for node s1gz0ss015 (1), flags 0x0001, state 3, wt4c 0


ocssd.log (s1gz0ss015)
================
[ CSSD]CLSS-3001: local node number 1, master node number 1

[ CSSD]2010-01-25 15:43:20.545 [1279928640] >TRACE: clssgmReconfigThread: completed for reconfig(14), with status(1)
[ CSSD]2010-01-25 15:51:54.617 >USER: Copyright 2010, Oracle version 10.2.0.4.0
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=s1gz0ss015DBG_CSSD))
……………….
[ CSSD]2010-01-25 16:25:21.074 [1248102720] >WARNING: clssnmPollingThread: node s1gz0ss016 (2) at 50 2.312183e-315artbeat fatal, eviction in 29.450 seconds
[ CSSD]2010-01-25 16:25:21.074 [1248102720] >TRACE: clssnmPollingThread: node s1gz0ss016 (2) is impending reconfig, flag 1037, misstime 30550
[ CSSD]2010-01-25 16:25:21.074 [1248102720] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[ CSSD]2010-01-25 16:25:21.764 [1185163584] >TRACE: clssnmDiskPMT: stale disk (84550 ms) (0//u02/oradata/vote/vote_data1.dbf)
[ CSSD]2010-01-25 16:25:21.764 [1185163584] >TRACE: clssnmDiskPMT: stale disk (84550 ms) (1//u03/oradata/vote/vote_data2.dbf)
[ CSSD]2010-01-25 16:25:21.764 [1185163584] >TRACE: clssnmDiskPMT: stale disk (84550 ms) (2//u03/oradata/vote1/vote_data3.dbf)
[ CSSD]2010-01-25 16:25:21.764 [1185163584] >ERROR: clssnmDiskPMT: Aborting, 3 of 3 voting disks unavailable
[ CSSD]2010-01-25 16:25:21.765 [1185163584] >ERROR: ###################################
[ CSSD]2010-01-25 16:25:21.765 [1185163584] >ERROR: clssscExit: CSSD aborting
[ CSSD]2010-01-25 16:25:21.765 [1185163584] >ERROR: ###################################

OS system logs (s1gz0ss015)
====================

Comments
--------
Jan 25 16:23:15 S1GZ0SS015 kernel: bnx2: eth0 NIC Copper Link is Down
Jan 25 16:23:16 S1GZ0SS015 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Jan 25 16:23:16 S1GZ0SS015 kernel: bonding: bond0: now running without any active interface !
Jan 25 16:23:19 S1GZ0SS015 snmpd[7769]: Connection from UDP: [127.0.0.1]:57543 
Jan 25 16:23:19 S1GZ0SS015 snmpd[7769]: Received SNMP packet(s) from UDP: [127.0.0.1]:57543 
Jan 25 16:23:34 S1GZ0SS015 snmpd[7769]: Connection from UDP: [127.0.0.1]:36167 
Jan 25 16:23:34 S1GZ0SS015 snmpd[7769]: Received SNMP packet(s) from UDP: [127.0.0.1]:36167 
Jan 25 16:23:45 S1GZ0SS015 kernel: o2net: connection to node S1GZ0SS016 (num 1) at 10.254.55.23:7777 has been idle for 30.0 seconds, shutting it down.
Jan 25 16:23:45 S1GZ0SS015 kernel: (0,2):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1264422195.99914 now 1264422225.99378 dr 1264422195.99905 adv 1264422195.99925:1264422195.99926 func (d24f33b7:505) 1264422143.100124:1264422143.100126)
Jan 25 16:23:45 S1GZ0SS015 kernel: o2net: no longer connected to node S1GZ0SS016 (num 1) at 10.254.55.23:7777
…………………
…………………
Jan 25 16:25:50 S1GZ0SS015 kernel: (19284,5):dlm_wait_for_node_death:370 8C697813DF044440B4D31E99E57E44EF: waiting 5000ms for notification of death of node 1
Jan 25 16:25:52 S1GZ0SS015 snmpd[7769]: Connection from UDP: [127.0.0.1]:38343 
Jan 25 16:25:52 S1GZ0SS015 snmpd[7769]: Received SNMP packet(s) from UDP: [127.0.0.1]:38343 
Jan 25 16:25:54 S1GZ0SS015 kernel: (11141,2):ocfs2_dlm_eviction_cb:98 device (8,129): dlm has evicted node 1
Jan 25 16:25:54 S1GZ0SS015 kernel: (11141,2):ocfs2_dlm_eviction_cb:98 device (8,209): dlm has evicted node 1

Solution:


problem with network bonding failover.
1.       Change Private IP with Public IP in OCFS2.
2.       Specify heartbeat dead threshold (>=7) [61]: 61 
3.       Specify network idle timeout in ms (>=5000) [10000]: 60000

References:

Note 423183.1 == > Using Bonded Network Device Can Cause OCFS2 to Detect Network Outage


Get Oracle Certifications for all Exams
Free Online Exams.com

No comments: