Bonjour,
je débarque sur un problème il ya 4 jours, en fait je suis entrain de mettre en place un cluster
avec 2 noeuds un primaire et un secondaire.
voici l'architecture et les conf :
2 IBM servers with RAID 1
drbd version: 8.2.6 (API: 88/proto :86-88) and heartbeat installed on 2 servers
############################################
############################################
drbd.conf:
#
# drbd.conf
#
resource r1 {
protocol B;
#incon-degr-cmd "halt -f";
#incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";
#handlers { pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; }
startup {
#degr-wfc-timeout 120; # 2 minutes.
}
disk {
#on-io-error detach;
}
net {
#sndbuf-size 512k;
#timeout 60; # 6 seconds (unit = 0.1 seconds)
#connect-int 10; # 10 seconds (unit = 1 second)
#ping-int 10; # 10 seconds (unit = 1 second)
#ping-timeout 50; # 500 ms (unit = 0.1 seconds)
#max-buffers 8000;
#max-epoch-size 8000;
}
syncer {
rate 2048;
#group 1;
#al-extents 257;
}
on serv11 {
device /dev/drbd0;
disk /dev/sda4;
address 192.168.1.246:7788;
meta-disk internal;
}
on serv12 {
device /dev/drbd0;
disk /dev/sda4;
address 192.168.1.247:7788;
meta-disk internal;
}
}
############################################
/etc/ha.d/ha.cf:
#bcast eth0 car le reseau contient 2 cluster donc on va utiliser le
unicast
ucast eth0 192.168.1.247
#baud 19200
#serial /dev/ttyS0
#bcast eth1
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 10
warntime 6
initdead 60
udpport 694
node serv11
node serv12
auto_failback off
##############################################
/etc/ha.d/haressources
serv11 drbddisk::r1 Filesystem::/dev/drbd0::/data::ext3 IPaddr::192.168.1.250 MailTo::toto@toto.com::Cluster1-StatusUpdated fetchmail
#################################################
donc, lorsque je redémarre heartbeat sur le noeud primaire le noeud secondaire prend la main
et monte la partition /data et devient primaire -----> comportement correct
mais le sousci, losque je débranche le noeud primaire-----> blocage du cluster et voila le log sur
le noeud secondaire:
#####################################################
Oct 10 12:16:31 serv12 kernel: drbd0: PingAck did not arrive in time.
Oct 10 12:16:31 serv12 kernel: drbd0: peer (Primary -> Unknown) conn (
SyncTarget -> NetworkFailure) pdsk (UpToDate -> DUnknown)
Oct 10 12:16:31 serv12 kernel: drbd0: asender terminated
Oct 10 12:16:31 serv12 kernel: drbd0: Terminating thread asender
Oct 10 12:16:31 serv12 kernel: drbd0: short read expecting header on sock: r =- 512
Oct 10 12:16:31 serv12 kernel: drbd0: Writing meta data super block now.
Oct 10 12:16:31 serv12 kernel: drbd0: tl_clear ()
Oct 10 12:16:31 serv12 kernel: drbd0: Connection closed
Oct 10 12:16:31 serv12 kernel: drbd0: conn (NetworkFailure -> Unconnected)
Oct 10 12:16:31 serv12 kernel: drbd0: receiver terminated
Oct 10 12:16:31 aserv12 kernel: drbd0: receiver (re) started
Oct 10 12:16:31 serv12 kernel: drbd0: conn (Unconnected -> WFConnection
)
Oct 10 12:16:31 serv12 heartbeat [2810]: WARN: node serv11: is deadOct 10 12:16:31 serv12 heartbeat [2810]: WARN: No stonith device configured.
Oct 10 12:16:31 serv12 heartbeat [2810]: WARN: Shared disks are not protected.
Oct 10 12:16:31 aserv12 heartbeat [2810]: info: Resources being acquired
from serv11.
Oct 10 12:16:31 serv12 heartbeat [2810]: info: Link serv11: eth2 dead.
Oct 10 12:16:31 serv12 heartbeat [3039]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
Oct 10 12:16:31 serv12 heartbeat: info: Running / etc / ha.d / rc.d / status status
Oct 10 12:16:31 serv12 heartbeat [3040]: info: No local resources [/ usr / lib / heartbeat / ResourceManagement listkeys serv12] to acquire.
Oct 10 12:16:31 serv12 heartbeat [2810]: debug: StartNextRemoteRscReq ():
1 child count
Oct 10 12:16:31 serv12 heartbeat: info: Taking over resource group drbddisk: r1
Oct 10 12:16:31 serv12 heartbeat: info: Acquiring resource group: serv11 drbddisk: Filesystem r1:: / dev/drbd0:: / data:: ext3 IPAddr: 192.168.1.250 MailTo: otot@ ttt: Cluster1-StatusUpdated fetchmail
Oct 10 12:16:31 serv12 heartbeat: info: Running / etc / ha.d / resource.d / drbddisk r1 start
Oct 10 12:16:31 serv12 kernel: drbd0: State change failed: Refusing to
Primary be without at least one disk UpToDate
Oct 10 12:16:31 serv12 kernel: drbd0: (state = cs: WFConnection st: Secondary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:31 aserv12 kernel: drbd0: = (wanted cs: WFConnection st: Primary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:32 serv12 kernel: drbd0: State change failed: Refusing to
Primary be without at least one disk UpToDate
Oct 10 12:16:32 arserv12 kernel: drbd0: (state = cs: WFConnection st: Secondary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:50 serv12 heartbeat: debug: / etc / ha.d / resource.d / drbddisk
r1 start done. RC = 1
Oct 10 12:16:50 serv12 heartbeat: ERROR: Return code 1 from / etc / ha.d / resource.d / drbddisk
Oct 10 12:16:50 aserv12 heartbeat: CRIT: Giving up resources due to failure of drbddisk: r1
Oct 10 12:16:50 aserv12 heartbeat: info: Releasing resource group: serv11 drbddisk: Filesystem r1:: / dev/drbd0:: / data:: ext3 IPAddr: 192.168.1.250 MailTo: @ : Cluster1-StatusUpdated fetchmail
Oct 10 12:16:50 serv12 heartbeat: info: Running / etc / init.d / fetchmail
stop
Oct 10 12:16:50 serv12 heartbeat: debug: Starting / etc / init.d / fetchmail
stop
Oct 10 12:16:50 aserv12 heartbeat: debug: / etc / init.d / fetchmail stop done. RC = 0
Oct 10 12:16:50 serv12 heartbeat: info: Running / etc / ha.d / resource.d / MailTo Cluster1-stop StatusUpdated
Oct 10 12:16:50 serv12 heartbeat: debug: Starting / etc / ha.d / resource.d / MailTo Cluster1-stop StatusUpdated
Oct 10 12:16:50 serv12 heartbeat: debug: / etc / ha.d / resource.d / MailTo l Cluster1-stop StatusUpdated done. RC = 0
Oct 10 12:16:50 serv12 heartbeat: info: Running / etc / ha.d / resource.d / 192.168.1.250 stop IPAddr
Oct 10 12:16:50 serv12 heartbeat: debug: Starting / etc / ha.d / resource.d / 192.168.1.250 stop IPAddr
Oct 10 12:16:50 serv12 heartbeat: debug: / etc / ha.d / resource.d / IPAddr 192.168.1.250 stop done. RC = 0
Oct 10 12:16:50 serv12 heartbeat: info: Running / etc / ha.d / resource.d / Filesystem / dev/drbd0 / data ext3 stop
Oct 10 12:16:50 serv12 heartbeat: debug: Starting / etc / ha.d / resource.d / Filesystem / dev/drbd0 / data ext3 stop
Oct 10 12:16:50 serv12 heartbeat: WARNING: Filesystem / data not mounted?
Oct 10 12:16:50 serv12 heartbeat: debug: / etc / ha.d / resource.d / Filesystem / dev/drbd0 / data ext3 stop done. RC = 0
Oct 10 12:16:50 serv12 heartbeat: info: Running / etc / ha.d / resource.d / drbddisk stop r1
Oct 10 12:16:50 serv12 heartbeat: debug: Starting / etc / ha.d / resource.d / drbddisk stop r1
Oct 10 12:16:50 serv12 heartbeat: debug: / etc / ha.d / resource.d / drbddisk
r1 stop done. RC = 0
Oct 10 12:16:50 serv12 heartbeat: info: / usr / lib / heartbeat / mach_down: nice_failback: foreign resources acquired
Oct 10 12:16:50 serv12 heartbeat [2810]: info: mach_down complete takeover.
Oct 10 12:16:50 serv12 heartbeat: info: mach_down takeover complete for
serv11 node.
##############################################
Note: lorsque je remet le cable réseau du noeud primaire le cluster se débloque et monte sur le primaire.
Merci beaucoup en avance.
Lassaad.