Discussion:
[DRBD-user] DRBD keeps doing full resync
Viktor Villafuerte
2015-08-14 07:00:48 UTC
Permalink
Hi all,

I've got very similar problem to

http://lists.linbit.com/pipermail/drbd-user/2015-July/022243.html

that.


OS: Red Hat Enterprise Linux Server release 6.7 (Santiago)
kernel: kernel-2.6.32-573.1.1.el6.x86_64
DRBD: kmod-drbd84-8.4.6-1.el6.elrepo.x86_64


Two node cluster where after upgrade to the lastest kernel + Corosync
Pacemaker DRBD I've encoutered problems.

Secondary does a full resync, seems to successfully finish it but then
'PingAck did not arrive in time' happens and the resync repeats. This
keeps happening in a loop thereafter.

Please note that I could not observe any network problems, no problems
with the resync or any other services, except the very end of the
resync.

I've tried reboot the secondary (target) node, invalidating the data
and forcing full resync by hand, wating about 4 rounds of the loop :)
But to no avail.

Currently I'm in process of downgrading the kernel to see what happens
but as it's a production server this will take little time (outage) :/

I'm also fairly new to DRBD so please point out anything obvious I may
be missing.

thanks

v


Logs:

primary node (SyncSource)

Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: PingAck did not arrive in time.
Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure )
Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: ack_receiver terminated
Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: Connection closed
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: conn( NetworkFailure -> Unconnected )
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: receiver terminated
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: Restarting receiver thread
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: receiver (re)started
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection )
Aug 12 15:22:47 dbpg02 kernel: drbd drbd0: Handshake successful: Agreed network protocol version 101
Aug 12 15:22:47 dbpg02 kernel: drbd drbd0: Agreed to support TRIM on protocol level
Aug 12 15:22:48 dbpg02 kernel: drbd drbd0: conn( WFConnection -> WFReportParams )
Aug 12 15:22:48 dbpg02 kernel: drbd drbd0: Starting ack_recv thread (from drbd_r_drbd0 [3465])
Aug 12 15:22:48 dbpg02 kernel: block drbd0: drbd_sync_handshake:
Aug 12 15:22:48 dbpg02 kernel: block drbd0: self 453310593995C7C3:FAE7FAB0FA09B197:FAE6FAB0FA09B197:2B42A0D9A256A63A bits:152 flags:0
Aug 12 15:22:49 dbpg02 kernel: block drbd0: peer 453310593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 flags:0
Aug 12 15:22:49 dbpg02 kernel: block drbd0: was SyncSource, missed the resync finished event, corrected myself:
Aug 12 15:22:49 dbpg02 kernel: block drbd0: self 453310593995C7C3:0000000000000000:FAE7FAB0FA09B197:FAE6FAB0FA09B197 bits:162 flags:0
Aug 12 15:22:49 dbpg02 kernel: block drbd0: uuid_compare()=1 by rule 34
Aug 12 15:22:49 dbpg02 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Inconsistent -> Consistent )
Aug 12 15:22:49 dbpg02 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0%
Aug 12 15:22:50 dbpg02 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0%
Aug 12 15:22:50 dbpg02 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0
Aug 12 15:22:50 dbpg02 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
Aug 12 15:22:50 dbpg02 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
Aug 12 15:22:50 dbpg02 kernel: block drbd0: Began resync as SyncSource (will sync 672 KB [168 bits set]).
Aug 12 15:22:51 dbpg02 kernel: block drbd0: updated sync UUID 453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197:FAE6FAB0FA09B197
Aug 12 15:22:52 dbpg02 kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 672 K/sec)
Aug 12 15:22:52 dbpg02 kernel: block drbd0: updated UUIDs 453310593995C7C3:0000000000000000:803193A96DCAEE9C:FAE7FAB0FA09B197
Aug 12 15:22:52 dbpg02 kernel: block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Aug 12 15:22:52 dbpg02 crmd[5746]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: sock was shut down by peer
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: short read (expected size 16)
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: meta connection shut down by peer.
Aug 12 15:22:53 dbpg02 kernel: block drbd0: new current UUID EBEE6945467BB233:453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: ack_receiver terminated
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: Connection closed
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected )
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: helper command: /sbin/drbdadm fence-peer drbd0
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: receiver terminated
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: Restarting receiver thread
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: receiver (re)started
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection )
Aug 12 15:22:54 dbpg02 crm-fence-peer.sh[5817]: invoked for drbd0
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: Handshake successful: Agreed network protocol version 101
Aug 12 15:22:54 dbpg02 crmd[2719]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Aug 12 15:22:54 dbpg02 crm-fence-peer.sh[5817]: INFO peer is reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-drbd0-ms-drbd0'
Aug 12 15:22:54 dbpg02 pengine[2718]: notice: update_validation: pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Aug 12 15:22:54 dbpg02 pengine[2718]: notice: update_validation: Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with upgrade-1.3.xsl
Aug 12 15:22:54 dbpg02 pengine[2718]: notice: update_validation: Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Aug 12 15:22:54 dbpg02 pengine[2718]: notice: unpack_config: On loss of CCM Quorum: Ignore
Aug 12 15:22:54 dbpg02 pengine[2718]: notice: process_pe_message: Calculated Transition 8: /var/lib/pacemaker/pengine/pe-input-20.bz2
Aug 12 15:22:54 dbpg02 crmd[2719]: notice: run_graph: Transition 8 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-20.bz2): Complete
Aug 12 15:22:54 dbpg02 crmd[2719]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: Agreed to support TRIM on protocol level
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: helper command: /sbin/drbdadm fence-peer drbd0 exit code 4 (0x400)
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: fence-peer helper returned 4 (peer was fenced)
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: pdsk( DUnknown -> Outdated )
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: conn( WFConnection -> WFReportParams )
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: Starting ack_recv thread (from drbd_r_drbd0 [3465])
Aug 12 15:22:55 dbpg02 kernel: block drbd0: drbd_sync_handshake:
Aug 12 15:22:55 dbpg02 kernel: block drbd0: self EBEE6945467BB233:453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197 bits:32 flags:0
Aug 12 15:22:56 dbpg02 kernel: block drbd0: peer 803193A96DCAEE9C:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 flags:0
Aug 12 15:22:56 dbpg02 kernel: block drbd0: uuid_compare()=2 by rule 80
Aug 12 15:22:56 dbpg02 kernel: block drbd0: Becoming sync source due to disk states.
Aug 12 15:22:56 dbpg02 kernel: block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake.
Aug 12 15:22:56 dbpg02 kernel: block drbd0: bitmap WRITE of 6144 pages took 218 jiffies
Aug 12 15:22:57 dbpg02 kernel: block drbd0: 768 GB (201320439 bits) marked out-of-sync by on disk bit-map.



secondary node (SyncTarget)


Aug 12 15:22:45 dbpg01 kernel: block drbd0: Resync done (total 8 sec; paused 0 sec; 13212 K/sec)
Aug 12 15:22:45 dbpg01 kernel: block drbd0: updated UUIDs 453310593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197
Aug 12 15:22:45 dbpg01 kernel: block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Aug 12 15:22:45 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
Aug 12 15:22:45 dbpg01 crm-unfence-peer.sh[7419]: invoked for drbd0
Aug 12 15:22:45 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: sock was shut down by peer
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: short read (expected size 16)
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: meta connection shut down by peer.
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: ack_receiver terminated
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: Connection closed
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected )
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: receiver terminated
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: Restarting receiver thread
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: receiver (re)started
Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: conn( Unconnected -> WFConnection )
Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: Handshake successful: Agreed network protocol version 101
Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: Agreed to support TRIM on protocol level
Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: conn( WFConnection -> WFReportParams )
Aug 12 15:22:48 dbpg01 kernel: drbd drbd0: Starting ack_recv thread (from drbd_r_drbd0 [7326])
Aug 12 15:22:48 dbpg01 kernel: block drbd0: drbd_sync_handshake:
Aug 12 15:22:48 dbpg01 kernel: block drbd0: self 453310593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 flags:0
Aug 12 15:22:49 dbpg01 kernel: block drbd0: peer 453310593995C7C3:FAE7FAB0FA09B197:FAE6FAB0FA09B197:2B42A0D9A256A63A bits:152 flags:0
Aug 12 15:22:49 dbpg01 kernel: block drbd0: was SyncTarget, peer missed the resync finished event, corrected peer:
Aug 12 15:22:49 dbpg01 kernel: block drbd0: peer 453310593995C7C3:0000000000000000:FAE7FAB0FA09B197:FAE6FAB0FA09B197 bits:152 flags:0
Aug 12 15:22:49 dbpg01 kernel: block drbd0: uuid_compare()=-1 by rule 35
Aug 12 15:22:49 dbpg01 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate )
Aug 12 15:22:49 dbpg01 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0%
Aug 12 15:22:50 dbpg01 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0%
Aug 12 15:22:50 dbpg01 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
Aug 12 15:22:51 dbpg01 kernel: block drbd0: updated sync uuid 803193A96DCAEE9C:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197
Aug 12 15:22:51 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Aug 12 15:22:51 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Aug 12 15:22:51 dbpg01 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )
Aug 12 15:22:51 dbpg01 kernel: block drbd0: Began resync as SyncTarget (will sync 672 KB [168 bits set]).
Aug 12 15:22:52 dbpg01 kernel: drbd drbd0: PingAck did not arrive in time.
Aug 12 15:22:52 dbpg01 kernel: drbd drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: ack_receiver terminated
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: Connection closed
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: conn( NetworkFailure -> Unconnected )
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: receiver terminated
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: Restarting receiver thread
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: receiver (re)started
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: conn( Unconnected -> WFConnection )
Aug 12 15:22:54 dbpg01 kernel: drbd drbd0: Handshake successful: Agreed network protocol version 101
Aug 12 15:22:54 dbpg01 kernel: drbd drbd0: Agreed to support TRIM on protocol level
Aug 12 15:22:54 dbpg01 crmd[6750]: notice: handle_request: Current ping state: S_NOT_DC
Aug 12 15:22:55 dbpg01 kernel: drbd drbd0: conn( WFConnection -> WFReportParams )
Aug 12 15:22:55 dbpg01 kernel: drbd drbd0: Starting ack_recv thread (from drbd_r_drbd0 [7326])
Aug 12 15:22:55 dbpg01 kernel: block drbd0: drbd_sync_handshake:
Aug 12 15:22:55 dbpg01 kernel: block drbd0: self 803193A96DCAEE9C:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 flags:0
Aug 12 15:22:56 dbpg01 kernel: block drbd0: peer EBEE6945467BB233:453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197 bits:32 flags:0
Aug 12 15:22:56 dbpg01 kernel: block drbd0: uuid_compare()=-2 by rule 60
Aug 12 15:22:56 dbpg01 kernel: block drbd0: Becoming sync target due to disk states.
Aug 12 15:22:56 dbpg01 kernel: block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake.
Aug 12 15:22:56 dbpg01 kernel: block drbd0: bitmap WRITE of 6144 pages took 298 jiffies
Aug 12 15:22:56 dbpg01 kernel: block drbd0: 768 GB (201320439 bits) marked out-of-sync by on disk bit-map.
Aug 12 15:22:57 dbpg01 attrd[6748]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd0 (<null>)
Aug 12 15:22:57 dbpg01 attrd[6748]: notice: attrd_perform_update: Sent delete 19: node=dbpg01.sdp.syd.optusnet.com.au, attr=master-drbd0, id=<n/a>, set=(null), section=status
Aug 12 15:22:57 dbpg01 attrd[6748]: notice: attrd_perform_update: Sent delete 21: node=dbpg01.sdp.syd.optusnet.com.au, attr=master-drbd0, id=<n/a>, set=(null), section=status
Aug 12 15:22:57 dbpg01 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Aug 12 15:22:57 dbpg01 crmd[7488]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Aug 12 15:22:57 dbpg01 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Aug 12 15:22:58 dbpg01 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Aug 12 15:22:58 dbpg01 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
Aug 12 15:22:59 dbpg01 kernel: block drbd0: updated sync uuid 453410593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197
Aug 12 15:22:59 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Aug 12 15:22:59 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Aug 12 15:22:59 dbpg01 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget )
Aug 12 15:22:59 dbpg01 kernel: block drbd0: Began resync as SyncTarget (will sync 805281756 KB [201320439 bits set]).
--
Regards

Viktor Villafuerte
Optus Internet Engineering
t: +61 2 80825265
Lars Ellenberg
2015-08-18 16:59:18 UTC
Permalink
Post by Viktor Villafuerte
Hi all,
I've got very similar problem to
http://lists.linbit.com/pipermail/drbd-user/2015-July/022243.html
that.
OS: Red Hat Enterprise Linux Server release 6.7 (Santiago)
kernel: kernel-2.6.32-573.1.1.el6.x86_64
DRBD: kmod-drbd84-8.4.6-1.el6.elrepo.x86_64
Two node cluster where after upgrade to the lastest kernel + Corosync
Pacemaker DRBD I've encoutered problems.
Did you upgrade both nodes yet?
Post by Viktor Villafuerte
Secondary does a full resync, seems to successfully finish it but then
'PingAck did not arrive in time' happens and the resync repeats. This
keeps happening in a loop thereafter.
You could increase the "ping timeout" for now,
and see if that gets you over this odd issue.
Post by Viktor Villafuerte
Please note that I could not observe any network problems, no problems
with the resync or any other services, except the very end of the
resync.
I've tried reboot the secondary (target) node, invalidating the data
and forcing full resync by hand, wating about 4 rounds of the loop :)
But to no avail.
Currently I'm in process of downgrading the kernel to see what happens
but as it's a production server this will take little time (outage) :/
I'm also fairly new to DRBD so please point out anything obvious I may
be missing.
thanks
v
--
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA and Pacemaker support and consulting

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
Viktor Villafuerte
2015-08-18 23:37:25 UTC
Permalink
Hi Lars,

thanks for your answer. Please, see more details below
Post by Lars Ellenberg
Post by Viktor Villafuerte
Hi all,
I've got very similar problem to
http://lists.linbit.com/pipermail/drbd-user/2015-July/022243.html
that.
OS: Red Hat Enterprise Linux Server release 6.7 (Santiago)
kernel: kernel-2.6.32-573.1.1.el6.x86_64
DRBD: kmod-drbd84-8.4.6-1.el6.elrepo.x86_64
Two node cluster where after upgrade to the lastest kernel + Corosync
Pacemaker DRBD I've encoutered problems.
Did you upgrade both nodes yet?
Yes both were upgraded. When first failing over Corosync + friends onto
the newly upgraded node (from the not upgraded node) only partial sync
happened and no problems were found. After upgrading the second node
full resync was requested and I never recovered from that.
Post by Lars Ellenberg
Post by Viktor Villafuerte
Secondary does a full resync, seems to successfully finish it but then
'PingAck did not arrive in time' happens and the resync repeats. This
keeps happening in a loop thereafter.
You could increase the "ping timeout" for now,
and see if that gets you over this odd issue.
I'll give this a go thanks. I assume that will require restart..?
Post by Lars Ellenberg
Post by Viktor Villafuerte
Please note that I could not observe any network problems, no problems
with the resync or any other services, except the very end of the
resync.
I've tried reboot the secondary (target) node, invalidating the data
and forcing full resync by hand, wating about 4 rounds of the loop :)
But to no avail.
Currently I'm in process of downgrading the kernel to see what happens
but as it's a production server this will take little time (outage) :/
Downgraded both the nodes to kernel-2.6.32-504.23.4.el6.x86_64 but no
change as the sync just did one turn over :/
Post by Lars Ellenberg
Post by Viktor Villafuerte
I'm also fairly new to DRBD so please point out anything obvious I may
be missing.
thanks
v
--
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA and Pacemaker support and consulting
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
http://lists.linbit.com/mailman/listinfo/drbd-user
--
Regards

Viktor Villafuerte
Optus Internet Engineering
t: +61 2 80825265
Viktor Villafuerte
2015-08-19 04:35:26 UTC
Permalink
Post by Viktor Villafuerte
Hi Lars,
thanks for your answer. Please, see more details below
Post by Lars Ellenberg
Post by Viktor Villafuerte
Hi all,
I've got very similar problem to
http://lists.linbit.com/pipermail/drbd-user/2015-July/022243.html
that.
OS: Red Hat Enterprise Linux Server release 6.7 (Santiago)
kernel: kernel-2.6.32-573.1.1.el6.x86_64
DRBD: kmod-drbd84-8.4.6-1.el6.elrepo.x86_64
Two node cluster where after upgrade to the lastest kernel + Corosync
Pacemaker DRBD I've encoutered problems.
Did you upgrade both nodes yet?
Yes both were upgraded. When first failing over Corosync + friends onto
the newly upgraded node (from the not upgraded node) only partial sync
happened and no problems were found. After upgrading the second node
full resync was requested and I never recovered from that.
Post by Lars Ellenberg
Post by Viktor Villafuerte
Secondary does a full resync, seems to successfully finish it but then
'PingAck did not arrive in time' happens and the resync repeats. This
keeps happening in a loop thereafter.
You could increase the "ping timeout" for now,
and see if that gets you over this odd issue.
I'll give this a go thanks. I assume that will require restart..?
I've increased the ping-timout value to 5seconds just before the sync
finished and only on the SyncTarget side.

After the sync finished it didn't start again but ended up in this state
of things - logs below

(see more explanation after the logs)

SyncSource

Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: PingAck did not arrive in
time.
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: peer( Secondary -> Unknown )
conn( SyncSource -> NetworkFailure )
Aug 19 12:29:27 dbpg02 crmd[6415]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Aug 19 12:29:27 dbpg02 pengine[6414]: notice: update_validation:
pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Aug 19 12:29:27 dbpg02 pengine[6414]: notice: update_validation:
Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with
upgrade-1.3.xsl
Aug 19 12:29:27 dbpg02 pengine[6414]: notice: update_validation:
Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Aug 19 12:29:27 dbpg02 pengine[6414]: notice: unpack_config: On loss
of CCM Quorum: Ignore
Aug 19 12:29:27 dbpg02 pengine[6414]: notice: process_pe_message:
Calculated Transition 57: /var/lib/pacemaker/pengine/pe-input-77.bz2
Aug 19 12:29:27 dbpg02 crmd[6415]: notice: run_graph: Transition 57
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-77.bz2): Complete
Aug 19 12:29:27 dbpg02 crmd[6415]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: ack_receiver terminated
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: conn( NetworkFailure ->
Unconnected )
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: receiver terminated
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: Restarting receiver thread
Aug 19 12:29:27 dbpg02 kernel: drbd drbd0: receiver (re)started
Aug 19 12:29:28 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection )
Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe )
Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:29:31 dbpg02 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected )
Aug 19 12:29:32 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection )
Aug 19 12:29:35 dbpg02 kernel: device eth1 entered promiscuous mode
Aug 19 12:29:36 dbpg02 crmd[6415]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Aug 19 12:29:36 dbpg02 pengine[6414]: notice: update_validation:
pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Aug 19 12:29:36 dbpg02 pengine[6414]: notice: update_validation:
Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with
upgrade-1.3.xsl
Aug 19 12:29:36 dbpg02 pengine[6414]: notice: update_validation:
Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Aug 19 12:29:36 dbpg02 pengine[6414]: notice: unpack_config: On loss
of CCM Quorum: Ignore
Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe )
Aug 19 12:29:36 dbpg02 pengine[6414]: notice: process_pe_message:
Calculated Transition 58: /var/lib/pacemaker/pengine/pe-input-78.bz2
Aug 19 12:29:36 dbpg02 crmd[6415]: notice: run_graph: Transition 58
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-78.bz2): Complete
Aug 19 12:29:36 dbpg02 crmd[6415]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:29:36 dbpg02 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected )
Aug 19 12:29:37 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection )
Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe )
Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:29:49 dbpg02 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected )
Aug 19 12:29:50 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection )
Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe )
Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:29:54 dbpg02 pacemakerd[8284]: notice: crm_add_logfile:
Additional logging available in /var/log/corosync/corosync.log
Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:29:54 dbpg02 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected )
Aug 19 12:29:55 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection )
Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe )
Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:29:59 dbpg02 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected )
Aug 19 12:30:00 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection )
Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe )
Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:30:04 dbpg02 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected )
Aug 19 12:30:05 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection )
Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe )
Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: Connection closed
Aug 19 12:30:09 dbpg02 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected )
Aug 19 12:30:10 dbpg02 kernel: drbd drbd0: conn( Unconnected ->
WFConnection )
Aug 19 12:30:21 dbpg02 kernel: drbd drbd0: sock_recvmsg returned -11
Aug 19 12:30:22 dbpg02 kernel: drbd drbd0: conn( WFConnection ->
BrokenPipe )
Aug 19 12:30:22 dbpg02 kernel: drbd drbd0: short read (expected size 8)
Aug 19 12:30:22 dbpg02 kernel: drbd drbd0: Connection closed
..
.. repeat sock_recvmsg..
..




SyncTarget

Aug 19 12:29:27 dbpg01 kernel: block drbd0: Resync done (total 24958
sec; paused 0 sec; 32264 K/sec)
Aug 19 12:29:27 dbpg01 kernel: block drbd0: updated UUIDs
BFF7A490315C8E1C:0000000000000000:08956592516D4BB2:08946592516D4BB3
Aug 19 12:29:27 dbpg01 kernel: block drbd0: conn( SyncTarget ->
Connected ) disk( Inconsistent -> UpToDate )
Aug 19 12:29:27 dbpg01 kernel: block drbd0: helper command:
/sbin/drbdadm after-resync-target minor-0
Aug 19 12:29:27 dbpg01 crm-unfence-peer.sh[5847]: invoked for drbd0
Aug 19 12:29:27 dbpg01 kernel: block drbd0: helper command:
/sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
Aug 19 12:29:27 dbpg01 kernel: drbd drbd0: sock was shut down by peer
Aug 19 12:29:27 dbpg01 kernel: drbd drbd0: peer( Primary -> Unknown )
conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Aug 19 12:29:27 dbpg01 kernel: drbd drbd0: short read (expected size 16)
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: meta connection shut down by
peer.
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: ack_receiver terminated
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: Connection closed
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: conn( BrokenPipe ->
Unconnected )
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: receiver terminated
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: Restarting receiver thread
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: receiver (re)started
Aug 19 12:29:28 dbpg01 kernel: drbd drbd0: conn( Unconnected ->
WFConnection )
Aug 19 12:29:32 dbpg01 kernel: device eth1 entered promiscuous mode
Aug 19 12:29:36 dbpg01 attrd[4345]: notice: attrd_trigger_update:
Sending flush op to all hosts for: master-drbd0 (1000)
Aug 19 12:29:36 dbpg01 attrd[4345]: notice: attrd_perform_update: Sent
update 14: master-drbd0=1000
Aug 19 12:29:38 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:29:38 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:29:56 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:01 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:06 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:06 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:11 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:29 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:29 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:35 dbpg01 crmd[6239]: notice: crm_add_logfile: Additional
logging available in /var/log/pacemaker.log
Aug 19 12:30:47 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:30:47 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:31:04 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:31:09 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:31:09 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:31:12 dbpg01 kernel: device eth1 left promiscuous mode
Aug 19 12:31:30 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:31:35 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:31:35 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:31:49 dbpg01 kernel: drbd drbd0: initial packet S crossed
Aug 19 12:32:06 dbpg01 kernel: drbd drbd0: initial packet S crossed
..
.. repeat initial packet S crossed
..



However the sync didn't start again from 0% which was a (good) change. I
thought that initial packet is crossing but nothing happenes after so
maybe the timeout is too long. So I reset this to 1s after which the
secondary caught on, did a partial sync of whatever was oos (about 5
seconds) and now both the nodes are UpToDate :)



Would you have any possible explanation for this? I did read man pages
and about timeouts etc but I wonder why sometimes when the connection,
after the sync, is broken it does full sync and this time it did only
partial sync? The connection was broken just the same (according to
logs).
Post by Viktor Villafuerte
Post by Lars Ellenberg
Post by Viktor Villafuerte
Please note that I could not observe any network problems, no problems
with the resync or any other services, except the very end of the
resync.
I've tried reboot the secondary (target) node, invalidating the data
and forcing full resync by hand, wating about 4 rounds of the loop :)
But to no avail.
Currently I'm in process of downgrading the kernel to see what happens
but as it's a production server this will take little time (outage) :/
Downgraded both the nodes to kernel-2.6.32-504.23.4.el6.x86_64 but no
change as the sync just did one turn over :/
Post by Lars Ellenberg
Post by Viktor Villafuerte
I'm also fairly new to DRBD so please point out anything obvious I may
be missing.
thanks
v
--
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA and Pacemaker support and consulting
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
http://lists.linbit.com/mailman/listinfo/drbd-user
--
Regards
Viktor Villafuerte
Optus Internet Engineering
t: +61 2 80825265
_______________________________________________
drbd-user mailing list
http://lists.linbit.com/mailman/listinfo/drbd-user
--
Regards

Viktor Villafuerte
Optus Internet Engineering
t: +61 2 80825265
Helmut Wollmersdorfer
2015-08-20 07:14:19 UTC
Permalink
Post by Viktor Villafuerte
Would you have any possible explanation for this? I did read man pages
and about timeouts etc but I wonder why sometimes when the connection,
after the sync, is broken it does full sync and this time it did only
partial sync? The connection was broken just the same (according to
logs).
Maybe network hiccups. I had some unexplainable problems during migrating drbd-devices to other clusters using the service IPs. Sometimes it worked, sometimes not.

After using

net {
data-integrity-alg crc32c; # crc32c, sha1, md5
}
syncer {
csums-alg crc32c; # crc32c, sha1, md5
}

in the configuration this sort of problems was away.

HTH

Helmut Wollmersdorfer
Cesar Peschiera
2015-08-20 17:35:57 UTC
Permalink
Post by Helmut Wollmersdorfer
net {
data-integrity-alg crc32c; # crc32c, sha1, md5
}
This option only must be used for purposes of test, and never in a
production environment.
This directive commonly break the link of replication and produce into DRBD
a disconnection and, obviously, also a "OOS" (out of sync)
Post by Helmut Wollmersdorfer
syncer {
csums-alg crc32c; # crc32c, sha1, md5
}
I guess that is better to use "sha1", due to that is more robust.

And finally, is highly advisable to use the latest stable versions:
- drbd-utils 8.9.3 version
- drbd-8.4.6 or drbd-9.0.0 version (depending of your intentions in the
amount of DRBD replicated volumes)

Three tips of Hardware (common mistakes of many people):
1- Never uses NICs of the Realtek brand, at least for me, DRBD didn't work
well with this brand.
(but neither i did a test with the latest driver version.... ??? )

2- Moreover, if you have Servers with Intel processors, and your NICs
supports I/OAT DMA Engine, will be better to have it enabled in the Bios of
the server, with it enabled, the NICs will have access direct to the memory
of the Server (and obviously, it will gain speed), as also it will have less
consume of processor cycles.

3- Always is advisable update all the firmwares of each element of Hardware
that has your Servers (NICs, Bios, Chipsets, IDRAC, RAID controllers, etc.)



----- Original Message -----
From: "Helmut Wollmersdorfer" <***@fixpunkt.de>
To: <drbd-***@lists.linbit.com>
Cc: <drbd-***@lists.linbit.com>
Sent: Thursday, August 20, 2015 3:14 AM
Subject: Re: [DRBD-user] DRBD keeps doing full resync
Post by Helmut Wollmersdorfer
Am 19.08.2015 um 06:35 schrieb Viktor Villafuerte
Post by Viktor Villafuerte
Would you have any possible explanation for this? I did read man pages
and about timeouts etc but I wonder why sometimes when the connection,
after the sync, is broken it does full sync and this time it did only
partial sync? The connection was broken just the same (according to
logs).
Maybe network hiccups. I had some unexplainable problems during migrating
drbd-devices to other clusters using the service IPs. Sometimes it worked,
sometimes not.
After using
net {
data-integrity-alg crc32c; # crc32c, sha1, md5
}
syncer {
csums-alg crc32c; # crc32c, sha1, md5
}
in the configuration this sort of problems was away.
HTH
Helmut Wollmersdorfer
_______________________________________________
drbd-user mailing list
http://lists.linbit.com/mailman/listinfo/drbd-user
Viktor Villafuerte
2015-08-20 23:28:49 UTC
Permalink
Thanks all for you input.

Setting ping-timeout to 1s (instead of the default 0.5s) seems to be
working for me. As I had recovered the other node and made that one
primary, I had to do HW replacement (everything that could go wrong with
this upgrade did go wrong) and I tested this again. I left the default
set and PingAck problem happened, after I set ping-timeout to 1s and
went to sleep. In the morning DRBD was in UpToDate/UpToDate which made
me very happy.

Here's the command in case it helps anybody..

# 10 => 10 x 0.1 = 1s
drbdadm net-options --ping-timeout=10 <resource>


thanks

v
Post by Cesar Peschiera
Post by Helmut Wollmersdorfer
net {
data-integrity-alg crc32c; # crc32c, sha1, md5
}
This option only must be used for purposes of test, and never in a
production environment.
This directive commonly break the link of replication and produce into DRBD
a disconnection and, obviously, also a "OOS" (out of sync)
Post by Helmut Wollmersdorfer
syncer {
csums-alg crc32c; # crc32c, sha1, md5
}
I guess that is better to use "sha1", due to that is more robust.
- drbd-utils 8.9.3 version
- drbd-8.4.6 or drbd-9.0.0 version (depending of your intentions in the
amount of DRBD replicated volumes)
1- Never uses NICs of the Realtek brand, at least for me, DRBD didn't work
well with this brand.
(but neither i did a test with the latest driver version.... ??? )
2- Moreover, if you have Servers with Intel processors, and your NICs
supports I/OAT DMA Engine, will be better to have it enabled in the Bios of
the server, with it enabled, the NICs will have access direct to the memory
of the Server (and obviously, it will gain speed), as also it will have less
consume of processor cycles.
3- Always is advisable update all the firmwares of each element of Hardware
that has your Servers (NICs, Bios, Chipsets, IDRAC, RAID controllers, etc.)
----- Original Message ----- From: "Helmut Wollmersdorfer"
Sent: Thursday, August 20, 2015 3:14 AM
Subject: Re: [DRBD-user] DRBD keeps doing full resync
Post by Helmut Wollmersdorfer
Am 19.08.2015 um 06:35 schrieb Viktor Villafuerte
Post by Viktor Villafuerte
Would you have any possible explanation for this? I did read man pages
and about timeouts etc but I wonder why sometimes when the connection,
after the sync, is broken it does full sync and this time it did only
partial sync? The connection was broken just the same (according to
logs).
Maybe network hiccups. I had some unexplainable problems during migrating
drbd-devices to other clusters using the service IPs. Sometimes it worked,
sometimes not.
After using
net {
data-integrity-alg crc32c; # crc32c, sha1, md5
}
syncer {
csums-alg crc32c; # crc32c, sha1, md5
}
in the configuration this sort of problems was away.
HTH
Helmut Wollmersdorfer
_______________________________________________
drbd-user mailing list
http://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________
drbd-user mailing list
http://lists.linbit.com/mailman/listinfo/drbd-user
--
Regards

Viktor Villafuerte
Optus Internet Engineering
t: +61 2 80825265
Loading...