Discussion:
[DRBD-user] 8 Zettabytes out-of-sync?
Jarno Elonen
2018-10-31 18:46:13 UTC
Permalink
I've got several DRBD 9 resource that constantly show *UpToDate* with
9223372036854774304 bytes (exactly 8ZiB) of OutOfDate data.

Any idea what might cause this and how to fix it?

Example:

# drbdsetup status --verbose --statistics vm-106-disk-1
vm-106-disk-1 node-id:0 role:Primary suspended:no
write-ordering:flush
volume:0 minor:1003 disk:UpToDate quorum:yes
size:16777688 read:215779 written:22369564 al-writes:89 bm-writes:0
upper-pending:0
lower-pending:0 al-suspended:no blocked:no
mox-a node-id:1 connection:Connected role:Secondary congested:no
ap-in-flight:0
rs-in-flight:18446744073709549808
volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
received:215116 sent:22368903 out-of-sync:9223372036854774304
pending:0 unacked:0
mox-c node-id:2 connection:Connected role:Secondary congested:no
ap-in-flight:0
rs-in-flight:18446744073709549808
volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
received:1188 sent:19884428 out-of-sync:0 pending:0 unacked:0

Version info:
version: 9.0.15-1 (api:2/proto:86-114)
GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by ***@mox-b,
2018-10-10 17:50:25
Transports (api:16): tcp (9.0.15-1)

-Jarno
Jarno Elonen
2018-11-01 15:32:45 UTC
Permalink
Okay, today one of these resources got a sudden, severe filesystem
corruption on the primary.

On the other hand, the secondaries (that showed 8ZiB out-of-sync) were
still mountable after I disconnected the corrupted primary. No idea how
current data the secondaries had, but drbdtop still showed them as
connected and 8Zib out-of-sync.

This is getting quite worrisome. Is anyone else experiencing this with DRBD
9? Is it something really wrong in my setup, or are there perhaps some
known instabilities in DRBD 9.0.15-1?

-Jarno
Post by Jarno Elonen
I've got several DRBD 9 resource that constantly show *UpToDate* with
9223372036854774304 bytes (exactly 8ZiB) of OutOfDate data.
Any idea what might cause this and how to fix it?
# drbdsetup status --verbose --statistics vm-106-disk-1
vm-106-disk-1 node-id:0 role:Primary suspended:no
write-ordering:flush
volume:0 minor:1003 disk:UpToDate quorum:yes
size:16777688 read:215779 written:22369564 al-writes:89 bm-writes:0
upper-pending:0
lower-pending:0 al-suspended:no blocked:no
mox-a node-id:1 connection:Connected role:Secondary congested:no
ap-in-flight:0
rs-in-flight:18446744073709549808
volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
received:215116 sent:22368903 out-of-sync:9223372036854774304
pending:0 unacked:0
mox-c node-id:2 connection:Connected role:Secondary congested:no
ap-in-flight:0
rs-in-flight:18446744073709549808
volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
received:1188 sent:19884428 out-of-sync:0 pending:0 unacked:0
version: 9.0.15-1 (api:2/proto:86-114)
2018-10-10 17:50:25
Transports (api:16): tcp (9.0.15-1)
-Jarno
Jarno Elonen
2018-11-01 20:30:59 UTC
Permalink
Here's some more info.
Dmesg shows some suspicious looking log message, such as:

1) FIXME drbd_s_vm-117-s[2830] op clear, bitmap locked for 'receive bitmap'
by drbd_r_vm-117-s[5038]

2) Wrong magic value 0xffff0007 in protocol version 114

3) peer request with dagtag 399201392 not found
got_peer_ack [drbd] failed

4) Rejecting concurrent remote state change 2226202936 because of state
change 2923939731
Ignoring P_TWOPC_ABORT packet 2226202936.

5) drbd_r_vm-117-s[5038] going to 'detect_finished_resyncs()' but bitmap
already locked for 'write from resync_finished' by drbd_w_vm-117-s[2812]
md_sync_timer expired! Worker calls drbd_md_sync().

6) incompatible discard-my-data settings
conn( Connecting -> Disconnecting )
error receiving P_PROTOCOL, e: -5 l: 7!

Two of the four nodes have DRBD 9.0.15-1 and two have 9.0.16-1. All of them
API v 16:

== mox-a ==
version: 9.0.15-1 (api:2/proto:86-114)
GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by ***@mox-a,
2018-10-28 03:08:58
Transports (api:16): tcp (9.0.15-1)

== mox-b ==
version: 9.0.15-1 (api:2/proto:86-114)
GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by ***@mox-b,
2018-10-10 17:50:25
Transports (api:16): tcp (9.0.15-1)

== mox-c ==
version: 9.0.16-1 (api:2/proto:86-114)
GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by ***@mox-c,
2018-10-28 05:45:05
Transports (api:16): tcp (9.0.16-1)

== mox-d ==
version: 9.0.16-1 (api:2/proto:86-114)
GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by ***@mox-d,
2018-10-29 00:22:23
Transports (api:16): tcp (9.0.16-1)

Running Proxmox (5.2-2) as can you'd guess from host names. DRBD resources
being managed by LINSTOR.
Post by Jarno Elonen
Okay, today one of these resources got a sudden, severe filesystem
corruption on the primary.
On the other hand, the secondaries (that showed 8ZiB out-of-sync) were
still mountable after I disconnected the corrupted primary. No idea how
current data the secondaries had, but drbdtop still showed them as
connected and 8Zib out-of-sync.
This is getting quite worrisome. Is anyone else experiencing this with
DRBD 9? Is it something really wrong in my setup, or are there perhaps some
known instabilities in DRBD 9.0.15-1?
-Jarno
Post by Jarno Elonen
I've got several DRBD 9 resource that constantly show *UpToDate* with
9223372036854774304 bytes (exactly 8ZiB) of OutOfDate data.
Any idea what might cause this and how to fix it?
# drbdsetup status --verbose --statistics vm-106-disk-1
vm-106-disk-1 node-id:0 role:Primary suspended:no
write-ordering:flush
volume:0 minor:1003 disk:UpToDate quorum:yes
size:16777688 read:215779 written:22369564 al-writes:89 bm-writes:0
upper-pending:0
lower-pending:0 al-suspended:no blocked:no
mox-a node-id:1 connection:Connected role:Secondary congested:no
ap-in-flight:0
rs-in-flight:18446744073709549808
volume:0 replication:Established peer-disk:UpToDate
resync-suspended:no
received:215116 sent:22368903 out-of-sync:9223372036854774304
pending:0 unacked:0
mox-c node-id:2 connection:Connected role:Secondary congested:no
ap-in-flight:0
rs-in-flight:18446744073709549808
volume:0 replication:Established peer-disk:UpToDate
resync-suspended:no
received:1188 sent:19884428 out-of-sync:0 pending:0 unacked:0
version: 9.0.15-1 (api:2/proto:86-114)
2018-10-10 17:50:25
Transports (api:16): tcp (9.0.15-1)
-Jarno
Jarno Elonen
2018-11-02 08:45:16 UTC
Permalink
More clues:

Just witnessed a resync (after invalidate) to steadily go from 100%
out-of-sync to 0% (after several automatic disconnects and reconnects).
Immediately after reaching 0%, it went to negative -<very-large-number>% !
After that, drbdtop started showing 8.0ZiB out-of-sync.

Looks like a severe wrap-around bug.

-Jarno
Post by Jarno Elonen
Here's some more info.
1) FIXME drbd_s_vm-117-s[2830] op clear, bitmap locked for 'receive
bitmap' by drbd_r_vm-117-s[5038]
2) Wrong magic value 0xffff0007 in protocol version 114
3) peer request with dagtag 399201392 not found
got_peer_ack [drbd] failed
4) Rejecting concurrent remote state change 2226202936 because of state
change 2923939731
Ignoring P_TWOPC_ABORT packet 2226202936.
5) drbd_r_vm-117-s[5038] going to 'detect_finished_resyncs()' but bitmap
already locked for 'write from resync_finished' by drbd_w_vm-117-s[2812]
md_sync_timer expired! Worker calls drbd_md_sync().
6) incompatible discard-my-data settings
conn( Connecting -> Disconnecting )
error receiving P_PROTOCOL, e: -5 l: 7!
Two of the four nodes have DRBD 9.0.15-1 and two have 9.0.16-1. All of
== mox-a ==
version: 9.0.15-1 (api:2/proto:86-114)
2018-10-28 03:08:58
Transports (api:16): tcp (9.0.15-1)
== mox-b ==
version: 9.0.15-1 (api:2/proto:86-114)
2018-10-10 17:50:25
Transports (api:16): tcp (9.0.15-1)
== mox-c ==
version: 9.0.16-1 (api:2/proto:86-114)
2018-10-28 05:45:05
Transports (api:16): tcp (9.0.16-1)
== mox-d ==
version: 9.0.16-1 (api:2/proto:86-114)
2018-10-29 00:22:23
Transports (api:16): tcp (9.0.16-1)
Running Proxmox (5.2-2) as can you'd guess from host names. DRBD resources
being managed by LINSTOR.
Post by Jarno Elonen
Okay, today one of these resources got a sudden, severe filesystem
corruption on the primary.
On the other hand, the secondaries (that showed 8ZiB out-of-sync) were
still mountable after I disconnected the corrupted primary. No idea how
current data the secondaries had, but drbdtop still showed them as
connected and 8Zib out-of-sync.
This is getting quite worrisome. Is anyone else experiencing this with
DRBD 9? Is it something really wrong in my setup, or are there perhaps some
known instabilities in DRBD 9.0.15-1?
-Jarno
Post by Jarno Elonen
I've got several DRBD 9 resource that constantly show *UpToDate* with
9223372036854774304 bytes (exactly 8ZiB) of OutOfDate data.
Any idea what might cause this and how to fix it?
# drbdsetup status --verbose --statistics vm-106-disk-1
vm-106-disk-1 node-id:0 role:Primary suspended:no
write-ordering:flush
volume:0 minor:1003 disk:UpToDate quorum:yes
size:16777688 read:215779 written:22369564 al-writes:89
bm-writes:0 upper-pending:0
lower-pending:0 al-suspended:no blocked:no
mox-a node-id:1 connection:Connected role:Secondary congested:no
ap-in-flight:0
rs-in-flight:18446744073709549808
volume:0 replication:Established peer-disk:UpToDate
resync-suspended:no
received:215116 sent:22368903 out-of-sync:9223372036854774304
pending:0 unacked:0
mox-c node-id:2 connection:Connected role:Secondary congested:no
ap-in-flight:0
rs-in-flight:18446744073709549808
volume:0 replication:Established peer-disk:UpToDate
resync-suspended:no
received:1188 sent:19884428 out-of-sync:0 pending:0 unacked:0
version: 9.0.15-1 (api:2/proto:86-114)
2018-10-10 17:50:25
Transports (api:16): tcp (9.0.15-1)
-Jarno
Eddie Chapman
2018-11-02 10:33:33 UTC
Permalink
Post by Jarno Elonen
Just witnessed a resync (after invalidate) to steadily go from 100%
out-of-sync to 0% (after several automatic disconnects and reconnects).
Immediately after reaching 0%, it went to negative -<very-large-number>%
! After that, drbdtop started showing 8.0ZiB out-of-sync.
Looks like a severe wrap-around bug.
-Jarno
Here's some more info.
1) FIXME drbd_s_vm-117-s[2830] op clear, bitmap locked for 'receive
bitmap' by drbd_r_vm-117-s[5038]
2) Wrong magic value 0xffff0007 in protocol version 114
3) peer request with dagtag 399201392 not found
got_peer_ack [drbd] failed
4) Rejecting concurrent remote state change 2226202936 because of
state change 2923939731
Ignoring P_TWOPC_ABORT packet 2226202936.
5) drbd_r_vm-117-s[5038] going to 'detect_finished_resyncs()' but
bitmap already locked for 'write from resync_finished' by
drbd_w_vm-117-s[2812]
md_sync_timer expired! Worker calls drbd_md_sync().
6) incompatible discard-my-data settings
conn( Connecting -> Disconnecting )
error receiving P_PROTOCOL, e: -5 l: 7!
Two of the four nodes have DRBD 9.0.15-1 and two have 9.0.16-1. All
== mox-a ==
version: 9.0.15-1 (api:2/proto:86-114)
GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
Transports (api:16): tcp (9.0.15-1)
== mox-b ==
version: 9.0.15-1 (api:2/proto:86-114)
GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
Transports (api:16): tcp (9.0.15-1)
== mox-c ==
version: 9.0.16-1 (api:2/proto:86-114)
GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by
Transports (api:16): tcp (9.0.16-1)
== mox-d ==
version: 9.0.16-1 (api:2/proto:86-114)
GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by
Transports (api:16): tcp (9.0.16-1)
Running Proxmox (5.2-2) as can you'd guess from host names. DRBD
resources being managed by LINSTOR.
Okay, today one of these resources got a sudden, severe
filesystem corruption on the primary.
On the other hand, the secondaries (that showed 8ZiB
out-of-sync) were still mountable after I disconnected the
corrupted primary. No idea how current data the secondaries had,
but drbdtop still showed them as connected and 8Zib out-of-sync.
This is getting quite worrisome. Is anyone else experiencing
this with DRBD 9? Is it something really wrong in my setup, or
are there perhaps some known instabilities in DRBD 9.0.15-1?
-Jarno
I've got several DRBD 9 resource that constantly show
*UpToDate* with 9223372036854774304 bytes (exactly 8ZiB) of
OutOfDate data.
Any idea what might cause this and how to fix it?
# drbdsetup status --verbose --statistics vm-106-disk-1
vm-106-disk-1 node-id:0 role:Primary suspended:no
    write-ordering:flush
  volume:0 minor:1003 disk:UpToDate quorum:yes
      size:16777688 read:215779 written:22369564
al-writes:89 bm-writes:0 upper-pending:0
      lower-pending:0 al-suspended:no blocked:no
  mox-a node-id:1 connection:Connected role:Secondary
congested:no ap-in-flight:0
      rs-in-flight:18446744073709549808
    volume:0 replication:Established peer-disk:UpToDate
resync-suspended:no
        received:215116 sent:22368903
out-of-sync:9223372036854774304 pending:0 unacked:0
  mox-c node-id:2 connection:Connected role:Secondary
congested:no ap-in-flight:0
      rs-in-flight:18446744073709549808
    volume:0 replication:Established peer-disk:UpToDate
resync-suspended:no
        received:1188 sent:19884428 out-of-sync:0 pending:0
unacked:0
version: 9.0.15-1 (api:2/proto:86-114)
GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
Transports (api:16): tcp (9.0.15-1)
-Jarno
Not exactly the same issue you are seeing, but I have had an issue this
week with a newly created resource on a 9.0.16-1 primary against a
9.0.13-1 secondary.

As soon as I started writing to the new primary the secondary started
repeatedly disconnecting with the error:

drbd resource274 primary.host: Unexpected data packet ? (0x0036)

followed by resync (and then same error again, followed by resync, ....)

Probably completely unrelated to your issues, and I know there is a
_lot_ of bug fixes between 9.0.13-1 and 9.0.16-1 (and I _do_ have have a
long overdue update of the secondary planned v. soon).

Theoretically, different 9.0.x kernel versions should be able work
together (same api). But in practice, I avoid it and usually update drbd
& kernel at same time on all nodes.

So it could be that 9.0.16-1 has particular problems with co-operating
with earlier version, perhaps more so than other versions.

Eddie
Yannis Milios
2018-11-02 10:16:54 UTC
Permalink
Post by Jarno Elonen
This is getting quite worrisome. Is anyone else experiencing this with
DRBD 9? Is it something really wrong in my setup, or are there perhaps some
known instabilities in DRBD 9.0.15-1?
Yes, I have been facing this as well on all "recent" versions of DRBD9,
currently I'm on 9.0.16-1, on some of the resources. The way I usually get
that sorted by disconnecting,discarding the secondaries, but yes I would
agree with you that it looks a bit worrisome...

Yannis
Roberto Resoli
2018-11-02 13:23:03 UTC
Permalink
Post by Jarno Elonen
Post by Jarno Elonen
This is getting quite worrisome. Is anyone else experiencing this
with
Post by Jarno Elonen
DRBD 9? Is it something really wrong in my setup, or are there
perhaps some
Post by Jarno Elonen
known instabilities in DRBD 9.0.15-1?
Same here, mostly sfter rebooting a node.
Post by Jarno Elonen
Yes, I have been facing this as well on all "recent" versions of DRBD9,
currently I'm on 9.0.16-1, on some of the resources. The way I usually get
that sorted by disconnecting,discarding the secondaries,
I found that putting down (disconnecting is not sufficient) the resource on the primary node and then adjusting it solves, without any resync.
Post by Jarno Elonen
but yes I
would
agree with you that it looks a bit worrisome...
Imho definitely some bug ...

Bye,
rob
Post by Jarno Elonen
Yannis
Continue reading on narkive:
Loading...