RHEL 8.4: libceph considers osdmap corrupt if numbering is discontinuous
If I try to map rbd from a cluster where osd numbering is discontinuous it fails with message
[85115.711643] libceph: corrupt full osdmap (-2) epoch 248869 off 10063. The cluster runs nautilus 14.2.18 on RHEL 7. The ceph-common package on client is also nautilus and the OS is RHEL 8. The mapping succeeds from clients running RHEL 7. I tried with mimic, nautilus and octopus ceph-client packages.
# modinfo libceph filename: /lib/modules/4.18.0-305.3.1.el8_4.x86_64/kernel/net/ceph/libceph.ko.xz license: GPL description: Ceph core library author: Patience Warnick <firstname.lastname@example.org> author: Yehuda Sadeh <email@example.com> author: Sage Weil <firstname.lastname@example.org> rhelversion: 8.4 srcversion: 4A720ED724979ABE2F86C68 depends: libcrc32c,dns_resolver intree: Y name: libceph vermagic: 4.18.0-305.3.1.el8_4.x86_64 SMP mod_unload modversions
The dmesg output with kernel debugs enabled from client was too large to be attached. It can be downloaded from here
#2 Updated by Ilya Dryomov 4 days ago
The linked output is incomplete -- it starts with
85113.782261] front: 00012370: 00 00 79 a7 0f 00 10 00 00 00 02 00 1a b9 0a 65 ..y............e
which is in the middle of the message dump.
Could you please attach the entire output? Just regular dmesg, no need to enable libceph debugging, but I need to see everything starting with the monitor session being established.
#5 Updated by Ilya Dryomov 4 days ago
No need for additional output, I managed to reproduce with the attached osdmap:
[ 531.999627] libceph: no match of type 1 in addrvec [ 532.001980] libceph: corrupt full osdmap (-2) epoch 248869 off 10063 (00000000ca5d89f8 of 000000003ecd15f3-000000002276d85d)
This is fixed upstream, but not in CentOS/RHEL 8.4:
The fix was backported to 5.11.20 and 5.12.3.
Kernel 5.10 and older is not affected.