Bug #51112
RHEL 8.4: libceph considers osdmap corrupt if numbering is discontinuous
0%
Description
If I try to map rbd from a cluster where osd numbering is discontinuous it fails with message [85115.711643] libceph: corrupt full osdmap (-2) epoch 248869 off 10063
. The cluster runs nautilus 14.2.18 on RHEL 7. The ceph-common package on client is also nautilus and the OS is RHEL 8. The mapping succeeds from clients running RHEL 7. I tried with mimic, nautilus and octopus ceph-client packages.
# modinfo libceph filename: /lib/modules/4.18.0-305.3.1.el8_4.x86_64/kernel/net/ceph/libceph.ko.xz license: GPL description: Ceph core library author: Patience Warnick <patience@newdream.net> author: Yehuda Sadeh <yehuda@hq.newdream.net> author: Sage Weil <sage@newdream.net> rhelversion: 8.4 srcversion: 4A720ED724979ABE2F86C68 depends: libcrc32c,dns_resolver intree: Y name: libceph vermagic: 4.18.0-305.3.1.el8_4.x86_64 SMP mod_unload modversions
The dmesg output with kernel debugs enabled from client was too large to be attached. It can be downloaded from here
https://s3.datacloud.helsinki.fi/matti:public/client-dmesg-with-kernel-debug-enabled
History
#1 Updated by Ilya Dryomov almost 2 years ago
- Subject changed from RHEL 8: libceph considers osdmap corrupt if numbering is discontinuous to RHEL 8.4: libceph considers osdmap corrupt if numbering is discontinuous
- Assignee set to Ilya Dryomov
#2 Updated by Ilya Dryomov almost 2 years ago
The linked output is incomplete -- it starts with
85113.782261] front: 00012370: 00 00 79 a7 0f 00 10 00 00 00 02 00 1a b9 0a 65 ..y............e
which is in the middle of the message dump.
Could you please attach the entire output? Just regular dmesg, no need to enable libceph debugging, but I need to see everything starting with the monitor session being established.
#3 Updated by Ilya Dryomov almost 2 years ago
- Category changed from rbd to libceph
#4 Updated by Ilya Dryomov almost 2 years ago
- Status changed from New to Need More Info
#5 Updated by Ilya Dryomov almost 2 years ago
No need for additional output, I managed to reproduce with the attached osdmap:
[ 531.999627] libceph: no match of type 1 in addrvec [ 532.001980] libceph: corrupt full osdmap (-2) epoch 248869 off 10063 (00000000ca5d89f8 of 000000003ecd15f3-000000002276d85d)
This is fixed upstream, but not in CentOS/RHEL 8.4:
The fix was backported to 5.11.20 and 5.12.3.
Kernel 5.10 and older is not affected.
#7 Updated by Daniel van der Ster almost 2 years ago
Same issue here.
We worked around by filling all the id gaps with "ceph osd new `uuid`", then purging those newly created gap osds with "ceph osd purge <id>".
#8 Updated by Ilya Dryomov almost 2 years ago
- Status changed from Need More Info to Pending Backport
Moving to Pending Backport, will update once the fix lands in CentOS/RHEL 8.4.
#9 Updated by Loïc Dachary almost 2 years ago
- Target version deleted (
v14.2.22)
#10 Updated by Matti Saarinen almost 2 years ago
Sorry for the delay in replying. I have been on holiday. Thanks for fixing the issue. I will test it soon.
#11 Updated by Ernesto Puerta 10 months ago
- Tags set to backport_processed