Bug #24396
closedosd crashes in on_local_recover due to stray clone
0%
Description
Hello,
Today one of our ceph clusters
[ version (luminous 12.2.5 stable - updated from 12.2.4 a month ago). pool type - replicated - replicas - 2 osds - 60 storage - ssd (direct storage, not a caching tier) ]
has gone into HEALTH_WARN state and a couple of OSDs subsequently crashed and fell into a boot loop.
The initial problem was
2018-06-03 12:38:12.834210 7f94b17b1700 -1 log_channel(cluster) log [ERR] : scrub 12.ade 12:7b59badd:::rbd_data.36da116b8b4567.000000000003bd5f:1275b is an unexpected clone
So we tried pg deep scrub and pg repair for that pg and after the latter the OSD started flapping with assert(p != recovery_info.ss.clone_snaps.end())
2018-06-03 23:56:32.719840 7f7cb759a700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction)' thread 7f7cb759a700 time 2018-06-03 23:56:32.715249
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != recovery_info.ss.clone_snaps.end())
We took it down and outed it from the cluster and then a second OSD started showing the exact same symptoms. Since then I have read extensively on using ceph-objectstore-tool to remove the clone metadata for rbd_data.36da116b8b4567 but failed to achieve it
(for example even :
ceph-objectstore-tool --cluster XXXX --pgid 12.ade --op info --data-path /dev/disk/by-partuuid/cf8a45fe-0105-4b5c-81d9-cfb21e1a638d --journal-path /dev/disk/by-uuid/bcaebbc5-fbf3-4638-bb09-07cc554b616a
gives me a syntax error)
We now have 4 pgs down at the moment because of the second OSD going down on its own:
pg 12.2e is down, acting [19] pg 12.26d is down, acting [20] pg 12.d2b is down, acting [11] pg 12.ecb is down, acting [20]
(OSDs stuck in boot loop are 45, 53)
I enabled osd debug 20 and can see a lot of the following errors for different rbd images sharing the corrupted pgs:
2018-06-03 23:40:10.014258 7f3e920a2d00 20 read_log_and_missing 339927'44627733 (0'0) error 12:77c225f0:::rbd_data.231c236b8b4567.0000000000011837:head by client.6233256.0:46661 0.000000 -2 2018-06-03 23:40:09.168845 7f3e920a2d00 20 read_log_and_missing 340045'59503259 (0'0) error 12:ed2d2e35:::rbd_data.5f2b656b8b4567.0000000000029022:head by client.6233217.0:359414 0.000000 0
I think at this point we are lacking the expertise to make further steps towards fixing the problem with assert(p != recovery_info.ss.clone_snaps.end())
Can somebody that possibly experienced (I have seen a few similar threads) and resolved the problem lead us in the right direction.
Thanks in advance for any information on this.
P.S. The OSD log for 53, which we left running in the boot loop show progress in backfilling data to other pgs. I am not sure if this is wise at this point.