Project

General

Profile

Actions

Bug #24396

closed

osd crashes in on_local_recover due to stray clone

Added by Krasimir Velkov almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

Today one of our ceph clusters

[

version (luminous 12.2.5 stable - updated from 12.2.4 a month ago).

pool type - replicated - replicas - 2

osds - 60

storage - ssd (direct storage, not a caching tier)

]

has gone into HEALTH_WARN state and a couple of OSDs subsequently crashed and fell into a boot loop.

The initial problem was

2018-06-03 12:38:12.834210 7f94b17b1700 -1 log_channel(cluster) log [ERR] : scrub 12.ade 12:7b59badd:::rbd_data.36da116b8b4567.000000000003bd5f:1275b is an unexpected clone

So we tried pg deep scrub and pg repair for that pg and after the latter the OSD started flapping with assert(p != recovery_info.ss.clone_snaps.end())

2018-06-03 23:56:32.719840 7f7cb759a700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction)' thread 7f7cb759a700 time 2018-06-03 23:56:32.715249
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != recovery_info.ss.clone_snaps.end())

We took it down and outed it from the cluster and then a second OSD started showing the exact same symptoms. Since then I have read extensively on using ceph-objectstore-tool to remove the clone metadata for rbd_data.36da116b8b4567 but failed to achieve it

(for example even :

ceph-objectstore-tool --cluster XXXX --pgid 12.ade --op info --data-path /dev/disk/by-partuuid/cf8a45fe-0105-4b5c-81d9-cfb21e1a638d --journal-path /dev/disk/by-uuid/bcaebbc5-fbf3-4638-bb09-07cc554b616a

gives me a syntax error)

We now have 4 pgs down at the moment because of the second OSD going down on its own:

    pg 12.2e is down, acting [19]
    pg 12.26d is down, acting [20]
    pg 12.d2b is down, acting [11]
    pg 12.ecb is down, acting [20]

(OSDs stuck in boot loop are 45, 53)

I enabled osd debug 20 and can see a lot of the following errors for different rbd images sharing the corrupted pgs:

2018-06-03 23:40:10.014258 7f3e920a2d00 20 read_log_and_missing 339927'44627733 (0'0) error    12:77c225f0:::rbd_data.231c236b8b4567.0000000000011837:head by client.6233256.0:46661 0.000000 -2

2018-06-03 23:40:09.168845 7f3e920a2d00 20 read_log_and_missing 340045'59503259 (0'0) error    12:ed2d2e35:::rbd_data.5f2b656b8b4567.0000000000029022:head by client.6233217.0:359414 0.000000 0

I think at this point we are lacking the expertise to make further steps towards fixing the problem with assert(p != recovery_info.ss.clone_snaps.end())

Can somebody that possibly experienced (I have seen a few similar threads) and resolved the problem lead us in the right direction.

Thanks in advance for any information on this.

P.S. The OSD log for 53, which we left running in the boot loop show progress in backfilling data to other pgs. I am not sure if this is wise at this point.


Related issues 3 (0 open3 closed)

Related to RADOS - Bug #23875: Removal of snapshot with corrupt replica crashes osdResolvedDavid Zafman

Actions
Copied to Ceph - Backport #24469: luminous: osd crashes in on_local_recover due to stray cloneResolvedNathan CutlerActions
Copied to Ceph - Backport #24470: mimic: osd crashes in on_local_recover due to stray cloneResolvedNathan CutlerActions
Actions

Also available in: Atom PDF