Bug #24396: osd crashes in on_local_recover due to stray clone - Ceph - Ceph

Actions

Copy link

Bug #24396

closed

osd crashes in on_local_recover due to stray clone

Added by Krasimir Velkov almost 6 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Tags:

Backport:

mimic,luminous

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v12.2.5

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hello,

Today one of our ceph clusters

[

version (luminous 12.2.5 stable - updated from 12.2.4 a month ago).

pool type - replicated - replicas - 2

osds - 60

storage - ssd (direct storage, not a caching tier)

]

has gone into HEALTH_WARN state and a couple of OSDs subsequently crashed and fell into a boot loop.

The initial problem was

2018-06-03 12:38:12.834210 7f94b17b1700 -1 log_channel(cluster) log [ERR] : scrub 12.ade 12:7b59badd:::rbd_data.36da116b8b4567.000000000003bd5f:1275b is an unexpected clone

So we tried pg deep scrub and pg repair for that pg and after the latter the OSD started flapping with assert(p != recovery_info.ss.clone_snaps.end())

2018-06-03 23:56:32.719840 7f7cb759a700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction)' thread 7f7cb759a700 time 2018-06-03 23:56:32.715249
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != recovery_info.ss.clone_snaps.end())

We took it down and outed it from the cluster and then a second OSD started showing the exact same symptoms. Since then I have read extensively on using ceph-objectstore-tool to remove the clone metadata for rbd_data.36da116b8b4567 but failed to achieve it

(for example even :

ceph-objectstore-tool --cluster XXXX --pgid 12.ade --op info --data-path /dev/disk/by-partuuid/cf8a45fe-0105-4b5c-81d9-cfb21e1a638d --journal-path /dev/disk/by-uuid/bcaebbc5-fbf3-4638-bb09-07cc554b616a

gives me a syntax error)

We now have 4 pgs down at the moment because of the second OSD going down on its own:

    pg 12.2e is down, acting [19]
    pg 12.26d is down, acting [20]
    pg 12.d2b is down, acting [11]
    pg 12.ecb is down, acting [20]

(OSDs stuck in boot loop are 45, 53)

I enabled osd debug 20 and can see a lot of the following errors for different rbd images sharing the corrupted pgs:

2018-06-03 23:40:10.014258 7f3e920a2d00 20 read_log_and_missing 339927'44627733 (0'0) error    12:77c225f0:::rbd_data.231c236b8b4567.0000000000011837:head by client.6233256.0:46661 0.000000 -2

2018-06-03 23:40:09.168845 7f3e920a2d00 20 read_log_and_missing 340045'59503259 (0'0) error    12:ed2d2e35:::rbd_data.5f2b656b8b4567.0000000000029022:head by client.6233217.0:359414 0.000000 0

I think at this point we are lacking the expertise to make further steps towards fixing the problem with assert(p != recovery_info.ss.clone_snaps.end())

Can somebody that possibly experienced (I have seen a few similar threads) and resolved the problem lead us in the right direction.

Thanks in advance for any information on this.

P.S. The OSD log for 53, which we left running in the boot loop show progress in backfilling data to other pgs. I am not sure if this is wise at this point.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Krasimir Velkov almost 6 years ago

forgot to add - storage backend - bluestore

Actions

Copy link

Updated by Sage Weil almost 6 years ago

2018-06-03 19:54:31.862509 7f94ad7a9700 -1 log_channel(cluster) log [ERR] : 12.ade shard 45 missing 12:7b59badd:::rbd_data.36da116b8b4567.000000000003bd5f:1275b
2018-06-03 19:54:31.862684 7f94ad7a9700 -1 log_channel(cluster) log [ERR] : repair 12.ade 12:7b59badd:::rbd_data.36da116b8b4567.000000000003bd5f:1275b is an unexpected clone
2018-06-03 19:54:40.757966 7f94b17b1700 -1 osd.45 pg_epoch: 338908 pg[12.ade( v 338908'102929773 (338905'102928227,338908'102929773] local-lis/les=338296/338297 n=5834 ec=5621/5621 lis/c 338296/338296 les/c/f 338297/338304/277941 338296/338296/338296) [45,12] r=0 lpr=338296 crt=338908'102929773 lcod 338908'102929772 mlcod 338908'102929772 active+clean+scrubbing+deep+inconsistent+repair] _scan_snaps no head for 12:7b5c9ad6:::rbd_data.36c6736b8b4567.0000000000027506:fdf7 (have MIN)
2018-06-03 19:54:44.902110 7f94b17b1700 -1 osd.45 pg_epoch: 338908 pg[12.ade( v 338908'102929804 (338905'102928227,338908'102929804] local-lis/les=338296/338297 n=5834 ec=5621/5621 lis/c 338296/338296 les/c/f 338297/338304/277941 338296/338296/338296) [45,12] r=0 lpr=338296 crt=338908'102929804 lcod 338908'102929803 mlcod 338908'102929803 active+clean+scrubbing+deep+inconsistent+repair] _scan_snaps no head for 12:7b5dd00c:::rbd_data.36c6736b8b4567.00000000000158fb:ff6c (have MIN)
2018-06-03 19:54:48.152352 7f94ad7a9700 -1 osd.45 pg_epoch: 338908 pg[12.ade( v 338908'102929814 (338905'102928227,338908'102929814] local-lis/les=338296/338297 n=5834 ec=5621/5621 lis/c 338296/338296 les/c/f 338297/338304/277941 338296/338296/338296) [45,12] r=0 lpr=338296 crt=338908'102929814 lcod 338908'102929813 mlcod 338908'102929813 active+clean+scrubbing+deep+inconsistent+repair] _scan_snaps no head for 12:7b5ec943:::rbd_data.36c6736b8b4567.000000000001a0aa:fdf7 (have MIN)
2018-06-03 19:54:51.893305 7f94ad7a9700 -1 log_channel(cluster) log [ERR] : 12.ade repair 1 missing, 0 inconsistent objects
2018-06-03 19:54:51.893326 7f94ad7a9700 -1 log_channel(cluster) log [ERR] : 12.ade repair 2 errors, 1 fixed
2018-06-03 19:54:51.899931 7f94ad7a9700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f94ad7a9700 time 2018-06-03 19:54:51.895565
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != recovery_info.ss.clone_snaps.end())

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x560ca732a780]
 2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0x12b1) [0x560ca6efc4f1]
 3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&, PullOp*, std::list<ReplicatedBackend::pull_complete_info, std::allocator<ReplicatedBackend::pull_complete_info> >*, ObjectStore::Transaction*)+0x74b) [0x560ca707937b]
 4: (ReplicatedBackend::_do_pull_response(boost::intrusive_ptr<OpRequest>)+0x2a6) [0x560ca707c526]
 5: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x224) [0x560ca707d274]
 6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x560ca6fa32a0]
 7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x59c) [0x560ca6f0e61c]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9) [0x560ca6d92069]
 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x560ca7014ee7]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x560ca6dc0f7e]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x560ca7330299]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560ca7332230]
 13: (()+0x7e25) [0x7f94cc28fe25]
 14: (clone()+0x6d) [0x7f94cb38334d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

For the original crash, it looks like the stray clone was also missing on one of the shards, so the recovery/repair logic got confused and tried to recover it.

Not sure how that led to the subsequent crashes...

Actions

Copy link

Updated by Krasimir Velkov almost 6 years ago

Hello,

Thanks to the patch you provided we have been able to compile a temporary ceph binary and recover the cluster and then started the faulty osds with the default ceph binary.

I will deep scrub the faulty pg to ensure the stray clone is not present.

Thanks a lot for your help.

Actions

Copy link

Updated by Sage Weil almost 6 years ago

for the record here is the hacky workaround:

diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc
index 8637c63d40..9f0d212a61 100644
--- a/src/osd/PrimaryLogPG.cc
+++ b/src/osd/PrimaryLogPG.cc
@@ -355,14 +355,16 @@ void PrimaryLogPG::on_local_recover(
            recovery_info.oi.legacy_snaps.end());
     } else {
       auto p = recovery_info.ss.clone_snaps.find(hoid.snap);
-      assert(p != recovery_info.ss.clone_snaps.end());  // hmm, should we warn?
-      snaps.insert(p->second.begin(), p->second.end());
+      if (p != recovery_info.ss.clone_snaps.end()) {  // hmm, should we warn?
+    snaps.insert(p->second.begin(), p->second.end());
+      }
     }
     dout(20) << " snaps " << snaps << dendl;
-    snap_mapper.add_oid(
-      recovery_info.soid,
-      snaps,
-      &_t);
+    if (!snaps.empty())
+      snap_mapper.add_oid(
+    recovery_info.soid,
+    snaps,
+    &_t);
   }
   if (!is_delete && pg_log.get_missing().is_missing(recovery_info.soid) &&
       pg_log.get_missing().get_items().find(recovery_info.soid)->second.need > recovery_info.version) {

Actions

Copy link

Updated by Greg Farnum almost 6 years ago

Project changed from Ceph to RADOS
Category deleted (~~OSD~~)

Actions

Copy link

Updated by Sage Weil almost 6 years ago

Tracker changed from Support to Bug
Project changed from RADOS to Ceph
Subject changed from A couple of OSDs in boot loop to osd crashes in on_local_recover due to stray clone
Category set to OSD
Status changed from New to In Progress
Tags deleted (~~osd, ceph, boot, loop, assert, clone, snaps~~)
Backport set to mimic,luminous
Regression set to No
Severity set to 3 - minor

Actions

Copy link

Updated by David Zafman almost 6 years ago

Related to Bug #23875: Removal of snapshot with corrupt replica crashes osd added

Actions

Copy link

Updated by Sage Weil almost 6 years ago

Status changed from In Progress to Fix Under Review

https://github.com/ceph/ceph/pull/22396
https://github.com/ceph/ceph/pull/22428

Actions

Copy link

Updated by David Zafman almost 6 years ago

This is related to 23875 because in that case a stray clone (due to missing SnapSet), crashes replica during snapshot removal.

There is always a tension between crashing to prevent further corruption and tolerating inconsistencies in the hope that a repair can be performed.

Actions

Copy link

#10

Updated by Sage Weil almost 6 years ago

Status changed from Fix Under Review to Pending Backport
Priority changed from Normal to High

Actions

Copy link

#11

Updated by Kefu Chai almost 6 years ago

Status changed from Pending Backport to 12

http://pulpito.ceph.com/kchai-2018-06-05_05:07:24-rados-wip-kefu-testing-2018-06-05-0923-distro-basic-smithi/2629444/

Actions

Copy link

#12

Updated by Kefu Chai almost 6 years ago

Status changed from 12 to Pending Backport

Actions

Copy link

#13

Updated by Nick Fisk almost 6 years ago

I seem to be hitting this bug. Whats the best way for a quick fix to get the PG back to full health?

1. Cherry pick this fix and build a new luminous binary
2. Remove the stray snapshot file from the OSD. Either by filesystem deletion or objectstore tool?

Actions

Copy link

#14