Project

General

Profile

Bug #24396

osd crashes in on_local_recover due to stray clone

Added by Krasimir Velkov 8 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
Start date:
06/04/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Hello,

Today one of our ceph clusters

[

version (luminous 12.2.5 stable - updated from 12.2.4 a month ago).

pool type - replicated - replicas - 2

osds - 60

storage - ssd (direct storage, not a caching tier)

]

has gone into HEALTH_WARN state and a couple of OSDs subsequently crashed and fell into a boot loop.

The initial problem was

2018-06-03 12:38:12.834210 7f94b17b1700 -1 log_channel(cluster) log [ERR] : scrub 12.ade 12:7b59badd:::rbd_data.36da116b8b4567.000000000003bd5f:1275b is an unexpected clone

So we tried pg deep scrub and pg repair for that pg and after the latter the OSD started flapping with assert(p != recovery_info.ss.clone_snaps.end())

2018-06-03 23:56:32.719840 7f7cb759a700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction)' thread 7f7cb759a700 time 2018-06-03 23:56:32.715249
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != recovery_info.ss.clone_snaps.end())

We took it down and outed it from the cluster and then a second OSD started showing the exact same symptoms. Since then I have read extensively on using ceph-objectstore-tool to remove the clone metadata for rbd_data.36da116b8b4567 but failed to achieve it

(for example even :

ceph-objectstore-tool --cluster XXXX --pgid 12.ade --op info --data-path /dev/disk/by-partuuid/cf8a45fe-0105-4b5c-81d9-cfb21e1a638d --journal-path /dev/disk/by-uuid/bcaebbc5-fbf3-4638-bb09-07cc554b616a

gives me a syntax error)

We now have 4 pgs down at the moment because of the second OSD going down on its own:

    pg 12.2e is down, acting [19]
    pg 12.26d is down, acting [20]
    pg 12.d2b is down, acting [11]
    pg 12.ecb is down, acting [20]

(OSDs stuck in boot loop are 45, 53)

I enabled osd debug 20 and can see a lot of the following errors for different rbd images sharing the corrupted pgs:

2018-06-03 23:40:10.014258 7f3e920a2d00 20 read_log_and_missing 339927'44627733 (0'0) error    12:77c225f0:::rbd_data.231c236b8b4567.0000000000011837:head by client.6233256.0:46661 0.000000 -2

2018-06-03 23:40:09.168845 7f3e920a2d00 20 read_log_and_missing 340045'59503259 (0'0) error    12:ed2d2e35:::rbd_data.5f2b656b8b4567.0000000000029022:head by client.6233217.0:359414 0.000000 0

I think at this point we are lacking the expertise to make further steps towards fixing the problem with assert(p != recovery_info.ss.clone_snaps.end())

Can somebody that possibly experienced (I have seen a few similar threads) and resolved the problem lead us in the right direction.

Thanks in advance for any information on this.

P.S. The OSD log for 53, which we left running in the boot loop show progress in backfilling data to other pgs. I am not sure if this is wise at this point.


Related issues

Related to RADOS - Bug #23875: Removal of snapshot with corrupt replica crashes osd In Progress 04/25/2018
Copied to Ceph - Backport #24469: luminous: osd crashes in on_local_recover due to stray clone Resolved
Copied to Ceph - Backport #24470: mimic: osd crashes in on_local_recover due to stray clone Resolved

History

#1 Updated by Krasimir Velkov 8 months ago

forgot to add - storage backend - bluestore

#2 Updated by Sage Weil 8 months ago

2018-06-03 19:54:31.862509 7f94ad7a9700 -1 log_channel(cluster) log [ERR] : 12.ade shard 45 missing 12:7b59badd:::rbd_data.36da116b8b4567.000000000003bd5f:1275b
2018-06-03 19:54:31.862684 7f94ad7a9700 -1 log_channel(cluster) log [ERR] : repair 12.ade 12:7b59badd:::rbd_data.36da116b8b4567.000000000003bd5f:1275b is an unexpected clone
2018-06-03 19:54:40.757966 7f94b17b1700 -1 osd.45 pg_epoch: 338908 pg[12.ade( v 338908'102929773 (338905'102928227,338908'102929773] local-lis/les=338296/338297 n=5834 ec=5621/5621 lis/c 338296/338296 les/c/f 338297/338304/277941 338296/338296/338296) [45,12] r=0 lpr=338296 crt=338908'102929773 lcod 338908'102929772 mlcod 338908'102929772 active+clean+scrubbing+deep+inconsistent+repair] _scan_snaps no head for 12:7b5c9ad6:::rbd_data.36c6736b8b4567.0000000000027506:fdf7 (have MIN)
2018-06-03 19:54:44.902110 7f94b17b1700 -1 osd.45 pg_epoch: 338908 pg[12.ade( v 338908'102929804 (338905'102928227,338908'102929804] local-lis/les=338296/338297 n=5834 ec=5621/5621 lis/c 338296/338296 les/c/f 338297/338304/277941 338296/338296/338296) [45,12] r=0 lpr=338296 crt=338908'102929804 lcod 338908'102929803 mlcod 338908'102929803 active+clean+scrubbing+deep+inconsistent+repair] _scan_snaps no head for 12:7b5dd00c:::rbd_data.36c6736b8b4567.00000000000158fb:ff6c (have MIN)
2018-06-03 19:54:48.152352 7f94ad7a9700 -1 osd.45 pg_epoch: 338908 pg[12.ade( v 338908'102929814 (338905'102928227,338908'102929814] local-lis/les=338296/338297 n=5834 ec=5621/5621 lis/c 338296/338296 les/c/f 338297/338304/277941 338296/338296/338296) [45,12] r=0 lpr=338296 crt=338908'102929814 lcod 338908'102929813 mlcod 338908'102929813 active+clean+scrubbing+deep+inconsistent+repair] _scan_snaps no head for 12:7b5ec943:::rbd_data.36c6736b8b4567.000000000001a0aa:fdf7 (have MIN)
2018-06-03 19:54:51.893305 7f94ad7a9700 -1 log_channel(cluster) log [ERR] : 12.ade repair 1 missing, 0 inconsistent objects
2018-06-03 19:54:51.893326 7f94ad7a9700 -1 log_channel(cluster) log [ERR] : 12.ade repair 2 errors, 1 fixed
2018-06-03 19:54:51.899931 7f94ad7a9700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f94ad7a9700 time 2018-06-03 19:54:51.895565
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != recovery_info.ss.clone_snaps.end())

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x560ca732a780]
 2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0x12b1) [0x560ca6efc4f1]
 3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&, PullOp*, std::list<ReplicatedBackend::pull_complete_info, std::allocator<ReplicatedBackend::pull_complete_info> >*, ObjectStore::Transaction*)+0x74b) [0x560ca707937b]
 4: (ReplicatedBackend::_do_pull_response(boost::intrusive_ptr<OpRequest>)+0x2a6) [0x560ca707c526]
 5: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x224) [0x560ca707d274]
 6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x560ca6fa32a0]
 7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x59c) [0x560ca6f0e61c]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9) [0x560ca6d92069]
 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x560ca7014ee7]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x560ca6dc0f7e]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x560ca7330299]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560ca7332230]
 13: (()+0x7e25) [0x7f94cc28fe25]
 14: (clone()+0x6d) [0x7f94cb38334d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

For the original crash, it looks like the stray clone was also missing on one of the shards, so the recovery/repair logic got confused and tried to recover it.

Not sure how that led to the subsequent crashes...

#3 Updated by Krasimir Velkov 8 months ago

Hello,

Thanks to the patch you provided we have been able to compile a temporary ceph binary and recover the cluster and then started the faulty osds with the default ceph binary.

I will deep scrub the faulty pg to ensure the stray clone is not present.

Thanks a lot for your help.

#4 Updated by Sage Weil 8 months ago

for the record here is the hacky workaround:

diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc
index 8637c63d40..9f0d212a61 100644
--- a/src/osd/PrimaryLogPG.cc
+++ b/src/osd/PrimaryLogPG.cc
@@ -355,14 +355,16 @@ void PrimaryLogPG::on_local_recover(
            recovery_info.oi.legacy_snaps.end());
     } else {
       auto p = recovery_info.ss.clone_snaps.find(hoid.snap);
-      assert(p != recovery_info.ss.clone_snaps.end());  // hmm, should we warn?
-      snaps.insert(p->second.begin(), p->second.end());
+      if (p != recovery_info.ss.clone_snaps.end()) {  // hmm, should we warn?
+    snaps.insert(p->second.begin(), p->second.end());
+      }
     }
     dout(20) << " snaps " << snaps << dendl;
-    snap_mapper.add_oid(
-      recovery_info.soid,
-      snaps,
-      &_t);
+    if (!snaps.empty())
+      snap_mapper.add_oid(
+    recovery_info.soid,
+    snaps,
+    &_t);
   }
   if (!is_delete && pg_log.get_missing().is_missing(recovery_info.soid) &&
       pg_log.get_missing().get_items().find(recovery_info.soid)->second.need > recovery_info.version) {

#5 Updated by Greg Farnum 8 months ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)

#6 Updated by Sage Weil 8 months ago

  • Tracker changed from Support to Bug
  • Project changed from RADOS to Ceph
  • Subject changed from A couple of OSDs in boot loop to osd crashes in on_local_recover due to stray clone
  • Category set to OSD
  • Status changed from New to In Progress
  • Tags deleted (osd, ceph, boot, loop, assert, clone, snaps)
  • Backport set to mimic,luminous
  • Regression set to No
  • Severity set to 3 - minor

#7 Updated by David Zafman 8 months ago

  • Related to Bug #23875: Removal of snapshot with corrupt replica crashes osd added

#8 Updated by Sage Weil 8 months ago

  • Status changed from In Progress to Need Review

#9 Updated by David Zafman 8 months ago

This is related to 23875 because in that case a stray clone (due to missing SnapSet), crashes replica during snapshot removal.

There is always a tension between crashing to prevent further corruption and tolerating inconsistencies in the hope that a repair can be performed.

#10 Updated by Sage Weil 8 months ago

  • Status changed from Need Review to Pending Backport
  • Priority changed from Normal to High

#12 Updated by Kefu Chai 7 months ago

  • Status changed from Verified to Pending Backport

#13 Updated by Nick Fisk 7 months ago

I seem to be hitting this bug. Whats the best way for a quick fix to get the PG back to full health?

1. Cherry pick this fix and build a new luminous binary
2. Remove the stray snapshot file from the OSD. Either by filesystem deletion or objectstore tool?

#14 Updated by Nathan Cutler 7 months ago

  • Copied to Backport #24469: luminous: osd crashes in on_local_recover due to stray clone added

#15 Updated by Nathan Cutler 7 months ago

  • Copied to Backport #24470: mimic: osd crashes in on_local_recover due to stray clone added

#16 Updated by Nathan Cutler 5 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF