osds abort on shutdown with assert(peering_queue.empty()) or 'pgid X has ref count of 2'
These have been happening consistently against the rgw:multisite suite, which sets 'wait-for-scrub: false'.
2017-05-15 18:29:46.037468 16fbb700 10 osd.2 108 not yet active; waiting for peering wq to drain 2017-05-15 18:29:46.154089 31346700 -1 /home/jenkins-build/.../src/osd/OSD.h: In function 'virtual void OSD::PeeringWQ::_clear()' thread 31346700 time 2017-05-15 18:29:46.043928 /home/jenkins-build/.../src/osd/OSD.h: 1954: FAILED assert(peering_queue.empty()) ceph version 12.0.1-2381-g6db74cf (6db74cf102db927fabea554aec15bfcc2199b3c1) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xaeeb70] 2: (()+0x4d8e1c) [0x5e0e1c] 3: (ThreadPool::stop(bool)+0x2e5) [0xaf2425] 4: (OSD::shutdown()+0xb74) [0x5b2294] 5: (OSD::handle_signal(int)+0x11f) [0x5b363f] 6: (SignalHandler::entry()+0x1d7) [0xab2b87] 7: (()+0x7dc5) [0xc733dc5] 8: (clone()+0x6d) [0xd87f73d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2017-05-15 18:27:59.011467 31473700 20 osd.0 96 kicking pg 10.6 2017-05-15 18:27:59.011557 31473700 30 osd.0 pg_epoch: 96 pg[10.6( empty local-lis/les=93/94 n=0 ec=56 lis/c 93/93 les/c/f 94/96/0 93/93/93)  r=0 lpr=93 crt=0'0 mlcod 0'0 active+undersized+degraded] lock 2017-05-15 18:27:59.012779 31473700 -1 osd.0 96 pgid 10.6 has ref count of 2 2017-05-15 18:27:59.118433 31473700 -1 *** Caught signal (Aborted) ** in thread 31473700 thread_name:signal_handler ceph version 12.0.1-2381-g6db74cf (6db74cf102db927fabea554aec15bfcc2199b3c1) 1: (()+0x9a981f) [0xab181f] 2: (()+0xf370) [0xc73b370] 3: (gsignal()+0x37) [0xd7bd1d7] 4: (abort()+0x148) [0xd7be8c8] 5: (OSD::shutdown()+0x190f) [0x5b302f] 6: (OSD::handle_signal(int)+0x11f) [0x5b363f] 7: (SignalHandler::entry()+0x1d7) [0xab2b87] 8: (()+0x7dc5) [0xc733dc5] 9: (clone()+0x6d) [0xd87f73d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
- Priority changed from Normal to Immediate
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: ceph version 12.0.1-2422-g0d194cc (0d194cc84cf22302e6f62c1cba71d85c99637937) 2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 1: (()+0x9abc8f) [0xab3c8f] 2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 2: (()+0xf370) [0xc74e370] 2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 3: (gsignal()+0x37) [0xd7d01d7] 2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 4: (abort()+0x148) [0xd7d18c8] 2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 5: (OSD::shutdown()+0x190f) [0x5b39cf] 2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 6: (OSD::handle_signal(int)+0x11f) [0x5b3fdf] 2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 7: (SignalHandler::entry()+0x1d7) [0xab4ff7] 2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 8: (()+0x7dc5) [0xc746dc5] 2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 9: (clone()+0x6d) [0xd89273d] 2017-05-15T19:58:02.622 INFO:tasks.ceph.osd.2.smithi136.stderr:2017-05-15 19:58:02.560591 32406700 -1 osd.2 703 pgid 210.7 has ref count of 2
#6 Updated by Greg Farnum 9 months ago
- Status changed from Need Review to In Progress
Actually, this doesn't seem to be the issue. There's a race present but it's pretty narrow since we were already having the AsyncReserver cancel the callback in Trimming::exit(). Trying to reproduce with the trim sleep...
#7 Updated by Greg Farnum 9 months ago
I'm not sure these are actually the same bug. Sage's involves snaptrimming, but the others have no mention of it in the OSD log at all. And I've got another log here with the ref count crash where the PG in question is actually in snaptrim (not snaptrim_wait or WaitTrimTimer).
#8 Updated by Greg Farnum 9 months ago
I'm starting to think the leak is in peering now. All of them (except Casey's top one, which I didn't check and is about the peering_queue) involve PGs with a min_last_complete_on_disk of 0'0 — ie, one of the OSDs was being backfilled. Except they're in active+clean, so not sure how to resolve that?
#10 Updated by Greg Farnum 9 months ago
So, we identified today that it looks like the op Backoff code may have an issue with holding PGRefs past when we expect them to die, although it's not a certain thing. Anyway, the patch is very small.
More generally though, we shut down several different timers/AsyncReservers/etc in OSDService::shutdown. This happens after the pg refcount check and lots of these are given PGRefs via input Context objects. Many of them are cleaned up but a few of them seem broken to me at first glance (and, in general, it's really hard to validate or when writing new code to realize that they need to get cleaned up separately). These include reserver_finisher, objecter_finisher, recovery_request_timer, and snap_sleep_timer (though this one at least should get cleaned up via its state exit() function). It's not clear to me if we can maybe just skip doing the PGRef counting assert until after we do OSDService shutdown? I don't think we can shutdown these utility objects earlier since they can be invoked by various other things.
I'm also a little confused about the snap_trimmer state machine when it shuts down. I've done testing where PGs which are in the WaitTrimTimer state (with associated Context holding a PGRef in snap_sleep_timer) are successfully exiting, which as far as I can tell must mean the Reset() event getting emitted by PrimaryLogPG::on_change() is triggering WaitTrimTimer::exit(). But it's not printing out the log message about exiting that state? So I don't know what's going on. (This was one of the ways I initially assumed it was broken, especially as we've started seeing the same pgref shutdown bug on a Jewel user with the backports.)
#11 Updated by Greg Farnum 9 months ago
Actually there's a PrimaryLogPG::on_change and a PGBackend::on_change, they are distinct, and the PrimaryLogPG one is only invoked on start_peering_interval() in RecoveryState::Reset::react(AdvMap).
So now I'm again just confused that the WaitTrimTimer's OnTimer callback isn't causing the PGRef assert failure.
#12 Updated by Greg Farnum 9 months ago
- Status changed from In Progress to Need Review
Okay, my local testing was using init-ceph, and the way it repeatedly invokes kill signals means the OSD wasn't getting shut down cleanly anyway, and I was missing the asserts. But they were present.
So the proximate cause is that the snap_trimmer machine wasn't getting Reset() on shutdown; the PR has patches doing that (and closing some other holes) now. Will test against the rgw-multisite and see if that triggers any other issues, but I'm feeling pretty good.
#15 Updated by Greg Farnum 9 months ago
Well, it's not exactly great but we didn't see the same issues on multisite:
(The rados issue there is #19900; I didn't look into the multisite failure expect to check it didn't include crashed OSDs.)