Bug #19931
closed
osds abort on shutdown with assert(peering_queue.empty()) or 'pgid X has ref count of 2'
Added by Casey Bodley almost 7 years ago.
Updated over 6 years ago.
Description
These have been happening consistently against the rgw:multisite suite, which sets 'wait-for-scrub: false'.
An example run: http://qa-proxy.ceph.com/teuthology/cbodley-2017-05-15_13:57:03-rgw:multisite-master---basic-mira/1180877/teuthology.log
http://qa-proxy.ceph.com/teuthology/cbodley-2017-05-15_13:57:03-rgw:multisite-master---basic-mira/1180877/remote/mira095/log/c1-osd.2.log.gz
2017-05-15 18:29:46.037468 16fbb700 10 osd.2 108 not yet active; waiting for peering wq to drain
2017-05-15 18:29:46.154089 31346700 -1 /home/jenkins-build/.../src/osd/OSD.h: In function 'virtual void OSD::PeeringWQ::_clear()' thread 31346700 time 2017-05-15 18:29:46.043928
/home/jenkins-build/.../src/osd/OSD.h: 1954: FAILED assert(peering_queue.empty())
ceph version 12.0.1-2381-g6db74cf (6db74cf102db927fabea554aec15bfcc2199b3c1)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xaeeb70]
2: (()+0x4d8e1c) [0x5e0e1c]
3: (ThreadPool::stop(bool)+0x2e5) [0xaf2425]
4: (OSD::shutdown()+0xb74) [0x5b2294]
5: (OSD::handle_signal(int)+0x11f) [0x5b363f]
6: (SignalHandler::entry()+0x1d7) [0xab2b87]
7: (()+0x7dc5) [0xc733dc5]
8: (clone()+0x6d) [0xd87f73d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
http://qa-proxy.ceph.com/teuthology/cbodley-2017-05-15_13:57:03-rgw:multisite-master---basic-mira/1180877/remote/mira061/log/c2-osd.0.log.gz
2017-05-15 18:27:59.011467 31473700 20 osd.0 96 kicking pg 10.6
2017-05-15 18:27:59.011557 31473700 30 osd.0 pg_epoch: 96 pg[10.6( empty local-lis/les=93/94 n=0 ec=56 lis/c 93/93 les/c/f 94/96/0 93/93/93) [0] r=0 lpr=93 crt=0'0 mlcod 0'0 active+undersized+degraded] lock
2017-05-15 18:27:59.012779 31473700 -1 osd.0 96 pgid 10.6 has ref count of 2
2017-05-15 18:27:59.118433 31473700 -1 *** Caught signal (Aborted) **
in thread 31473700 thread_name:signal_handler
ceph version 12.0.1-2381-g6db74cf (6db74cf102db927fabea554aec15bfcc2199b3c1)
1: (()+0x9a981f) [0xab181f]
2: (()+0xf370) [0xc73b370]
3: (gsignal()+0x37) [0xd7bd1d7]
4: (abort()+0x148) [0xd7be8c8]
5: (OSD::shutdown()+0x190f) [0x5b302f]
6: (OSD::handle_signal(int)+0x11f) [0x5b363f]
7: (SignalHandler::entry()+0x1d7) [0xab2b87]
8: (()+0x7dc5) [0xc733dc5]
9: (clone()+0x6d) [0xd87f73d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
- Related to Bug #17704: osd: leaked pg refs on shutdown added
- Priority changed from Normal to Immediate
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: ceph version 12.0.1-2422-g0d194cc (0d194cc84cf22302e6f62c1cba71d85c99637937)
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 1: (()+0x9abc8f) [0xab3c8f]
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 2: (()+0xf370) [0xc74e370]
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 3: (gsignal()+0x37) [0xd7d01d7]
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 4: (abort()+0x148) [0xd7d18c8]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 5: (OSD::shutdown()+0x190f) [0x5b39cf]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 6: (OSD::handle_signal(int)+0x11f) [0x5b3fdf]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 7: (SignalHandler::entry()+0x1d7) [0xab4ff7]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 8: (()+0x7dc5) [0xc746dc5]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 9: (clone()+0x6d) [0xd89273d]
2017-05-15T19:58:02.622 INFO:tasks.ceph.osd.2.smithi136.stderr:2017-05-15 19:58:02.560591 32406700 -1 osd.2 703 pgid 210.7 has ref count of 2
on /a/sage-2017-05-15_17:07:41-rados-wip-sage-testing2---basic-smithi/1180651
- Assignee set to Greg Farnum
Yep, the snap trimming takes PGRefs and doesn't have a good way of dropping them for shutdown. Looking into it...`
(Note: definitely not sure the peering_queue assert is the same one. It could be related though.)
- Status changed from New to Fix Under Review
- Status changed from Fix Under Review to In Progress
Actually, this doesn't seem to be the issue. There's a race present but it's pretty narrow since we were already having the AsyncReserver cancel the callback in Trimming::exit(). Trying to reproduce with the trim sleep...
I'm not sure these are actually the same bug. Sage's involves snaptrimming, but the others have no mention of it in the OSD log at all. And I've got another log here with the ref count crash where the PG in question is actually in snaptrim (not snaptrim_wait or WaitTrimTimer).
I'm starting to think the leak is in peering now. All of them (except Casey's top one, which I didn't check and is about the peering_queue) involve PGs with a min_last_complete_on_disk of 0'0 — ie, one of the OSDs was being backfilled. Except they're in active+clean, so not sure how to resolve that?
Oh, that's just what happens on peering until you commit something to the PG or make progress in backfill/logging. Never mind.
So, we identified today that it looks like the op Backoff code may have an issue with holding PGRefs past when we expect them to die, although it's not a certain thing. Anyway, the patch is very small.
More generally though, we shut down several different timers/AsyncReservers/etc in OSDService::shutdown. This happens after the pg refcount check and lots of these are given PGRefs via input Context objects. Many of them are cleaned up but a few of them seem broken to me at first glance (and, in general, it's really hard to validate or when writing new code to realize that they need to get cleaned up separately). These include reserver_finisher, objecter_finisher, recovery_request_timer, and snap_sleep_timer (though this one at least should get cleaned up via its state exit() function). It's not clear to me if we can maybe just skip doing the PGRef counting assert until after we do OSDService shutdown? I don't think we can shutdown these utility objects earlier since they can be invoked by various other things.
I'm also a little confused about the snap_trimmer state machine when it shuts down. I've done testing where PGs which are in the WaitTrimTimer state (with associated Context holding a PGRef in snap_sleep_timer) are successfully exiting, which as far as I can tell must mean the Reset() event getting emitted by PrimaryLogPG::on_change() is triggering WaitTrimTimer::exit(). But it's not printing out the log message about exiting that state? So I don't know what's going on. (This was one of the ways I initially assumed it was broken, especially as we've started seeing the same pgref shutdown bug on a Jewel user with the backports.)
Actually there's a PrimaryLogPG::on_change and a PGBackend::on_change, they are distinct, and the PrimaryLogPG one is only invoked on start_peering_interval() in RecoveryState::Reset::react(AdvMap).
So now I'm again just confused that the WaitTrimTimer's OnTimer callback isn't causing the PGRef assert failure.
- Status changed from In Progress to Fix Under Review
Okay, my local testing was using init-ceph, and the way it repeatedly invokes kill signals means the OSD wasn't getting shut down cleanly anyway, and I was missing the asserts. But they were present.
So the proximate cause is that the snap_trimmer machine wasn't getting Reset() on shutdown; the PR has patches doing that (and closing some other holes) now. Will test against the rgw-multisite and see if that triggers any other issues, but I'm feeling pretty good.
- Copied to Backport #20084: OSDs assert on shutdown when PGs are in snaptrim_wait() state added
- Status changed from Fix Under Review to Resolved
- Status changed from Resolved to Pending Backport
- Backport changed from jewel to jewel, kraken
- Copied to Backport #20516: kraken: osds abort on shutdown with assert(peering_queue.empty()) or 'pgid X has ref count of 2' added
- Status changed from Pending Backport to Resolved
Also available in: Atom
PDF