Bug #19931: osds abort on shutdown with assert(peering_queue.empty()) or 'pgid X has ref count of 2' - Ceph - Ceph

Actions

Copy link

Bug #19931

closed

osds abort on shutdown with assert(peering_queue.empty()) or 'pgid X has ref count of 2'

Added by Casey Bodley almost 7 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Greg Farnum

Category:

Target version:

% Done:

Source:

Tags:

Backport:

jewel, kraken

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

These have been happening consistently against the rgw:multisite suite, which sets 'wait-for-scrub: false'.

An example run: http://qa-proxy.ceph.com/teuthology/cbodley-2017-05-15_13:57:03-rgw:multisite-master---basic-mira/1180877/teuthology.log

http://qa-proxy.ceph.com/teuthology/cbodley-2017-05-15_13:57:03-rgw:multisite-master---basic-mira/1180877/remote/mira095/log/c1-osd.2.log.gz

2017-05-15 18:29:46.037468 16fbb700 10 osd.2 108  not yet active; waiting for peering wq to drain
2017-05-15 18:29:46.154089 31346700 -1 /home/jenkins-build/.../src/osd/OSD.h: In function 'virtual void OSD::PeeringWQ::_clear()' thread 31346700 time 2017-05-15 18:29:46.043928
/home/jenkins-build/.../src/osd/OSD.h: 1954: FAILED assert(peering_queue.empty())

 ceph version 12.0.1-2381-g6db74cf (6db74cf102db927fabea554aec15bfcc2199b3c1)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xaeeb70]
 2: (()+0x4d8e1c) [0x5e0e1c]
 3: (ThreadPool::stop(bool)+0x2e5) [0xaf2425]
 4: (OSD::shutdown()+0xb74) [0x5b2294]
 5: (OSD::handle_signal(int)+0x11f) [0x5b363f]
 6: (SignalHandler::entry()+0x1d7) [0xab2b87]
 7: (()+0x7dc5) [0xc733dc5]
 8: (clone()+0x6d) [0xd87f73d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

http://qa-proxy.ceph.com/teuthology/cbodley-2017-05-15_13:57:03-rgw:multisite-master---basic-mira/1180877/remote/mira061/log/c2-osd.0.log.gz

2017-05-15 18:27:59.011467 31473700 20 osd.0 96  kicking pg 10.6
2017-05-15 18:27:59.011557 31473700 30 osd.0 pg_epoch: 96 pg[10.6( empty local-lis/les=93/94 n=0 ec=56 lis/c 93/93 les/c/f 94/96/0 93/93/93) [0] r=0 lpr=93 crt=0'0 mlcod 0'0 active+undersized+degraded] lock
2017-05-15 18:27:59.012779 31473700 -1 osd.0 96 pgid 10.6 has ref count of 2
2017-05-15 18:27:59.118433 31473700 -1 *** Caught signal (Aborted) **
 in thread 31473700 thread_name:signal_handler

 ceph version 12.0.1-2381-g6db74cf (6db74cf102db927fabea554aec15bfcc2199b3c1)
 1: (()+0x9a981f) [0xab181f]
 2: (()+0xf370) [0xc73b370]
 3: (gsignal()+0x37) [0xd7bd1d7]
 4: (abort()+0x148) [0xd7be8c8]
 5: (OSD::shutdown()+0x190f) [0x5b302f]
 6: (OSD::handle_signal(int)+0x11f) [0x5b363f]
 7: (SignalHandler::entry()+0x1d7) [0xab2b87]
 8: (()+0x7dc5) [0xc733dc5]
 9: (clone()+0x6d) [0xd87f73d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Casey Bodley almost 7 years ago

Related to Bug #17704: osd: leaked pg refs on shutdown added

Actions

Copy link

Updated by Sage Weil almost 7 years ago

Priority changed from Normal to Immediate

2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: ceph version 12.0.1-2422-g0d194cc (0d194cc84cf22302e6f62c1cba71d85c99637937)
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 1: (()+0x9abc8f) [0xab3c8f]
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 2: (()+0xf370) [0xc74e370]
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 3: (gsignal()+0x37) [0xd7d01d7]
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 4: (abort()+0x148) [0xd7d18c8]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 5: (OSD::shutdown()+0x190f) [0x5b39cf]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 6: (OSD::handle_signal(int)+0x11f) [0x5b3fdf]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 7: (SignalHandler::entry()+0x1d7) [0xab4ff7]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 8: (()+0x7dc5) [0xc746dc5]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 9: (clone()+0x6d) [0xd89273d]
2017-05-15T19:58:02.622 INFO:tasks.ceph.osd.2.smithi136.stderr:2017-05-15 19:58:02.560591 32406700 -1 osd.2 703 pgid 210.7 has ref count of 2

on /a/sage-2017-05-15_17:07:41-rados-wip-sage-testing2---basic-smithi/1180651

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Assignee set to Greg Farnum

Yep, the snap trimming takes PGRefs and doesn't have a good way of dropping them for shutdown. Looking into it...`

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

(Note: definitely not sure the peering_queue assert is the same one. It could be related though.)

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Status changed from New to Fix Under Review

https://github.com/ceph/ceph/pull/15214

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Status changed from Fix Under Review to In Progress

Actually, this doesn't seem to be the issue. There's a race present but it's pretty narrow since we were already having the AsyncReserver cancel the callback in Trimming::exit(). Trying to reproduce with the trim sleep...

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

I'm not sure these are actually the same bug. Sage's involves snaptrimming, but the others have no mention of it in the OSD log at all. And I've got another log here with the ref count crash where the PG in question is actually in snaptrim (not snaptrim_wait or WaitTrimTimer).

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

I'm starting to think the leak is in peering now. All of them (except Casey's top one, which I didn't check and is about the peering_queue) involve PGs with a min_last_complete_on_disk of 0'0 — ie, one of the OSDs was being backfilled. Except they're in active+clean, so not sure how to resolve that?

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Oh, that's just what happens on peering until you commit something to the PG or make progress in backfill/logging. Never mind.

Actions

Copy link

#10

Updated by Greg Farnum almost 7 years ago

So, we identified today that it looks like the op Backoff code may have an issue with holding PGRefs past when we expect them to die, although it's not a certain thing. Anyway, the patch is very small.

More generally though, we shut down several different timers/AsyncReservers/etc in OSDService::shutdown. This happens after the pg refcount check and lots of these are given PGRefs via input Context objects. Many of them are cleaned up but a few of them seem broken to me at first glance (and, in general, it's really hard to validate or when writing new code to realize that they need to get cleaned up separately). These include reserver_finisher, objecter_finisher, recovery_request_timer, and snap_sleep_timer (though this one at least should get cleaned up via its state exit() function). It's not clear to me if we can maybe just skip doing the PGRef counting assert until after we do OSDService shutdown? I don't think we can shutdown these utility objects earlier since they can be invoked by various other things.

I'm also a little confused about the snap_trimmer state machine when it shuts down. I've done testing where PGs which are in the WaitTrimTimer state (with associated Context holding a PGRef in snap_sleep_timer) are successfully exiting, which as far as I can tell must mean the Reset() event getting emitted by PrimaryLogPG::on_change() is triggering WaitTrimTimer::exit(). But it's not printing out the log message about exiting that state? So I don't know what's going on. (This was one of the ways I initially assumed it was broken, especially as we've started seeing the same pgref shutdown bug on a Jewel user with the backports.)

Actions

Copy link

#11

Updated by Greg Farnum almost 7 years ago

Actually there's a PrimaryLogPG::on_change and a PGBackend::on_change, they are distinct, and the PrimaryLogPG one is only invoked on start_peering_interval() in RecoveryState::Reset::react(AdvMap).
So now I'm again just confused that the WaitTrimTimer's OnTimer callback isn't causing the PGRef assert failure.

Actions

Copy link

#12

Updated by Greg Farnum almost 7 years ago

Status changed from In Progress to Fix Under Review

Okay, my local testing was using init-ceph, and the way it repeatedly invokes kill signals means the OSD wasn't getting shut down cleanly anyway, and I was missing the asserts. But they were present.

So the proximate cause is that the snap_trimmer machine wasn't getting Reset() on shutdown; the PR has patches doing that (and closing some other holes) now. Will test against the rgw-multisite and see if that triggers any other issues, but I'm feeling pretty good.

Actions

Copy link

#13