Project

General

Profile

Bug #19931

osds abort on shutdown with assert(peering_queue.empty()) or 'pgid X has ref count of 2'

Added by Casey Bodley 2 months ago. Updated 21 days ago.

Status:
Pending Backport
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
Start date:
05/15/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
jewel, kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

These have been happening consistently against the rgw:multisite suite, which sets 'wait-for-scrub: false'.

An example run: http://qa-proxy.ceph.com/teuthology/cbodley-2017-05-15_13:57:03-rgw:multisite-master---basic-mira/1180877/teuthology.log

http://qa-proxy.ceph.com/teuthology/cbodley-2017-05-15_13:57:03-rgw:multisite-master---basic-mira/1180877/remote/mira095/log/c1-osd.2.log.gz

2017-05-15 18:29:46.037468 16fbb700 10 osd.2 108  not yet active; waiting for peering wq to drain
2017-05-15 18:29:46.154089 31346700 -1 /home/jenkins-build/.../src/osd/OSD.h: In function 'virtual void OSD::PeeringWQ::_clear()' thread 31346700 time 2017-05-15 18:29:46.043928
/home/jenkins-build/.../src/osd/OSD.h: 1954: FAILED assert(peering_queue.empty())

 ceph version 12.0.1-2381-g6db74cf (6db74cf102db927fabea554aec15bfcc2199b3c1)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xaeeb70]
 2: (()+0x4d8e1c) [0x5e0e1c]
 3: (ThreadPool::stop(bool)+0x2e5) [0xaf2425]
 4: (OSD::shutdown()+0xb74) [0x5b2294]
 5: (OSD::handle_signal(int)+0x11f) [0x5b363f]
 6: (SignalHandler::entry()+0x1d7) [0xab2b87]
 7: (()+0x7dc5) [0xc733dc5]
 8: (clone()+0x6d) [0xd87f73d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

http://qa-proxy.ceph.com/teuthology/cbodley-2017-05-15_13:57:03-rgw:multisite-master---basic-mira/1180877/remote/mira061/log/c2-osd.0.log.gz

2017-05-15 18:27:59.011467 31473700 20 osd.0 96  kicking pg 10.6
2017-05-15 18:27:59.011557 31473700 30 osd.0 pg_epoch: 96 pg[10.6( empty local-lis/les=93/94 n=0 ec=56 lis/c 93/93 les/c/f 94/96/0 93/93/93) [0] r=0 lpr=93 crt=0'0 mlcod 0'0 active+undersized+degraded] lock
2017-05-15 18:27:59.012779 31473700 -1 osd.0 96 pgid 10.6 has ref count of 2
2017-05-15 18:27:59.118433 31473700 -1 *** Caught signal (Aborted) **
 in thread 31473700 thread_name:signal_handler

 ceph version 12.0.1-2381-g6db74cf (6db74cf102db927fabea554aec15bfcc2199b3c1)
 1: (()+0x9a981f) [0xab181f]
 2: (()+0xf370) [0xc73b370]
 3: (gsignal()+0x37) [0xd7bd1d7]
 4: (abort()+0x148) [0xd7be8c8]
 5: (OSD::shutdown()+0x190f) [0x5b302f]
 6: (OSD::handle_signal(int)+0x11f) [0x5b363f]
 7: (SignalHandler::entry()+0x1d7) [0xab2b87]
 8: (()+0x7dc5) [0xc733dc5]
 9: (clone()+0x6d) [0xd87f73d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Related issues

Related to Ceph - Bug #17704: osd: leaked pg refs on shutdown Resolved 10/26/2016
Copied to Ceph - Backport #20084: OSDs assert on shutdown when PGs are in snaptrim_wait() state In Progress
Copied to Ceph - Backport #20516: kraken: osds abort on shutdown with assert(peering_queue.empty()) or 'pgid X has ref count of 2' New

History

#1 Updated by Casey Bodley 2 months ago

  • Related to Bug #17704: osd: leaked pg refs on shutdown added

#2 Updated by Sage Weil 2 months ago

  • Priority changed from Normal to Immediate
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: ceph version 12.0.1-2422-g0d194cc (0d194cc84cf22302e6f62c1cba71d85c99637937)
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 1: (()+0x9abc8f) [0xab3c8f]
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 2: (()+0xf370) [0xc74e370]
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 3: (gsignal()+0x37) [0xd7d01d7]
2017-05-15T19:58:02.613 INFO:tasks.ceph.osd.2.smithi136.stderr: 4: (abort()+0x148) [0xd7d18c8]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 5: (OSD::shutdown()+0x190f) [0x5b39cf]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 6: (OSD::handle_signal(int)+0x11f) [0x5b3fdf]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 7: (SignalHandler::entry()+0x1d7) [0xab4ff7]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 8: (()+0x7dc5) [0xc746dc5]
2017-05-15T19:58:02.614 INFO:tasks.ceph.osd.2.smithi136.stderr: 9: (clone()+0x6d) [0xd89273d]
2017-05-15T19:58:02.622 INFO:tasks.ceph.osd.2.smithi136.stderr:2017-05-15 19:58:02.560591 32406700 -1 osd.2 703 pgid 210.7 has ref count of 2

on /a/sage-2017-05-15_17:07:41-rados-wip-sage-testing2---basic-smithi/1180651

#3 Updated by Greg Farnum 2 months ago

  • Assignee set to Greg Farnum

Yep, the snap trimming takes PGRefs and doesn't have a good way of dropping them for shutdown. Looking into it...`

#4 Updated by Greg Farnum 2 months ago

(Note: definitely not sure the peering_queue assert is the same one. It could be related though.)

#5 Updated by Greg Farnum 2 months ago

  • Status changed from New to Need Review

#6 Updated by Greg Farnum 2 months ago

  • Status changed from Need Review to In Progress

Actually, this doesn't seem to be the issue. There's a race present but it's pretty narrow since we were already having the AsyncReserver cancel the callback in Trimming::exit(). Trying to reproduce with the trim sleep...

#7 Updated by Greg Farnum 2 months ago

I'm not sure these are actually the same bug. Sage's involves snaptrimming, but the others have no mention of it in the OSD log at all. And I've got another log here with the ref count crash where the PG in question is actually in snaptrim (not snaptrim_wait or WaitTrimTimer).

#8 Updated by Greg Farnum 2 months ago

I'm starting to think the leak is in peering now. All of them (except Casey's top one, which I didn't check and is about the peering_queue) involve PGs with a min_last_complete_on_disk of 0'0 — ie, one of the OSDs was being backfilled. Except they're in active+clean, so not sure how to resolve that?

#9 Updated by Greg Farnum 2 months ago

Oh, that's just what happens on peering until you commit something to the PG or make progress in backfill/logging. Never mind.

#10 Updated by Greg Farnum 2 months ago

So, we identified today that it looks like the op Backoff code may have an issue with holding PGRefs past when we expect them to die, although it's not a certain thing. Anyway, the patch is very small.

More generally though, we shut down several different timers/AsyncReservers/etc in OSDService::shutdown. This happens after the pg refcount check and lots of these are given PGRefs via input Context objects. Many of them are cleaned up but a few of them seem broken to me at first glance (and, in general, it's really hard to validate or when writing new code to realize that they need to get cleaned up separately). These include reserver_finisher, objecter_finisher, recovery_request_timer, and snap_sleep_timer (though this one at least should get cleaned up via its state exit() function). It's not clear to me if we can maybe just skip doing the PGRef counting assert until after we do OSDService shutdown? I don't think we can shutdown these utility objects earlier since they can be invoked by various other things.

I'm also a little confused about the snap_trimmer state machine when it shuts down. I've done testing where PGs which are in the WaitTrimTimer state (with associated Context holding a PGRef in snap_sleep_timer) are successfully exiting, which as far as I can tell must mean the Reset() event getting emitted by PrimaryLogPG::on_change() is triggering WaitTrimTimer::exit(). But it's not printing out the log message about exiting that state? So I don't know what's going on. (This was one of the ways I initially assumed it was broken, especially as we've started seeing the same pgref shutdown bug on a Jewel user with the backports.)

#11 Updated by Greg Farnum 2 months ago

Actually there's a PrimaryLogPG::on_change and a PGBackend::on_change, they are distinct, and the PrimaryLogPG one is only invoked on start_peering_interval() in RecoveryState::Reset::react(AdvMap).
So now I'm again just confused that the WaitTrimTimer's OnTimer callback isn't causing the PGRef assert failure.

#12 Updated by Greg Farnum 2 months ago

  • Status changed from In Progress to Need Review

Okay, my local testing was using init-ceph, and the way it repeatedly invokes kill signals means the OSD wasn't getting shut down cleanly anyway, and I was missing the asserts. But they were present.

So the proximate cause is that the snap_trimmer machine wasn't getting Reset() on shutdown; the PR has patches doing that (and closing some other holes) now. Will test against the rgw-multisite and see if that triggers any other issues, but I'm feeling pretty good.

#13 Updated by Greg Farnum 2 months ago

  • Copied to Backport #20084: OSDs assert on shutdown when PGs are in snaptrim_wait() state added

#14 Updated by Nathan Cutler 2 months ago

  • Backport set to jewel

#15 Updated by Greg Farnum 2 months ago

Well, it's not exactly great but we didn't see the same issues on multisite:
http://pulpito.ceph.com/gregf-2017-05-26_06:42:33-rgw:multisite-wip-19931-snaptrim-pgs---basic-smithi/

(The rados issue there is #19900; I didn't look into the multisite failure expect to check it didn't include crashed OSDs.)

#16 Updated by Greg Farnum 2 months ago

  • Status changed from Need Review to Resolved

#17 Updated by Nathan Cutler about 1 month ago

  • Status changed from Resolved to Pending Backport

#18 Updated by Josh Durgin 21 days ago

  • Backport changed from jewel to jewel, kraken

#19 Updated by Nathan Cutler 21 days ago

  • Copied to Backport #20516: kraken: osds abort on shutdown with assert(peering_queue.empty()) or 'pgid X has ref count of 2' added

Also available in: Atom PDF