Project

General

Profile

Actions

Bug #19931

closed

osds abort on shutdown with assert(peering_queue.empty()) or 'pgid X has ref count of 2'

Added by Casey Bodley almost 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
jewel, kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

These have been happening consistently against the rgw:multisite suite, which sets 'wait-for-scrub: false'.

An example run: http://qa-proxy.ceph.com/teuthology/cbodley-2017-05-15_13:57:03-rgw:multisite-master---basic-mira/1180877/teuthology.log

http://qa-proxy.ceph.com/teuthology/cbodley-2017-05-15_13:57:03-rgw:multisite-master---basic-mira/1180877/remote/mira095/log/c1-osd.2.log.gz

2017-05-15 18:29:46.037468 16fbb700 10 osd.2 108  not yet active; waiting for peering wq to drain
2017-05-15 18:29:46.154089 31346700 -1 /home/jenkins-build/.../src/osd/OSD.h: In function 'virtual void OSD::PeeringWQ::_clear()' thread 31346700 time 2017-05-15 18:29:46.043928
/home/jenkins-build/.../src/osd/OSD.h: 1954: FAILED assert(peering_queue.empty())

 ceph version 12.0.1-2381-g6db74cf (6db74cf102db927fabea554aec15bfcc2199b3c1)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xaeeb70]
 2: (()+0x4d8e1c) [0x5e0e1c]
 3: (ThreadPool::stop(bool)+0x2e5) [0xaf2425]
 4: (OSD::shutdown()+0xb74) [0x5b2294]
 5: (OSD::handle_signal(int)+0x11f) [0x5b363f]
 6: (SignalHandler::entry()+0x1d7) [0xab2b87]
 7: (()+0x7dc5) [0xc733dc5]
 8: (clone()+0x6d) [0xd87f73d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

http://qa-proxy.ceph.com/teuthology/cbodley-2017-05-15_13:57:03-rgw:multisite-master---basic-mira/1180877/remote/mira061/log/c2-osd.0.log.gz

2017-05-15 18:27:59.011467 31473700 20 osd.0 96  kicking pg 10.6
2017-05-15 18:27:59.011557 31473700 30 osd.0 pg_epoch: 96 pg[10.6( empty local-lis/les=93/94 n=0 ec=56 lis/c 93/93 les/c/f 94/96/0 93/93/93) [0] r=0 lpr=93 crt=0'0 mlcod 0'0 active+undersized+degraded] lock
2017-05-15 18:27:59.012779 31473700 -1 osd.0 96 pgid 10.6 has ref count of 2
2017-05-15 18:27:59.118433 31473700 -1 *** Caught signal (Aborted) **
 in thread 31473700 thread_name:signal_handler

 ceph version 12.0.1-2381-g6db74cf (6db74cf102db927fabea554aec15bfcc2199b3c1)
 1: (()+0x9a981f) [0xab181f]
 2: (()+0xf370) [0xc73b370]
 3: (gsignal()+0x37) [0xd7bd1d7]
 4: (abort()+0x148) [0xd7be8c8]
 5: (OSD::shutdown()+0x190f) [0x5b302f]
 6: (OSD::handle_signal(int)+0x11f) [0x5b363f]
 7: (SignalHandler::entry()+0x1d7) [0xab2b87]
 8: (()+0x7dc5) [0xc733dc5]
 9: (clone()+0x6d) [0xd87f73d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Related issues 3 (0 open3 closed)

Related to Ceph - Bug #17704: osd: leaked pg refs on shutdownResolved10/26/2016

Actions
Copied to Ceph - Backport #20084: OSDs assert on shutdown when PGs are in snaptrim_wait() stateResolvedGreg FarnumActions
Copied to Ceph - Backport #20516: kraken: osds abort on shutdown with assert(peering_queue.empty()) or 'pgid X has ref count of 2'RejectedActions
Actions

Also available in: Atom PDF