Project

General

Profile

Actions

Bug #20000

closed

osd assert in shared_cache.hpp: 107: FAILED assert(weak_refs.empty())

Added by xw zhang almost 7 years ago. Updated over 4 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

version:
root@node0:~# ceph -v
ceph version 12.0.2 (5a1b6b3269da99a18984c138c23935e5eb96f73e)

bluestore+ec+overwrite ,ec-k/m-4/2

2017-05-19 17:14:08.553617 7f290b016c80 -1 /build/ceph-12.0.2/src/common/shared_cache.hpp: In function 'SharedLRU<K, V, C, H>::~SharedLRU() [with K = unsigned int; V = const OSDMap; C = std::less<unsigned int>; H = std::hash<unsigned int>]' thread 7f290b016c80 time 2017-05-19 17:14:08.533752
/build/ceph-12.0.2/src/common/shared_cache.hpp: 107: FAILED assert(weak_refs.empty())

ceph version 12.0.2 (5a1b6b3269da99a18984c138c23935e5eb96f73e)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55aa6593a072]
2: (()+0x79d7f3) [0x55aa653c87f3]
3: (OSDService::~OSDService()+0x158) [0x55aa65350138]
4: (OSD::~OSD()+0x125) [0x55aa653a5f15]
5: (OSD::~OSD()+0x9) [0x55aa653a66b9]
6: (main()+0x30ff) [0x55aa6532787f]
7: (__libc_start_main()+0xf0) [0x7f2908481830]
8: (_start()+0x29) [0x55aa65333cc9]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this

Related issues 3 (0 open3 closed)

Related to RADOS - Bug #20273: osd/OSD.h: 1957: FAILED assert(peerin g_queue.empty())ResolvedSage Weil06/12/2017

Actions
Related to RADOS - Bug #20432: pgid 0.7 has ref count of 2ResolvedKefu Chai06/27/2017

Actions
Related to RADOS - Bug #21823: on_flushed: object ... obc still alive (ec + cache tiering)Can't reproduce10/17/2017

Actions
Actions #1

Updated by Sage Weil almost 7 years ago

  • Status changed from New to 12
  • Priority changed from Normal to Urgent
2017-06-06T23:24:29.946 INFO:tasks.ceph.osd.1.smithi011.stderr:dump_weak_refs 0xf6b5480 weak_refs: 521 = 0xf590950 with 5 refs
2017-06-06T23:24:29.946 INFO:tasks.ceph.osd.1.smithi011.stderr:
2017-06-06T23:24:29.946 INFO:tasks.ceph.osd.1.smithi011.stderr:     0> 2017-06-06 23:24:29.556669 9635080 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.2-2450-g0804911/rpm/el7/BUILD/ceph-12.0.2-2450-g0804911/src/common/shared_cache.hpp: In function 'SharedLRU<K, V, C, H>::~SharedLRU() [with K = unsigned int; V = const OSDMap; C = std::less<unsigned int>; H = std::hash<unsigned int>]' thread 9635080 time 2017-06-06 23:24:29.520142
2017-06-06T23:24:29.946 INFO:tasks.ceph.osd.1.smithi011.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.2-2450-g0804911/rpm/el7/BUILD/ceph-12.0.2-2450-g0804911/src/common/shared_cache.hpp: 105: FAILED assert(weak_refs.empty())
2017-06-06T23:24:29.946 INFO:tasks.ceph.osd.1.smithi011.stderr:
2017-06-06T23:24:29.946 INFO:tasks.ceph.osd.1.smithi011.stderr: ceph version  12.0.2-2450-g0804911 (0804911540df0c38883250e101725c383f2486b5) luminous (dev)
2017-06-06T23:24:29.946 INFO:tasks.ceph.osd.1.smithi011.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xafb9f0]
2017-06-06T23:24:29.946 INFO:tasks.ceph.osd.1.smithi011.stderr: 2: (()+0x5036da) [0x60b6da]
2017-06-06T23:24:29.946 INFO:tasks.ceph.osd.1.smithi011.stderr: 3: (OSDService::~OSDService()+0x17c) [0x594f3c]
2017-06-06T23:24:29.947 INFO:tasks.ceph.osd.1.smithi011.stderr: 4: (OSD::~OSD()+0x133) [0x5e37a3]
2017-06-06T23:24:29.947 INFO:tasks.ceph.osd.1.smithi011.stderr: 5: (OSD::~OSD()+0x9) [0x5e3de9]
2017-06-06T23:24:29.947 INFO:tasks.ceph.osd.1.smithi011.stderr: 6: (main()+0x2f48) [0x4dc4e8]
2017-06-06T23:24:29.947 INFO:tasks.ceph.osd.1.smithi011.stderr: 7: (__libc_start_main()+0xf5) [0xd5c5b35]
2017-06-06T23:24:29.947 INFO:tasks.ceph.osd.1.smithi011.stderr: 8: (()+0x46d4a6) [0x5754a6]

/a/sage-2017-06-06_21:54:14-rados-wip-sage-testing-distro-basic-smithi/1265660
Actions #2

Updated by Zengran Zhang almost 7 years ago

we found that the msg threads still working after the `delete osd` in asyncmsg env, its because the asyncmsg::wait() will not join its workers's threads. may this cause the issue here?

Actions #3

Updated by Sage Weil almost 7 years ago

  • Related to Bug #20273: osd/OSD.h: 1957: FAILED assert(peerin g_queue.empty()) added
Actions #4

Updated by Sage Weil almost 7 years ago

Could be... maybe also #20273?

Actions #5

Updated by Sage Weil almost 7 years ago

  • Status changed from 12 to Need More Info
Actions #6

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
Actions #7

Updated by Yuri Weinstein almost 7 years ago

Also in http://qa-proxy.ceph.com/teuthology/yuriw-2017-06-21_01:02:43-rgw-master_2017_6_21-distro-basic-smithi/1307264/teuthology.log

2017-06-21T01:36:20.088 INFO:tasks.ceph.c2.osd.2.smithi044.stderr:/build/ceph-12.0.3-1946-g503de20/src/common/shared_cache.hpp: In function 'SharedLRU<K, V, C, H>::~SharedLRU() [with K = unsigned int; V = const OSDMap; C = std::less<unsigned int>; H = std::hash<unsigned int>]' thread 96903c0 time 2017-06-21 01:36:20.091545
2017-06-21T01:36:20.089 INFO:tasks.ceph.c2.osd.2.smithi044.stderr:/build/ceph-12.0.3-1946-g503de20/src/common/shared_cache.hpp: 108: FAILED assert(weak_refs.empty())
2017-06-21T01:36:20.127 INFO:tasks.ceph.c2.osd.2.smithi044.stderr: ceph version 12.0.3-1946-g503de20 (503de209275b5d54a41747e19bca5495259bec43) luminous (dev)
2017-06-21T01:36:20.127 INFO:tasks.ceph.c2.osd.2.smithi044.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0xaea5be]
2017-06-21T01:36:20.127 INFO:tasks.ceph.c2.osd.2.smithi044.stderr: 2: (()+0x512445) [0x61a445]
2017-06-21T01:36:20.127 INFO:tasks.ceph.c2.osd.2.smithi044.stderr: 3: (OSDService::~OSDService()+0x16c) [0x5a871c]
2017-06-21T01:36:20.127 INFO:tasks.ceph.c2.osd.2.smithi044.stderr: 4: (OSD::~OSD()+0x123) [0x5f5c83]
2017-06-21T01:36:20.127 INFO:tasks.ceph.c2.osd.2.smithi044.stderr: 5: (OSD::~OSD()+0x9) [0x5f62a9]
2017-06-21T01:36:20.128 INFO:tasks.ceph.c2.osd.2.smithi044.stderr: 6: (main()+0x2ae7) [0x4f3037]
2017-06-21T01:36:20.128 INFO:tasks.ceph.c2.osd.2.smithi044.stderr: 7: (__libc_start_main()+0xf5) [0xc760f45]
2017-06-21T01:36:20.128 INFO:tasks.ceph.c2.osd.2.smithi044.stderr: 8: (()+0x485096) [0x58d096]
2017-06-21T01:36:20.128 INFO:tasks.ceph.c2.osd.2.smithi044.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #8

Updated by Casey Bodley almost 7 years ago

These osd assertion failures reproduce consistently on shutdown in the rgw:multisite suite.

Actions #9

Updated by Sage Weil almost 7 years ago

  • Related to Bug #20432: pgid 0.7 has ref count of 2 added
Actions #10

Updated by Sage Weil almost 7 years ago

/a/sage-2017-06-27_05:44:05-rados-wip-sage-testing-distro-basic-smithi/1331957

Actions #11

Updated by Patrick Donnelly almost 7 years ago

/a/pdonnell-2017-06-27_19:50:40-fs-wip-pdonnell-20170627---basic-smithi/1333726
/a/pdonnell-2017-06-27_19:50:40-fs-wip-pdonnell-20170627---basic-smithi/1333675

Actions #12

Updated by Kefu Chai almost 7 years ago

  • Priority changed from Urgent to High

lower the priority since we haven't spotted it for a while.

Actions #13

Updated by Sage Weil over 6 years ago

  • Related to Bug #21823: on_flushed: object ... obc still alive (ec + cache tiering) added
Actions #14

Updated by Sage Weil over 6 years ago

  • Status changed from Need More Info to Can't reproduce
Actions #15

Updated by Sage Weil over 5 years ago

  • Status changed from Can't reproduce to 12

i see a zillion of these in this run

http://pulpito.ceph.com/teuthology-2019-01-05_03:09:02-powercycle-master-distro-basic-smithi/#

for example,

/a/teuthology-2019-01-05_03:09:02-powercycle-master-distro-basic-smithi/3424141

2019-01-07 20:43:38.514 7f98ab336900  1 --  shutdown_connections 
2019-01-07 20:43:38.514 7f98ab336900  1 --  wait complete.
2019-01-07 20:43:38.518 7f98ab336900 -1 leaked refs:
dump_weak_refs 0xbf768e0 weak_refs: 24 = 0xc2a6d00 with 1 refs

2019-01-07 20:43:38.522 7f98ab336900 -1 /build/ceph-14.0.1-2314-g4331a92/src/common/shared_cache.hpp: In function 'SharedLRU<K, V>::~SharedLRU() [with K = unsigned int; V = const OSDMap]' thread 7f98ab336900 time 2019-01-07 20:43:38.519177
/build/ceph-14.0.1-2314-g4331a92/src/common/shared_cache.hpp: 121: FAILED ceph_assert(weak_refs.empty())

 ceph version 14.0.1-2314-g4331a92 (4331a92ab70f878fac574bd60a1ce3bc310680f2) nautilus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x82e1a3]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x82e37e]
 3: ceph-osd() [0x9d5001]
 4: (OSDService::~OSDService()+0x144) [0x988d14]
 5: (OSD::~OSD()+0x2c3) [0x98e7b3]
 6: (OSD::~OSD()+0x9) [0x98ede9]

Actions #16

Updated by Josh Durgin over 4 years ago

  • Status changed from 12 to Can't reproduce

Re-open if this still occurs.

Actions

Also available in: Atom PDF