Project

General

Profile

Actions

Bug #7995

closed

osd shutdown: ./common/shared_cache.hpp: 93: FAILED assert(weak_refs.empty())

Added by Sage Weil about 10 years ago. Updated about 7 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-04-04_02:30:19-rados-master-testing-basic-plana/168602

    -3> 2014-04-04 14:53:32.441869 4036900 10 -- 0.0.0.0:6809/31028 wait: waiting for pipes  to close
    -2> 2014-04-04 14:53:32.442008 4036900 10 -- 0.0.0.0:6809/31028 wait: done.
    -1> 2014-04-04 14:53:32.442139 4036900  1 -- 0.0.0.0:6809/31028 shutdown complete.
     0> 2014-04-04 14:53:32.706163 4036900 -1 ./common/shared_cache.hpp: In function 'SharedLRU<K, V>::~SharedLRU() [with K = unsigned int, V = const OSDMap]' thread 4036900 time 2014-04-04 14:53:32.556532
./common/shared_cache.hpp: 93: FAILED assert(weak_refs.empty())

 ceph version 0.78-556-g18201ef (18201efd65d88aa3fb21eb971132bd599b215a85)
 1: (OSDService::~OSDService()+0xa97) [0x62a287]
 2: (OSD::~OSD()+0x113) [0x642153]
 3: (OSD::~OSD()+0x9) [0x6432c9]
 4: (main()+0x1f8f) [0x60d85f]
 5: (__libc_start_main()+0xed) [0x6c4b76d]
 6: ceph-osd() [0x6118b9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Files

ceph-osd.2.log (17.8 MB) ceph-osd.2.log Sahana Lokeshappa, 09/19/2014 01:55 AM
osd-assert-weak_refs.txt.gz (3.2 KB) osd-assert-weak_refs.txt.gz Eric Eastman, 10/02/2015 08:37 AM
osd-tree.txt.gz (1.36 KB) osd-tree.txt.gz Eric Eastman, 10/02/2015 08:37 AM

Related issues 4 (0 open4 closed)

Has duplicate Ceph - Bug #8022: coredumps found in librbd testsDuplicate04/07/2014

Actions
Has duplicate Ceph - Bug #10901: FAILED assert(weak_refs.empty()) (outstanding OSDMap ref at shutdown)Duplicate02/17/2015

Actions
Has duplicate Ceph - Bug #14265: ./common/shared_cache.hpp: 107: FAILED assert(weak_refs.empty())Duplicate01/06/2016

Actions
Has duplicate Ceph - Bug #14458: "shared_cache.hpp: 104: FAILED assert(weak_refs.empty())" in rados-hammer-distro-basic-miraDuplicate01/21/2016

Actions
Actions #1

Updated by Sage Weil about 10 years ago

ubuntu@teuthology:/a/teuthology-2014-04-06_01:10:23-ceph-deploy-firefly-distro-basic-vps/173432

Actions #2

Updated by Sage Weil almost 10 years ago

  • Status changed from 12 to Duplicate

this seems to be the same as #7891

Actions #3

Updated by Sage Weil almost 10 years ago

  • Status changed from Duplicate to 12
  • Priority changed from Urgent to High

ubuntu@teuthology:/a/samuelj-2014-05-08_12:44:29-rados-firefly-testing-basic-plana/243892

Actions #4

Updated by Sage Weil almost 10 years ago

i don't think this is a dup of #7891. the pgs have been cleaned up at this point.

Actions #5

Updated by Haomai Wang almost 10 years ago

I don't explore the failure of this case. But it reminds of me that I ever try shared_cache for KeyValueStore header cache and find something bad.

In SharedCache, if we call "clear" method to evict key in the cache, you must ensure no other places hold the evicted key! For example, we insert "key1" to SharedCache, then we get the shared_ptr to "key1". At a time, we call "clear" to evict "key1" but caller still hold the shared_ptr point to "key1". So SharedCache::remove method won't be called and "weak_refs" won't erase "key1". This will cause if we lookup "key1" in the cache, we will find that "weak_refs" still has "key1" point and unexpected result will happen.

Back to this case, I think there should be someone still hold the key, so if we release SharedCache, "weak_refs" still has elements.

I'm not sure my guess is right, just recall my memory about using SharedCache.

Actions #6

Updated by Samuel Just almost 10 years ago

clear() should be better named, it actually only removes the lru ref to the key. If there are other refs around, the key will remain in weak_refs, which is correct. I don't think we ever use that method for the OSDMap cache.

This assert most likely means that an OSDMapRef remains alive either through a leak or a structure we failed to clear at this point.

Actions #7

Updated by Sage Weil almost 10 years ago

two nearly-identialy yamls hit this:

ubuntu@teuthology:/a/teuthology-2014-06-27_23:00:47-rgw-master-testing-basic-plana/331383
ubuntu@teuthology:/a/teuthology-2014-06-27_23:00:47-rgw-master-testing-basic-plana/331389

ubuntu@teuthology:/a/teuthology-2014-06-27_23:00:47-rgw-master-testing-basic-plana$ diff 331389/orig.config.yaml 331383/orig.config.yaml
1c1
< archive_path: /var/lib/teuthworker/archive/teuthology-2014-06-27_23:00:47-rgw-master-testing-basic-plana/331389
---

archive_path: /var/lib/teuthworker/archive/teuthology-2014-06-27_23:00:47-rgw-master-testing-basic-plana/331383

4c4
< rgw_pool_type/ec.yaml tasks/rgw_swift.yaml validater/valgrind.yaml}
---

rgw_pool_type/ec-cache.yaml tasks/rgw_swift.yaml validater/valgrind.yaml}

6c6
< job_id: '331389'
---

job_id: '331383'

61a62

cache-pools: true

98c99
< worker_log: /var/lib/teuthworker/archive/worker_logs/worker.plana.10805
---

worker_log: /var/lib/teuthworker/archive/worker_logs/worker.plana.13546

... which suggests this is probably pretty reproducible with this:

roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
  - client.0
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.1
suite: rgw
tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- rgw:
    client.0:
      valgrind:
      - --tool=memcheck
- swift:
    client.0:
      rgw_server: client.0
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
        osd heartbeat grace: 40
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
        osd op thread timeout: 60
        osd sloppy crc: true
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 62027ec9f4e7ebcf9ea129bba78476309bc82702
    valgrind:
      mds:
      - --tool=memcheck
      mon:
      - --tool=memcheck
      - --leak-check=full
      - --show-reachable=yes
      osd:
      - --tool=memcheck
  ceph-deploy:
    branch:
      dev: master
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 62027ec9f4e7ebcf9ea129bba78476309bc82702
  rgw:
    ec-data-pool: true
  s3tests:
    branch: master
  workunit:
    sha1: 62027ec9f4e7ebcf9ea129bba78476309bc82702
Actions #8

Updated by Sage Weil almost 10 years ago

  • Priority changed from High to Urgent
Actions #10

Updated by Sage Weil almost 10 years ago

  • Assignee set to Sage Weil
Actions #11

Updated by Sage Weil almost 10 years ago

  • Status changed from 12 to Fix Under Review

https://github.com/ceph/ceph/pull/2081

I suspect the original incidents for this bug were triggered by a different leak, but this captures the current ones that I can reproduce. We can reopen if/when it reoccurs.

Actions #12

Updated by Sage Weil almost 10 years ago

  • Status changed from Fix Under Review to Resolved
Actions #13

Updated by John Spray over 9 years ago

  • Status changed from Resolved to New

This is happening again:

http://pulpito.front.sepia.ceph.com/john-2014-08-22_02:21:21-rados-wip-objecter-testing-basic-multi/440848/

    -1> 2014-08-21 20:59:20.676251 4038b00  1 -- 0.0.0.0:6814/3952 shutdown complete.
     0> 2014-08-21 20:59:20.801080 4038b00 -1 ./common/shared_cache.hpp: In function 'SharedLRU<K, V>::~SharedLRU() [with K = unsigned int, V = const OSDMap]' thread 4038b00 time 2014-08-21 20:59:20.722476
./common/shared_cache.hpp: 93: FAILED assert(weak_refs.empty())

 ceph version 0.84-630-gc10b489 (c10b48963dc879ec97a238447905abc34e06caa6)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xa98d8f]
 2: (OSDService::~OSDService()+0xbb6) [0x62b0e6]
 3: (OSD::~OSD()+0x115) [0x63e315]
 4: (OSD::~OSD()+0x9) [0x63f4c9]
 5: (main()+0x1fe9) [0x60bd29]
 6: (__libc_start_main()+0xed) [0x683b76d]
 7: ceph-osd() [0x60fec9]
Actions #15

Updated by Sage Weil over 9 years ago

added patches to master that will dump the weak_refs on shutdown

Actions #16

Updated by Sage Weil over 9 years ago

  • Status changed from New to Need More Info
Actions #17

Updated by Sage Weil over 9 years ago

    -1> 2014-08-29 01:23:09.102704 4038ac0 -1 leaked refs:
dump_weak_refs 0xaa0ff88 weak_refs: 1853 = 0x2c0477e0 with 1 refs

ubuntu@teuthology:/a/sage-2014-08-28_16:08:59-rados-master-testing-basic-multi/458983
Actions #18

Updated by Sage Weil over 9 years ago

ubuntu@teuthology:/a/teuthology-2014-08-31_02:30:01-rados-next-testing-basic-multi/463412 (but didn't have debug enabled, so we don't know which map it was)

Actions #19

Updated by Sage Weil over 9 years ago

ubuntu@teuthology:/a/teuthology-2014-09-08_02:32:01-rados-master-testing-basic-multi/472466
(most recent osdmap)

Actions #20

Updated by Sahana Lokeshappa over 9 years ago

It reproed. PFA logs attached.

here is the snippet:

2014-09-19 10:27:02.228364 7f86d73a2700 0 log_channel(default) log [INF] : 2.f1 restarting backfill on osd.57 from (0'0,0'0] MAX to 6820'24528
2014-09-19 10:27:08.614353 7f86ea1b1700 -1 osd.2 13201 heartbeat_check: no reply from osd.1 ever on either front or back, first ping sent 2014-09-19 10:26:48.508945 (cutoff 2014-09-19 10:26:48.614351)
2014-09-19 10:27:09.748657 7f86d7ba3700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f86d7ba3700 time 2014-09-19 10:27:09.745965
osd/PGLog.cc: 546: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

ceph version 0.84-sd-sprint4 (3215c520e1306f50d0094b5646636c02456c9df4)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb776db]
2: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1c89) [0x765e59]
3: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x9c) [0x7a9d1c]
4: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x380) [0x7e1ea0]
5: (boost::statechart::simple_state&lt;PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na&gt;, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1f4) [0x81a944]
6: (boost::statechart::state_machine&lt;PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x805d6b]
7: (PG::handle_peering_event(std::tr1::shared_ptr&lt;PG::CephPeeringEvt&gt;, PG::RecoveryCtx*)+0x1ce) [0x7b5b8e]
8: (OSD::process_peering_events(std::list&lt;PG*, std::allocator&lt;PG*&gt; > const&, ThreadPool::TPHandle&)+0x2b0) [0x6a3140]
9: (OSD::PeeringWQ::_process(std::list&lt;PG*, std::allocator&lt;PG*&gt; > const&, ThreadPool::TPHandle&)+0x18) [0x6f83c8]
10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa46) [0xb68866]
11: (ThreadPool::WorkThread::entry()+0x10) [0xb69910]
12: (()+0x8182) [0x7f86f1936182]
13: (clone()+0x6d) [0x7f86efd1c30d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.
Actions #21

Updated by Sahana Lokeshappa over 9 years ago

Please ignore the previous update. Here is the correct one:

While some osds where in nearfull situation, shutdown of one the osds lead to the below crash
:
Here is the snippet:

2014-09-19 12:15:51.270234 7f20ac26b900 10 -- 0.0.0.0:6828/16757 reaper done
2014-09-19 12:15:51.270235 7f20ac26b900 10 -- 0.0.0.0:6828/16757 wait: waiting for pipes to close
2014-09-19 12:15:51.270237 7f20ac26b900 10 -- 0.0.0.0:6828/16757 wait: done.
2014-09-19 12:15:51.270238 7f20ac26b900 1 -- 0.0.0.0:6828/16757 shutdown complete.
2014-09-19 12:15:51.312746 7f20ac26b900 -1 leaked refs:
dump_weak_refs 0x4b25940 weak_refs: 15678 = 0x13ef0a00 with 7 refs

2014-09-19 12:15:51.322518 7f20ac26b900 -1 ./common/shared_cache.hpp: In function 'SharedLRU<K, V>::~SharedLRU() [with K = unsigned int; V = const OSDMap]' thread 7f20ac26b900 time 2014-09-19 12:15:51.321024
./common/shared_cache.hpp: 98: FAILED assert(weak_refs.empty())

ceph version 0.84-sd-sprint4 (3215c520e1306f50d0094b5646636c02456c9df4)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb776db]
2: (SharedLRU&lt;unsigned int, OSDMap const&gt;::~SharedLRU()+0x1b4) [0x6e0184]
3: (OSDService::~OSDService()+0x16c) [0x677d9c]
4: (OSD::~OSD()+0x115) [0x6bb975]
5: (OSD::~OSD()+0x9) [0x6bc5a9]
6: (main()+0x260f) [0x6439bf]
7: (__libc_start_main()+0xf5) [0x7f20a9220ec5]
8: /usr/bin/ceph-osd() [0x65bc07]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- begin dump of recent events ---
10000> 2014-09-19 12:15:49.756077 7f207f2e6700 30 - 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46569 s=2 pgs=241 cs=271 l=0 c=0x17e64680). bl_pos 8389954 b_off 0 leftinchunk 531 buffer len 15 writing 15
9999> 2014-09-19 12:15:49.756082 7f207f2e6700 30 - 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46569 s=2 pgs=241 cs=271 l=0 c=0x17e64680). bl_pos 8389969 b_off 0 leftinchunk 516 buffer len 31 writing 31

Actions #22

Updated by Sage Weil over 9 years ago

  • Priority changed from Urgent to High
Actions #23

Updated by Samuel Just over 9 years ago

ubuntu@teuthology:/a/samuelj-2014-10-01_18:59:42-rados-giant-wip-testing-old-vanilla-basic-multi/524988

Actions #24

Updated by Varada Kari over 9 years ago

Looking at the ceph-osd.2.log uploaded by Sahana.
Prior to the reported problem, there was one more crash while merging the PGlog entries. And some more backfills were in progress for some PG's.

-9> 2014-09-19 10:27:33.604368 7f83cb923700 5 osd.2 pg_epoch: 13227 pg[2.e3( v 6791'26630 (4155'23631,6791'26630] lb fbd5f8e3/rbd_data.13f92ae8944a.000000000001c877/head//2 local-les=0 n=510 ec=188 les/c 13222/11333 13226/13226/11427) [12,57,2]/[35,18] r=-1 lpr=13226 pi=11332-13225/28 crt=0'0 lcod 0'0 remapped NOTIFY] exit Started/Stray 1.028444 7 0.000191
-8> 2014-09-19 10:27:33.604382 7f83cb923700 5 osd.2 pg_epoch: 13227 pg[2.e3( v 6791'26630 (4155'23631,6791'26630] lb fbd5f8e3/rbd_data.13f92ae8944a.000000000001c877/head//2 local-les=0 n=510 ec=188 les/c 13222/11333 13226/13226/11427) [12,57,2]/[35,18] r=-1 lpr=13226 pi=11332-13225/28 crt=0'0 lcod 0'0 remapped NOTIFY] enter Started/ReplicaActive
-7> 2014-09-19 10:27:33.604396 7f83cb923700 5 osd.2 pg_epoch: 13227 pg[2.e3( v 6791'26630 (4155'23631,6791'26630] lb fbd5f8e3/rbd_data.13f92ae8944a.000000000001c877/head//2 local-les=0 n=510 ec=188 les/c 13222/11333 13226/13226/11427) [12,57,2]/[35,18] r=-1 lpr=13226 pi=11332-13225/28 crt=0'0 lcod 0'0 remapped NOTIFY] enter Started/ReplicaActive/RepNotRecovering
-6> 2014-09-19 10:27:33.604401 7f83d4134700 1 – 10.242.42.164:6807/8845 <== osd.60 10.242.42.178:6804/2488 6 ==== pg_info(1 pgs e13227:2.d3) v4 ==== 742+0+0 (223357553 0 0) 0x15523a20 con 0x14713b80
-5> 2014-09-19 10:27:33.604410 7f83d4134700 5 – op tracker – , seq: 247, time: 2014-09-19 10:27:33.604356, event: header_read, op: pg_info(1 pgs e13227:2.d3)
-4> 2014-09-19 10:27:33.604417 7f83d4134700 5 – op tracker – , seq: 247, time: 2014-09-19 10:27:33.604357, event: throttled, op: pg_info(1 pgs e13227:2.d3)
-3> 2014-09-19 10:27:33.604421 7f83d4134700 5 – op tracker – , seq: 247, time: 2014-09-19 10:27:33.604387, event: all_read, op: pg_info(1 pgs e13227:2.d3)
-2> 2014-09-19 10:27:33.604427 7f83d4134700 5 – op tracker – , seq: 247, time: 2014-09-19 10:27:33.604407, event: dispatched, op: pg_info(1 pgs e13227:2.d3)
-1> 2014-09-19 10:27:33.604431 7f83d4134700 5 – op tracker – , seq: 247, time: 2014-09-19 10:27:33.604430, event: waiting_for_osdmap, op: pg_info(1 pgs e13227:2.d3)
0> 2014-09-19 10:27:33.604411 7f83cc124700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f83cc124700 time 2014-09-19 10:27:33.601896
osd/PGLog.cc: 546: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

For couple of hours osd.2 was down.

Once the OSD.2 back online, started back filling for PG 2.7b8(some back fills were happening for the same PG on osd.58 prior to this crash).

2014-09-19 12:15:17.646933 7f20ac26b900 0 osd.2 13227 crush map has features 1073741824, adjusting msgr requires for clients
2014-09-19 12:15:17.646941 7f20ac26b900 0 osd.2 13227 crush map has features 1073741824 was 8705, adjusting msgr requires for mons
2014-09-19 12:15:17.646943 7f20ac26b900 0 osd.2 13227 crush map has features 1073741824, adjusting msgr requires for osds
2014-09-19 12:15:17.646970 7f20ac26b900 0 osd.2 13227 load_pgs
2014-09-19 12:15:19.320789 7f20ac26b900 0 osd.2 13227 load_pgs opened 175 pgs
2014-09-19 12:15:19.325741 7f2098e80700 0 osd.2 13227 ignoring osdmap until we have initialized
2014-09-19 12:15:19.326218 7f2098e80700 0 osd.2 13227 ignoring osdmap until we have initialized
2014-09-19 12:15:19.327051 7f2084d57700 0 – 0.0.0.0:6825/16757 >> 10.242.42.174:6801/21548 pipe(0x499ab00 sd=32 :57349 s=1 pgs=0 cs=0 l=0 c=0x12fa0000).connect claims to be 10.242.42.174:6801/45986 not 10.242.42.174:6801/21548 - wrong node!

...

2014-09-19 12:15:48.293732 7f20811fe700 -1 osd.2 15678 * Got signal Terminated *
2014-09-19 12:15:48.293815 7f20811fe700 0 osd.2 15678 prepare_to_stop telling mon we are shutting down
2014-09-19 12:15:48.519110 7f2098e80700 0 osd.2 15678 got_stop_ack starting shutdown
2014-09-19 12:15:48.519130 7f20811fe700 0 osd.2 15678 prepare_to_stop starting shutdown

2014-09-19 12:15:48.554036 7f20811fe700 -1 osd.2 15678 shutdown

...
2014-09-19 12:15:48.554443 7f207f2e6700 10 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46410 s=1 pgs=105 cs=2 l=0 c=0x17e64680).reply.authorizer_len=36
2014-09-19 12:15:48.554468 7f207f2e6700 10 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46410 s=2 pgs=107 cs=3 l=0 c=0x17e64680).connect success 3, lossy = 0, features 17592186044415
2014-09-19 12:15:48.554481 7f207f2e6700 20 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46410 s=2 pgs=107 cs=3 l=0 c=0x17e64680).connect starting reader
2014-09-19 12:15:48.554523 7f207f2e6700 10 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46410 s=2 pgs=107 cs=3 l=0 c=0x17e64680).writer: state = open policy.server=0
2014-09-19 12:15:48.554532 7f207f2e6700 20 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46410 s=2 pgs=107 cs=3 l=0 c=0x17e64680).writer encoding 181 features 17592186044415 0xbdd8d00 pg_backfill(progress 2.7b8 e 15678/15678 lb 854e8fb8/rbd_data.13672ae8944a.0000000000021292/head//2) v3
2014-09-19 12:15:48.554554 7f207f2e6700 20 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46410 s=2 pgs=107 cs=3 l=0 c=0x17e64680).writer signed seq # 181): sig = 2459147136158861754
2014-09-19 12:15:48.554563 7f207f2e6700 20 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46410 s=2 pgs=107 cs=3 l=0 c=0x17e64680).writer sending 181 0xbdd8d00
2014-09-19 12:15:48.554567 7f207f2e6700 30 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46410 s=2 pgs=107 cs=3 l=0 c=0x17e64680). bl_pos 0 b_off 0 leftinchunk 758 buffer len 758 writing 758
2014-09-19 12:15:48.554578 7f207f2e6700 10 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46410 s=2 pgs=107 cs=3 l=0 c=0x17e64680).writer: state = open policy.server=0
2014-09-19 12:15:48.554582 7f207f2e6700 20 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46410 s=2 pgs=107 cs=3 l=0 c=0x17e64680).writer encoding 182 features 17592186044415 0x1516ca00 MOSDPGPush(2.7b8 15678 [PushOp(b66e8fb8/rbd_data.13ff2ae8944a.0000000000020b50/head//2, version: 4075'26386, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(b66e8fb8/rbd_data.13ff2ae8944a.0000000000020b50/head//2@4075'26386, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(5bce8fb8/rbd_data.13fc2ae8944a.000000000001a1e2/head//2, version: 4108'26458, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(5bce8fb8/rbd_data.13fc2ae8944a.000000000001a1e2/head//2@4108'26458, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2

2014-09-19 12:15:50.383864 7f207f2e6700 10 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46636 s=2 pgs=298 cs=385 l=0 c=0x17e64680).writer: state = open policy.server=0
2014-09-19 12:15:50.383889 7f207f2e6700 20 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46636 s=2 pgs=298 cs=385 l=0 c=0x17e64680).writer half-reencoding 183 features 17592186044415 0x17ef4800 MOSDPGPush(2.7b8 15678 [PushOp(277f8fb8/rbd_data.13672ae8944a.00000000000205fc/head//2, version: 4060'26194, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(277f8fb8/rbd_data.13672ae8944a.00000000000205fc/head//2@4060'26194, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(69bf8fb8/rbd_data.13f92ae8944a.0000000000009345/head//2, version: 3523'24386, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(69bf8fb8/rbd_data.13f92ae8944a.0000000000009345/head//2@3523'24386, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2

...
2014-09-19 12:15:48.556307 7f207f1e5700 10 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46410 s=2 pgs=107 cs=3 l=0 c=0x17e64680).requeue_sent MOSDPGPush(2.7b8 15678 [PushOp(b66e8fb8/rbd_data.13ff2ae8944a.0000000000020b50/head//2, version: 4075'26386, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(b66e8fb8/rbd_data.13ff2ae8944a.0000000000020b50/head//2@4075'26386, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(5bce8fb8/rbd_data.13fc2ae8944a.000000000001a1e2/head//2, version: 4108'26458, data_included: [0~4194304], data_size: 4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(5bce8fb8/rbd_data.13fc2ae8944a.000000000001a1e2/head//2@4108'26458, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 for resend seq 182 (182)
2014-09-19 12:15:48.556327 7f207f1e5700 10 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=18 :46410 s=2 pgs=107 cs=3 l=0 c=0x17e64680).requeue_sent pg_backfill(progress 2.7b8 e 15678/15678 lb 854e8fb8/rbd_data.13672ae8944a.0000000000021292/head//2) v3 for resend seq 181 (181)

.....

-455> 2014-09-19 12:15:51.264698 7f20a5443700 10 – 10.242.42.164:6825/16757 reaper deleted pipe 0x499c680
-454> 2014-09-19 12:15:51.264700 7f20ac26b900 10 – 10.242.42.164:6827/16757 wait: waiting for dispatch queue
-453> 2014-09-19 12:15:51.264700 7f20a5443700 10 – 10.242.42.164:6825/16757 reaper reaping pipe 0x16bb4940 10.242.42.176:6818/27198
-452> 2014-09-19 12:15:51.264702 7f20a5443700 10 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=12 :46713 s=4 pgs=369 cs=527 l=0 c=0x17e64680).discard_queue
-451> 2014-09-19 12:15:51.264706 7f20a5443700 20 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=12 :46713 s=4 pgs=369 cs=527 l=0 c=0x17e64680). discard 0x1516ca00
-450> 2014-09-19 12:15:51.264724 7f20a5443700 20 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=12 :46713 s=4 pgs=369 cs=527 l=0 c=0x17e64680). discard 0x15a21e00
-449> 2014-09-19 12:15:51.264738 7f20a5443700 20 – 10.242.42.164:6825/16757 >> 10.242.42.176:6818/27198 pipe(0x16bb4940 sd=12 :46713 s=4 pgs=369 cs=527 l=0 c=0x17e64680). discard 0x17ef4800
-448> 2014-09-19 12:15:51.264742 7f20ac26b900 10 – 10.242.42.164:6827/16757 wait: dispatch queue is stopped
-447> 2014-09-19 12:15:51.264746 7f20ac26b900 20 – 10.242.42.164:6827/16757 wait: stopping accepter thread

....

So the push op requests are discarded but the reference on the corresponding epoch was still there leading to the crash?

Actions #25

Updated by Samuel Just over 9 years ago

ubuntu@teuthology:/a/samuelj-2014-10-15_20:19:09-rados-wip-sam-testing-wip-testing-vanilla-fixes-basic-multi/551397/remote

Actions #26

Updated by Samuel Just over 9 years ago

ubuntu@teuthology:/a/samuelj-2014-10-22_14:27:22-rados-wip-sam-testing-wip-testing-vanilla-fixes-basic-multi/566853/remote

Actions #27

Updated by Samuel Just over 9 years ago

ubuntu@teuthology:/a/samuelj-2014-10-24_23:51:24-rados-wip-sam-testing-wip-testing-vanilla-fixes-basic-multi/571474/remote

Actions #28

Updated by Sage Weil over 9 years ago

ubuntu@teuthology:/a/teuthology-2014-11-10_02:32:01-rados-giant-distro-basic-multi/594038

Actions #29

Updated by Sage Weil over 9 years ago

  • Assignee deleted (Sage Weil)
Actions #31

Updated by Loïc Dachary about 9 years ago

http://pulpito.ceph.com/loic-2015-02-02_23:31:31-rados-giant-backports---basic-multi/736802/

2015-02-05 06:28:48.097484 4038b00 -1 leaked refs:
dump_weak_refs 0x7d424c8 weak_refs: 1178 = 0xa6fac10 with 1 refs

Actions #32

Updated by David Zafman almost 9 years ago

  • Regression set to No

Saw this while testing a branch:

ubuntu@teuthology:/a/dzafman-2015-05-12_08:43:26-rados-wip-10809-11135-10290---basic-multi/887717

Actions #33

Updated by Yuri Weinstein almost 9 years ago

Saw in run: http://pulpito.ceph.com/teuthology-2015-05-20_21:00:01-rados-next-distro-basic-multi/
Job: ['902503']

Assertion: ./common/shared_cache.hpp: 104: FAILED assert(weak_refs.empty())
ceph version 9.0.0-532-g055c0c4 (055c0c480e668581df421ebedc0c6e8f23e2aa0b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xc32deb]
 2: (SharedLRU<unsigned int, OSDMap const>::~SharedLRU()+0x332) [0x718482]
 3: (OSDService::~OSDService()+0x16c) [0x6ad67c]
 4: (OSD::~OSD()+0x115) [0x6f3035]
 5: (OSD::~OSD()+0x9) [0x6f3be9]
 6: (main()+0x283d) [0x6673ad]
 7: (__libc_start_main()+0xf5) [0x776aec5]
 8: ceph-osd() [0x693b77]
Actions #34

Updated by Loïc Dachary almost 9 years ago

The osd with the failed assert was running with valgrind.

Updated by Eric Eastman over 8 years ago

Looking through my OSD logs on a Ceph v9.0.3 cluster running on Trusty with the 4.2 kernels I am seeing multiple of these FAILED assert messages:

/var/log/ceph/ceph-osd.47.log-    -7> 2015-10-01 21:13:21.719550 7f85e4552980 10 -- :/227190 wait: closing pipes
/var/log/ceph/ceph-osd.47.log-    -6> 2015-10-01 21:13:21.719551 7f85e4552980 10 -- :/227190 reaper
/var/log/ceph/ceph-osd.47.log-    -5> 2015-10-01 21:13:21.719552 7f85e4552980 10 -- :/227190 reaper done
/var/log/ceph/ceph-osd.47.log-    -4> 2015-10-01 21:13:21.719553 7f85e4552980 10 -- :/227190 wait: waiting for pipes  to close
/var/log/ceph/ceph-osd.47.log-    -3> 2015-10-01 21:13:21.719555 7f85e4552980 10 -- :/227190 wait: done.
/var/log/ceph/ceph-osd.47.log-    -2> 2015-10-01 21:13:21.719556 7f85e4552980  1 -- :/227190 shutdown complete.
/var/log/ceph/ceph-osd.47.log-    -1> 2015-10-01 21:13:21.740289 7f85e4552980 -1 leaked refs:
/var/log/ceph/ceph-osd.47.log-dump_weak_refs 0x3f52908 weak_refs: 34630 = 0x7356d80 with 2 refs
/var/log/ceph/ceph-osd.47.log-
/var/log/ceph/ceph-osd.47.log-     0> 2015-10-01 21:13:21.743409 7f85e4552980 -1 ./common/shared_cache.hpp: In function 'SharedLRU<K, V>::~SharedLRU() [with K = unsigned int; V = const OSDMap]' thread 7f85e4552980 time 2015-10-01 21:13:21.740320
/var/log/ceph/ceph-osd.47.log:./common/shared_cache.hpp: 104: FAILED assert(weak_refs.empty())
/var/log/ceph/ceph-osd.47.log-
/var/log/ceph/ceph-osd.47.log- ceph version 9.0.3 (7295612d29f953f46e6e88812ef372b89a43b9da)
/var/log/ceph/ceph-osd.47.log- 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xc4689b]
/var/log/ceph/ceph-osd.47.log- 2: (SharedLRU<unsigned int, OSDMap const>::~SharedLRU()+0x318) [0x717068]
/var/log/ceph/ceph-osd.47.log- 3: (OSDService::~OSDService()+0x13a) [0x6af84a]
/var/log/ceph/ceph-osd.47.log- 4: (OSD::~OSD()+0x117) [0x6f3487]
/var/log/ceph/ceph-osd.47.log- 5: (OSD::~OSD()+0x9) [0x6f40e9]
/var/log/ceph/ceph-osd.47.log- 6: (main()+0x2873) [0x669203]
/var/log/ceph/ceph-osd.47.log- 7: (__libc_start_main()+0xf5) [0x7f85e146bec5]
/var/log/ceph/ceph-osd.47.log- 8: /usr/bin/ceph-osd() [0x695a47]

The cluster has 6 OSD servers, each with 30 drives, but several drives are out of service, so there are 168 active.

The output of grepping through all the OSD Servers with the command:
grep -C10 'assert(weak_refs.empty())' /var/log/ceph/*log
is attached, along with the ceph osd tree output. I tried to attached the gzip OSD 47 log file, but the submit failed saying it was too big.

Most of the testing on this system has been with cephfs.

System info:

uname -a
Linux dfss04 4.2.0-040200-generic #201508301530 SMP Sun Aug 30 19:31:40 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

ceph -v
ceph version 9.0.3 (7295612d29f953f46e6e88812ef372b89a43b9da)

ceph -s
  cluster c261c2dc-5e29-11e5-98ba-68b599c50db0
     health HEALTH_WARN
            37 requests are blocked > 32 sec
     monmap e1: 3 mons at {dfmon01=10.16.51.21:6789/0,dfmon02=10.16.51.22:6789/0,dfmon03=10.16.51.23:6789/0}
            election epoch 6, quorum 0,1,2 dfmon01,dfmon02,dfmon03
     mdsmap e3261: 1/1/1 up {0=dfmds01=up:active}, 1 up:standby
     osdmap e36229: 176 osds: 168 up, 168 in
      pgmap v655714: 18496 pgs, 4 pools, 48674 GB data, 12379 kobjects
            143 TB used, 103 TB / 246 TB avail
               18490 active+clean
                   6 active+clean+scrubbing

free
             total       used       free     shared    buffers     cached
Mem:      65948092   64324828    1623264       1128      16572   41075828
-/+ buffers/cache:   23232428   42715664
Swap:     15998972      29732   15969240

This system is using btrfs for the OSD with the following setting:
filestore_btrfs_snap = false
filestore_btrfs_clone_range = false
filestore_journal_parallel = false
osd_mount_options_btrfs = rw,noatime,autodefrag,user_subvol_rm_allowed
osd_mkfs_options_btrfs = -f -m single -n 32768

Actions #37

Updated by Loïc Dachary over 8 years ago

  • Status changed from Need More Info to 12
Actions #38

Updated by Samuel Just over 8 years ago

  • Has duplicate Bug #14265: ./common/shared_cache.hpp: 107: FAILED assert(weak_refs.empty()) added
Actions #39

Updated by Sage Weil over 8 years ago

/a/sage-2016-01-12_06:26:42-rados-wip-sage-testing---basic-smithi/25450

Actions #40

Updated by Sage Weil about 8 years ago

  • Has duplicate Bug #14458: "shared_cache.hpp: 104: FAILED assert(weak_refs.empty())" in rados-hammer-distro-basic-mira added
Actions #41

Updated by Yuri Weinstein about 8 years ago

Run: http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-04-11_02:00:06-rados-infernalis-distro-basic-openstack/
Job: 33378
Logs: http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2016-04-11_02:00:06-rados-infernalis-distro-basic-openstack/33378/teuthology.log

2016-04-11T08:20:06.020 INFO:tasks.ceph.osd.0.target064172.stderr:./common/shared_cache.hpp: In function 'SharedLRU<K, V, C>::~SharedLRU() [with K = unsigned int; V = const OSDMap; C = std::less<unsigned int>]' thread 403c4c0 time 2016-04-11 08:20:05.957718
2016-04-11T08:20:06.020 INFO:tasks.ceph.osd.0.target064172.stderr:./common/shared_cache.hpp: 104: FAILED assert(weak_refs.empty())
2016-04-11T08:20:06.020 INFO:tasks.ceph.osd.0.target064172.stderr:2016-04-11 08:20:05.945194 403c4c0 -1 leaked refs:
2016-04-11T08:20:06.020 INFO:tasks.ceph.osd.0.target064172.stderr:dump_weak_refs 0x83e1110 weak_refs: 795 = 0x2c62d820 with 1 refs
2016-04-11T08:20:06.021 INFO:tasks.ceph.osd.0.target064172.stderr:
2016-04-11T08:20:06.052 INFO:tasks.ceph.osd.0.target064172.stderr: ceph version 9.2.1-14-geff3ff4 (eff3ff4cbe9bfef7c6429b183f7dc0a16359c395)
2016-04-11T08:20:06.052 INFO:tasks.ceph.osd.0.target064172.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x9c5e7b]
2016-04-11T08:20:06.052 INFO:tasks.ceph.osd.0.target064172.stderr: 2: (()+0x37987d) [0x48187d]
2016-04-11T08:20:06.052 INFO:tasks.ceph.osd.0.target064172.stderr: 3: (OSDService::~OSDService()+0x1e9) [0x418c89]
2016-04-11T08:20:06.052 INFO:tasks.ceph.osd.0.target064172.stderr: 4: (OSD::~OSD()+0x123) [0x45cc93]
2016-04-11T08:20:06.052 INFO:tasks.ceph.osd.0.target064172.stderr: 5: (OSD::~OSD()+0x9) [0x45d2b9]
2016-04-11T08:20:06.053 INFO:tasks.ceph.osd.0.target064172.stderr: 6: (main()+0x2a4e) [0x3d03ae]
2016-04-11T08:20:06.053 INFO:tasks.ceph.osd.0.target064172.stderr: 7: (__libc_start_main()+0xf5) [0x74daec5]
2016-04-11T08:20:06.053 INFO:tasks.ceph.osd.0.target064172.stderr: 8: (()+0x2f8007) [0x400007]
2016-04-11T08:20:06.053 INFO:tasks.ceph.osd.0.target064172.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #42

Updated by Sage Weil almost 8 years ago

/a/teuthology-2016-05-16_22:00:02-rados-jewel-distro-basic-smithi/197309

Actions #43

Updated by Sage Weil about 7 years ago

  • Status changed from 12 to Can't reproduce
Actions

Also available in: Atom PDF