Project

General

Profile

Bug #21259

bluestore: segv in BlueStore::TwoQCache::_trim

Added by Henrik Korkuc 3 months ago. Updated about 1 month ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
Correctness/Safety
Target version:
-
Start date:
09/06/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
luminous
Needs Doc:
No
Component(RADOS):
BlueStore

Description

Sometimes OSD segfaults:

     0> 2017-09-06 06:12:14.448894 7f9544649700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f9544649700 thread_name:bstore_mempool

Currently cluster is backfilling, but I am not sure if it is related and currently I am unable to reproduce this.

OSD log attached.

ceph-osd.76.log.gz (262 KB) Henrik Korkuc, 09/06/2017 06:25 AM


Related issues

Related to RADOS - Bug #22039: bluestore: segv during unmount Resolved 11/03/2017

History

#1 Updated by Josh Durgin 3 months ago

  • Project changed from Ceph to RADOS
  • Category changed from OSD to Correctness/Safety
  • Priority changed from Normal to Urgent
  • Component(RADOS) BlueStore added

Crash from the logs is:


 ceph version 12.2.0-6-gbcf8970 (bcf89707d6c3c3086f001d08c754095b9e7a85f1) luminous (rc)
 1: (()+0xa33dbc) [0xa1d31dcdbc]
 2: (()+0xf890) [0x7f9554120890]
 3: (BlueStore::TwoQCache::_trim(unsigned long, unsigned long)+0x327) [0xa1d309c6b7]
 4: (BlueStore::Cache::trim(unsigned long, float, float, float)+0x469) [0xa1d306d519]
 5: (BlueStore::MempoolThread::entry()+0x155) [0xa1d30739c5]
 6: (()+0x8064) [0x7f9554119064]
 7: (clone()+0x6d) [0x7f955320d62d]

There's no further bluestore logs:

   -30> 2017-09-06 06:12:14.400572 7f955063c700  5 -- 10.30.0.7:6805/108707 >> 10.30.4.33:6833/51572 conn(0xa252e4e800 :6805 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=4155 cs=7 l=0). rx osd.349 seq 45830 0xa25f5673c0 MOSDPGPush(1.d8s4 25646/15542 [PushOp(1:1b0776d3:::rbd_dat
a.2.6fa882ae8944a.00000000000d9edd:head, version: 6576'21852, data_included: [0~1048576], data_size: 1048576, omap_header_size: 0, omap_entries_size: 0, attrset_size: 3, recovery_info: ObjectRecoveryInfo(1:1b0776d3:::rbd_data.2.6fa882ae8944a.00000000000d9edd:head@6576'21852, size: 4
194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_
complete:false, omap_recovered_to:, omap_complete:true, error:false))]) v3
   -29> 2017-09-06 06:12:14.400607 7f955063c700  1 -- 10.30.0.7:6805/108707 <== osd.349 10.30.4.33:6833/51572 45830 ==== MOSDPGPush(1.d8s4 25646/15542 [PushOp(1:1b0776d3:::rbd_data.2.6fa882ae8944a.00000000000d9edd:head, version: 6576'21852, data_included: [0~1048576], data_size: 104
8576, omap_header_size: 0, omap_entries_size: 0, attrset_size: 3, recovery_info: ObjectRecoveryInfo(1:1b0776d3:::rbd_data.2.6fa882ae8944a.00000000000d9edd:head@6576'21852, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!fi
rst, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:true, error:false))]) v3 ==== 1049660+0+0 (2554025136 
0 0) 0xa25f5673c0 con 0xa252e4e800
   -28> 2017-09-06 06:12:14.405897 7f953ce3a700  1 -- 10.30.0.7:6805/108707 --> 10.30.24.17:6805/14430 -- MOSDECSubOpReadReply(1.1accs0 25646/15544 ECSubReadReply(tid=349362, attrs_read=0)) v2 -- 0xa20a110000 con 0
   -27> 2017-09-06 06:12:14.407165 7f955063c700  5 -- 10.30.0.7:6820/108707 >> 10.164.4.2:0/81437 conn(0xa1e3ff9000 :6820 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=45 cs=1 l=1). rx osd.363 seq 37194 0xa251ae1c00 osd_ping(ping e25646 stamp 2017-09-06 06:12:14.402077) v4
   -26> 2017-09-06 06:12:14.407199 7f955063c700  1 -- 10.30.0.7:6820/108707 <== osd.363 10.164.4.2:0/81437 37194 ==== osd_ping(ping e25646 stamp 2017-09-06 06:12:14.402077) v4 ==== 2004+0+0 (1125801279 0 0) 0xa251ae1c00 con 0xa1e3ff9000
   -25> 2017-09-06 06:12:14.407214 7f955063c700  1 -- 10.30.0.7:6820/108707 --> 10.164.4.2:0/81437 -- osd_ping(ping_reply e25646 stamp 2017-09-06 06:12:14.402077) v4 -- 0xa251b4c200 con 0
   -24> 2017-09-06 06:12:14.407321 7f9550e3d700  5 -- 10.30.0.7:6807/108707 >> 10.164.4.2:0/81437 conn(0xa2603e5800 :6807 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=82 cs=1 l=1). rx osd.363 seq 37194 0xa20285b800 osd_ping(ping e25646 stamp 2017-09-06 06:12:14.402077) v4
   -23> 2017-09-06 06:12:14.407337 7f9550e3d700  1 -- 10.30.0.7:6807/108707 <== osd.363 10.164.4.2:0/81437 37194 ==== osd_ping(ping e25646 stamp 2017-09-06 06:12:14.402077) v4 ==== 2004+0+0 (1125801279 0 0) 0xa20285b800 con 0xa2603e5800
   -22> 2017-09-06 06:12:14.407346 7f9550e3d700  1 -- 10.30.0.7:6807/108707 --> 10.164.4.2:0/81437 -- osd_ping(ping_reply e25646 stamp 2017-09-06 06:12:14.402077) v4 -- 0xa23e3c8000 con 0
   -21> 2017-09-06 06:12:14.414556 7f955063c700  5 -- 10.30.0.7:6805/108707 >> 10.30.10.5:6837/48854 conn(0xa2527fb000 :6805 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=7035 cs=25 l=0). rx osd.394 seq 22954 0xa240e76a00 pg_backfill(progress 1.1654s2 e 25646/25646 lb 1:2a6c8150
:::rbd_data.2.6fae12ae8944a.00000000001ca50f:26) v3
   -20> 2017-09-06 06:12:14.414590 7f955063c700  1 -- 10.30.0.7:6805/108707 <== osd.394 10.30.10.5:6837/48854 22954 ==== pg_backfill(progress 1.1654s2 e 25646/25646 lb 1:2a6c8150:::rbd_data.2.6fae12ae8944a.00000000001ca50f:26) v3 ==== 979+0+0 (316524789 0 0) 0xa240e76a00 con 0xa2527
fb000
   -19> 2017-09-06 06:12:14.419852 7f955063c700  5 -- 10.30.0.7:6820/108707 >> 10.164.19.2:0/25733 conn(0xa1e78ef000 :6820 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=340 cs=1 l=1). rx osd.367 seq 37551 0xa251ae1c00 osd_ping(ping e25646 stamp 2017-09-06 06:12:14.414651) v4
   -18> 2017-09-06 06:12:14.419875 7f955063c700  1 -- 10.30.0.7:6820/108707 <== osd.367 10.164.19.2:0/25733 37551 ==== osd_ping(ping e25646 stamp 2017-09-06 06:12:14.414651) v4 ==== 2004+0+0 (1640435177 0 0) 0xa251ae1c00 con 0xa1e78ef000
   -17> 2017-09-06 06:12:14.419888 7f955063c700  1 -- 10.30.0.7:6820/108707 --> 10.164.19.2:0/25733 -- osd_ping(ping_reply e25646 stamp 2017-09-06 06:12:14.414651) v4 -- 0xa251b4c200 con 0
   -16> 2017-09-06 06:12:14.419977 7f955163e700  5 -- 10.30.0.7:6807/108707 >> 10.164.19.2:0/25733 conn(0xa207c6b800 :6807 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=340 cs=1 l=1). rx osd.367 seq 37551 0xa251c96a00 osd_ping(ping e25646 stamp 2017-09-06 06:12:14.414651) v4
   -15> 2017-09-06 06:12:14.420005 7f955063c700  5 -- 10.30.0.7:6820/108707 >> 10.30.10.5:0/48854 conn(0xa2522e6800 :6820 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=249 cs=1 l=1). rx osd.394 seq 37522 0xa251ae5400 osd_ping(ping e25646 stamp 2017-09-06 06:12:14.413955) v4
   -14> 2017-09-06 06:12:14.420018 7f955063c700  1 -- 10.30.0.7:6820/108707 <== osd.394 10.30.10.5:0/48854 37522 ==== osd_ping(ping e25646 stamp 2017-09-06 06:12:14.413955) v4 ==== 2004+0+0 (3857141869 0 0) 0xa251ae5400 con 0xa2522e6800
   -13> 2017-09-06 06:12:14.420017 7f955163e700  1 -- 10.30.0.7:6807/108707 <== osd.367 10.164.19.2:0/25733 37551 ==== osd_ping(ping e25646 stamp 2017-09-06 06:12:14.414651) v4 ==== 2004+0+0 (1640435177 0 0) 0xa251c96a00 con 0xa207c6b800
   -12> 2017-09-06 06:12:14.420027 7f955063c700  1 -- 10.30.0.7:6820/108707 --> 10.30.10.5:0/48854 -- osd_ping(ping_reply e25646 stamp 2017-09-06 06:12:14.413955) v4 -- 0xa251b4c200 con 0
   -11> 2017-09-06 06:12:14.420071 7f955163e700  1 -- 10.30.0.7:6807/108707 --> 10.164.19.2:0/25733 -- osd_ping(ping_reply e25646 stamp 2017-09-06 06:12:14.414651) v4 -- 0xa1e3637000 con 0
   -10> 2017-09-06 06:12:14.420181 7f955163e700  5 -- 10.30.0.7:6807/108707 >> 10.30.10.5:0/48854 conn(0xa2522e5000 :6807 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=249 cs=1 l=1). rx osd.394 seq 37522 0xa251b09000 osd_ping(ping e25646 stamp 2017-09-06 06:12:14.413955) v4
    -9> 2017-09-06 06:12:14.420203 7f955163e700  1 -- 10.30.0.7:6807/108707 <== osd.394 10.30.10.5:0/48854 37522 ==== osd_ping(ping e25646 stamp 2017-09-06 06:12:14.413955) v4 ==== 2004+0+0 (3857141869 0 0) 0xa251b09000 con 0xa2522e5000
    -8> 2017-09-06 06:12:14.420221 7f955163e700  1 -- 10.30.0.7:6807/108707 --> 10.30.10.5:0/48854 -- osd_ping(ping_reply e25646 stamp 2017-09-06 06:12:14.413955) v4 -- 0xa1e3637000 con 0
    -7> 2017-09-06 06:12:14.420826 7f953be38700  1 -- 10.30.0.7:6805/108707 --> 10.30.37.34:6845/58057 -- MOSDECSubOpReadReply(1.1e70s0 25646/16413 ECSubReadReply(tid=515210, attrs_read=0)) v2 -- 0xa23dac7700 con 0
    -6> 2017-09-06 06:12:14.423508 7f955163e700  5 -- 10.30.0.7:6807/108707 >> 10.30.12.66:0/81266 conn(0xa252457000 :6807 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=76 cs=1 l=1). rx osd.3 seq 37375 0xa251c96a00 osd_ping(ping e25646 stamp 2017-09-06 06:12:14.418873) v4
    -5> 2017-09-06 06:12:14.423526 7f955163e700  1 -- 10.30.0.7:6807/108707 <== osd.3 10.30.12.66:0/81266 37375 ==== osd_ping(ping e25646 stamp 2017-09-06 06:12:14.418873) v4 ==== 2004+0+0 (350090004 0 0) 0xa251c96a00 con 0xa252457000
    -4> 2017-09-06 06:12:14.423538 7f955163e700  1 -- 10.30.0.7:6807/108707 --> 10.30.12.66:0/81266 -- osd_ping(ping_reply e25646 stamp 2017-09-06 06:12:14.418873) v4 -- 0xa1e3637000 con 0
    -3> 2017-09-06 06:12:14.423568 7f9550e3d700  5 -- 10.30.0.7:6820/108707 >> 10.30.12.66:0/81266 conn(0xa2525a9000 :6820 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=38 cs=1 l=1). rx osd.3 seq 37375 0xa253f8c000 osd_ping(ping e25646 stamp 2017-09-06 06:12:14.418873) v4
    -2> 2017-09-06 06:12:14.423599 7f9550e3d700  1 -- 10.30.0.7:6820/108707 <== osd.3 10.30.12.66:0/81266 37375 ==== osd_ping(ping e25646 stamp 2017-09-06 06:12:14.418873) v4 ==== 2004+0+0 (350090004 0 0) 0xa253f8c000 con 0xa2525a9000
    -1> 2017-09-06 06:12:14.423620 7f9550e3d700  1 -- 10.30.0.7:6820/108707 --> 10.30.12.66:0/81266 -- osd_ping(ping_reply e25646 stamp 2017-09-06 06:12:14.418873) v4 -- 0xa23e3c8000 con 0
     0> 2017-09-06 06:12:14.448894 7f9544649700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f9544649700 thread_name:bstore_mempool

#2 Updated by Mike Lowe 3 months ago

I have also had this segfault on 5/244 osd's in the past 24 hours. It's not been a major problem yet as systemd kicks them back online in relatively short order.

#3 Updated by Henrik Korkuc 3 months ago

I thought that all needed information is between "begin dump of recent events" and "end dump of recent events" so I trimmed log only to these lines.

I posted log for full day of that OSD, it contains one abort and two "FAILED assert(p.second->shared_blob_set.empty())" asserts.

ceph-post-file: 562cbecf-8027-431e-a944-9932a086721b

#4 Updated by Sage Weil 3 months ago

I've pushed a branch, wip-21259 to shaman. Packages should appear at shaman.ceph.com in an hour or so. They add some debug output during bluestore shutdown that should help track this down. Do you mind giving it a try? Thanks!

#5 Updated by Henrik Korkuc 3 months ago

it looks like build failed... https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=trusty,DIST=trusty,MACHINE_SIZE=huge/6101/console

/build/ceph-12.2.0-47-g152806d/src/os/ObjectStore.h: In member function 'void BlueStore::OnodeSpace::dump(int)':
/build/ceph-12.2.0-47-g152806d/src/os/ObjectStore.h:73:16: error: invalid use of non-static data member 'ObjectStore::cct'
   CephContext* cct;
                ^
In file included from /build/ceph-12.2.0-47-g152806d/src/include/Context.h:19:0,
                 from /build/ceph-12.2.0-47-g152806d/src/common/Cond.h:19,
                 from /build/ceph-12.2.0-47-g152806d/src/common/Finisher.h:19,
                 from /build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.h:36,
                 from /build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.cc:22:
/build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.cc:36:22: error: from this location
 #define dout_context cct
                      ^
/build/ceph-12.2.0-47-g152806d/src/common/dout.h:51:7: note: in definition of macro 'dout_impl'
   if (cct->_conf->subsys.should_gather(sub, v)) {   \
       ^
/build/ceph-12.2.0-47-g152806d/src/common/debug.h:23:17: note: in expansion of macro 'ldout'
 #define dout(v) ldout((dout_context), v)
                 ^
/build/ceph-12.2.0-47-g152806d/src/common/debug.h:23:24: note: in expansion of macro 'dout_context'
 #define dout(v) ldout((dout_context), v)
                        ^
/build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.cc:1621:5: note: in expansion of macro 'dout'
     dout(lvl) << i.first << " : " << *i.second << dendl;
     ^
In file included from /build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.h:39:0,
                 from /build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.cc:22:
/build/ceph-12.2.0-47-g152806d/src/os/ObjectStore.h:73:16: error: invalid use of non-static data member 'ObjectStore::cct'
   CephContext* cct;
                ^
Scanning dependencies of target librbd
In file included from /build/ceph-12.2.0-47-g152806d/src/include/Context.h:19:0,
                 from /build/ceph-12.2.0-47-g152806d/src/common/Cond.h:19,
                 from /build/ceph-12.2.0-47-g152806d/src/common/Finisher.h:19,
                 from /build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.h:36,
                 from /build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.cc:22:
/build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.cc:36:22: error: from this location
 #define dout_context cct
                      ^
/build/ceph-12.2.0-47-g152806d/src/common/dout.h:56:37: note: in definition of macro 'dout_impl'
     ceph::logging::Entry *_dout_e = cct->_log->create_entry(v, sub, &_log_exp_length); \
                                     ^
/build/ceph-12.2.0-47-g152806d/src/common/debug.h:23:17: note: in expansion of macro 'ldout'
 #define dout(v) ldout((dout_context), v)
                 ^
/build/ceph-12.2.0-47-g152806d/src/common/debug.h:23:24: note: in expansion of macro 'dout_context'
 #define dout(v) ldout((dout_context), v)
                        ^
/build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.cc:1621:5: note: in expansion of macro 'dout'
     dout(lvl) << i.first << " : " << *i.second << dendl;
     ^
In file included from /build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.h:39:0,
                 from /build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.cc:22:
/build/ceph-12.2.0-47-g152806d/src/os/ObjectStore.h:73:16: error: invalid use of non-static data member 'ObjectStore::cct'
   CephContext* cct;
                ^
In file included from /build/ceph-12.2.0-47-g152806d/src/include/Context.h:19:0,
                 from /build/ceph-12.2.0-47-g152806d/src/common/Cond.h:19,
                 from /build/ceph-12.2.0-47-g152806d/src/common/Finisher.h:19,
                 from /build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.h:36,
                 from /build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.cc:22:
/build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.cc:36:22: error: from this location
 #define dout_context cct
                      ^
/build/ceph-12.2.0-47-g152806d/src/common/dout.h:61:22: note: in definition of macro 'dout_impl'
     auto _dout_cct = cct;      \
                      ^
/build/ceph-12.2.0-47-g152806d/src/common/debug.h:23:17: note: in expansion of macro 'ldout'
 #define dout(v) ldout((dout_context), v)
                 ^
/build/ceph-12.2.0-47-g152806d/src/common/debug.h:23:24: note: in expansion of macro 'dout_context'
 #define dout(v) ldout((dout_context), v)
                        ^
/build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.cc:1621:5: note: in expansion of macro 'dout'
     dout(lvl) << i.first << " : " << *i.second << dendl;
     ^
make[4]: Leaving directory `/build/ceph-12.2.0-47-g152806d/obj-x86_64-linux-gnu'
make -f src/librbd/CMakeFiles/librbd.dir/build.make src/librbd/CMakeFiles/librbd.dir/build
/build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.cc:1621:26: error: cannot bind 'std::basic_ostream<char>' lvalue to 'std::basic_ostream<char>&&'
     dout(lvl) << i.first << " : " << *i.second << dendl;
                          ^
In file included from /usr/include/c++/4.8/iostream:39:0,
                 from /build/ceph-12.2.0-47-g152806d/src/include/cpp-btree/btree.h:109,
                 from /build/ceph-12.2.0-47-g152806d/src/include/cpp-btree/btree_set.h:27,
                 from /build/ceph-12.2.0-47-g152806d/src/os/bluestore/BlueStore.cc:20:
/usr/include/c++/4.8/ostream:602:5: error:   initializing argument 1 of 'std::basic_ostream<_CharT, _Traits>& std::operator<<(std::basic_ostream<_CharT, _Traits>&&, const _Tp&) [with _CharT = char; _Traits = std::char_traits<char>; _Tp = BlueStore::Onode]'
     operator<<(basic_ostream<_CharT, _Traits>&& __os, const _Tp& __x)
     ^

#6 Updated by Henrik Korkuc 3 months ago

btw I am running on debian Jessie, it looks like shaman does not build for it, so I am going to build it myself, I also saw multiple fixes pushed to luminous branch, maybe it is worth building them too?

#7 Updated by Sage Weil 3 months ago

Sorry, I missed your earlier message. Repushed the branch; should build now. It also includes the other bluestore fixes queued up for 12.2.1. Can you give it a try and see if you can reproduce the crash on shutdown? Thanks!

#8 Updated by Henrik Korkuc 3 months ago

I upgraded cluster to 12.2.0-178-gba746cd (ba746cd14ddd70a4f24a734f83ff9d276dd327d1) last week (to mitigate aio submission deadlock). I do not experience any segfaults anymore.

#9 Updated by Josh Durgin 3 months ago

  • Status changed from New to Resolved

#10 Updated by Kefu Chai 2 months ago

  • Status changed from Resolved to New

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg40646.html

12.2.1 3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e still has this issue

#11 Updated by Sage Weil 2 months ago

https://github.com/ceph/ceph/pull/18103 is the luminous backport.. will ping you on this ticket once it is merged.

you'll need to enable 'osd debug shutdown = true' in [osd] section of ceph.conf to crank up debugging

#12 Updated by Sage Weil 2 months ago

  • Status changed from New to Need More Info

#13 Updated by Sage Weil 2 months ago

  • Subject changed from Ocasionally Bluestore OSDs segfault to bluestore: segv in BlueStore::TwoQCache::_trim

#14 Updated by Sage Weil 2 months ago

Ok, the patch is merged to teh latest luminous branch, which you can install from https://shaman.ceph.com/.

#15 Updated by Patrick Donnelly about 2 months ago

Crash here:

2017-10-24T23:36:30.284 INFO:tasks.ceph.osd.1.smithi001.stderr:2017-10-24 23:36:30.264 34c57700 -1 received  signal: Terminated from  PID: 20208 task name: /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.1.log --time-stamp=yes --tool=memcheck ceph-osd -f --cluster ceph -i 1  UID: 0
2017-10-24T23:36:30.285 INFO:tasks.ceph.osd.1.smithi001.stderr:2017-10-24 23:36:30.268 34c57700 -1 osd.1 38 *** Got signal Terminated ***
2017-10-24T23:36:30.589 INFO:tasks.ceph.osd.1.smithi001.stderr:2017-10-24 23:36:30.535 34c57700 -1 osd.1 38 shutdown
2017-10-24T23:36:33.463 INFO:tasks.ceph.osd.1.smithi001.stderr:*** Caught signal (Segmentation fault) **
2017-10-24T23:36:33.463 INFO:tasks.ceph.osd.1.smithi001.stderr: in thread 34c57700 thread_name:signal_handler
2017-10-24T23:36:33.516 INFO:tasks.ceph.osd.1.smithi001.stderr: ceph version 13.0.0-2328-gfc31752 (fc317524b8a24874a4a57c192e5da46ea11d593f) mimic (dev)
2017-10-24T23:36:33.517 INFO:tasks.ceph.osd.1.smithi001.stderr: 1: (()+0xa8f831) [0xb97831]
2017-10-24T23:36:33.517 INFO:tasks.ceph.osd.1.smithi001.stderr: 2: (()+0xf5e0) [0xc57c5e0]
2017-10-24T23:36:33.517 INFO:tasks.ceph.osd.1.smithi001.stderr: 3: (operator<<(std::ostream&, bluestore_shared_blob_t const&)+0xc) [0xad0e0c]
2017-10-24T23:36:33.517 INFO:tasks.ceph.osd.1.smithi001.stderr: 4: (operator<<(std::ostream&, BlueStore::SharedBlob const&)+0xba) [0xa367aa]
2017-10-24T23:36:33.517 INFO:tasks.ceph.osd.1.smithi001.stderr: 5: (BlueStore::SharedBlobSet::dump(CephContext*, int)+0x16e) [0xa3695e]
2017-10-24T23:36:33.517 INFO:tasks.ceph.osd.1.smithi001.stderr: 6: (BlueStore::_flush_cache()+0x1ae) [0xa63e8e]
2017-10-24T23:36:33.517 INFO:tasks.ceph.osd.1.smithi001.stderr: 7: (BlueStore::umount()+0x1e6) [0xa644c6]
2017-10-24T23:36:33.517 INFO:tasks.ceph.osd.1.smithi001.stderr: 8: (OSD::shutdown()+0x1685) [0x690125]
2017-10-24T23:36:33.517 INFO:tasks.ceph.osd.1.smithi001.stderr: 9: (OSD::handle_signal(int)+0x11f) [0x69092f]
2017-10-24T23:36:33.518 INFO:tasks.ceph.osd.1.smithi001.stderr: 10: (SignalHandler::entry()+0x1d7) [0xb98b97]
2017-10-24T23:36:33.518 INFO:tasks.ceph.osd.1.smithi001.stderr: 11: (()+0x7e25) [0xc574e25]
2017-10-24T23:36:33.518 INFO:tasks.ceph.osd.1.smithi001.stderr: 12: (clone()+0x6d) [0xd4bb34d]
2017-10-24T23:36:33.558 INFO:tasks.ceph.osd.1.smithi001.stderr:2017-10-24 23:36:33.433 34c57700 -1 bluestore(/var/lib/ceph/osd/ceph-1) _flush_cache stray shared blobs on 3.4s0_head
2017-10-24T23:36:33.559 INFO:tasks.ceph.osd.1.smithi001.stderr:2017-10-24 23:36:33.504 34c57700 -1 *** Caught signal (Segmentation fault) **

From: /ceph/teuthology-archive/pdonnell-2017-10-24_21:18:53-multimds-wip-pdonnell-testing-201710240526-testing-basic-smithi/1771028/teuthology.log

Very reproducible. Same crash in many other jobs.

Sage believes this crash reproduces this bug.

17:08:00            @sage | can probably put it under http://tracker.ceph.com/issues/21259, the bug it was added to track down.
17:08:41            @sage | (the fact that the debug dump si crashing means you're probably reproducing that bug, which is good :)

#16 Updated by Sage Weil about 1 month ago

  • Related to Bug #22039: bluestore: segv during unmount added

#17 Updated by Sage Weil about 1 month ago

  • Status changed from Need More Info to Duplicate

Also available in: Atom PDF