Actions
Bug #48468
openceph-osd crash before being up again
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Hi hi,
I'm in trouble with 3 osd's never able to be up again inside de cluster after having manually marked as "out".
custom ceph config:
osd_memory_target = 2147483648 bluestore_cache_autotune = true
When starting the osd container, it consume ~35Gb of memory and finaly Seg fault (for debuging purpose I've added a 54G swap on two optane drives).
$ sudo docker stats --no-stream CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 007c3908f0d6 ceph-e93d934c-903d-4a98-ae45-7cfcac568657-osd.2 282.36% 13.85GiB / 15.45GiB 89.64% 0B / 0B 11GB / 1.59MB 69 $ free -mh total used free shared buff/cache available Mem: 15Gi 15Gi 133Mi 0.0Ki 179Mi 53Mi Swap: 54Gi 17Gi 36Gi $ vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 6 19019576 129656 952 192492 5280 6532 6768 6742 4 8 1 12 81 5 0
Feel free to ask if you need more informations or testing.
King regards
PS: here is the Segfault dump
debug -23> 2020-12-04T20:16:19.092+0000 7fdbdf8af700 10 osd.2 211841 tick debug -22> 2020-12-04T20:16:19.092+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start debug -21> 2020-12-04T20:16:19.092+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish debug -20> 2020-12-04T20:16:19.304+0000 7fdbbe7fd700 5 osd.2 211841 heartbeat osd_stat(store_statfs(0x4a951d4000/0x40000000/0x7440000000, data 0x2a88fcef36/0x296ae28000, compress 0x0/0x0/0x0, omap 0x6064ae, meta 0x3f9f9b52), peers [] op hist []) debug -19> 2020-12-04T20:16:19.820+0000 7fdbbe7fd700 5 osd.2 211841 heartbeat osd_stat(store_statfs(0x4a951d4000/0x40000000/0x7440000000, data 0x2a88fcef36/0x296ae28000, compress 0x0/0x0/0x0, omap 0x6064ae, meta 0x3f9f9b52), peers [] op hist []) debug -18> 2020-12-04T20:16:20.060+0000 7fdbd7629700 5 prioritycache tune_memory target: 2147483648 mapped: 36742332416 unmapped: 671744 heap: 36743004160 old mem: 134217728 new mem: 134217728 debug -17> 2020-12-04T20:16:20.136+0000 7fdbdf8af700 10 osd.2 211841 tick debug -16> 2020-12-04T20:16:20.136+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start debug -15> 2020-12-04T20:16:20.136+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish debug -14> 2020-12-04T20:16:21.064+0000 7fdbd7629700 5 prioritycache tune_memory target: 2147483648 mapped: 36885151744 unmapped: 458752 heap: 36885610496 old mem: 134217728 new mem: 134217728 debug -13> 2020-12-04T20:16:21.148+0000 7fdbdf8af700 10 osd.2 211841 tick debug -12> 2020-12-04T20:16:21.148+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start debug -11> 2020-12-04T20:16:21.148+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish debug -10> 2020-12-04T20:16:21.268+0000 7fdbd7629700 5 bluestore.MempoolThread(0x55db2f9cba98) _resize_shards cache_size: 134217728 kv_alloc: 67108864 kv_used: 65725840 meta_alloc: 67108864 meta_used: 103623 data_alloc: 67108864 data_used: 0 debug -9> 2020-12-04T20:16:21.532+0000 7fdbbe7fd700 5 osd.2 211841 heartbeat osd_stat(store_statfs(0x4a951d4000/0x40000000/0x7440000000, data 0x2a88fcef36/0x296ae28000, compress 0x0/0x0/0x0, omap 0x6064ae, meta 0x3f9f9b52), peers [] op hist []) debug -8> 2020-12-04T20:16:22.072+0000 7fdbd7629700 5 prioritycache tune_memory target: 2147483648 mapped: 37013979136 unmapped: 606208 heap: 37014585344 old mem: 134217728 new mem: 134217728 debug -7> 2020-12-04T20:16:22.152+0000 7fdbdf8af700 10 osd.2 211841 tick debug -6> 2020-12-04T20:16:22.152+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start debug -5> 2020-12-04T20:16:22.152+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish debug -4> 2020-12-04T20:16:23.072+0000 7fdbd7629700 5 prioritycache tune_memory target: 2147483648 mapped: 37101379584 unmapped: 679936 heap: 37102059520 old mem: 134217728 new mem: 134217728 debug -3> 2020-12-04T20:16:23.196+0000 7fdbdf8af700 10 osd.2 211841 tick debug -2> 2020-12-04T20:16:23.196+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start debug -1> 2020-12-04T20:16:23.196+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish debug 0> 2020-12-04T20:16:23.676+0000 7fdbc3807700 -1 *** Caught signal (Aborted) ** in thread 7fdbc3807700 thread_name:tp_osd_tp ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable) 1: (()+0x12dd0) [0x7fdbe68a5dd0] 2: (pthread_kill()+0x35) [0x7fdbe68a2a65] 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x258) [0x55db2601ed48] 4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned long, unsigned long)+0x262) [0x55db2601f392] 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x7a3) [0x55db25a12973] 6: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xa4) [0x55db25a14634] 7: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55db25c460c6] 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x55db25a074df] 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55db26040224] 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55db26042e84] 11: (()+0x82de) [0x7fdbe689b2de] 12: (clone()+0x43) [0x7fdbe55d2e83] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 0 filer 0/ 1 striper 0/ 0 objecter 0/ 0 rados 0/ 0 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_rwl 0/ 0 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 0 client 10/10 osd 0/ 0 optracker 0/ 0 objclass 0/ 0 filestore 0/ 0 journal 0/ 0 ms 0/ 0 mon 0/ 0 monc 0/ 0 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 1/ 1 reserver 0/ 0 heartbeatmap 0/ 0 perfcounter 0/ 0 rgw 1/ 5 rgw_sync 1/10 civetweb 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test -2/-2 (syslog threshold) 99/99 (stderr threshold) --- pthread ID / name mapping for recent threads --- 7fdbbe7fd700 / osd_srv_heartbt 7fdbbeffe700 / tp_osd_tp 7fdbbf7ff700 / tp_osd_tp 7fdbc0801700 / tp_osd_tp 7fdbc1803700 / tp_osd_tp 7fdbc2004700 / tp_osd_tp 7fdbc2805700 / tp_osd_tp 7fdbc3006700 / tp_osd_tp 7fdbc3807700 / tp_osd_tp 7fdbc4008700 / tp_osd_tp 7fdbc4809700 / tp_osd_tp 7fdbc500a700 / tp_osd_tp 7fdbc580b700 / tp_osd_tp 7fdbc600c700 / tp_osd_tp 7fdbc680d700 / tp_osd_tp 7fdbc700e700 / osd_srv_agent 7fdbd0821700 / rocksdb:dump_st 7fdbd161d700 / fn_anonymous 7fdbd49fc700 / ms_dispatch 7fdbd7629700 / bstore_mempool 7fdbde03c700 / safe_timer 7fdbdf8af700 / safe_timer 7fdbe00b0700 / signal_handler 7fdbe18b3700 / service 7fdbe20b4700 / msgr-worker-2 7fdbe28b5700 / msgr-worker-1 7fdbe30b6700 / msgr-worker-0 7fdbe8b32f40 / ceph-osd max_recent 10000 max_new 1000 log_file /var/lib/ceph/crash/2020-12-04T20:16:23.666863Z_7c804440-12b3-4398-a660-4ec9bd52dd57/log --- end dump of recent events --- reraise_fatal: default handler for signal 6 didn't terminate the process? terminate called after throwing an instance of 'boost::wrapexcept<boost::bad_get>' what(): boost::bad_get: failed value get using boost::get *** Caught signal (Segmentation fault) ** in thread 7fdbd39fa700 thread_name:safe_timer
Files
Actions