Project

General

Profile

Actions

Bug #48468

open

ceph-osd crash before being up again

Added by Clément Hampaï over 3 years ago. Updated over 2 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi hi,

I'm in trouble with 3 osd's never able to be up again inside de cluster after having manually marked as "out".

custom ceph config:

osd_memory_target = 2147483648
bluestore_cache_autotune = true

When starting the osd container, it consume ~35Gb of memory and finaly Seg fault (for debuging purpose I've added a 54G swap on two optane drives).

$ sudo docker stats --no-stream
CONTAINER ID        NAME                                                          CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
007c3908f0d6        ceph-e93d934c-903d-4a98-ae45-7cfcac568657-osd.2               282.36%             13.85GiB / 15.45GiB   89.64%              0B / 0B             11GB / 1.59MB       69

$ free -mh
              total        used        free      shared  buff/cache   available
Mem:           15Gi        15Gi       133Mi       0.0Ki       179Mi        53Mi
Swap:          54Gi        17Gi        36Gi

$ vmstat 
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  6 19019576 129656    952 192492 5280 6532  6768  6742    4    8  1 12 81  5  0

Feel free to ask if you need more informations or testing.

King regards

PS: here is the Segfault dump

debug    -23> 2020-12-04T20:16:19.092+0000 7fdbdf8af700 10 osd.2 211841 tick
debug    -22> 2020-12-04T20:16:19.092+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start
debug    -21> 2020-12-04T20:16:19.092+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish
debug    -20> 2020-12-04T20:16:19.304+0000 7fdbbe7fd700  5 osd.2 211841 heartbeat osd_stat(store_statfs(0x4a951d4000/0x40000000/0x7440000000, data 0x2a88fcef36/0x296ae28000, compress 0x0/0x0/0x0, omap 0x6064ae, meta 0x3f9f9b52), peers [] op hist [])
debug    -19> 2020-12-04T20:16:19.820+0000 7fdbbe7fd700  5 osd.2 211841 heartbeat osd_stat(store_statfs(0x4a951d4000/0x40000000/0x7440000000, data 0x2a88fcef36/0x296ae28000, compress 0x0/0x0/0x0, omap 0x6064ae, meta 0x3f9f9b52), peers [] op hist [])
debug    -18> 2020-12-04T20:16:20.060+0000 7fdbd7629700  5 prioritycache tune_memory target: 2147483648 mapped: 36742332416 unmapped: 671744 heap: 36743004160 old mem: 134217728 new mem: 134217728
debug    -17> 2020-12-04T20:16:20.136+0000 7fdbdf8af700 10 osd.2 211841 tick
debug    -16> 2020-12-04T20:16:20.136+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start
debug    -15> 2020-12-04T20:16:20.136+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish
debug    -14> 2020-12-04T20:16:21.064+0000 7fdbd7629700  5 prioritycache tune_memory target: 2147483648 mapped: 36885151744 unmapped: 458752 heap: 36885610496 old mem: 134217728 new mem: 134217728
debug    -13> 2020-12-04T20:16:21.148+0000 7fdbdf8af700 10 osd.2 211841 tick
debug    -12> 2020-12-04T20:16:21.148+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start
debug    -11> 2020-12-04T20:16:21.148+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish
debug    -10> 2020-12-04T20:16:21.268+0000 7fdbd7629700  5 bluestore.MempoolThread(0x55db2f9cba98) _resize_shards cache_size: 134217728 kv_alloc: 67108864 kv_used: 65725840 meta_alloc: 67108864 meta_used: 103623 data_alloc: 67108864 data_used: 0
debug     -9> 2020-12-04T20:16:21.532+0000 7fdbbe7fd700  5 osd.2 211841 heartbeat osd_stat(store_statfs(0x4a951d4000/0x40000000/0x7440000000, data 0x2a88fcef36/0x296ae28000, compress 0x0/0x0/0x0, omap 0x6064ae, meta 0x3f9f9b52), peers [] op hist [])
debug     -8> 2020-12-04T20:16:22.072+0000 7fdbd7629700  5 prioritycache tune_memory target: 2147483648 mapped: 37013979136 unmapped: 606208 heap: 37014585344 old mem: 134217728 new mem: 134217728
debug     -7> 2020-12-04T20:16:22.152+0000 7fdbdf8af700 10 osd.2 211841 tick
debug     -6> 2020-12-04T20:16:22.152+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start
debug     -5> 2020-12-04T20:16:22.152+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish
debug     -4> 2020-12-04T20:16:23.072+0000 7fdbd7629700  5 prioritycache tune_memory target: 2147483648 mapped: 37101379584 unmapped: 679936 heap: 37102059520 old mem: 134217728 new mem: 134217728
debug     -3> 2020-12-04T20:16:23.196+0000 7fdbdf8af700 10 osd.2 211841 tick
debug     -2> 2020-12-04T20:16:23.196+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start
debug     -1> 2020-12-04T20:16:23.196+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish
debug      0> 2020-12-04T20:16:23.676+0000 7fdbc3807700 -1 *** Caught signal (Aborted) **
 in thread 7fdbc3807700 thread_name:tp_osd_tp

 ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable)
 1: (()+0x12dd0) [0x7fdbe68a5dd0]
 2: (pthread_kill()+0x35) [0x7fdbe68a2a65]
 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x258) [0x55db2601ed48]
 4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned long, unsigned long)+0x262) [0x55db2601f392]
 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x7a3) [0x55db25a12973]
 6: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xa4) [0x55db25a14634]
 7: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55db25c460c6]
 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x55db25a074df]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55db26040224]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55db26042e84]
 11: (()+0x82de) [0x7fdbe689b2de]
 12: (clone()+0x43) [0x7fdbe55d2e83]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_rwl
   0/ 0 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 0 client
  10/10 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   1/ 1 reserver
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7fdbbe7fd700 / osd_srv_heartbt
  7fdbbeffe700 / tp_osd_tp
  7fdbbf7ff700 / tp_osd_tp
  7fdbc0801700 / tp_osd_tp
  7fdbc1803700 / tp_osd_tp
  7fdbc2004700 / tp_osd_tp
  7fdbc2805700 / tp_osd_tp
  7fdbc3006700 / tp_osd_tp
  7fdbc3807700 / tp_osd_tp
  7fdbc4008700 / tp_osd_tp
  7fdbc4809700 / tp_osd_tp
  7fdbc500a700 / tp_osd_tp
  7fdbc580b700 / tp_osd_tp
  7fdbc600c700 / tp_osd_tp
  7fdbc680d700 / tp_osd_tp
  7fdbc700e700 / osd_srv_agent
  7fdbd0821700 / rocksdb:dump_st
  7fdbd161d700 / fn_anonymous
  7fdbd49fc700 / ms_dispatch
  7fdbd7629700 / bstore_mempool
  7fdbde03c700 / safe_timer
  7fdbdf8af700 / safe_timer
  7fdbe00b0700 / signal_handler
  7fdbe18b3700 / service
  7fdbe20b4700 / msgr-worker-2
  7fdbe28b5700 / msgr-worker-1
  7fdbe30b6700 / msgr-worker-0
  7fdbe8b32f40 / ceph-osd
  max_recent     10000
  max_new         1000
  log_file /var/lib/ceph/crash/2020-12-04T20:16:23.666863Z_7c804440-12b3-4398-a660-4ec9bd52dd57/log
--- end dump of recent events ---
reraise_fatal: default handler for signal 6 didn't terminate the process?
terminate called after throwing an instance of 'boost::wrapexcept<boost::bad_get>'
  what():  boost::bad_get: failed value get using boost::get
*** Caught signal (Segmentation fault) **
 in thread 7fdbd39fa700 thread_name:safe_timer


Files


Related issues 1 (0 open1 closed)

Related to RADOS - Bug #53729: ceph-osd takes all memory before oom on bootResolvedNitzan Mordechai

Actions
Actions

Also available in: Atom PDF