Bug #48468: ceph-osd crash before being up again - RADOS - Ceph

Actions

Copy link

Bug #48468

open

ceph-osd crash before being up again

Added by Clément Hampaï over 3 years ago. Updated over 2 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Target version:

Ceph - v15.2.7

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.7

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi hi,

I'm in trouble with 3 osd's never able to be up again inside de cluster after having manually marked as "out".

custom ceph config:

osd_memory_target = 2147483648
bluestore_cache_autotune = true

When starting the osd container, it consume ~35Gb of memory and finaly Seg fault (for debuging purpose I've added a 54G swap on two optane drives).

$ sudo docker stats --no-stream
CONTAINER ID        NAME                                                          CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
007c3908f0d6        ceph-e93d934c-903d-4a98-ae45-7cfcac568657-osd.2               282.36%             13.85GiB / 15.45GiB   89.64%              0B / 0B             11GB / 1.59MB       69

$ free -mh
              total        used        free      shared  buff/cache   available
Mem:           15Gi        15Gi       133Mi       0.0Ki       179Mi        53Mi
Swap:          54Gi        17Gi        36Gi

$ vmstat 
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  6 19019576 129656    952 192492 5280 6532  6768  6742    4    8  1 12 81  5  0

Feel free to ask if you need more informations or testing.

King regards

PS: here is the Segfault dump

debug    -23> 2020-12-04T20:16:19.092+0000 7fdbdf8af700 10 osd.2 211841 tick
debug    -22> 2020-12-04T20:16:19.092+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start
debug    -21> 2020-12-04T20:16:19.092+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish
debug    -20> 2020-12-04T20:16:19.304+0000 7fdbbe7fd700  5 osd.2 211841 heartbeat osd_stat(store_statfs(0x4a951d4000/0x40000000/0x7440000000, data 0x2a88fcef36/0x296ae28000, compress 0x0/0x0/0x0, omap 0x6064ae, meta 0x3f9f9b52), peers [] op hist [])
debug    -19> 2020-12-04T20:16:19.820+0000 7fdbbe7fd700  5 osd.2 211841 heartbeat osd_stat(store_statfs(0x4a951d4000/0x40000000/0x7440000000, data 0x2a88fcef36/0x296ae28000, compress 0x0/0x0/0x0, omap 0x6064ae, meta 0x3f9f9b52), peers [] op hist [])
debug    -18> 2020-12-04T20:16:20.060+0000 7fdbd7629700  5 prioritycache tune_memory target: 2147483648 mapped: 36742332416 unmapped: 671744 heap: 36743004160 old mem: 134217728 new mem: 134217728
debug    -17> 2020-12-04T20:16:20.136+0000 7fdbdf8af700 10 osd.2 211841 tick
debug    -16> 2020-12-04T20:16:20.136+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start
debug    -15> 2020-12-04T20:16:20.136+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish
debug    -14> 2020-12-04T20:16:21.064+0000 7fdbd7629700  5 prioritycache tune_memory target: 2147483648 mapped: 36885151744 unmapped: 458752 heap: 36885610496 old mem: 134217728 new mem: 134217728
debug    -13> 2020-12-04T20:16:21.148+0000 7fdbdf8af700 10 osd.2 211841 tick
debug    -12> 2020-12-04T20:16:21.148+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start
debug    -11> 2020-12-04T20:16:21.148+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish
debug    -10> 2020-12-04T20:16:21.268+0000 7fdbd7629700  5 bluestore.MempoolThread(0x55db2f9cba98) _resize_shards cache_size: 134217728 kv_alloc: 67108864 kv_used: 65725840 meta_alloc: 67108864 meta_used: 103623 data_alloc: 67108864 data_used: 0
debug     -9> 2020-12-04T20:16:21.532+0000 7fdbbe7fd700  5 osd.2 211841 heartbeat osd_stat(store_statfs(0x4a951d4000/0x40000000/0x7440000000, data 0x2a88fcef36/0x296ae28000, compress 0x0/0x0/0x0, omap 0x6064ae, meta 0x3f9f9b52), peers [] op hist [])
debug     -8> 2020-12-04T20:16:22.072+0000 7fdbd7629700  5 prioritycache tune_memory target: 2147483648 mapped: 37013979136 unmapped: 606208 heap: 37014585344 old mem: 134217728 new mem: 134217728
debug     -7> 2020-12-04T20:16:22.152+0000 7fdbdf8af700 10 osd.2 211841 tick
debug     -6> 2020-12-04T20:16:22.152+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start
debug     -5> 2020-12-04T20:16:22.152+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish
debug     -4> 2020-12-04T20:16:23.072+0000 7fdbd7629700  5 prioritycache tune_memory target: 2147483648 mapped: 37101379584 unmapped: 679936 heap: 37102059520 old mem: 134217728 new mem: 134217728
debug     -3> 2020-12-04T20:16:23.196+0000 7fdbdf8af700 10 osd.2 211841 tick
debug     -2> 2020-12-04T20:16:23.196+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start
debug     -1> 2020-12-04T20:16:23.196+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish
debug      0> 2020-12-04T20:16:23.676+0000 7fdbc3807700 -1 *** Caught signal (Aborted) **
 in thread 7fdbc3807700 thread_name:tp_osd_tp

 ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable)
 1: (()+0x12dd0) [0x7fdbe68a5dd0]
 2: (pthread_kill()+0x35) [0x7fdbe68a2a65]
 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x258) [0x55db2601ed48]
 4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned long, unsigned long)+0x262) [0x55db2601f392]
 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x7a3) [0x55db25a12973]
 6: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xa4) [0x55db25a14634]
 7: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55db25c460c6]
 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x55db25a074df]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55db26040224]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55db26042e84]
 11: (()+0x82de) [0x7fdbe689b2de]
 12: (clone()+0x43) [0x7fdbe55d2e83]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_rwl
   0/ 0 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 0 client
  10/10 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   1/ 1 reserver
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7fdbbe7fd700 / osd_srv_heartbt
  7fdbbeffe700 / tp_osd_tp
  7fdbbf7ff700 / tp_osd_tp
  7fdbc0801700 / tp_osd_tp
  7fdbc1803700 / tp_osd_tp
  7fdbc2004700 / tp_osd_tp
  7fdbc2805700 / tp_osd_tp
  7fdbc3006700 / tp_osd_tp
  7fdbc3807700 / tp_osd_tp
  7fdbc4008700 / tp_osd_tp
  7fdbc4809700 / tp_osd_tp
  7fdbc500a700 / tp_osd_tp
  7fdbc580b700 / tp_osd_tp
  7fdbc600c700 / tp_osd_tp
  7fdbc680d700 / tp_osd_tp
  7fdbc700e700 / osd_srv_agent
  7fdbd0821700 / rocksdb:dump_st
  7fdbd161d700 / fn_anonymous
  7fdbd49fc700 / ms_dispatch
  7fdbd7629700 / bstore_mempool
  7fdbde03c700 / safe_timer
  7fdbdf8af700 / safe_timer
  7fdbe00b0700 / signal_handler
  7fdbe18b3700 / service
  7fdbe20b4700 / msgr-worker-2
  7fdbe28b5700 / msgr-worker-1
  7fdbe30b6700 / msgr-worker-0
  7fdbe8b32f40 / ceph-osd
  max_recent     10000
  max_new         1000
  log_file /var/lib/ceph/crash/2020-12-04T20:16:23.666863Z_7c804440-12b3-4398-a660-4ec9bd52dd57/log
--- end dump of recent events ---
reraise_fatal: default handler for signal 6 didn't terminate the process?
terminate called after throwing an instance of 'boost::wrapexcept<boost::bad_get>'
  what():  boost::bad_get: failed value get using boost::get
*** Caught signal (Segmentation fault) **
 in thread 7fdbd39fa700 thread_name:safe_timer

Files

Download all files

2020-12-04T20_16_23.666863Z_7c804440-12b3-4398-a660-4ec9bd52dd57.tar.gz (54.9 KB) 2020-12-04T20_16_23.666863Z_7c804440-12b3-4398-a660-4ec9bd52dd57.tar.gz	Crash report	Clément Hampaï, 12/04/2020 08:59 PM
2020-12-26T13_17_37.106551Z_0c29f139-ade1-4a07-990a-11a4e1b270f9.tar.gz (17.8 KB) 2020-12-26T13_17_37.106551Z_0c29f139-ade1-4a07-990a-11a4e1b270f9.tar.gz	osd 4 systemd crash report (v15.2.8)	Clément Hampaï, 12/26/2020 01:24 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Clément Hampaï over 3 years ago

File 2020-12-04T20_16_23.666863Z_7c804440-12b3-4398-a660-4ec9bd52dd57.tar.gz 2020-12-04T20_16_23.666863Z_7c804440-12b3-4398-a660-4ec9bd52dd57.tar.gz added

I'm adding the crash report as well

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Project changed from 18 to RADOS

I believe this isn't ceph-deploy issue...

Actions

Copy link

Updated by Clément Hampaï over 3 years ago

Igor Fedotov wrote:

I believe this isn't ceph-deploy issue...

Probably not indeed, my bad

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Just to mention - telemetry reports show multiple crashes inside HeartbeatMap::_check for different clusters.
Hence this might worth a priority raise.

Actions

Copy link

Updated by Clément Hampaï over 3 years ago

Igor Fedotov wrote:

Just to mention - telemetry reports show multiple crashes inside HeartbeatMap::_check for different clusters.
Hence this might worth a priority raise.

Alright, thanks for mentioning it. I'm raising the issue.

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Clément Hampaï over 3 years ago

little update, I've tried with v15.2.8. Sadly same behaviour.

Actions

Copy link

Updated by Clément Hampaï over 3 years ago

File 2020-12-26T13_17_37.106551Z_0c29f139-ade1-4a07-990a-11a4e1b270f9.tar.gz 2020-12-26T13_17_37.106551Z_0c29f139-ade1-4a07-990a-11a4e1b270f9.tar.gz added

Tried with a systemd service instead of the docker container, same behaviour as well.
I've submitted the new crash-report.

Actions

Copy link

Updated by Sage Weil about 3 years ago

Status changed from New to Need More Info

Hi Clement,
Can you reproduce this with logs?

ceph config set osd.2 debug_osd 20  # or whichever osd(s)
ceph config set osd.2 log_to_file true

and ceph-post-file the resulting log in /var/log/ceph/<cluster-fsid>/?

Actions

Copy link

#10

Updated by Clément Hampaï almost 3 years ago

Hi Sage,

Hum I've finally managed to recover my cluster after an uncounted osd restart procedures until they started up successfully somehow.
The issue occured after a pool's PG number modification leading osd to crash and letting some of them unable to restart properly.
I'm currently unable to reproduce the issue but I'll for sure keep you up to date along the way if it occurs again.
Thanks a lot !

Actions

Copy link

#11

Updated by Neha Ojha over 2 years ago

Priority changed from Urgent to Normal

Reducing priority for now.

Actions

Copy link

#12

Updated by Gonzalo Aguilar Delgado over 2 years ago

Hi I'm having the same problem.

-7> 2021-12-25T12:05:37.491+0100 7fd15c920640  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7fd15c920640' had suicide timed out after 1500.000000000s
    -6> 2021-12-25T12:05:37.683+0100 7fd167936640 10 monclient: tick
    -5> 2021-12-25T12:05:37.683+0100 7fd167936640 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2021-12-25T12:05:07.685922+0100)
    -4> 2021-12-25T12:05:37.719+0100 7fd167936640 10 monclient: _send_mon_message to mon.blue-compute at v2:172.16.0.119:3300/0
    -3> 2021-12-25T12:05:38.719+0100 7fd167936640 10 monclient: tick
    -2> 2021-12-25T12:05:38.719+0100 7fd167936640 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2021-12-25T12:05:08.720135+0100)
    -1> 2021-12-25T12:05:38.719+0100 7fd167936640 10 monclient: _send_mon_message to mon.blue-compute at v2:172.16.0.119:3300/0
     0> 2021-12-25T12:05:39.295+0100 7fd15c920640 -1 ** Caught signal (Aborted) *
 in thread 7fd15c920640 thread_name:tp_osd_tp

ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
 1: /lib/x86_64-linux-gnu/libc.so.6(+0x46520) [0x7fd1c4b56520]
 2: pthread_kill()
 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point&lt;ceph::coarse_mono_clock, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; > >)+0x485) [0x559eb61883b5]
 4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; >, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; >)+0x74) [0x559eb6188c04]
 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x706) [0x559eb5b32586]
 6: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr&lt;PGPeeringEvent&gt;, ThreadPool::TPHandle&)+0xca) [0x559eb5b3447a]
 7: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr&lt;PG&gt;&, ThreadPool::TPHandle&)+0x5b) [0x559eb5d7bb9b]
 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9c3) [0x559eb5b25f73]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x434) [0x559eb61aa284]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x559eb61ac5b4]
 11: /lib/x86_64-linux-gnu/libc.so.6(+0x98927) [0x7fd1c4ba8927]
 12: clone()
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 1 osd
0/ 5 optracker
0/ 5 objclass
10/10 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
-- pthread ID / name mapping for recent threads ---
140537102464576 / tp_osd_tp
140537169573440 / tp_osd_tp
140537177966144 / tp_osd_tp
140537186358848 / tp_osd_tp
140537194751552 / tp_osd_tp
140537362605632 / safe_timer
140537379391040 / ms_dispatch
140538376353344 / tp_fstore_op
140538458064448 / fn_jrn_objstore
140538483242560 / filestore_sync
140538741188160 / msgr-worker-2
140538779006528 / io_context_pool
140538858620480 / io_context_pool
140538875405888 / admin_socket
140538883798592 / msgr-worker-1
140538892191296 / msgr-worker-0
140538922681024 / ceph-osd
max_recent 10000
max_new 10000
log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---

But this is:

ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)

Actions

Copy link

#13

Updated by Gonzalo Aguilar Delgado over 2 years ago

Clément Hampaï wrote:

Hi Sage,

Hum I've finally managed to recover my cluster after an uncounted osd restart procedures until they started up successfully somehow.
The issue occured after a pool's PG number modification leading osd to crash and letting some of them unable to restart properly.
I'm currently unable to reproduce the issue but I'll for sure keep you up to date along the way if it occurs again.
Thanks a lot !

Can you explain how did you resolved it?

Actions

Copy link

#14

Updated by Clément Hampaï over 2 years ago

Hey Gonzalo,

It was some times ago but from my memories I've created a hudge swapfile ~50G and I restarted the osd's multiple times one after the other when they finally crashed.
After some restarts one failed osd was able to boot properly again. Then a second one and a third one. At this point my unknown PGs where online again.
I hope it might help you to fix your cluster too. Keep me up to date if you want to.

KR,

Clément

Gonzalo Aguilar Delgado wrote:

Clément Hampaï wrote:

Hi Sage,

Hum I've finally managed to recover my cluster after an uncounted osd restart procedures until they started up successfully somehow.
The issue occured after a pool's PG number modification leading osd to crash and letting some of them unable to restart properly.
I'm currently unable to reproduce the issue but I'll for sure keep you up to date along the way if it occurs again.
Thanks a lot !

Can you explain how did you resolved it?

Actions

Copy link

#15

Updated by Sebastian Wagner over 2 years ago

Related to Bug #53729: ceph-osd takes all memory before oom on boot added

Actions

Copy link

#16

Updated by Neha Ojha over 2 years ago

Gonzalo Aguilar Delgado wrote:

Hi I'm having the same problem.

-7> 2021-12-25T12:05:37.491+0100 7fd15c920640 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7fd15c920640' had suicide timed out after 1500.000000000s
-6> 2021-12-25T12:05:37.683+0100 7fd167936640 10 monclient: tick
-5> 2021-12-25T12:05:37.683+0100 7fd167936640 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2021-12-25T12:05:07.685922+0100)
-4> 2021-12-25T12:05:37.719+0100 7fd167936640 10 monclient: _send_mon_message to mon.blue-compute at v2:172.16.0.119:3300/0
-3> 2021-12-25T12:05:38.719+0100 7fd167936640 10 monclient: tick
-2> 2021-12-25T12:05:38.719+0100 7fd167936640 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2021-12-25T12:05:08.720135+0100)
-1> 2021-12-25T12:05:38.719+0100 7fd167936640 10 monclient: _send_mon_message to mon.blue-compute at v2:172.16.0.119:3300/0
0> 2021-12-25T12:05:39.295+0100 7fd15c920640 -1 ** Caught signal (Aborted) *
in thread 7fd15c920640 thread_name:tp_osd_tp

ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
1: /lib/x86_64-linux-gnu/libc.so.6(+0x46520) [0x7fd1c4b56520]
2: pthread_kill()
3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x485) [0x559eb61883b5]
4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >)+0x74) [0x559eb6188c04]
5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x706) [0x559eb5b32586]
6: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xca) [0x559eb5b3447a]
7: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x5b) [0x559eb5d7bb9b]
8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9c3) [0x559eb5b25f73]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x434) [0x559eb61aa284]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x559eb61ac5b4]
11: /lib/x86_64-linux-gnu/libc.so.6(+0x98927) [0x7fd1c4ba8927]
12: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 1 osd
0/ 5 optracker
0/ 5 objclass
10/10 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
-- pthread ID / name mapping for recent threads ---
140537102464576 / tp_osd_tp
140537169573440 / tp_osd_tp
140537177966144 / tp_osd_tp
140537186358848 / tp_osd_tp
140537194751552 / tp_osd_tp
140537362605632 / safe_timer
140537379391040 / ms_dispatch
140538376353344 / tp_fstore_op
140538458064448 / fn_jrn_objstore
140538483242560 / filestore_sync
140538741188160 / msgr-worker-2
140538779006528 / io_context_pool
140538858620480 / io_context_pool
140538875405888 / admin_socket
140538883798592 / msgr-worker-1
140538892191296 / msgr-worker-0
140538922681024 / ceph-osd
max_recent 10000
max_new 10000
log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---

But this is:

ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)

Can you provide the info sage asked in https://tracker.ceph.com/issues/48468#note-9 and also the output of "ceph daemon osd.X dump_mempools" for the problematic osd?

Actions

Copy link

#17

Updated by Igor Fedotov over 2 years ago

@Neha ., @Gonsalo - to avoid the mess let's use https://tracker.ceph.com/issues/53729 for further communication on the issue.

Actions

Copy link

#18

Updated by Neha Ojha over 2 years ago

Igor Fedotov wrote:

@Neha ., @Gonsalo - to avoid the mess let's use https://tracker.ceph.com/issues/53729 for further communication on the issue.

sounds good

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #48468

ceph-osd crash before being up again

Updated by Clément Hampaï over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by Clément Hampaï over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by Clément Hampaï over 3 years ago

Updated by Neha Ojha over 3 years ago

Updated by Clément Hampaï over 3 years ago

Updated by Clément Hampaï over 3 years ago

Updated by Sage Weil about 3 years ago

Updated by Clément Hampaï almost 3 years ago

Updated by Neha Ojha over 2 years ago

Updated by Gonzalo Aguilar Delgado over 2 years ago

Updated by Gonzalo Aguilar Delgado over 2 years ago

Updated by Clément Hampaï over 2 years ago

Updated by Sebastian Wagner over 2 years ago

Updated by Neha Ojha over 2 years ago

Updated by Igor Fedotov over 2 years ago

Updated by Neha Ojha over 2 years ago