Bug #48468
openceph-osd crash before being up again
0%
Description
Hi hi,
I'm in trouble with 3 osd's never able to be up again inside de cluster after having manually marked as "out".
custom ceph config:
osd_memory_target = 2147483648 bluestore_cache_autotune = true
When starting the osd container, it consume ~35Gb of memory and finaly Seg fault (for debuging purpose I've added a 54G swap on two optane drives).
$ sudo docker stats --no-stream CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 007c3908f0d6 ceph-e93d934c-903d-4a98-ae45-7cfcac568657-osd.2 282.36% 13.85GiB / 15.45GiB 89.64% 0B / 0B 11GB / 1.59MB 69 $ free -mh total used free shared buff/cache available Mem: 15Gi 15Gi 133Mi 0.0Ki 179Mi 53Mi Swap: 54Gi 17Gi 36Gi $ vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 6 19019576 129656 952 192492 5280 6532 6768 6742 4 8 1 12 81 5 0
Feel free to ask if you need more informations or testing.
King regards
PS: here is the Segfault dump
debug -23> 2020-12-04T20:16:19.092+0000 7fdbdf8af700 10 osd.2 211841 tick debug -22> 2020-12-04T20:16:19.092+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start debug -21> 2020-12-04T20:16:19.092+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish debug -20> 2020-12-04T20:16:19.304+0000 7fdbbe7fd700 5 osd.2 211841 heartbeat osd_stat(store_statfs(0x4a951d4000/0x40000000/0x7440000000, data 0x2a88fcef36/0x296ae28000, compress 0x0/0x0/0x0, omap 0x6064ae, meta 0x3f9f9b52), peers [] op hist []) debug -19> 2020-12-04T20:16:19.820+0000 7fdbbe7fd700 5 osd.2 211841 heartbeat osd_stat(store_statfs(0x4a951d4000/0x40000000/0x7440000000, data 0x2a88fcef36/0x296ae28000, compress 0x0/0x0/0x0, omap 0x6064ae, meta 0x3f9f9b52), peers [] op hist []) debug -18> 2020-12-04T20:16:20.060+0000 7fdbd7629700 5 prioritycache tune_memory target: 2147483648 mapped: 36742332416 unmapped: 671744 heap: 36743004160 old mem: 134217728 new mem: 134217728 debug -17> 2020-12-04T20:16:20.136+0000 7fdbdf8af700 10 osd.2 211841 tick debug -16> 2020-12-04T20:16:20.136+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start debug -15> 2020-12-04T20:16:20.136+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish debug -14> 2020-12-04T20:16:21.064+0000 7fdbd7629700 5 prioritycache tune_memory target: 2147483648 mapped: 36885151744 unmapped: 458752 heap: 36885610496 old mem: 134217728 new mem: 134217728 debug -13> 2020-12-04T20:16:21.148+0000 7fdbdf8af700 10 osd.2 211841 tick debug -12> 2020-12-04T20:16:21.148+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start debug -11> 2020-12-04T20:16:21.148+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish debug -10> 2020-12-04T20:16:21.268+0000 7fdbd7629700 5 bluestore.MempoolThread(0x55db2f9cba98) _resize_shards cache_size: 134217728 kv_alloc: 67108864 kv_used: 65725840 meta_alloc: 67108864 meta_used: 103623 data_alloc: 67108864 data_used: 0 debug -9> 2020-12-04T20:16:21.532+0000 7fdbbe7fd700 5 osd.2 211841 heartbeat osd_stat(store_statfs(0x4a951d4000/0x40000000/0x7440000000, data 0x2a88fcef36/0x296ae28000, compress 0x0/0x0/0x0, omap 0x6064ae, meta 0x3f9f9b52), peers [] op hist []) debug -8> 2020-12-04T20:16:22.072+0000 7fdbd7629700 5 prioritycache tune_memory target: 2147483648 mapped: 37013979136 unmapped: 606208 heap: 37014585344 old mem: 134217728 new mem: 134217728 debug -7> 2020-12-04T20:16:22.152+0000 7fdbdf8af700 10 osd.2 211841 tick debug -6> 2020-12-04T20:16:22.152+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start debug -5> 2020-12-04T20:16:22.152+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish debug -4> 2020-12-04T20:16:23.072+0000 7fdbd7629700 5 prioritycache tune_memory target: 2147483648 mapped: 37101379584 unmapped: 679936 heap: 37102059520 old mem: 134217728 new mem: 134217728 debug -3> 2020-12-04T20:16:23.196+0000 7fdbdf8af700 10 osd.2 211841 tick debug -2> 2020-12-04T20:16:23.196+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- start debug -1> 2020-12-04T20:16:23.196+0000 7fdbdf8af700 10 osd.2 211841 do_waiters -- finish debug 0> 2020-12-04T20:16:23.676+0000 7fdbc3807700 -1 *** Caught signal (Aborted) ** in thread 7fdbc3807700 thread_name:tp_osd_tp ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable) 1: (()+0x12dd0) [0x7fdbe68a5dd0] 2: (pthread_kill()+0x35) [0x7fdbe68a2a65] 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x258) [0x55db2601ed48] 4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned long, unsigned long)+0x262) [0x55db2601f392] 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x7a3) [0x55db25a12973] 6: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xa4) [0x55db25a14634] 7: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55db25c460c6] 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x55db25a074df] 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55db26040224] 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55db26042e84] 11: (()+0x82de) [0x7fdbe689b2de] 12: (clone()+0x43) [0x7fdbe55d2e83] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 0 filer 0/ 1 striper 0/ 0 objecter 0/ 0 rados 0/ 0 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_rwl 0/ 0 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 0 client 10/10 osd 0/ 0 optracker 0/ 0 objclass 0/ 0 filestore 0/ 0 journal 0/ 0 ms 0/ 0 mon 0/ 0 monc 0/ 0 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 1/ 1 reserver 0/ 0 heartbeatmap 0/ 0 perfcounter 0/ 0 rgw 1/ 5 rgw_sync 1/10 civetweb 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test -2/-2 (syslog threshold) 99/99 (stderr threshold) --- pthread ID / name mapping for recent threads --- 7fdbbe7fd700 / osd_srv_heartbt 7fdbbeffe700 / tp_osd_tp 7fdbbf7ff700 / tp_osd_tp 7fdbc0801700 / tp_osd_tp 7fdbc1803700 / tp_osd_tp 7fdbc2004700 / tp_osd_tp 7fdbc2805700 / tp_osd_tp 7fdbc3006700 / tp_osd_tp 7fdbc3807700 / tp_osd_tp 7fdbc4008700 / tp_osd_tp 7fdbc4809700 / tp_osd_tp 7fdbc500a700 / tp_osd_tp 7fdbc580b700 / tp_osd_tp 7fdbc600c700 / tp_osd_tp 7fdbc680d700 / tp_osd_tp 7fdbc700e700 / osd_srv_agent 7fdbd0821700 / rocksdb:dump_st 7fdbd161d700 / fn_anonymous 7fdbd49fc700 / ms_dispatch 7fdbd7629700 / bstore_mempool 7fdbde03c700 / safe_timer 7fdbdf8af700 / safe_timer 7fdbe00b0700 / signal_handler 7fdbe18b3700 / service 7fdbe20b4700 / msgr-worker-2 7fdbe28b5700 / msgr-worker-1 7fdbe30b6700 / msgr-worker-0 7fdbe8b32f40 / ceph-osd max_recent 10000 max_new 1000 log_file /var/lib/ceph/crash/2020-12-04T20:16:23.666863Z_7c804440-12b3-4398-a660-4ec9bd52dd57/log --- end dump of recent events --- reraise_fatal: default handler for signal 6 didn't terminate the process? terminate called after throwing an instance of 'boost::wrapexcept<boost::bad_get>' what(): boost::bad_get: failed value get using boost::get *** Caught signal (Segmentation fault) ** in thread 7fdbd39fa700 thread_name:safe_timer
Files
Updated by Clément Hampaï over 3 years ago
- File 2020-12-04T20_16_23.666863Z_7c804440-12b3-4398-a660-4ec9bd52dd57.tar.gz 2020-12-04T20_16_23.666863Z_7c804440-12b3-4398-a660-4ec9bd52dd57.tar.gz added
I'm adding the crash report as well
Updated by Igor Fedotov over 3 years ago
- Project changed from 18 to RADOS
I believe this isn't ceph-deploy issue...
Updated by Clément Hampaï over 3 years ago
Igor Fedotov wrote:
I believe this isn't ceph-deploy issue...
Probably not indeed, my bad
Updated by Igor Fedotov over 3 years ago
Just to mention - telemetry reports show multiple crashes inside HeartbeatMap::_check for different clusters.
Hence this might worth a priority raise.
Updated by Clément Hampaï over 3 years ago
Igor Fedotov wrote:
Just to mention - telemetry reports show multiple crashes inside HeartbeatMap::_check for different clusters.
Hence this might worth a priority raise.
Alright, thanks for mentioning it. I'm raising the issue.
Updated by Clément Hampaï over 3 years ago
little update, I've tried with v15.2.8.
Sadly same behaviour.
Updated by Clément Hampaï over 3 years ago
- File 2020-12-26T13_17_37.106551Z_0c29f139-ade1-4a07-990a-11a4e1b270f9.tar.gz 2020-12-26T13_17_37.106551Z_0c29f139-ade1-4a07-990a-11a4e1b270f9.tar.gz added
Tried with a systemd service instead of the docker container, same behaviour as well.
I've submitted the new crash-report.
Updated by Sage Weil about 3 years ago
- Status changed from New to Need More Info
Hi Clement,
Can you reproduce this with logs?
ceph config set osd.2 debug_osd 20 # or whichever osd(s) ceph config set osd.2 log_to_file true
and ceph-post-file the resulting log in /var/log/ceph/<cluster-fsid>/?
Updated by Clément Hampaï almost 3 years ago
Hi Sage,
Hum I've finally managed to recover my cluster after an uncounted osd restart procedures until they started up successfully somehow.
The issue occured after a pool's PG number modification leading osd to crash and letting some of them unable to restart properly.
I'm currently unable to reproduce the issue but I'll for sure keep you up to date along the way if it occurs again.
Thanks a lot !
Updated by Neha Ojha over 2 years ago
- Priority changed from Urgent to Normal
Reducing priority for now.
Updated by Gonzalo Aguilar Delgado over 2 years ago
Hi I'm having the same problem.
-7> 2021-12-25T12:05:37.491+0100 7fd15c920640 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7fd15c920640' had suicide timed out after 1500.000000000s
-6> 2021-12-25T12:05:37.683+0100 7fd167936640 10 monclient: tick
-5> 2021-12-25T12:05:37.683+0100 7fd167936640 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2021-12-25T12:05:07.685922+0100)
-4> 2021-12-25T12:05:37.719+0100 7fd167936640 10 monclient: _send_mon_message to mon.blue-compute at v2:172.16.0.119:3300/0
-3> 2021-12-25T12:05:38.719+0100 7fd167936640 10 monclient: tick
-2> 2021-12-25T12:05:38.719+0100 7fd167936640 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2021-12-25T12:05:08.720135+0100)
-1> 2021-12-25T12:05:38.719+0100 7fd167936640 10 monclient: _send_mon_message to mon.blue-compute at v2:172.16.0.119:3300/0
0> 2021-12-25T12:05:39.295+0100 7fd15c920640 -1 ** Caught signal (Aborted) *
in thread 7fd15c920640 thread_name:tp_osd_tp
ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
1: /lib/x86_64-linux-gnu/libc.so.6(+0x46520) [0x7fd1c4b56520]
2: pthread_kill()
3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x485) [0x559eb61883b5]
4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >)+0x74) [0x559eb6188c04]
5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x706) [0x559eb5b32586]
6: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xca) [0x559eb5b3447a]
7: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x5b) [0x559eb5d7bb9b]
8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9c3) [0x559eb5b25f73]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x434) [0x559eb61aa284]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x559eb61ac5b4]
11: /lib/x86_64-linux-gnu/libc.so.6(+0x98927) [0x7fd1c4ba8927]
12: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 1 osd
0/ 5 optracker
0/ 5 objclass
10/10 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
2/-2 (syslog threshold) pthread ID / name mapping for recent threads ---
-1/-1 (stderr threshold)
--
140537102464576 / tp_osd_tp
140537169573440 / tp_osd_tp
140537177966144 / tp_osd_tp
140537186358848 / tp_osd_tp
140537194751552 / tp_osd_tp
140537362605632 / safe_timer
140537379391040 / ms_dispatch
140538376353344 / tp_fstore_op
140538458064448 / fn_jrn_objstore
140538483242560 / filestore_sync
140538741188160 / msgr-worker-2
140538779006528 / io_context_pool
140538858620480 / io_context_pool
140538875405888 / admin_socket
140538883798592 / msgr-worker-1
140538892191296 / msgr-worker-0
140538922681024 / ceph-osd
max_recent 10000
max_new 10000
log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---
But this is:
ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
Updated by Gonzalo Aguilar Delgado over 2 years ago
Clément Hampaï wrote:
Hi Sage,
Hum I've finally managed to recover my cluster after an uncounted osd restart procedures until they started up successfully somehow.
The issue occured after a pool's PG number modification leading osd to crash and letting some of them unable to restart properly.
I'm currently unable to reproduce the issue but I'll for sure keep you up to date along the way if it occurs again.
Thanks a lot !
Can you explain how did you resolved it?
Updated by Clément Hampaï over 2 years ago
Hey Gonzalo,
It was some times ago but from my memories I've created a hudge swapfile ~50G and I restarted the osd's multiple times one after the other when they finally crashed.
After some restarts one failed osd was able to boot properly again. Then a second one and a third one. At this point my unknown PGs where online again.
I hope it might help you to fix your cluster too. Keep me up to date if you want to.
KR,
Clément
Gonzalo Aguilar Delgado wrote:
Clément Hampaï wrote:
Hi Sage,
Hum I've finally managed to recover my cluster after an uncounted osd restart procedures until they started up successfully somehow.
The issue occured after a pool's PG number modification leading osd to crash and letting some of them unable to restart properly.
I'm currently unable to reproduce the issue but I'll for sure keep you up to date along the way if it occurs again.
Thanks a lot !Can you explain how did you resolved it?
Updated by Sebastian Wagner over 2 years ago
- Related to Bug #53729: ceph-osd takes all memory before oom on boot added
Updated by Neha Ojha over 2 years ago
Gonzalo Aguilar Delgado wrote:
Hi I'm having the same problem.
-7> 2021-12-25T12:05:37.491+0100 7fd15c920640 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7fd15c920640' had suicide timed out after 1500.000000000s
-6> 2021-12-25T12:05:37.683+0100 7fd167936640 10 monclient: tick
-5> 2021-12-25T12:05:37.683+0100 7fd167936640 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2021-12-25T12:05:07.685922+0100)
-4> 2021-12-25T12:05:37.719+0100 7fd167936640 10 monclient: _send_mon_message to mon.blue-compute at v2:172.16.0.119:3300/0
-3> 2021-12-25T12:05:38.719+0100 7fd167936640 10 monclient: tick
-2> 2021-12-25T12:05:38.719+0100 7fd167936640 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2021-12-25T12:05:08.720135+0100)
-1> 2021-12-25T12:05:38.719+0100 7fd167936640 10 monclient: _send_mon_message to mon.blue-compute at v2:172.16.0.119:3300/0
0> 2021-12-25T12:05:39.295+0100 7fd15c920640 -1 ** Caught signal (Aborted) *
in thread 7fd15c920640 thread_name:tp_osd_tpceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
1: /lib/x86_64-linux-gnu/libc.so.6(+0x46520) [0x7fd1c4b56520]
2: pthread_kill()
3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x485) [0x559eb61883b5]
4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >)+0x74) [0x559eb6188c04]
5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x706) [0x559eb5b32586]
6: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xca) [0x559eb5b3447a]
7: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x5b) [0x559eb5d7bb9b]
8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9c3) [0x559eb5b25f73]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x434) [0x559eb61aa284]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x559eb61ac5b4]
11: /lib/x86_64-linux-gnu/libc.so.6(+0x98927) [0x7fd1c4ba8927]
12: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 1 osd
0/ 5 optracker
0/ 5 objclass
10/10 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite2/-2 (syslog threshold)pthread ID / name mapping for recent threads ---
-1/-1 (stderr threshold)
--
140537102464576 / tp_osd_tp
140537169573440 / tp_osd_tp
140537177966144 / tp_osd_tp
140537186358848 / tp_osd_tp
140537194751552 / tp_osd_tp
140537362605632 / safe_timer
140537379391040 / ms_dispatch
140538376353344 / tp_fstore_op
140538458064448 / fn_jrn_objstore
140538483242560 / filestore_sync
140538741188160 / msgr-worker-2
140538779006528 / io_context_pool
140538858620480 / io_context_pool
140538875405888 / admin_socket
140538883798592 / msgr-worker-1
140538892191296 / msgr-worker-0
140538922681024 / ceph-osd
max_recent 10000
max_new 10000
log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---But this is:
ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
Can you provide the info sage asked in https://tracker.ceph.com/issues/48468#note-9 and also the output of "ceph daemon osd.X dump_mempools" for the problematic osd?
Updated by Igor Fedotov over 2 years ago
@Neha ., @Gonsalo - to avoid the mess let's use https://tracker.ceph.com/issues/53729 for further communication on the issue.
Updated by Neha Ojha over 2 years ago
Igor Fedotov wrote:
@Neha ., @Gonsalo - to avoid the mess let's use https://tracker.ceph.com/issues/53729 for further communication on the issue.
sounds good