Feature #58158
OSD container's shutdown timeout should be raised (and configurable)
0%
Description
Currently OSD starts are very slow in quincy v.17.2.5. Disabling osd fast shutdown has no impact.
2022-12-04T09:22:31.147+0000 7f209515b3c0 -1 bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file)
Related issues
History
#1 Updated by Igor Fedotov over 1 year ago
Could you please set fast osd shutdown to false and debug-bluestore to 20. Then restart OSD and share shutdown/startup log
#2 Updated by imirc tw over 1 year ago
Igor Fedotov wrote:
Could you please set fast osd shutdown to false and debug-bluestore to 20. Then restart OSD and share shutdown/startup log
which part are you interested in, as it is pretty large :)
#3 Updated by Igor Fedotov over 1 year ago
well, let it be -50000 and +50000 lines from OSD restart point, below is a sample log line which indicates such a point in Pacific:
2022-10-06T12:08:06.419+0000 7f134009d100 0 ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable), process ceph-osd, pid 5881
#4 Updated by imirc tw over 1 year ago
- File sample.part1.log.gz added
- File sample.part2.log.gz added
- File sample.part3.log.gz added
- File sample.part4.log.gz added
Needed to split it in several parts due to upload limit, took -100k + 100k to be sure.
#5 Updated by Igor Fedotov over 1 year ago
I can see ~17 seconds missing in the log:
2022-12-05T09:15:31.918+0000 7f820508e700 20 bluestore.BufferSpace(0x555cf0ee27f8 in 0x555bf55e5b00) _clear
???
2022-12-05T09:15:48.126+0000 7efcc632d3c0 0 set uid:gid to 167:167 (ceph:ceph)
2022-12-05T09:15:48.126+0000 7efcc632d3c0 0 ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable), process ceph-osd, pid 1
Is it the case in the original log or just improper/incomplete snippet?
Also - could you please recollect the same log with debug-bluestore set to 10. Current one is too verbose and the termination stage is fully filled with data cache cleanup, e.g.:
2022-12-05T09:15:31.326+0000 7f820508e700 20 bluestore.sharedblob(0x555d6cf3fc70) put 0x555d6cf3fc70 removing self from set 0x555bf7269ed0
2022-12-05T09:15:31.326+0000 7f820508e700 20 bluestore.BufferSpace(0x555d6cf3fc88 in 0x555bf64c8400) _clear
#6 Updated by imirc tw over 1 year ago
I noticed the same, not sure why though. Will create a new log with debug to 10.
#7 Updated by imirc tw over 1 year ago
again, a gap:
2022-12-05T14:23:33.437+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe3f9a0 in 0x562deb647000) clear 5578
2022-12-05T14:23:33.569+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded3aaaa0 in 0x562deb674800) clear 4774
2022-12-05T14:23:33.684+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debad0aa0 in 0x562deb65a000) clear 11370
2022-12-05T14:23:48.340+0000 7f896f4fc3c0 0 set uid:gid to 167:167 (ceph:ceph)
2022-12-05T14:23:48.340+0000 7f896f4fc3c0 0 ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable), process ceph-osd, pid 1
I guess the important stuff should be listed in the gap?
#8 Updated by imirc tw over 1 year ago
for the record, the shutdown log before the gap:
2022-12-05T14:23:24.924+0000 7efcaf895700 0 osd.8 2149519 got_stop_ack starting shutdown 2022-12-05T14:23:24.924+0000 7efcbdb19700 0 osd.8 2149519 prepare_to_stop starting shutdown 2022-12-05T14:23:24.924+0000 7efcbdb19700 0 osd.8 2149519 shutdown 2022-12-05T14:23:25.062+0000 7efcb02fe700 10 bluestore(/var/lib/ceph/osd/ceph-8) _deferred_submit_unlock osr 0x562dec03f600 1 ios pending 2022-12-05T14:23:25.062+0000 7efcb02fe700 10 bluestore(/var/lib/ceph/osd/ceph-8) _deferred_submit_unlock osr 0x562e098fc9a0 1 ios pending 2022-12-05T14:23:25.062+0000 7efcb02fe700 10 bluestore(/var/lib/ceph/osd/ceph-8) _deferred_submit_unlock osr 0x562deba13a20 1 ios pending 2022-12-05T14:23:25.062+0000 7efcb02fe700 10 bluestore(/var/lib/ceph/osd/ceph-8) _deferred_submit_unlock osr 0x562debcb3080 1 ios pending 2022-12-05T14:23:25.063+0000 7efcb4b07700 10 bluestore(/var/lib/ceph/osd/ceph-8) _deferred_aio_finish osr 0x562dec03f600 2022-12-05T14:23:25.063+0000 7efcb4b07700 10 bluestore(/var/lib/ceph/osd/ceph-8) _deferred_aio_finish osr 0x562e098fc9a0 2022-12-05T14:23:25.063+0000 7efcb4b07700 10 bluestore(/var/lib/ceph/osd/ceph-8) _deferred_aio_finish osr 0x562deba13a20 2022-12-05T14:23:25.063+0000 7efcb4b07700 10 bluestore(/var/lib/ceph/osd/ceph-8) _deferred_aio_finish osr 0x562debcb3080 2022-12-05T14:23:25.078+0000 7efcbdb19700 10 bluestore(/var/lib/ceph/osd/ceph-8) queue_transactions ch 0x562debad0b40 meta 2022-12-05T14:23:25.078+0000 7efcbdb19700 10 bluestore(/var/lib/ceph/osd/ceph-8) _do_write_small 0x0~24e 2022-12-05T14:23:25.078+0000 7efcbdb19700 10 bluestore(/var/lib/ceph/osd/ceph-8) _write meta #-1:7b3f43c4:::osd_superblock:0# 0x0~24e = 0 2022-12-05T14:23:25.078+0000 7efcbdb19700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_calc_cost 0x562deba67b00 cost 670803 (1 ios * 670000 + 803 bytes) 2022-12-05T14:23:25.078+0000 7efcbdb19700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_state_proc txc 0x562deba67b00 prepare 2022-12-05T14:23:25.078+0000 7efcbdb19700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_state_proc txc 0x562deba67b00 io_done 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_state_proc txc 0x562deba67b00 kv_submitted 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_state_proc txc 0x562e33ba2f00 deferred_cleanup 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_release_alloc(sync) 0x562e33ba2f00 [] 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_release_alloc(sync) 0x562ea4bdcc00 [] 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_release_alloc(sync) 0x562ebe966000 [] 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_release_alloc(sync) 0x562f4e227500 [] 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_state_proc txc 0x562df21c5800 deferred_cleanup 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_release_alloc(sync) 0x562df21c5800 [] 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_release_alloc(sync) 0x562e24d10c00 [] 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_state_proc txc 0x562e24d11b00 deferred_cleanup 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_release_alloc(sync) 0x562e24d11b00 [] 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_release_alloc(sync) 0x562ffa718900 [] 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_state_proc txc 0x562e2ad4d200 deferred_cleanup 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_release_alloc(sync) 0x562e2ad4d200 [] 2022-12-05T14:23:25.118+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_release_alloc(sync) 0x562e24bfb800 [] 2022-12-05T14:23:25.719+0000 7efcb02fe700 5 bluestore.MempoolThread(0x562dea802b38) _resize_shards cache_size: 7408818583 kv_alloc: 1442840576 kv_used: 190418160 kv_onode_alloc: 1744830464 kv_onode_used: 1 744797840 meta_alloc: 1577058304 meta_used: 1337675628 data_alloc: 805306368 data_used: 756465664 2022-12-05T14:23:27.028+0000 7efcbdb19700 10 bluestore(/var/lib/ceph/osd/ceph-8) _osr_drain_all 2022-12-05T14:23:27.028+0000 7efcbdb19700 10 bluestore(/var/lib/ceph/osd/ceph-8) _deferred_submit_unlock osr 0x562e098fcdc0 1 ios pending 2022-12-05T14:23:27.029+0000 7efcb4b07700 10 bluestore(/var/lib/ceph/osd/ceph-8) _deferred_aio_finish osr 0x562e098fcdc0 2022-12-05T14:23:27.043+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_state_proc txc 0x562deba67b00 deferred_cleanup 2022-12-05T14:23:27.043+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_release_alloc(sync) 0x562deba67b00 [] 2022-12-05T14:23:27.043+0000 7efcbdb19700 10 bluestore(/var/lib/ceph/osd/ceph-8) _osr_drain_all done 2022-12-05T14:23:27.043+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _txc_finish empty zombie osr 0x562e098fcdc0 already reaped 2022-12-05T14:23:27.043+0000 7efcb02fe700 0 bluestore(/var/lib/ceph/osd/ceph-8) allocation stats probe 0: cnt: 823935 frags: 824198 size: 102560845824 2022-12-05T14:23:27.043+0000 7efcb02fe700 0 bluestore(/var/lib/ceph/osd/ceph-8) probe -1: 0, 0, 0 2022-12-05T14:23:27.043+0000 7efcb02fe700 0 bluestore(/var/lib/ceph/osd/ceph-8) probe -2: 0, 0, 0 2022-12-05T14:23:27.043+0000 7efcb02fe700 0 bluestore(/var/lib/ceph/osd/ceph-8) probe -4: 0, 0, 0 2022-12-05T14:23:27.043+0000 7efcb02fe700 0 bluestore(/var/lib/ceph/osd/ceph-8) probe -8: 0, 0, 0 2022-12-05T14:23:27.043+0000 7efcb02fe700 0 bluestore(/var/lib/ceph/osd/ceph-8) probe -16: 0, 0, 0 2022-12-05T14:23:27.043+0000 7efcb02fe700 0 bluestore(/var/lib/ceph/osd/ceph-8) ------------ 2022-12-05T14:23:27.044+0000 7efcbdb19700 10 bluestore(/var/lib/ceph/osd/ceph-8) _kv_stop 2022-12-05T14:23:27.044+0000 7efcacaf7700 10 bluestore(/var/lib/ceph/osd/ceph-8) _kv_finalize_thread finish 2022-12-05T14:23:27.044+0000 7efcac2f6700 10 bluestore(/var/lib/ceph/osd/ceph-8) _kv_sync_thread finish 2022-12-05T14:23:27.045+0000 7efcbdb19700 10 bluestore(/var/lib/ceph/osd/ceph-8) _kv_stop stopping finishers 2022-12-05T14:23:27.046+0000 7efcbdb19700 10 bluestore(/var/lib/ceph/osd/ceph-8) _kv_stop stopped 2022-12-05T14:23:27.046+0000 7efcbdb19700 10 bluestore(/var/lib/ceph/osd/ceph-8) _shutdown_cache 2022-12-05T14:23:27.183+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debad0c80 in 0x562deb534000) clear 644 2022-12-05T14:23:27.202+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debad08c0 in 0x562deb536800) clear 1 2022-12-05T14:23:27.202+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debad0500 in 0x562deb647000) clear 1 2022-12-05T14:23:27.202+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deb9cd7c0 in 0x562deb66c800) clear 1143 2022-12-05T14:23:27.238+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deb9cd220 in 0x562deb657000) clear 85 2022-12-05T14:23:27.241+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deb9cd9a0 in 0x562deb63a000) clear 18025 2022-12-05T14:23:27.743+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deb9cdb80 in 0x562deb634800) clear 231 2022-12-05T14:23:27.750+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deb9cdd60 in 0x562deb534000) clear 498 2022-12-05T14:23:27.766+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562dece06500 in 0x562deb667000) clear 891 2022-12-05T14:23:27.796+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debad1f40 in 0x562deb644800) clear 5 2022-12-05T14:23:27.796+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562dece068c0 in 0x562deb667000) clear 922 2022-12-05T14:23:27.826+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562dece07220 in 0x562deb64f000) clear 20019 2022-12-05T14:23:28.326+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562dece077c0 in 0x562deb67c800) clear 702 2022-12-05T14:23:28.347+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debad15e0 in 0x562deb66c800) clear 964 2022-12-05T14:23:28.373+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded3aae60 in 0x562deb667000) clear 956 2022-12-05T14:23:28.400+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe0bf40 in 0x562deb647000) clear 615 2022-12-05T14:23:28.419+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deb9ccaa0 in 0x562deb652000) clear 295 2022-12-05T14:23:28.428+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deceb66e0 in 0x562deb65f000) clear 7739 2022-12-05T14:23:28.652+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562dece06aa0 in 0x562deb65a000) clear 182 2022-12-05T14:23:28.657+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deceb7040 in 0x562deb654800) clear 4422 2022-12-05T14:23:28.777+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deceb7b80 in 0x562deb654800) clear 4429 2022-12-05T14:23:28.892+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debad1220 in 0x562deb657000) clear 18480 2022-12-05T14:23:29.301+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debad1040 in 0x562deb64f000) clear 14 2022-12-05T14:23:29.301+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debad0e60 in 0x562deb634800) clear 9251 2022-12-05T14:23:29.543+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debad06e0 in 0x562deb677000) clear 1 2022-12-05T14:23:29.543+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded35fd60 in 0x562deb63f000) clear 4646 2022-12-05T14:23:29.666+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded35f5e0 in 0x562deb66f000) clear 3203 2022-12-05T14:23:29.753+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded35f220 in 0x562deb662000) clear 9820 2022-12-05T14:23:29.988+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deceb6320 in 0x562deb65c800) clear 13499 2022-12-05T14:23:30.301+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded35f040 in 0x562deb64a000) clear 4614 2022-12-05T14:23:30.412+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded35ee60 in 0x562deb63c800) clear 4540 2022-12-05T14:23:30.525+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe0b400 in 0x562deb634800) clear 2456 2022-12-05T14:23:30.585+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562dece066e0 in 0x562deb677000) clear 953 2022-12-05T14:23:30.610+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562dece07400 in 0x562deb64c800) clear 3819 2022-12-05T14:23:30.702+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe0bd60 in 0x562deb67a000) clear 5725 2022-12-05T14:23:30.842+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded35e320 in 0x562deb674800) clear 4805 2022-12-05T14:23:30.952+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deb9cd5e0 in 0x562deb539000) clear 234 2022-12-05T14:23:30.959+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562dece079a0 in 0x562deb534000) clear 4454 2022-12-05T14:23:31.063+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe3fd60 in 0x562deb66f000) clear 3264 2022-12-05T14:23:31.142+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562dece06e60 in 0x562deb634800) clear 2498 2022-12-05T14:23:31.204+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded3abb80 in 0x562deb65c800) clear 2056 2022-12-05T14:23:31.254+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded35e140 in 0x562deb642000) clear 5931 2022-12-05T14:23:31.398+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deb9cc500 in 0x562deb637000) clear 1713 2022-12-05T14:23:31.440+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded3ab220 in 0x562deb67c800) clear 6490 2022-12-05T14:23:31.589+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deceb79a0 in 0x562deb632000) clear 1413 2022-12-05T14:23:31.625+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded35f9a0 in 0x562deb534000) clear 4591 2022-12-05T14:23:31.729+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562dece07f40 in 0x562deb539000) clear 12425 2022-12-05T14:23:32.029+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded35e6e0 in 0x562deb63f000) clear 4732 2022-12-05T14:23:32.144+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe3f400 in 0x562deb539000) clear 2023 2022-12-05T14:23:32.194+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562dece06140 in 0x562deb65f000) clear 913 2022-12-05T14:23:32.217+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded35fb80 in 0x562deb63c800) clear 4629 2022-12-05T14:23:32.329+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe3ff40 in 0x562deb65c800) clear 2132 2022-12-05T14:23:32.382+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deb9cc140 in 0x562deb63c800) clear 482 2022-12-05T14:23:32.395+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deb9cdf40 in 0x562deb64a000) clear 487 2022-12-05T14:23:32.408+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deb6315e0 in 0x562deb65a000) clear 1607 2022-12-05T14:23:32.448+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded35f7c0 in 0x562deb674800) clear 4653 2022-12-05T14:23:32.558+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded35f400 in 0x562deb539000) clear 2066 2022-12-05T14:23:32.609+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe3ee60 in 0x562deb63a000) clear 338 2022-12-05T14:23:32.618+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debad1d60 in 0x562deb536800) clear 489 2022-12-05T14:23:32.633+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe3ec80 in 0x562deb637000) clear 1747 2022-12-05T14:23:32.677+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded3ab7c0 in 0x562deb672000) clear 532 2022-12-05T14:23:32.690+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562dece07d60 in 0x562deb654800) clear 494 2022-12-05T14:23:32.703+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562dece06320 in 0x562deb667000) clear 8437 2022-12-05T14:23:32.904+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe3f5e0 in 0x562deb64a000) clear 4585 2022-12-05T14:23:33.011+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deb9cc6e0 in 0x562deb654800) clear 505 2022-12-05T14:23:33.025+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debad1400 in 0x562deb534000) clear 4557 2022-12-05T14:23:33.131+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe0b5e0 in 0x562deb644800) clear 43 2022-12-05T14:23:33.132+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562dece07040 in 0x562deb637000) clear 1747 2022-12-05T14:23:33.175+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe3e8c0 in 0x562deb66f000) clear 3258 2022-12-05T14:23:33.253+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe3f220 in 0x562deb652000) clear 2696 2022-12-05T14:23:33.318+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deb9ccc80 in 0x562deb672000) clear 68 2022-12-05T14:23:33.319+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562deb631400 in 0x562deb64a000) clear 4604 2022-12-05T14:23:33.425+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe0a320 in 0x562deb672000) clear 519 2022-12-05T14:23:33.437+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debe3f9a0 in 0x562deb647000) clear 5578 2022-12-05T14:23:33.569+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562ded3aaaa0 in 0x562deb674800) clear 4774 2022-12-05T14:23:33.684+0000 7efcbdb19700 10 bluestore.OnodeSpace(0x562debad0aa0 in 0x562deb65a000) clear 11370 2022-12-05T14:23:48.340+0000 7f896f4fc3c0 0 set uid:gid to 167:167 (ceph:ceph) 2022-12-05T14:23:48.340+0000 7f896f4fc3c0 0 ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable), process ceph-osd, pid 1
Also, not sure if it does make a difference, but these OSD's run in podman containers managed by cephadm
#9 Updated by imirc tw over 1 year ago
I think this is container related. I suspect the container has a default timeout of 10 seconds for commands, after they are send a SIGKILL. So this really is a cephadm/podman issue which kills the osd process too fast before it has finished shutting down correctly.
#10 Updated by Igor Fedotov over 1 year ago
imirc tw wrote:
I think this is container related. I suspect the container has a default timeout of 10 seconds for commands, after they are send a SIGKILL. So this really is a cephadm/podman issue which kills the osd process too fast before it has finished shutting down correctly.
Yeah, looks like that to me as well... Can you tune this timeout up?
#11 Updated by imirc tw over 1 year ago
Yeah that does the trick. I added a timeout to the unit.stop of 30s which solves the issue.
for i in /var/lib/ceph/{uuid}/osd.*; do sed -i 's/podman stop/podman stop -t 30/g' $i/unit.stop; done
However the proper fix would be to add the timeout to the unit.run cmd so that podman stop without arguments works as well (if you use podman stop {containerid} manually for some reason)
for i in /var/lib/ceph/{uuid}/osd.*; do sed -i 's/--stop-signal=SIGTERM/--stop-signal=SIGTERM --stop-timeout=30/g' $i/unit.run; done
This probably needs an update in cephadm to add the stop-timeout into the unit.run file being generated for osds.
#12 Updated by Igor Fedotov over 1 year ago
- Tracker changed from Bug to Feature
- Project changed from Ceph to Orchestrator
- Subject changed from Slow OSD starts due to NCB::__restore_allocator to OSD container's shutdown timeout should be raised (and configurable)
- Category deleted (
OSD)
Existing 10 sec default value might cause non-graceful osd shutdown which dramatically slows down subsequent OSD restart..
#13 Updated by Igor Fedotov over 1 year ago
Just FYI: I reassigned the ticket to Orchestrator project and reworded the subject.
imirc tw wrote:
Yeah that does the trick. I added a timeout to the unit.stop of 30s which solves the issue.
[...]However the proper fix would be to add the timeout to the unit.run cmd so that podman stop without arguments works as well (if you use podman stop {containerid} manually for some reason)
[...]
This probably needs an update in cephadm to add the stop-timeout into the unit.run file being generated for osds.
#14 Updated by imirc tw over 1 year ago
Thanks for your support!
#15 Updated by Redouane Kachach Elhichou about 1 year ago
- Assignee set to Redouane Kachach Elhichou
#16 Updated by Redouane Kachach Elhichou about 1 year ago
- Pull request ID set to 50655
#17 Updated by Redouane Kachach Elhichou about 1 year ago
- Status changed from New to In Progress
#19 Updated by Backport Bot 12 months ago
- Copied to Backport #59238: quincy: OSD container's shutdown timeout should be raised (and configurable) added
#20 Updated by Backport Bot 12 months ago
- Tags set to backport_processed