Documentation #58590
osd_op_thread_suicide_timeout is not documented
0%
Description
There are plenty of references to configuration option osd_op_thread_suicide_timeout
but this config variable is not explained anywhere.
- Red Hat's (obsolete?) 2018 presentation "Optimizing Ceph Object Storage For Production in Multisite Clouds"
- CERN's (more obsolete?) tuning presentation.
- Debugging documentation at https://docs.ceph.com/en/latest/dev/developer_guide/debugging-gdb/
- Mailing lists etc.
I see that in Quincy the default value seems to be 150.
Any other config variable that I have came across has been documented in docs.ceph.com but this variable is not or at least a web search limited to site:docs.ceph.com
does not find it.
It would be nice to know what this variable does and why several tuning guides talk about increasing it - often together with talking about reducing effect of network problems.
History
#1 Updated by Zac Dover about 2 months ago
[zdover@fedora doc]$ grep -ir "osd_op_thread_suicide_timeout" *
[zdover@fedora doc]$ grep -ir "suicide" *
changelog/v0.67.10.txt: mon: Monitor: suicide on start if mon has been removed from monmap
changelog/v0.80.11.txt: make the all osd/filestore thread pool suicide timeouts separately configurable
changelog/v0.80.11.txt: OSD: add scrub_finalize_wq suicide timeout
changelog/v0.80.11.txt: OSD: add scrub_wq suicide timeout
changelog/v0.80.11.txt: OSD: add op_wq suicide timeout
changelog/v0.80.11.txt: OSD: add remove_wq suicide timeout
changelog/v0.80.11.txt: OSD: add snap_trim_wq suicide timeout
changelog/v0.80.11.txt: OSD: add recovery_wq suicide timeout
changelog/v0.80.11.txt: place OPTION
changelog/v0.80.11.txt: OSD: add command_wq suicide timeout
changelog/v0.80.11.txt: place OPTION
changelog/v0.87.1.txt: immediately after a call to suicide() completes, this needs
changelog/v0.94.10.txt: hammer: tests: OSDs commit suicide in rbd suite when testing on btrfs
changelog/v0.94.3.txt: make the all osd/filestore thread pool suicide timeouts separately configurable
changelog/v0.94.3.txt: OSD: add command_wq suicide timeout
changelog/v0.94.3.txt: OSD: add remove_wq suicide timeout
changelog/v0.94.3.txt: OSD: add scrub_wq suicide timeout
changelog/v0.94.3.txt: OSD: add snap_trim_wq suicide timeout
changelog/v0.94.3.txt: OSD: add recovery_wq suicide timeout
changelog/v0.94.3.txt: OSD: add op_wq suicide timeout
changelog/v0.94.4.txt: osd suicide timeout during peering - search for missing objects
changelog/v0.94.6.txt: config_opts: increase suicide timeout to 300 to match recovery
changelog/v0.94.6.txt: config_opts: increase suicide timeout to 300 to match recovery
changelog/v10.2.4.txt: For one user, this was causing OSD suicides when scrub ran because it
changelog/v12.2.2.txt: luminous: librbd: object map batch update might cause OSD suicide timeout
rados/configuration/filestore-config-ref.rst:``filestore_op_thread_suicide_timeout``
rados/troubleshooting/troubleshooting-osd.rst:If the daemon stopped because of a heartbeat failure or ``suicide timeout``,
radosgw/config-ref.rst:.. confval:: rgw_op_thread_suicide_timeout
[zdover@fedora doc]$
#2 Updated by Zac Dover about 2 months ago
https://old.ceph.com/planet/dealing-with-some-osd-timeouts/ (from an email from Neha Ojha to Zac Dover)
https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/ - the location where this variable should be documented, as far as I can tell (as of 29 Jan 2023 0555AEST)
#3 Updated by Zac Dover about 2 months ago
ceph/src/common/options/global.yaml.in is the file in which these variables are documented.
#4 Updated by Zac Dover about 2 months ago
#5 Updated by Zac Dover about 2 months ago
common/options/osd.yaml.in:- name: osd_op_thread_suicide_timeout
common/options/rgw.yaml.in:- name: rgw_op_thread_suicide_timeout
common/options/global.yaml.in:- name: osd_op_thread_suicide_timeout
common/options/global.yaml.in:- name: filestore_op_thread_suicide_timeout