Documentation #58590: osd_op_thread_suicide_timeout is not documented - RADOS - Ceph

Actions

Copy link

Documentation #58590

open

osd_op_thread_suicide_timeout is not documented

Added by Voja Molani about 1 year ago. Updated 11 months ago.

Status:

New

Priority:

Normal

Assignee:

Zac Dover

Category:

Documentation

Target version:

% Done:

Spent time:

3:00 h

Tags:

Backport:

Reviewed:

Affected Versions:

Ceph - v17.2.5

Pull request ID:

Description

There are plenty of references to configuration option osd_op_thread_suicide_timeout but this config variable is not explained anywhere.

For example these refers the variable:

Red Hat's (obsolete?) 2018 presentation "Optimizing Ceph Object Storage For Production in Multisite Clouds"
CERN's (more obsolete?) tuning presentation.
Debugging documentation at https://docs.ceph.com/en/latest/dev/developer_guide/debugging-gdb/
Mailing lists etc.

I see that in Quincy the default value seems to be 150.
Any other config variable that I have came across has been documented in docs.ceph.com but this variable is not or at least a web search limited to site:docs.ceph.com does not find it.

It would be nice to know what this variable does and why several tuning guides talk about increasing it - often together with talking about reducing effect of network problems.

Actions

Copy link

Updated by Zac Dover about 1 year ago

[zdover@fedora doc]$ grep -ir "osd_op_thread_suicide_timeout" *
[zdover@fedora doc]$ grep -ir "suicide" *
changelog/v0.67.10.txt: mon: Monitor: suicide on start if mon has been removed from monmap
changelog/v0.80.11.txt: make the all osd/filestore thread pool suicide timeouts separately configurable
changelog/v0.80.11.txt: OSD: add scrub_finalize_wq suicide timeout
changelog/v0.80.11.txt: OSD: add scrub_wq suicide timeout
changelog/v0.80.11.txt: OSD: add op_wq suicide timeout
changelog/v0.80.11.txt: OSD: add remove_wq suicide timeout
changelog/v0.80.11.txt: OSD: add snap_trim_wq suicide timeout
changelog/v0.80.11.txt: OSD: add recovery_wq suicide timeout
changelog/v0.80.11.txt: place OPTION
changelog/v0.80.11.txt: OSD: add command_wq suicide timeout
changelog/v0.80.11.txt: place OPTION
changelog/v0.87.1.txt: immediately after a call to suicide() completes, this needs
changelog/v0.94.10.txt: hammer: tests: OSDs commit suicide in rbd suite when testing on btrfs
changelog/v0.94.3.txt: make the all osd/filestore thread pool suicide timeouts separately configurable
changelog/v0.94.3.txt: OSD: add command_wq suicide timeout
changelog/v0.94.3.txt: OSD: add remove_wq suicide timeout
changelog/v0.94.3.txt: OSD: add scrub_wq suicide timeout
changelog/v0.94.3.txt: OSD: add snap_trim_wq suicide timeout
changelog/v0.94.3.txt: OSD: add recovery_wq suicide timeout
changelog/v0.94.3.txt: OSD: add op_wq suicide timeout
changelog/v0.94.4.txt: osd suicide timeout during peering - search for missing objects
changelog/v0.94.6.txt: config_opts: increase suicide timeout to 300 to match recovery
changelog/v0.94.6.txt: config_opts: increase suicide timeout to 300 to match recovery
changelog/v10.2.4.txt: For one user, this was causing OSD suicides when scrub ran because it
changelog/v12.2.2.txt: luminous: librbd: object map batch update might cause OSD suicide timeout
rados/configuration/filestore-config-ref.rst:``filestore_op_thread_suicide_timeout``
rados/troubleshooting/troubleshooting-osd.rst:If the daemon stopped because of a heartbeat failure or ``suicide timeout``,
radosgw/config-ref.rst:.. confval:: rgw_op_thread_suicide_timeout
[zdover@fedora doc]$

Actions

Copy link

Updated by Zac Dover about 1 year ago

https://old.ceph.com/planet/dealing-with-some-osd-timeouts/ (from an email from Neha Ojha to Zac Dover)

https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/ - the location where this variable should be documented, as far as I can tell (as of 29 Jan 2023 0555AEST)

Actions

Copy link

Updated by Zac Dover about 1 year ago

ceph/src/common/options/global.yaml.in is the file in which these variables are documented.

Actions

Copy link

Updated by Zac Dover about 1 year ago

https://github.com/ceph/ceph/pull/49905

Actions

Copy link

Updated by Zac Dover about 1 year ago

common/options/osd.yaml.in:- name: osd_op_thread_suicide_timeout
common/options/rgw.yaml.in:- name: rgw_op_thread_suicide_timeout
common/options/global.yaml.in:- name: osd_op_thread_suicide_timeout
common/options/global.yaml.in:- name: filestore_op_thread_suicide_timeout

Actions

Copy link

Updated by Anthony D'Atri 11 months ago

My sense is that options that are marked `advanced` don't necessarily warrant detailed documentation. There are nearly two thousand of these now, and the ones that operators generally should touch number less than .... fifty I suspect. The obscure and advanced ones are subject to change at any time.

Actions

Copy link

Updated by Voja Molani 11 months ago

options that are marked `advanced` don't necessarily warrant detailed documentation

Yes that may be true. But the particular variable that this issue concerns is referred to in many places, see OP for a list. The worst thing to do is to read/watch a "tuning presentation" and blindly change a variable without even knowing what it does!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries