Bug #48274
openmon_osd_adjust_heartbeat_grace blocks OSDs from being marked as down
0%
Description
I encountered a situation which I also posted on the users list: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/HZXM2DQTNKE55CS35IJMJKYVEO7237VQ/#SBW4X46UHMUNHC7HCDTPBST3LMSIONSM
The cluster (15.2.2) has a fairly simple CRUSHMap:
root default rack 206 host A rack 207 host B rack 208 host C
We tried to perform a failure of a rack.
The MON in that rack was marked as down, but the OSDs were marked down after 900 seconds eventually.
This morning (2020-11-18) we tried this test again and I noticed this:
2020-11-18T02:33:21.498-0500 7f849ae8c700 0 log_channel(cluster) log [DBG] : osd.58 reported failed by osd.105 2020-11-18T02:33:21.498-0500 7f849ae8c700 10 mon.CEPH2-MON1-206-U39@0(leader).osd e108196 osd.58 has 1 reporters, 82.847365 grace (20.000000 + 62.8474 + 5.67647e-167), max_failed_since 2020-11-18T02:32:59.498216-0500 2020-11-18T02:33:21.498-0500 7f849ae8c700 10 mon.CEPH2-MON1-206-U39@0(leader).log v9218943 logging 2020-11-18T02:33:21.499338-0500 mon.CEPH2-MON1-206-U39 (mon.0) 165 : cluster [DBG] osd.58 reported failed by osd.105 2020-11-18T02:33:21.814-0500 7f8499689700 7 mon.CEPH2-MON1-206-U39@0(leader).log v9218944 update_from_paxos applying incremental log 9218944 2020-11-18T02:33:21.499338-0500 mon.CEPH2-MON1-206-U39 (mon.0) 165 : cluster [DBG] osd.58 reported failed by osd.105 2020-11-18T02:33:22.158-0500 7f849ae8c700 10 mon.CEPH2-MON1-206-U39@0(leader).osd e108196 preprocess_query osd_failure(failed timeout osd.58 [v2:10.101.13.137:6811/2762,v1:10.101.13.137:6819/2762] for 22sec e108196 v108196) v4 from osd.90 v2:10.101.13.12:6830/2708 2020-11-18T02:33:22.158-0500 7f849ae8c700 10 mon.CEPH2-MON1-206-U39@0(leader).osd e108196 preprocess_failure new: osd.58 [v2:10.101.13.137:6811/2762,v1:10.101.13.137:6819/2762], from osd.90 2020-11-18T02:33:22.158-0500 7f849ae8c700 7 mon.CEPH2-MON1-206-U39@0(leader).osd e108196 prepare_update osd_failure(failed timeout osd.58 [v2:10.101.13.137:6811/2762,v1:10.101.13.137:6819/2762] for 22sec e108196 v108196) v4 from osd.90 v2:10.101.13.12:6830/2708 2020-11-18T02:33:22.158-0500 7f849ae8c700 1 mon.CEPH2-MON1-206-U39@0(leader).osd e108196 prepare_failure osd.58 [v2:10.101.13.137:6811/2762,v1:10.101.13.137:6819/2762] from osd.90 is reporting failure:1 2020-11-18T02:33:22.158-0500 7f849ae8c700 10 mon.CEPH2-MON1-206-U39@0(leader) e1 no_reply to osd.90 v2:10.101.13.12:6830/2708 osd_failure(failed timeout osd.58 [v2:10.101.13.137:6811/2762,v1:10.101.13.137:6819/2762] for 22sec e108196 v108196) v4
After reading it thoroughly this is what I noticed:
osd.58 has 1 reporters, 82.847365 grace (20.000000 + 62.8474 + 5.67647e-167)
Somehow the MONs were adding time to the grace period.
After looking through the sourcecode (mon/OSDMonitor.cc: OSDMonitor::check_failure) I saw that the config setting mon_osd_adjust_heartbeat_grace was influencing the grace period.
I changed mon_osd_adjust_heartbeat_grace to false and the OSDs were marked as down right away.
I played with setting mon_osd_reporter_subtree_level to rack, but this didn't change anything.
This setting has been in the code for a while, but I have never seen this happen before.
Updated by Sage Weil about 3 years ago
- Project changed from Ceph to RADOS
- Category deleted (
Monitor)