Bug #49231

MONs unresponsive over extended periods of time

Added by Frank Schilder about 3 years ago. Updated over 1 year ago.

Target version:
% Done:


1 - critical
Affected Versions:
Pull request ID:
Crash signature (v1):
Crash signature (v2):


Version: 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)

I'm repeatedly observing that the MONs become unresponsive over extended periods of time, for example, 70 minutes as reported here. I observed two 70-minutes instances during an 11-hour window recently, but have seen this problem for over a year now. Symptoms:

- the MONs use 100% CPU
- booting OSDs cannot join the cluster
- MONs don't respond to MGR and OSD beacons while still
marking them out, leading to storage service down time
- for some reason, the MDS cluster is unaffected
- peering that starts does not complete; this can lead to stuck IO
- changes to the cluster map are not applied (not even detected/handled?)
- IO continues as long as nothing goes down (services running match last active cluster map)
- admin commands (ceph status etc.) are extremely slow

I'm not able to match this to client IO or reproduce this. The most significant indicator that shows up in our monitoring is an extremely increased network traffic on the MON servers; see packet reports attached.

Additional information:

- the outage starts on all MONs at the same time, but the MONs come back at different times
- the outage duration of each MON is roughly proportional to their number of client connections (open sessions)
- the MONs are in quorum the whole time
- mon_status does not show anything unusual, the states are: leader, peon, peon; all in quorum
- sometimes slow ops are reported for some MONs
- the combined MON cluster outage continues until the last MON gets back to normal, a single MON in this state drags everything down
- the incidents do not follow IO patterns, but seem to be more frequent under heavier load
- our MON DB is on RAID10 arrays of 6 high-performance SSDs, DB size currently 2GB (I just added a lot of OSDs and the cluster is re-balancing), usual size 150-400MB
- network is operating fine, 2x10G LACP bond on the MONs
- RAM usage of MONs is low and does not increase during incident
- we have a large number of snapshots on our RBD pools

I suspect that a MON-client sync might be the issue. I have observed something similar (but not the same) during synchronization of a freshly deployed MON. Here, the reason was stated to be a large number of snapshots. Setting `mon_sync_max_payload_size=4096` solved the issue.

I have currently no idea how to reproduce or debug this further. I have not found anything that allows to predict incidents. The logs contain nothing unusual as far as I can tell. As a half-working mitigation I set MGR and OSD beacon timeouts to 24h to avoid these services going down. I'm praying that we don't have a disk fail during these incidents.

graphs.pdf - Packet reports during incident (graphs, ganglia). (103 KB) Frank Schilder, 02/10/2021 10:38 AM


#1 Updated by Frank Schilder about 3 years ago

Update: I start seeing this issue now 2 to 3 times a day, its getting really irritating. Possibly due to the large rebalancing operation going on?

I had one just now and dared to restart one of the unresponsive MONs. This immediately restores normal operation and the MONs do not get back into unresponsive mode right away.

#2 Updated by Frank Schilder about 3 years ago

I think I found the reason for this behaviour. I managed to pull extended logs during an incident and saw that the MONs were trying to send a huge amount of osd_map updates to all clients. In between it was reporting a lossy channel, but this was self-inflicted. The network was up and operating fine at the time. I searched for osd_map and found very few hits, some pointing me to the parameter osd_map_message_max_bytes. Apparently, the default value can be too large and I reduced it now to

global advanced osd_map_message_max_bytes 1048576

This seems to fix it, but I need to wait longer before I can confirm that.

For future reference, I would like to add that the first time our cluster was affected, we observed that:

- a large number of OSDs was marked down,
- MGRs were marked down and entered restart-loops and
- the cluster was extremely unresponsive to admin commands

all for no apparent reason. Only later did we discover that the MON daemons were causing this. The ceph-users thread with additional information showing how it looks to an admin is [Ceph meltdown, need help]( Our first reaction was to increase beacon timeouts to prevent services being marked out, which is not really fixing anything. I hope the message max bytes option above is the real fix.

#3 Updated by Sage Weil almost 3 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (Monitor)

#4 Updated by Frank Schilder almost 3 years ago

After running for a few months with the modified setting, it seems that it fixes the issue. I still see CPU load- and network package peaks caused by this problem as well as the occasional slow MON ops warning, but these windows are now down to 2 minutes max. I will soon repeat an OSD deployment that was triggering massive problems last time and would like to reduce this parameter even further. Is there a strict lower bound for this message size? For example, would something very small like 4096 work? It would be really helpful if this parameter was covered in some detail in the documentation.

#5 Updated by Frank Schilder almost 2 years ago

Experimenting further, I found the value osd_map_message_max_bytes=16384 to be the best choice. With larger or smaller values I observed longer outage windows. I now also deployed 2 more monitors (in total 5 MONs). Since then I did not observe the problem again.

I would like to conclude that my observations of network traffic overload seemed to indicate that the MONs were sending packets too fast, leading to errors. In response to the errors it seemed like the MONs would try a re-send immediately instead of waiting for a short amount of time, which lead to a total overload of the interface with a lot of packets dropped. Some throttling in the relevant part should solve this.

#6 Updated by Radoslaw Zarzynski almost 2 years ago

Thanks for update! Could you please say something more about the version on which this you've tested? Is it mimic maybe?

#7 Updated by Frank Schilder almost 2 years ago

Yes, I observed this on mimic latest stable.

I think this issue is still present, but not easy to observe. My guess is that one needs a large number of ceph-fs kernel clients to see this happening. It seems only to occur if sending osd_map updates overruns the network queue, which in turn seems only to occur if a large number of clients is present that updates are sent to more or less simultaneously.

I found the hint to change osd_map_message_max_bytes in an old ceph blog post. Unfortunately, I can't find it any more. I would like to link it to this case, because it seems to be related to my observations.

#8 Updated by Radoslaw Zarzynski almost 2 years ago

Mimic is EOL :-(. Would you be able to upgrade soon?

#9 Updated by Frank Schilder over 1 year ago

We are planning to upgrade to Octopus. However, I do not believe we can reproduce the issue here. The above config has worked for us.

However, I do not believe it is difficult to find the MON code fragment that uses osd_map_message_max_bytes as a parameter and check

  • if it has even changed since mimic (mimic may be EOL, but the code certainly isn't) and
  • if it contains a loop over network IO that incorrectly responds to NIC port overload.

There is, in fact, a similar (identical?) problem with MON code being configured with mon_sync_max_payload_size. There seems to be some throttling missing that should kick in when a NIC link is fully utilised and network send packet requests err out. I believe at the moment it just does a resend immediately, overloading the NIC port (memory bandwidth is so much higher than a 10G ETH link).

#10 Updated by Frank Schilder over 1 year ago

OK, I did some more work and it looks like I can trigger the issue with some certainty by failing an MDS that was up for a longer period of time. What I observed 2 times recently is that when the MDS starts clientreplay, the MONs start their unusually high network activity all at the same time. Things become unresponsive during this period. With our config settings above, such a window lasts 2:30 minutes. I can try to do the same once we are on octopus. I'm afraid in order to see this issue one needs 1000+ ceph fs kclient connections, so I can't make a reproducer for a test cluster.

Also available in: Atom PDF