Project

General

Profile

Bug #46445

nautilis client may hunt for mon very long if msg v2 is not enabled on mons

Added by Mykola Golub about 1 month ago. Updated 10 days ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

The problem is observed for a nautilus client. For newer client versions the situation is accidentally much better (see below). Still it seems important to improve this for nautilus clients because this is where this situation is the most common during upgrade from pre-nautilus: mons and client are already upgraded but msg v2 is not enabled due to osds are not upgraded yet. Another valid case is when someone does not want to enable msg v2 on mons for some reasons.

So, if msg v2 is not enabled on mons the monclient may spend long time (> 10 sec) hunting for a mon. I am attaching a detailed log with an example of such a session (`ceph status` command). If the protocols are not explicitly specified in `mon_host` parameter (just mon IPs) the client tries both v1 and v2 addresses. And if it tries v2 first, it gets "connection refused" error but will continue to try this address until the monclient `tick()` is called, which sees that we are still in "hunting" mode and reopens the connection, choosing this time a different address, which may be v1.

The main problem is that on nautilus the first monclient tick() is fired only after 10 sec interval. According to the code [1] I think it was assumed that when the first tick is scheduled we are in "hunting" mode and "mon_client_hunt_interval" is used but in reality we are not at that mode yet and "mon_client_ping_interval" is used, which is 10 sec.

On octopus it is much faster because although the the fist tick is still scheduled when we are not in "hunting" mode, the "mon_client_log_interval" is used for the tick interval, which is 1 sec by default.

So the fix seems to make sure the first tick is scheduled when we are in "hunting" mode (which I think what was supposed initially) so "mon_client_hunt_interval" is used. Note, though in this case it will become slower for post nautilus clients, because "mon_client_hunt_interval" is 3 sec by default.

Probably we need to consider decreasing "mon_client_hunt_interval" more then? Or may be the monclient should try to reopen connection (pick another address) just after it gets "connection refused" instead of retying the same address until the tick is fired?

[1] https://github.com/ceph/ceph/blob/v14.2.9/src/mon/MonClient.cc#L863
[2] https://github.com/ceph/ceph/blob/v15.2.3/src/mon/MonClient.cc#L953

ceph-14.2.9-debug-ms-10-debug-monc-10.log View (102 KB) Mykola Golub, 07/10/2020 07:41 AM

History

#1 Updated by Mykola Golub about 1 month ago

  • Backport set to octopus, nautilus

#2 Updated by Mykola Golub 28 days ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 36065

#3 Updated by Kefu Chai 10 days ago

  • Status changed from Fix Under Review to Pending Backport

Also available in: Atom PDF