Bug #46445
closednautilis client may hunt for mon very long if msg v2 is not enabled on mons
0%
Description
The problem is observed for a nautilus client. For newer client versions the situation is accidentally much better (see below). Still it seems important to improve this for nautilus clients because this is where this situation is the most common during upgrade from pre-nautilus: mons and client are already upgraded but msg v2 is not enabled due to osds are not upgraded yet. Another valid case is when someone does not want to enable msg v2 on mons for some reasons.
So, if msg v2 is not enabled on mons the monclient may spend long time (> 10 sec) hunting for a mon. I am attaching a detailed log with an example of such a session (`ceph status` command). If the protocols are not explicitly specified in `mon_host` parameter (just mon IPs) the client tries both v1 and v2 addresses. And if it tries v2 first, it gets "connection refused" error but will continue to try this address until the monclient `tick()` is called, which sees that we are still in "hunting" mode and reopens the connection, choosing this time a different address, which may be v1.
The main problem is that on nautilus the first monclient tick() is fired only after 10 sec interval. According to the code [1] I think it was assumed that when the first tick is scheduled we are in "hunting" mode and "mon_client_hunt_interval" is used but in reality we are not at that mode yet and "mon_client_ping_interval" is used, which is 10 sec.
On octopus it is much faster because although the the fist tick is still scheduled when we are not in "hunting" mode, the "mon_client_log_interval" is used for the tick interval, which is 1 sec by default.
So the fix seems to make sure the first tick is scheduled when we are in "hunting" mode (which I think what was supposed initially) so "mon_client_hunt_interval" is used. Note, though in this case it will become slower for post nautilus clients, because "mon_client_hunt_interval" is 3 sec by default.
Probably we need to consider decreasing "mon_client_hunt_interval" more then? Or may be the monclient should try to reopen connection (pick another address) just after it gets "connection refused" instead of retying the same address until the tick is fired?
[1] https://github.com/ceph/ceph/blob/v14.2.9/src/mon/MonClient.cc#L863
[2] https://github.com/ceph/ceph/blob/v15.2.3/src/mon/MonClient.cc#L953
Files
Updated by Mykola Golub almost 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 36065
Updated by Kefu Chai over 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler over 3 years ago
- Copied to Backport #46951: octopus: nautilis client may hunt for mon very long if msg v2 is not enabled on mons added
Updated by Nathan Cutler over 3 years ago
- Copied to Backport #46952: nautilus: nautilis client may hunt for mon very long if msg v2 is not enabled on mons added
Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".