Project

General

Profile

Actions

Bug #46445

closed

nautilis client may hunt for mon very long if msg v2 is not enabled on mons

Added by Mykola Golub almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The problem is observed for a nautilus client. For newer client versions the situation is accidentally much better (see below). Still it seems important to improve this for nautilus clients because this is where this situation is the most common during upgrade from pre-nautilus: mons and client are already upgraded but msg v2 is not enabled due to osds are not upgraded yet. Another valid case is when someone does not want to enable msg v2 on mons for some reasons.

So, if msg v2 is not enabled on mons the monclient may spend long time (> 10 sec) hunting for a mon. I am attaching a detailed log with an example of such a session (`ceph status` command). If the protocols are not explicitly specified in `mon_host` parameter (just mon IPs) the client tries both v1 and v2 addresses. And if it tries v2 first, it gets "connection refused" error but will continue to try this address until the monclient `tick()` is called, which sees that we are still in "hunting" mode and reopens the connection, choosing this time a different address, which may be v1.

The main problem is that on nautilus the first monclient tick() is fired only after 10 sec interval. According to the code [1] I think it was assumed that when the first tick is scheduled we are in "hunting" mode and "mon_client_hunt_interval" is used but in reality we are not at that mode yet and "mon_client_ping_interval" is used, which is 10 sec.

On octopus it is much faster because although the the fist tick is still scheduled when we are not in "hunting" mode, the "mon_client_log_interval" is used for the tick interval, which is 1 sec by default.

So the fix seems to make sure the first tick is scheduled when we are in "hunting" mode (which I think what was supposed initially) so "mon_client_hunt_interval" is used. Note, though in this case it will become slower for post nautilus clients, because "mon_client_hunt_interval" is 3 sec by default.

Probably we need to consider decreasing "mon_client_hunt_interval" more then? Or may be the monclient should try to reopen connection (pick another address) just after it gets "connection refused" instead of retying the same address until the tick is fired?

[1] https://github.com/ceph/ceph/blob/v14.2.9/src/mon/MonClient.cc#L863
[2] https://github.com/ceph/ceph/blob/v15.2.3/src/mon/MonClient.cc#L953


Files


Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #46951: octopus: nautilis client may hunt for mon very long if msg v2 is not enabled on monsResolvedNathan CutlerActions
Copied to RADOS - Backport #46952: nautilus: nautilis client may hunt for mon very long if msg v2 is not enabled on monsResolvedNathan CutlerActions
Actions #1

Updated by Mykola Golub almost 4 years ago

  • Backport set to octopus, nautilus
Actions #2

Updated by Mykola Golub almost 4 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 36065
Actions #3

Updated by Kefu Chai over 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #4

Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #46951: octopus: nautilis client may hunt for mon very long if msg v2 is not enabled on mons added
Actions #5

Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #46952: nautilus: nautilis client may hunt for mon very long if msg v2 is not enabled on mons added
Actions #6

Updated by Nathan Cutler over 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF