Project

General

Profile

Actions

Bug #37753

closed

mgr will refuse connection from the monitor who starts behind it

Added by Xinying Song over 5 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

For example, in a 3 monitor cluster. mon-A and mon-B are active and in quorum, now start a mgr, then start mon-C. The mgr cannot recognize mon-C. When we send query command such as 'ceph pg dump' to mon-C, mon-C will try to connect with mgr, but mgr will markdown this session after a few steps. If we are lucky enough, the query result will return to mon-C before mgr marks this session down. If not, mon-C will retry to connect with mgr until it successes.


Related issues 2 (0 open2 closed)

Copied to mgr - Backport #38109: mimic: mgr will refuse connection from the monitor who starts behind itResolvedPrashant DActions
Copied to mgr - Backport #38110: luminous: mgr will refuse connection from the monitor who starts behind itRejectedActions
Actions #2

Updated by Xinying Song over 5 years ago

Seems no one cares about this issue. But I'd still like to give a more detail description.

I'm using `juju` to deploy ceph environment. Ceph version is Luminous. Charm files for juju are downloaded from official charm-store. And use ceph_exporter provided by digitalocean for prometheus.
When all those components have been deployed, we find the query to ceph_eporter is dramatically slow. Further investigation shows that the ceph_export send a `pg dump` command to monitor-A using an interface called 'rados_mon_command' in librados. Then monitor-A delegates this query to mgr-A, waiting for the results that mgr-A should return. However, monitor-A always failed to read the result from mgr-A, despiting mgr-A indeed has successfully returned the result, and mon-A keep trying to resend the query until it successfully gets the result. According to monitor log, when monitor-A try to read the result returned from mgr-A, it got an 'peer close connection' error. After read a lot source codes that related, we find the root cause: service mgr-A is started before mon-A being in the quorum, so it doesn't know mon-A. when mon-A try to establish a connection to mgr-A, mgr-A will first accept it(in DaemonServer::handl_open()), and later mon-A(be strictly is mgr-client in mon-A) will send an MMgrReport message to mgr-A, then mgr-A will handle this(in DaemonServer::handle_report()) and find out it doesn't have any knowledge(DaemonServer::daemon_state) about mon-A, so mgr-A close this connection on it's own.

Although this problem can be avoided by starting ceph components in strictly right order, we still think it could be processed more elegantly in ceph. All we need to do is handle monmap change in mgr.

Actions #3

Updated by Xinying Song over 5 years ago

Here is a simple version about how to observe the problem in mgr.
1. prepare ceph.conf with 3 monitors.
2. init and start mon.A and mon.B
3. init and start mgr.A with debug_mgr=5
4. init and start mon.C
5. tail -f /var/log/ceph/ceph-mgr.A.log |grep 'mon,'

Then you will see logs like 'mgr.server handle_report rejecting report from mon,C, since we do not have its metadata now.' periodically occur. This indicates mgr doesn't update its daemon_state info as expected.

Actions #4

Updated by Mykola Golub over 5 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 25725
Actions #5

Updated by Kefu Chai over 5 years ago

  • Assignee set to Xinying Song
Actions #6

Updated by Kefu Chai over 5 years ago

  • Category set to ceph-mgr
Actions #7

Updated by Kefu Chai about 5 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to mimic,luminous
Actions #8

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #38109: mimic: mgr will refuse connection from the monitor who starts behind it added
Actions #9

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #38110: luminous: mgr will refuse connection from the monitor who starts behind it added
Actions #10

Updated by Nathan Cutler about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF