Project

General

Profile

Bug #37753

mgr will refuse connection from the monitor who starts behind it

Added by Xinying Song 8 months ago. Updated 7 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
ceph-mgr
Target version:
-
Start date:
12/25/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

For example, in a 3 monitor cluster. mon-A and mon-B are active and in quorum, now start a mgr, then start mon-C. The mgr cannot recognize mon-C. When we send query command such as 'ceph pg dump' to mon-C, mon-C will try to connect with mgr, but mgr will markdown this session after a few steps. If we are lucky enough, the query result will return to mon-C before mgr marks this session down. If not, mon-C will retry to connect with mgr until it successes.


Related issues

Copied to mgr - Backport #38109: mimic: mgr will refuse connection from the monitor who starts behind it Resolved
Copied to mgr - Backport #38110: luminous: mgr will refuse connection from the monitor who starts behind it Need More Info

History

#2 Updated by Xinying Song 8 months ago

Seems no one cares about this issue. But I'd still like to give a more detail description.

I'm using `juju` to deploy ceph environment. Ceph version is Luminous. Charm files for juju are downloaded from official charm-store. And use ceph_exporter provided by digitalocean for prometheus.
When all those components have been deployed, we find the query to ceph_eporter is dramatically slow. Further investigation shows that the ceph_export send a `pg dump` command to monitor-A using an interface called 'rados_mon_command' in librados. Then monitor-A delegates this query to mgr-A, waiting for the results that mgr-A should return. However, monitor-A always failed to read the result from mgr-A, despiting mgr-A indeed has successfully returned the result, and mon-A keep trying to resend the query until it successfully gets the result. According to monitor log, when monitor-A try to read the result returned from mgr-A, it got an 'peer close connection' error. After read a lot source codes that related, we find the root cause: service mgr-A is started before mon-A being in the quorum, so it doesn't know mon-A. when mon-A try to establish a connection to mgr-A, mgr-A will first accept it(in DaemonServer::handl_open()), and later mon-A(be strictly is mgr-client in mon-A) will send an MMgrReport message to mgr-A, then mgr-A will handle this(in DaemonServer::handle_report()) and find out it doesn't have any knowledge(DaemonServer::daemon_state) about mon-A, so mgr-A close this connection on it's own.

Although this problem can be avoided by starting ceph components in strictly right order, we still think it could be processed more elegantly in ceph. All we need to do is handle monmap change in mgr.

#3 Updated by Xinying Song 7 months ago

Here is a simple version about how to observe the problem in mgr.
1. prepare ceph.conf with 3 monitors.
2. init and start mon.A and mon.B
3. init and start mgr.A with debug_mgr=5
4. init and start mon.C
5. tail -f /var/log/ceph/ceph-mgr.A.log |grep 'mon,'

Then you will see logs like 'mgr.server handle_report rejecting report from mon,C, since we do not have its metadata now.' periodically occur. This indicates mgr doesn't update its daemon_state info as expected.

#4 Updated by Mykola Golub 7 months ago

  • Status changed from New to Need Review
  • Pull request ID set to 25725

#5 Updated by Kefu Chai 7 months ago

  • Assignee set to Xinying Song

#6 Updated by Kefu Chai 7 months ago

  • Category set to ceph-mgr

#7 Updated by Kefu Chai 7 months ago

  • Status changed from Need Review to Pending Backport
  • Backport set to mimic,luminous

#8 Updated by Nathan Cutler 7 months ago

  • Copied to Backport #38109: mimic: mgr will refuse connection from the monitor who starts behind it added

#9 Updated by Nathan Cutler 7 months ago

  • Copied to Backport #38110: luminous: mgr will refuse connection from the monitor who starts behind it added

Also available in: Atom PDF