Bug #37753: mgr will refuse connection from the monitor who starts behind it - mgr - Ceph

Actions

Copy link

Bug #37753

closed

mgr will refuse connection from the monitor who starts behind it

Added by Xinying Song over 5 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Xinying Song

Category:

ceph-mgr

Target version:

% Done:

Source:

Tags:

Backport:

mimic,luminous

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.5

ceph-qa-suite:

Pull request ID:

25725

Crash signature (v1):

Crash signature (v2):

Description

For example, in a 3 monitor cluster. mon-A and mon-B are active and in quorum, now start a mgr, then start mon-C. The mgr cannot recognize mon-C. When we send query command such as 'ceph pg dump' to mon-C, mon-C will try to connect with mgr, but mgr will markdown this session after a few steps. If we are lucky enough, the query result will return to mon-C before mgr marks this session down. If not, mon-C will retry to connect with mgr until it successes.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Xinying Song over 5 years ago

Pr: https://github.com/ceph/ceph/pull/25725

Actions

Copy link

Updated by Xinying Song over 5 years ago

Seems no one cares about this issue. But I'd still like to give a more detail description.

I'm using `juju` to deploy ceph environment. Ceph version is Luminous. Charm files for juju are downloaded from official charm-store. And use ceph_exporter provided by digitalocean for prometheus.
When all those components have been deployed, we find the query to ceph_eporter is dramatically slow. Further investigation shows that the ceph_export send a `pg dump` command to monitor-A using an interface called 'rados_mon_command' in librados. Then monitor-A delegates this query to mgr-A, waiting for the results that mgr-A should return. However, monitor-A always failed to read the result from mgr-A, despiting mgr-A indeed has successfully returned the result, and mon-A keep trying to resend the query until it successfully gets the result. According to monitor log, when monitor-A try to read the result returned from mgr-A, it got an 'peer close connection' error. After read a lot source codes that related, we find the root cause: service mgr-A is started before mon-A being in the quorum, so it doesn't know mon-A. when mon-A try to establish a connection to mgr-A, mgr-A will first accept it(in DaemonServer::handl_open()), and later mon-A(be strictly is mgr-client in mon-A) will send an MMgrReport message to mgr-A, then mgr-A will handle this(in DaemonServer::handle_report()) and find out it doesn't have any knowledge(DaemonServer::daemon_state) about mon-A, so mgr-A close this connection on it's own.

Although this problem can be avoided by starting ceph components in strictly right order, we still think it could be processed more elegantly in ceph. All we need to do is handle monmap change in mgr.

Actions

Copy link

Updated by Xinying Song over 5 years ago

Here is a simple version about how to observe the problem in mgr.
1. prepare ceph.conf with 3 monitors.
2. init and start mon.A and mon.B
3. init and start mgr.A with debug_mgr=5
4. init and start mon.C
5. tail -f /var/log/ceph/ceph-mgr.A.log |grep 'mon,'

Then you will see logs like 'mgr.server handle_report rejecting report from mon,C, since we do not have its metadata now.' periodically occur. This indicates mgr doesn't update its daemon_state info as expected.

Actions

Copy link

Updated by Mykola Golub over 5 years ago

Status changed from New to Fix Under Review
Pull request ID set to 25725

Actions

Copy link

Updated by Kefu Chai over 5 years ago

Assignee set to Xinying Song

Actions

Copy link

Updated by Kefu Chai over 5 years ago

Category set to ceph-mgr

Actions

Copy link

Updated by Kefu Chai about 5 years ago

Status changed from Fix Under Review to Pending Backport
Backport set to mimic,luminous

Actions

Copy link

Updated by Nathan Cutler about 5 years ago

Copied to Backport #38109: mimic: mgr will refuse connection from the monitor who starts behind it added

Actions

Copy link

Updated by Nathan Cutler about 5 years ago

Copied to Backport #38110: luminous: mgr will refuse connection from the monitor who starts behind it added

Actions

Copy link

#10

Updated by Nathan Cutler about 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #37753

mgr will refuse connection from the monitor who starts behind it

Updated by Xinying Song over 5 years ago

Updated by Xinying Song over 5 years ago

Updated by Xinying Song over 5 years ago

Updated by Mykola Golub over 5 years ago

Updated by Kefu Chai over 5 years ago

Updated by Kefu Chai over 5 years ago

Updated by Kefu Chai about 5 years ago

Updated by Nathan Cutler about 5 years ago

Updated by Nathan Cutler about 5 years ago

Updated by Nathan Cutler about 3 years ago