Bug #19134: mds suicide during multimds test - Ceph - Ceph

Actions

Copy link

Bug #19134

closed

mds suicide during multimds test

Added by Zheng Yan about 7 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Kefu Chai

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2017-03-02 15:08:30.681545 7f7bcb897700  0 mds.0.bal   mds.0 mdsload<[863.489,828.768 2521.02]/[15.2384,687.851 1390.94], req 4.36541e+06, hr 0, qlen 13, cpu 2.39> = 4.36784e+06 ~ 2521.02
2017-03-02 15:08:30.681561 7f7bcb897700  0 mds.0.bal   mds.1 mdsload<[0,0 0]/[0,0 0], req 2.6468e+06, hr 0, qlen 0, cpu 0.61> = 2.65163e+06 ~ 1530.47
2017-03-02 15:08:30.681568 7f7bcb897700  0 mds.0.bal   mds.2 mdsload<[0,0 0]/[0,0 0], req 2.63005e+06, hr 0, qlen 0, cpu 2.39> = 2.63005e+06 ~ 1518.01
2017-03-02 15:08:30.681574 7f7bcb897700  0 mds.0.bal   mds.3 mdsload<[0,0 0]/[0,0 0], req 2.63264e+06, hr 0, qlen 0, cpu 1.42> = 2.63264e+06 ~ 1519.5
2017-03-02 15:08:30.681580 7f7bcb897700  0 mds.0.bal   mds.4 mdsload<[0,0 0]/[0,0 0], req 2.59067e+06, hr 0, qlen 0, cpu 0.68> = 2.59067e+06 ~ 1495.28
2017-03-02 15:08:30.681586 7f7bcb897700  0 mds.0.bal   mds.5 mdsload<[0,0 0]/[0,0 0], req 3.83805e+06, hr 0, qlen 0, cpu 1.42> = 3.84498e+06 ~ 2219.24
2017-03-02 15:08:30.681592 7f7bcb897700  0 mds.0.bal   mds.6 mdsload<[0,0 0]/[0,0 0], req 3.88322e+06, hr 0, qlen 0, cpu 0.68> = 3.88327e+06 ~ 2241.34
2017-03-02 15:08:50.881856 7f7bc8891700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-02 15:08:50.881870 7f7bc8891700  1 mds.beacon.d _send skipping beacon, heartbeat map not healthy
2017-03-02 15:08:52.547892 7f7bcc899700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-02 15:08:54.881952 7f7bc8891700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-02 15:08:54.881968 7f7bc8891700  1 mds.beacon.d _send skipping beacon, heartbeat map not healthy
2017-03-02 15:08:57.548010 7f7bcc899700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-02 15:08:58.882016 7f7bc8891700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-02 15:08:58.882028 7f7bc8891700  1 mds.beacon.d _send skipping beacon, heartbeat map not healthy
2017-03-02 15:09:02.548127 7f7bcc899700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-02 15:09:02.634374 7f7bc9092700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2017-03-02 15:09:02.636029 7f7bcb897700  1 mds.d handle_mds_map i (172.21.15.22:6805/2203629154) dne in the mdsmap, new instance has larger gid 4121, suicide
2017-03-02 15:09:02.636034 7f7bcb897700  1 mds.d suicide.  wanted state up:active
2017-03-02 15:09:02.636362 7f7bcb897700  1 mds.0.6 shutdown: shutting down rank 0

This happened in 7 mds cluster. Looks like some busy loop caused heartbeat timeout. The strange thing is why it said there was a new instance (there is no standby mds).

Files

mon.0.log (994 KB) mon.0.log

Zheng Yan, 03/03/2017 02:32 AM

Actions

Copy link

Updated by Zheng Yan about 7 years ago

Category set to 90

Actions

Copy link

Updated by Zheng Yan about 7 years ago

Status changed from New to 12

The reason is that

Beacon::is_laggy() calls MonClient::reopen_session(). MonClient::_reopen_session() reset MonClient::active_con, which causes MonClint::get_global_id() to return 0 (Before it receives the MAuthReply). The '0' global_id confuses MDSDaemon::handle_mds_map ()

Actions

Copy link