Project

General

Profile

Bug #19134

mds suicide during multimds test

Added by Zheng Yan 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
03/03/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

2017-03-02 15:08:30.681545 7f7bcb897700  0 mds.0.bal   mds.0 mdsload<[863.489,828.768 2521.02]/[15.2384,687.851 1390.94], req 4.36541e+06, hr 0, qlen 13, cpu 2.39> = 4.36784e+06 ~ 2521.02
2017-03-02 15:08:30.681561 7f7bcb897700  0 mds.0.bal   mds.1 mdsload<[0,0 0]/[0,0 0], req 2.6468e+06, hr 0, qlen 0, cpu 0.61> = 2.65163e+06 ~ 1530.47
2017-03-02 15:08:30.681568 7f7bcb897700  0 mds.0.bal   mds.2 mdsload<[0,0 0]/[0,0 0], req 2.63005e+06, hr 0, qlen 0, cpu 2.39> = 2.63005e+06 ~ 1518.01
2017-03-02 15:08:30.681574 7f7bcb897700  0 mds.0.bal   mds.3 mdsload<[0,0 0]/[0,0 0], req 2.63264e+06, hr 0, qlen 0, cpu 1.42> = 2.63264e+06 ~ 1519.5
2017-03-02 15:08:30.681580 7f7bcb897700  0 mds.0.bal   mds.4 mdsload<[0,0 0]/[0,0 0], req 2.59067e+06, hr 0, qlen 0, cpu 0.68> = 2.59067e+06 ~ 1495.28
2017-03-02 15:08:30.681586 7f7bcb897700  0 mds.0.bal   mds.5 mdsload<[0,0 0]/[0,0 0], req 3.83805e+06, hr 0, qlen 0, cpu 1.42> = 3.84498e+06 ~ 2219.24
2017-03-02 15:08:30.681592 7f7bcb897700  0 mds.0.bal   mds.6 mdsload<[0,0 0]/[0,0 0], req 3.88322e+06, hr 0, qlen 0, cpu 0.68> = 3.88327e+06 ~ 2241.34
2017-03-02 15:08:50.881856 7f7bc8891700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-02 15:08:50.881870 7f7bc8891700  1 mds.beacon.d _send skipping beacon, heartbeat map not healthy
2017-03-02 15:08:52.547892 7f7bcc899700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-02 15:08:54.881952 7f7bc8891700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-02 15:08:54.881968 7f7bc8891700  1 mds.beacon.d _send skipping beacon, heartbeat map not healthy
2017-03-02 15:08:57.548010 7f7bcc899700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-02 15:08:58.882016 7f7bc8891700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-02 15:08:58.882028 7f7bc8891700  1 mds.beacon.d _send skipping beacon, heartbeat map not healthy
2017-03-02 15:09:02.548127 7f7bcc899700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-02 15:09:02.634374 7f7bc9092700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2017-03-02 15:09:02.636029 7f7bcb897700  1 mds.d handle_mds_map i (172.21.15.22:6805/2203629154) dne in the mdsmap, new instance has larger gid 4121, suicide
2017-03-02 15:09:02.636034 7f7bcb897700  1 mds.d suicide.  wanted state up:active
2017-03-02 15:09:02.636362 7f7bcb897700  1 mds.0.6 shutdown: shutting down rank 0

This happened in 7 mds cluster. Looks like some busy loop caused heartbeat timeout. The strange thing is why it said there was a new instance (there is no standby mds).

mon.0.log View (994 KB) Zheng Yan, 03/03/2017 02:32 AM

History

#1 Updated by Zheng Yan 5 months ago

  • Category set to multi-MDS

#2 Updated by Zheng Yan 5 months ago

  • Status changed from New to Verified

The reason is that

Beacon::is_laggy() calls MonClient::reopen_session(). MonClient::_reopen_session() reset MonClient::active_con, which causes MonClint::get_global_id() to return 0 (Before it receives the MAuthReply). The '0' global_id confuses MDSDaemon::handle_mds_map ()

#3 Updated by Zheng Yan 5 months ago

  • Status changed from Verified to Need Review

#4 Updated by Kefu Chai 4 months ago

  • Project changed from fs to Ceph
  • Category deleted (multi-MDS)
  • Status changed from Need Review to Resolved
  • Assignee set to Kefu Chai

Also available in: Atom PDF