Actions
Bug #19134
closedmds suicide during multimds test
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2017-03-02 15:08:30.681545 7f7bcb897700 0 mds.0.bal mds.0 mdsload<[863.489,828.768 2521.02]/[15.2384,687.851 1390.94], req 4.36541e+06, hr 0, qlen 13, cpu 2.39> = 4.36784e+06 ~ 2521.02 2017-03-02 15:08:30.681561 7f7bcb897700 0 mds.0.bal mds.1 mdsload<[0,0 0]/[0,0 0], req 2.6468e+06, hr 0, qlen 0, cpu 0.61> = 2.65163e+06 ~ 1530.47 2017-03-02 15:08:30.681568 7f7bcb897700 0 mds.0.bal mds.2 mdsload<[0,0 0]/[0,0 0], req 2.63005e+06, hr 0, qlen 0, cpu 2.39> = 2.63005e+06 ~ 1518.01 2017-03-02 15:08:30.681574 7f7bcb897700 0 mds.0.bal mds.3 mdsload<[0,0 0]/[0,0 0], req 2.63264e+06, hr 0, qlen 0, cpu 1.42> = 2.63264e+06 ~ 1519.5 2017-03-02 15:08:30.681580 7f7bcb897700 0 mds.0.bal mds.4 mdsload<[0,0 0]/[0,0 0], req 2.59067e+06, hr 0, qlen 0, cpu 0.68> = 2.59067e+06 ~ 1495.28 2017-03-02 15:08:30.681586 7f7bcb897700 0 mds.0.bal mds.5 mdsload<[0,0 0]/[0,0 0], req 3.83805e+06, hr 0, qlen 0, cpu 1.42> = 3.84498e+06 ~ 2219.24 2017-03-02 15:08:30.681592 7f7bcb897700 0 mds.0.bal mds.6 mdsload<[0,0 0]/[0,0 0], req 3.88322e+06, hr 0, qlen 0, cpu 0.68> = 3.88327e+06 ~ 2241.34 2017-03-02 15:08:50.881856 7f7bc8891700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2017-03-02 15:08:50.881870 7f7bc8891700 1 mds.beacon.d _send skipping beacon, heartbeat map not healthy 2017-03-02 15:08:52.547892 7f7bcc899700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2017-03-02 15:08:54.881952 7f7bc8891700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2017-03-02 15:08:54.881968 7f7bc8891700 1 mds.beacon.d _send skipping beacon, heartbeat map not healthy 2017-03-02 15:08:57.548010 7f7bcc899700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2017-03-02 15:08:58.882016 7f7bc8891700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2017-03-02 15:08:58.882028 7f7bc8891700 1 mds.beacon.d _send skipping beacon, heartbeat map not healthy 2017-03-02 15:09:02.548127 7f7bcc899700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2017-03-02 15:09:02.634374 7f7bc9092700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2017-03-02 15:09:02.636029 7f7bcb897700 1 mds.d handle_mds_map i (172.21.15.22:6805/2203629154) dne in the mdsmap, new instance has larger gid 4121, suicide 2017-03-02 15:09:02.636034 7f7bcb897700 1 mds.d suicide. wanted state up:active 2017-03-02 15:09:02.636362 7f7bcb897700 1 mds.0.6 shutdown: shutting down rank 0
This happened in 7 mds cluster. Looks like some busy loop caused heartbeat timeout. The strange thing is why it said there was a new instance (there is no standby mds).
Files
Updated by Zheng Yan about 7 years ago
- Status changed from New to 12
The reason is that
Beacon::is_laggy() calls MonClient::reopen_session(). MonClient::_reopen_session() reset MonClient::active_con, which causes MonClint::get_global_id() to return 0 (Before it receives the MAuthReply). The '0' global_id confuses MDSDaemon::handle_mds_map ()
Updated by Zheng Yan about 7 years ago
- Status changed from 12 to Fix Under Review
Updated by Kefu Chai about 7 years ago
- Project changed from CephFS to Ceph
- Category deleted (
90) - Status changed from Fix Under Review to Resolved
- Assignee set to Kefu Chai
Actions