Bug #5290
closedmds: crash whilst trying to reconnect
0%
Description
Hi,
Recently I experienced an issue with the mds servers in my cluster, the cluster storage would be absolutely fine, however when in the reconnect state the server would segfault. The last entry in the log file looks something along the lines of:
-25> 2013-06-07 09:17:09.263806 7f5d6419d700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.32> -24> 2013-06-07 09:17:11.085358 7f5d6419d700 10 mds.0.109 beacon_send up:reconnect seq 11 (currently up:reconnect) -23> 2013-06-07 09:17:11.085401 7f5d6419d700 10 monclient: _send_mon_message to mon.node22 at 172.22.10.22:6789/0 -22> 2013-06-07 09:17:11.086720 7f5d662a2700 10 mds.0.109 handle_mds_beacon up:reconnect seq 11 rtt 0.001332 -21> 2013-06-07 09:17:14.263846 7f5d6419d700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.45> -20> 2013-06-07 09:17:15.085497 7f5d6419d700 10 mds.0.109 beacon_send up:reconnect seq 12 (currently up:reconnect) -19> 2013-06-07 09:17:15.085567 7f5d6419d700 10 monclient: _send_mon_message to mon.node22 at 172.22.10.22:6789/0 -18> 2013-06-07 09:17:15.086674 7f5d662a2700 10 mds.0.109 handle_mds_beacon up:reconnect seq 12 rtt 0.001147 -17> 2013-06-07 09:17:17.262087 7f5d652a0700 10 monclient: tick -16> 2013-06-07 09:17:17.262117 7f5d652a0700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2013-06-07 09:16:47.262115) -15> 2013-06-07 09:17:17.262145 7f5d652a0700 10 monclient: renew subs? (now: 2013-06-07 09:17:17.262145; renew after: 2013-06-07 09:19:04.581996) -- no -14> 2013-06-07 09:17:19.085675 7f5d6419d700 10 mds.0.109 beacon_send up:reconnect seq 13 (currently up:reconnect) -13> 2013-06-07 09:17:19.085720 7f5d6419d700 10 monclient: _send_mon_message to mon.node22 at 172.22.10.22:6789/0 -12> 2013-06-07 09:17:19.086956 7f5d662a2700 10 mds.0.109 handle_mds_beacon up:reconnect seq 13 rtt 0.001248 -11> 2013-06-07 09:17:19.263898 7f5d6419d700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.5> -10> 2013-06-07 09:17:23.085831 7f5d6419d700 10 mds.0.109 beacon_send up:reconnect seq 14 (currently up:reconnect) -9> 2013-06-07 09:17:23.085883 7f5d6419d700 10 monclient: _send_mon_message to mon.node22 at 172.22.10.22:6789/0 -8> 2013-06-07 09:17:23.087073 7f5d662a2700 10 mds.0.109 handle_mds_beacon up:reconnect seq 14 rtt 0.001206 -7> 2013-06-07 09:17:24.264002 7f5d6419d700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.62> -6> 2013-06-07 09:17:24.264057 7f5d6419d700 10 mds.0.server reconnect timed out -5> 2013-06-07 09:17:24.264061 7f5d6419d700 1 mds.0.server reconnect gave up on client.0 172.22.10.25:0/4275051773 -4> 2013-06-07 09:17:24.264071 7f5d6419d700 1 mds.0.server reconnect gave up on client.5664 172.22.10.33:0/3106616745 -3> 2013-06-07 09:17:24.264082 7f5d6419d700 1 mds.0.server reconnect gave up on client.5693 172.22.10.30:0/10349 -2> 2013-06-07 09:17:24.264084 7f5d6419d700 1 mds.0.server reconnect gave up on client.5706 172.22.10.37:0/3475748832 -1> 2013-06-07 09:17:24.264086 7f5d6419d700 1 mds.0.server reconnect gave up on client.16777216 172.22.10.24:0/1217318668 0> 2013-06-07 09:17:24.286387 7f5d6419d700 -1 *** Caught signal (Segmentation fault) ** in thread 7f5d6419d70 ceph version 0.63 (054e96cf79e960894ef7e33a4d13635d3ad2a1b9) 1: /usr/bin/ceph-mds() [0x8556ba] 2: (()+0xfcb0) [0x7f5d6ae98cb0] 3: (Server::reconnect_tick()+0x2d5) [0x517f95] 4: (MDS::tick()+0x36c) [0x4bec6c] 5: (Context::complete(int)+0xa) [0x4b521a] 6: (SafeTimer::timer_thread()+0x425) [0x7a8175] 7: (SafeTimerThread::entry()+0xd) [0x7a8dad] 8: (()+0x7e9a) [0x7f5d6ae90e9a] 9: (clone()+0x6d) [0x7f5d69d31ccd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I traced this back to the reconnect_tick
method in src/mds/Server.cc
, specifically the call to mds->sessionmap.get_session
was returning NULL so when trying to log "gave up on client" a segfault was produced.
--- ceph-0.61.3.orig/src/mds/Server.cc +++ ceph-0.61.3/src/mds/Server.cc @@ -684,7 +684,10 @@ void Server::reconnect_tick() p != client_reconnect_gather.end(); ++p) { Session *session = mds->sessionmap.get_session(entity_name_t::CLIENT(p->v)); - dout(1) << "reconnect gave up on " << session->info.inst << dendl; + if (session == NULL) + dout(1) << "reconnect to NULL session" << dendl; + else + dout(1) << "reconnect gave up on " << session->info.inst << dendl; failed_reconnects++; } client_reconnect_gather.clear();
Applying the above patch results in a working MDS server (I have little idea about what the implications are though) and a number of NULL sessions are logged.
2013-06-10 10:03:05.364613 7f320962b700 10 mds.0.server reconnect timed out 2013-06-10 10:03:05.364617 7f320962b700 1 mds.0.server reconnect gave up on client.0 172.22.10.25:0/4275051773 2013-06-10 10:03:05.364628 7f320962b700 1 mds.0.server reconnect gave up on client.5664 172.22.10.33:0/3106616745 2013-06-10 10:03:05.364630 7f320962b700 1 mds.0.server reconnect gave up on client.5693 172.22.10.30:0/10349 2013-06-10 10:03:05.364632 7f320962b700 1 mds.0.server reconnect gave up on client.5706 172.22.10.37:0/3475748832 2013-06-10 10:03:05.364637 7f320962b700 1 mds.0.server reconnect gave up on client.16777216 172.22.10.24:0/1217318668 2013-06-10 10:03:05.364639 7f320962b700 1 mds.0.server reconnect to NULL session 2013-06-10 10:03:05.364641 7f320962b700 1 mds.0.server reconnect to NULL session 2013-06-10 10:03:05.364642 7f320962b700 1 mds.0.server reconnect gave up on client.0 172.22.10.25:0/4275051773 2013-06-10 10:03:05.364643 7f320962b700 1 mds.0.server reconnect to NULL session 2013-06-10 10:03:05.364644 7f320962b700 1 mds.0.server reconnect to NULL session 2013-06-10 10:03:05.364645 7f320962b700 1 mds.0.server reconnect to NULL session
I did try with 0.63 and the issue is still present there.
Files
Updated by Zheng Yan almost 11 years ago
looks like session map corruption.
Damien, please upload the session map. you can find where is it by "ceph osd map metadata mds0_sessionmap"
Updated by Damien Churchill almost 11 years ago
Hi Zheng,
Is this what you mean?
Updated by Sage Weil almost 11 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1)
Updated by Sage Weil almost 11 years ago
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-06-20_13:32:57-fs-master-testing-basic/41231
logs in logs subdir
same with
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-06-20_13:32:57-fs-master-testing-basic/41230
Updated by Sage Weil over 10 years ago
i'm inclined to call this can't reproduce. there was a locking fix recently that covered the session_map too that could conceivably be blamed.
Updated by Damien Churchill over 10 years ago
I certainly haven't been hit by this again, so if you consider it resolved...
Updated by Sage Weil over 10 years ago
- Status changed from New to Can't reproduce