Project

General

Profile

Actions

Bug #5290

closed

mds: crash whilst trying to reconnect

Added by Damien Churchill almost 11 years ago. Updated over 10 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

Recently I experienced an issue with the mds servers in my cluster, the cluster storage would be absolutely fine, however when in the reconnect state the server would segfault. The last entry in the log file looks something along the lines of:

   -25> 2013-06-07 09:17:09.263806 7f5d6419d700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.32>
   -24> 2013-06-07 09:17:11.085358 7f5d6419d700 10 mds.0.109 beacon_send up:reconnect seq 11 (currently up:reconnect)
   -23> 2013-06-07 09:17:11.085401 7f5d6419d700 10 monclient: _send_mon_message to mon.node22 at 172.22.10.22:6789/0
   -22> 2013-06-07 09:17:11.086720 7f5d662a2700 10 mds.0.109 handle_mds_beacon up:reconnect seq 11 rtt 0.001332
   -21> 2013-06-07 09:17:14.263846 7f5d6419d700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.45>
   -20> 2013-06-07 09:17:15.085497 7f5d6419d700 10 mds.0.109 beacon_send up:reconnect seq 12 (currently up:reconnect)
   -19> 2013-06-07 09:17:15.085567 7f5d6419d700 10 monclient: _send_mon_message to mon.node22 at 172.22.10.22:6789/0
   -18> 2013-06-07 09:17:15.086674 7f5d662a2700 10 mds.0.109 handle_mds_beacon up:reconnect seq 12 rtt 0.001147
   -17> 2013-06-07 09:17:17.262087 7f5d652a0700 10 monclient: tick
   -16> 2013-06-07 09:17:17.262117 7f5d652a0700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2013-06-07 09:16:47.262115)
   -15> 2013-06-07 09:17:17.262145 7f5d652a0700 10 monclient: renew subs? (now: 2013-06-07 09:17:17.262145; renew after: 2013-06-07 09:19:04.581996) -- no
   -14> 2013-06-07 09:17:19.085675 7f5d6419d700 10 mds.0.109 beacon_send up:reconnect seq 13 (currently up:reconnect)
   -13> 2013-06-07 09:17:19.085720 7f5d6419d700 10 monclient: _send_mon_message to mon.node22 at 172.22.10.22:6789/0
   -12> 2013-06-07 09:17:19.086956 7f5d662a2700 10 mds.0.109 handle_mds_beacon up:reconnect seq 13 rtt 0.001248
   -11> 2013-06-07 09:17:19.263898 7f5d6419d700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.5>
   -10> 2013-06-07 09:17:23.085831 7f5d6419d700 10 mds.0.109 beacon_send up:reconnect seq 14 (currently up:reconnect)
    -9> 2013-06-07 09:17:23.085883 7f5d6419d700 10 monclient: _send_mon_message to mon.node22 at 172.22.10.22:6789/0
    -8> 2013-06-07 09:17:23.087073 7f5d662a2700 10 mds.0.109 handle_mds_beacon up:reconnect seq 14 rtt 0.001206
    -7> 2013-06-07 09:17:24.264002 7f5d6419d700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.62>
    -6> 2013-06-07 09:17:24.264057 7f5d6419d700 10 mds.0.server reconnect timed out
    -5> 2013-06-07 09:17:24.264061 7f5d6419d700  1 mds.0.server reconnect gave up on client.0 172.22.10.25:0/4275051773
    -4> 2013-06-07 09:17:24.264071 7f5d6419d700  1 mds.0.server reconnect gave up on client.5664 172.22.10.33:0/3106616745
    -3> 2013-06-07 09:17:24.264082 7f5d6419d700  1 mds.0.server reconnect gave up on client.5693 172.22.10.30:0/10349
    -2> 2013-06-07 09:17:24.264084 7f5d6419d700  1 mds.0.server reconnect gave up on client.5706 172.22.10.37:0/3475748832
    -1> 2013-06-07 09:17:24.264086 7f5d6419d700  1 mds.0.server reconnect gave up on client.16777216 172.22.10.24:0/1217318668
     0> 2013-06-07 09:17:24.286387 7f5d6419d700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f5d6419d70

 ceph version 0.63 (054e96cf79e960894ef7e33a4d13635d3ad2a1b9)
 1: /usr/bin/ceph-mds() [0x8556ba]
 2: (()+0xfcb0) [0x7f5d6ae98cb0]
 3: (Server::reconnect_tick()+0x2d5) [0x517f95]
 4: (MDS::tick()+0x36c) [0x4bec6c]
 5: (Context::complete(int)+0xa) [0x4b521a]
 6: (SafeTimer::timer_thread()+0x425) [0x7a8175]
 7: (SafeTimerThread::entry()+0xd) [0x7a8dad]
 8: (()+0x7e9a) [0x7f5d6ae90e9a]
 9: (clone()+0x6d) [0x7f5d69d31ccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I traced this back to the reconnect_tick method in src/mds/Server.cc, specifically the call to mds->sessionmap.get_session was returning NULL so when trying to log "gave up on client" a segfault was produced.

--- ceph-0.61.3.orig/src/mds/Server.cc
+++ ceph-0.61.3/src/mds/Server.cc
@@ -684,7 +684,10 @@ void Server::reconnect_tick()
      p != client_reconnect_gather.end();
      ++p) {
       Session *session = mds->sessionmap.get_session(entity_name_t::CLIENT(p->v));
-      dout(1) << "reconnect gave up on " << session->info.inst << dendl;
+      if (session == NULL)
+        dout(1) << "reconnect to NULL session" << dendl;
+      else
+        dout(1) << "reconnect gave up on " << session->info.inst << dendl;
       failed_reconnects++;
     }
     client_reconnect_gather.clear();

Applying the above patch results in a working MDS server (I have little idea about what the implications are though) and a number of NULL sessions are logged.

2013-06-10 10:03:05.364613 7f320962b700 10 mds.0.server reconnect timed out
2013-06-10 10:03:05.364617 7f320962b700  1 mds.0.server reconnect gave up on client.0 172.22.10.25:0/4275051773
2013-06-10 10:03:05.364628 7f320962b700  1 mds.0.server reconnect gave up on client.5664 172.22.10.33:0/3106616745
2013-06-10 10:03:05.364630 7f320962b700  1 mds.0.server reconnect gave up on client.5693 172.22.10.30:0/10349
2013-06-10 10:03:05.364632 7f320962b700  1 mds.0.server reconnect gave up on client.5706 172.22.10.37:0/3475748832
2013-06-10 10:03:05.364637 7f320962b700  1 mds.0.server reconnect gave up on client.16777216 172.22.10.24:0/1217318668
2013-06-10 10:03:05.364639 7f320962b700  1 mds.0.server reconnect to NULL session
2013-06-10 10:03:05.364641 7f320962b700  1 mds.0.server reconnect to NULL session
2013-06-10 10:03:05.364642 7f320962b700  1 mds.0.server reconnect gave up on client.0 172.22.10.25:0/4275051773
2013-06-10 10:03:05.364643 7f320962b700  1 mds.0.server reconnect to NULL session
2013-06-10 10:03:05.364644 7f320962b700  1 mds.0.server reconnect to NULL session
2013-06-10 10:03:05.364645 7f320962b700  1 mds.0.server reconnect to NULL session

I did try with 0.63 and the issue is still present there.


Files

usessionmap__head_3270C60B__1 (17 Bytes) usessionmap__head_3270C60B__1 Damien Churchill, 06/12/2013 06:10 AM
Actions #1

Updated by Sage Weil almost 11 years ago

  • Priority changed from Normal to High
Actions #2

Updated by Ian Colle almost 11 years ago

  • Assignee set to Sage Weil
Actions #3

Updated by Zheng Yan almost 11 years ago

looks like session map corruption.

Damien, please upload the session map. you can find where is it by "ceph osd map metadata mds0_sessionmap"

Actions #4

Updated by Damien Churchill almost 11 years ago

Hi Zheng,

Is this what you mean?

Actions #5

Updated by Sage Weil almost 11 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
Actions #6

Updated by Sage Weil almost 11 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-06-20_13:32:57-fs-master-testing-basic/41231
logs in logs subdir

same with
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-06-20_13:32:57-fs-master-testing-basic/41230

Actions #7

Updated by Sage Weil over 10 years ago

i'm inclined to call this can't reproduce. there was a locking fix recently that covered the session_map too that could conceivably be blamed.

Actions #8

Updated by Damien Churchill over 10 years ago

I certainly haven't been hit by this again, so if you consider it resolved...

Actions #9

Updated by Sage Weil over 10 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF