Project

General

Profile

Bug #19566

MDS crash on mgr message during shutdown

Added by John Spray about 1 month ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
04/10/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Component(FS):
Needs Doc:
No

Description

   -28> 2017-04-10 03:05:44.978454 1122e700 10 mds.beacon.a-s handle_mds_beacon down:dne seq 436 rtt 0.072707
   -27> 2017-04-10 03:05:44.997378 14e35700  1 mds.0.7 shutdown: shutting down rank 0
   -26> 2017-04-10 03:05:45.000413 14e35700  5 mds.0.log shutdown
   -25> 2017-04-10 03:05:45.005520 14e35700  1 mds.0.journaler.mdlog(rw) shutdown
   -24> 2017-04-10 03:05:45.009103 14e35700  7 mds.0.cache WARNING: mdcache shutdown with non-empty cache
   -23> 2017-04-10 03:05:45.010305 14e35700 15 mds.0.cache show_subtrees
   -22> 2017-04-10 03:05:45.010456 14e35700 10 mds.0.cache |__ 0    auth [dir 1 / [2,head] auth v=2969 cv=1/1 REP dir_auth=0 state=1610612738|complete f(v0 m2017-04-10 03:04:35.708557) n(v321 rc2017-04-10 03:04:35.708557) hs=0+1,ss=0+0 dirty=1 | child=1 subtree=1 dirty=1 authpin=0 0x132a0490]
   -21> 2017-04-10 03:05:45.010815 14e35700 10 mds.0.cache |__ 0    auth [dir 100 ~mds0/ [2,head] auth v=4143 cv=1/1 dir_auth=0 state=1610612738|complete f(v0 10=0+10) n(v180 rc2017-04-10 03:04:35.708557 b43297798 257=235+22) hs=10+0,ss=0+0 dirty=10 | child=1 subtree=1 dirty=1 authpin=0 0x132a07a8]
   -20> 2017-04-10 03:05:45.012075 14e35700  1 mds.0.journaler.pq(rw) shutdown
   -19> 2017-04-10 03:05:45.021235 14e35700  1 -- 172.21.15.21:6808/2917499623 >> 172.21.15.4:6805/25237 conn(0x13376ab0 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).mark_down
   -18> 2017-04-10 03:05:45.029119 14e35700  1 -- 172.21.15.21:6808/2917499623 >> 172.21.15.21:6804/23652 conn(0x1338bef0 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).mark_down
   -17> 2017-04-10 03:05:45.030067 14e35700  1 -- 172.21.15.21:6808/2917499623 >> 172.21.15.21:6800/23653 conn(0x1337c9d0 :-1 s=STATE_OPEN pgs=4 cs=1 l=1).mark_down
   -16> 2017-04-10 03:05:45.031359 e2b7700  1 -- 172.21.15.21:6808/2917499623 reap_dead start
   -15> 2017-04-10 03:05:45.055861 14e35700  5 asok(0xcec3fe0) unregister_command objecter_requests
   -14> 2017-04-10 03:05:45.057152 14e35700 10 monclient: shutdown
   -13> 2017-04-10 03:05:45.058490 14e35700  1 -- 172.21.15.21:6808/2917499623 >> 172.21.15.21:6789/0 conn(0x1a7692d0 :-1 s=STATE_OPEN pgs=36 cs=1 l=1).mark_down
   -12> 2017-04-10 03:05:45.062892 e2b7700  1 -- 172.21.15.21:6808/2917499623 reap_dead start
   -11> 2017-04-10 03:05:45.079585 14e35700  1 -- 172.21.15.21:6808/2917499623 shutdown_connections
   -10> 2017-04-10 03:05:45.087903 1122e700  4 mgrc ms_handle_reset ms_handle_reset con 0x132434e0
    -9> 2017-04-10 03:05:45.092045 1122e700  4 mgrc reconnect Terminating session with 172.21.15.4:6800/25066
    -8> 2017-04-10 03:05:45.093512 1122e700  1 -- 172.21.15.21:6808/2917499623 >> 172.21.15.4:6800/25066 conn(0x132434e0 :-1 s=STATE_CLOSED pgs=24 cs=1 l=1).mark_down
    -7> 2017-04-10 03:05:45.096545 1122e700  4 mgrc reconnect Starting new session with 172.21.15.4:6800/25066
    -6> 2017-04-10 03:05:45.099042 1122e700  1 -- 172.21.15.21:6808/2917499623 --> 172.21.15.4:6800/25066 -- mgropen(a-s) v1 -- 0xd063df0 con 0
    -5> 2017-04-10 03:05:45.112209 8f9a980  1 -- 172.21.15.21:6808/2917499623 shutdown_connections
    -4> 2017-04-10 03:05:45.116279 f2b9700 10 mds.a-s MDSDaemon::ms_get_authorizer type=mgr
    -3> 2017-04-10 03:05:45.116922 f2b9700  0 monclient: build_authorizer for mgr, but no auth is available now
    -2> 2017-04-10 03:05:45.119639 f2b9700  0 -- 172.21.15.21:6808/2917499623 >> 172.21.15.4:6800/25066 conn(0x19fd54f0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
    -1> 2017-04-10 03:05:45.121514 f2b9700 10 mds.a-s MDSDaemon::ms_get_authorizer type=mgr
     0> 2017-04-10 03:05:45.195683 f2b9700 -1 *** Caught signal (Segmentation fault) **

Note these failures are initially reported as failing on valgrind issues:
http://pulpito.ceph.com/jspray-2017-04-10_01:12:12-fs-wip-jcsp-testing-20170409-distro-basic-smithi/1006036/

History

#1 Updated by John Spray about 1 month ago

  • Status changed from New to In Progress
  • Assignee set to John Spray

#2 Updated by John Spray about 1 month ago

  • Status changed from In Progress to Need Review

#3 Updated by John Spray about 1 month ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF