Project

General

Profile

Actions

Bug #43596

open

mds: crash when enable msgr v2 due to lost contact

Added by Dan van der Ster over 4 years ago. Updated over 4 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, MDSMonitor
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We just upgraded from mimic v13.2.7 to v14.2.6 and when we enable msgr v2 on the mon which an MDS is connected to, the mds respawns.
Note that we run v1 on nonstandard port 6790, so we need to enable v2 one-by-one on the mons.

Here's the mds log just after enabling v2 on mon at 188.184.86.25:

   -42> 2020-01-14 11:59:38.628 7fb701ca1700  2 mds.0.cache Memory usage:  total 3173228, rss 2729004, heap 332028, baseline 332028, 19001 / 1013921 inodes have caps, 25557 caps, 0.0252061 caps per inode
   -41> 2020-01-14 11:59:39.629 7fb701ca1700  2 mds.0.cache Memory usage:  total 3173228, rss 2729004, heap 332028, baseline 332028, 19001 / 1013921 inodes have caps, 25557 caps, 0.0252061 caps per inode
   -40> 2020-01-14 11:59:40.629 7fb701ca1700  2 mds.0.cache Memory usage:  total 3173228, rss 2729004, heap 332028, baseline 332028, 18997 / 1013921 inodes have caps, 25553 caps, 0.0252022 caps per inode
   -39> 2020-01-14 11:59:41.401 7fb7034a4700 10 monclient: _send_mon_message to mon.cephdwight-mon-1633994557 at v1:188.184.86.25:6790/0
   -38> 2020-01-14 11:59:41.630 7fb701ca1700  2 mds.0.cache Memory usage:  total 3173228, rss 2729004, heap 332028, baseline 332028, 18997 / 1013921 inodes have caps, 25553 caps, 0.0252022 caps per inode
   -37> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient: handle_monmap mon_map magic: 0 v1
   -36> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient:  got monmap 43 from mon.cephdwight-mon-1633994557 (according to old e43)
   -35> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient: dump:
epoch 43
fsid dd535a7e-4647-4bee-853d-f34112615f81
last_changed 2020-01-14 11:59:41.883881
created 2014-03-06 14:41:11.600153
min_mon_release 14 (nautilus)
0: [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0] mon.cephdwight-mon-1633994557
1: [v2:188.184.94.56:3300/0,v1:188.184.94.56:6790/0] mon.cephdwight-mon-f7df6839c6
2: [v2:188.185.66.208:3300/0,v1:188.185.66.208:6790/0] mon.cephdwight-mon-d8788e3256

   -34> 2020-01-14 11:59:41.888 7fb7064aa700  1 monclient:  mon.0 has (v2) addrs [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0] but i'm connected to v1:188.184.86.25:6790/0, reconnecting
   -33> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient: _reopen_session rank -1
   -32> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): picked mon.cephdwight-mon-f7df6839c6 con 0x555b82718800 addr [v2:188.184.94.56:3300/0,v1:188.184.94.56:6790/0]
   -31> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): picked mon.cephdwight-mon-d8788e3256 con 0x555b82718400 addr [v2:188.185.66.208:3300/0,v1:188.185.66.208:6790/0]
   -30> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): picked mon.cephdwight-mon-1633994557 con 0x555ba0827c00 addr [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0]
   -29> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): start opening mon connection
   -28> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): start opening mon connection
   -27> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): start opening mon connection
   -26> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): _renew_subs
   -25> 2020-01-14 11:59:41.891 7fb709f3b700 10 monclient(hunting): get_auth_request con 0x555b82718800 auth_method 0
   -24> 2020-01-14 11:59:41.891 7fb709f3b700 10 monclient(hunting): get_auth_request method 2 preferred_modes [1,2]
   -23> 2020-01-14 11:59:41.891 7fb709f3b700 10 monclient(hunting): _init_auth method 2
   -22> 2020-01-14 11:59:41.894 7fb709f3b700 10 monclient(hunting): handle_auth_reply_more payload 9
   -21> 2020-01-14 11:59:41.894 7fb709f3b700 10 monclient(hunting): handle_auth_reply_more payload_len 9
   -20> 2020-01-14 11:59:41.894 7fb709f3b700 10 monclient(hunting): handle_auth_reply_more responding with 36 bytes
   -19> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): get_auth_request con 0x555b82718400 auth_method 0
   -18> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): get_auth_request method 2 preferred_modes [1,2]
   -17> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): _init_auth method 2
   -16> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): handle_auth_reply_more payload 9
   -15> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): handle_auth_reply_more payload_len 9
   -14> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): handle_auth_reply_more responding with 36 bytes
   -13> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient(hunting): handle_auth_done global_id 1024259360 payload 1198
   -12> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient: _finish_hunting 0
   -11> 2020-01-14 11:59:41.898 7fb708f39700  1 monclient: found mon.cephdwight-mon-d8788e3256
   -10> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient: _send_mon_message to mon.cephdwight-mon-d8788e3256 at v2:188.185.66.208:3300/0
    -9> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient: _finish_auth 0
    -8> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2020-01-14 11:59:11.898795)
    -7> 2020-01-14 11:59:41.900 7fb7064aa700 10 monclient: handle_monmap mon_map magic: 0 v1
    -6> 2020-01-14 11:59:41.900 7fb7064aa700 10 monclient:  got monmap 43 from mon.cephdwight-mon-d8788e3256 (according to old e43)
    -5> 2020-01-14 11:59:41.900 7fb7064aa700 10 monclient: dump:
epoch 43
fsid dd535a7e-4647-4bee-853d-f34112615f81
last_changed 2020-01-14 11:59:41.883881
created 2014-03-06 14:41:11.600153
min_mon_release 14 (nautilus)
0: [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0] mon.cephdwight-mon-1633994557
1: [v2:188.184.94.56:3300/0,v1:188.184.94.56:6790/0] mon.cephdwight-mon-f7df6839c6
2: [v2:188.185.66.208:3300/0,v1:188.185.66.208:6790/0] mon.cephdwight-mon-d8788e3256

    -4> 2020-01-14 11:59:41.900 7fb7064aa700 10 monclient: handle_config config(1 keys) v1
    -3> 2020-01-14 11:59:41.900 7fb7064aa700  1 mds.cephdwightmds2 Updating MDS map to version 70675 from mon.2
    -2> 2020-01-14 11:59:41.900 7fb704ca7700  4 set_mon_vals no callback set
    -1> 2020-01-14 11:59:41.900 7fb7064aa700  1 mds.cephdwightmds2 Map removed me (mds.-1 gid:1024259360) from cluster due to lost contact; respawning
     0> 2020-01-14 11:59:41.900 7fb7064aa700  1 mds.cephdwightmds2 respawn!

There was also a case where restarting a standby pushed out a new mon map, triggering the same `lost contact` as above.

And at one point when I was trying to stabilize the cluster I had this:

mds blacklisted:

2020-01-14 12:02:18.959 7effe452e700  1 mds.cephdwightmds1 Updating MDS map to version 70680 from mon.2
2020-01-14 12:02:18.961 7effe452e700  1 mds.0.70680 handle_mds_map i am now mds.0.70680
2020-01-14 12:02:18.961 7effe452e700  1 mds.0.70680 handle_mds_map state change up:boot --> up:replay
2020-01-14 12:02:18.961 7effe452e700  1 mds.0.70680 replay_start
2020-01-14 12:02:18.961 7effe452e700  1 mds.0.70680  recovery set is 
2020-01-14 12:02:18.961 7effe452e700  1 mds.0.70680  waiting for osdmap 293949 (which blacklists prior instance)
2020-01-14 12:02:18.964 7effddd21700  2 mds.0.70680 Booting: 0: opening inotable
2020-01-14 12:02:18.964 7effddd21700  2 mds.0.70680 Booting: 0: opening sessionmap
2020-01-14 12:02:18.964 7effddd21700  2 mds.0.70680 Booting: 0: opening mds log
2020-01-14 12:02:18.964 7effddd21700  2 mds.0.70680 Booting: 0: opening purge queue (async)
2020-01-14 12:02:18.964 7effddd21700  2 mds.0.70680 Booting: 0: loading open file table (async)
2020-01-14 12:02:18.965 7effddd21700  2 mds.0.70680 Booting: 0: opening snap table
2020-01-14 12:02:18.967 7effddd21700 -1 MDSIOContextBase: blacklisted!  Restarting...
2020-01-14 12:02:18.967 7effddd21700  1 mds.cephdwightmds1 respawn!
2020-01-14 12:02:18.967 7effded23700  0 mds.0.journaler.pq(ro) error getting journal off disk
2020-01-14 12:02:18.967 7effded23700 -1 mds.0.purge_queue operator(): Error -108 loading Journaler
2020-01-14 12:02:18.967 7effded23700  1 mds.0.purge_queue _go_readonly: going readonly because internal IO failed: Cannot send after transport endpoint shutdown

other mds booting:

2020-01-14 12:02:19.992 7f2ed7a85700  1 mds.cephdwightmds2 Updating MDS map to version 70682 from mon.2
2020-01-14 12:02:19.993 7f2ed7a85700  1 mds.0.70682 handle_mds_map i am now mds.0.70682
2020-01-14 12:02:19.993 7f2ed7a85700  1 mds.0.70682 handle_mds_map state change up:boot --> up:replay
2020-01-14 12:02:19.993 7f2ed7a85700  1 mds.0.70682 replay_start
2020-01-14 12:02:19.993 7f2ed7a85700  1 mds.0.70682  recovery set is 
2020-01-14 12:02:19.993 7f2ed7a85700  1 mds.0.70682  waiting for osdmap 293950 (which blacklists prior instance)
2020-01-14 12:02:19.995 7f2ed1278700  2 mds.0.70682 Booting: 0: opening inotable
2020-01-14 12:02:19.995 7f2ed1278700  2 mds.0.70682 Booting: 0: opening sessionmap
2020-01-14 12:02:19.995 7f2ed1278700  2 mds.0.70682 Booting: 0: opening mds log
2020-01-14 12:02:19.995 7f2ed1278700  2 mds.0.70682 Booting: 0: opening purge queue (async)
2020-01-14 12:02:19.996 7f2ed1278700  2 mds.0.70682 Booting: 0: loading open file table (async)
2020-01-14 12:02:19.996 7f2ed1278700  2 mds.0.70682 Booting: 0: opening snap table
2020-01-14 12:02:20.014 7f2ed0a77700  2 mds.0.70682 Booting: 1: loading/discovering base inodes
2020-01-14 12:02:20.014 7f2ed0a77700  0 mds.0.cache creating system inode with ino:0x100
2020-01-14 12:02:20.014 7f2ed0a77700  0 mds.0.cache creating system inode with ino:0x1
2020-01-14 12:02:20.017 7f2ed1278700  2 mds.0.70682 Booting: 2: replaying mds log
2020-01-14 12:02:20.017 7f2ed1278700  2 mds.0.70682 Booting: 2: waiting for purge queue recovered
2020-01-14 12:02:24.146 7f2ecfa75700  1 mds.0.70682 Finished replaying journal
2020-01-14 12:02:24.146 7f2ecfa75700  1 mds.0.70682 making mds journal writeable
2020-01-14 12:02:24.146 7f2ecfa75700  2 mds.0.70682 i am alone, moving to state reconnect

It seems to be stable now after enabling v2 on all mons and restarting all mds's.

Actions

Also available in: Atom PDF