Project

General

Profile

Bug #43596

mds: crash when enable msgr v2 due to lost contact

Added by Dan van der Ster about 1 month ago. Updated 18 days ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, MDSMonitor
Labels (FS):
crash
Pull request ID:
Crash signature:

Description

We just upgraded from mimic v13.2.7 to v14.2.6 and when we enable msgr v2 on the mon which an MDS is connected to, the mds respawns.
Note that we run v1 on nonstandard port 6790, so we need to enable v2 one-by-one on the mons.

Here's the mds log just after enabling v2 on mon at 188.184.86.25:

   -42> 2020-01-14 11:59:38.628 7fb701ca1700  2 mds.0.cache Memory usage:  total 3173228, rss 2729004, heap 332028, baseline 332028, 19001 / 1013921 inodes have caps, 25557 caps, 0.0252061 caps per inode
   -41> 2020-01-14 11:59:39.629 7fb701ca1700  2 mds.0.cache Memory usage:  total 3173228, rss 2729004, heap 332028, baseline 332028, 19001 / 1013921 inodes have caps, 25557 caps, 0.0252061 caps per inode
   -40> 2020-01-14 11:59:40.629 7fb701ca1700  2 mds.0.cache Memory usage:  total 3173228, rss 2729004, heap 332028, baseline 332028, 18997 / 1013921 inodes have caps, 25553 caps, 0.0252022 caps per inode
   -39> 2020-01-14 11:59:41.401 7fb7034a4700 10 monclient: _send_mon_message to mon.cephdwight-mon-1633994557 at v1:188.184.86.25:6790/0
   -38> 2020-01-14 11:59:41.630 7fb701ca1700  2 mds.0.cache Memory usage:  total 3173228, rss 2729004, heap 332028, baseline 332028, 18997 / 1013921 inodes have caps, 25553 caps, 0.0252022 caps per inode
   -37> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient: handle_monmap mon_map magic: 0 v1
   -36> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient:  got monmap 43 from mon.cephdwight-mon-1633994557 (according to old e43)
   -35> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient: dump:
epoch 43
fsid dd535a7e-4647-4bee-853d-f34112615f81
last_changed 2020-01-14 11:59:41.883881
created 2014-03-06 14:41:11.600153
min_mon_release 14 (nautilus)
0: [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0] mon.cephdwight-mon-1633994557
1: [v2:188.184.94.56:3300/0,v1:188.184.94.56:6790/0] mon.cephdwight-mon-f7df6839c6
2: [v2:188.185.66.208:3300/0,v1:188.185.66.208:6790/0] mon.cephdwight-mon-d8788e3256

   -34> 2020-01-14 11:59:41.888 7fb7064aa700  1 monclient:  mon.0 has (v2) addrs [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0] but i'm connected to v1:188.184.86.25:6790/0, reconnecting
   -33> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient: _reopen_session rank -1
   -32> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): picked mon.cephdwight-mon-f7df6839c6 con 0x555b82718800 addr [v2:188.184.94.56:3300/0,v1:188.184.94.56:6790/0]
   -31> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): picked mon.cephdwight-mon-d8788e3256 con 0x555b82718400 addr [v2:188.185.66.208:3300/0,v1:188.185.66.208:6790/0]
   -30> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): picked mon.cephdwight-mon-1633994557 con 0x555ba0827c00 addr [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0]
   -29> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): start opening mon connection
   -28> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): start opening mon connection
   -27> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): start opening mon connection
   -26> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): _renew_subs
   -25> 2020-01-14 11:59:41.891 7fb709f3b700 10 monclient(hunting): get_auth_request con 0x555b82718800 auth_method 0
   -24> 2020-01-14 11:59:41.891 7fb709f3b700 10 monclient(hunting): get_auth_request method 2 preferred_modes [1,2]
   -23> 2020-01-14 11:59:41.891 7fb709f3b700 10 monclient(hunting): _init_auth method 2
   -22> 2020-01-14 11:59:41.894 7fb709f3b700 10 monclient(hunting): handle_auth_reply_more payload 9
   -21> 2020-01-14 11:59:41.894 7fb709f3b700 10 monclient(hunting): handle_auth_reply_more payload_len 9
   -20> 2020-01-14 11:59:41.894 7fb709f3b700 10 monclient(hunting): handle_auth_reply_more responding with 36 bytes
   -19> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): get_auth_request con 0x555b82718400 auth_method 0
   -18> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): get_auth_request method 2 preferred_modes [1,2]
   -17> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): _init_auth method 2
   -16> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): handle_auth_reply_more payload 9
   -15> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): handle_auth_reply_more payload_len 9
   -14> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): handle_auth_reply_more responding with 36 bytes
   -13> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient(hunting): handle_auth_done global_id 1024259360 payload 1198
   -12> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient: _finish_hunting 0
   -11> 2020-01-14 11:59:41.898 7fb708f39700  1 monclient: found mon.cephdwight-mon-d8788e3256
   -10> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient: _send_mon_message to mon.cephdwight-mon-d8788e3256 at v2:188.185.66.208:3300/0
    -9> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient: _finish_auth 0
    -8> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2020-01-14 11:59:11.898795)
    -7> 2020-01-14 11:59:41.900 7fb7064aa700 10 monclient: handle_monmap mon_map magic: 0 v1
    -6> 2020-01-14 11:59:41.900 7fb7064aa700 10 monclient:  got monmap 43 from mon.cephdwight-mon-d8788e3256 (according to old e43)
    -5> 2020-01-14 11:59:41.900 7fb7064aa700 10 monclient: dump:
epoch 43
fsid dd535a7e-4647-4bee-853d-f34112615f81
last_changed 2020-01-14 11:59:41.883881
created 2014-03-06 14:41:11.600153
min_mon_release 14 (nautilus)
0: [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0] mon.cephdwight-mon-1633994557
1: [v2:188.184.94.56:3300/0,v1:188.184.94.56:6790/0] mon.cephdwight-mon-f7df6839c6
2: [v2:188.185.66.208:3300/0,v1:188.185.66.208:6790/0] mon.cephdwight-mon-d8788e3256

    -4> 2020-01-14 11:59:41.900 7fb7064aa700 10 monclient: handle_config config(1 keys) v1
    -3> 2020-01-14 11:59:41.900 7fb7064aa700  1 mds.cephdwightmds2 Updating MDS map to version 70675 from mon.2
    -2> 2020-01-14 11:59:41.900 7fb704ca7700  4 set_mon_vals no callback set
    -1> 2020-01-14 11:59:41.900 7fb7064aa700  1 mds.cephdwightmds2 Map removed me (mds.-1 gid:1024259360) from cluster due to lost contact; respawning
     0> 2020-01-14 11:59:41.900 7fb7064aa700  1 mds.cephdwightmds2 respawn!

There was also a case where restarting a standby pushed out a new mon map, triggering the same `lost contact` as above.

And at one point when I was trying to stabilize the cluster I had this:

mds blacklisted:

2020-01-14 12:02:18.959 7effe452e700  1 mds.cephdwightmds1 Updating MDS map to version 70680 from mon.2
2020-01-14 12:02:18.961 7effe452e700  1 mds.0.70680 handle_mds_map i am now mds.0.70680
2020-01-14 12:02:18.961 7effe452e700  1 mds.0.70680 handle_mds_map state change up:boot --> up:replay
2020-01-14 12:02:18.961 7effe452e700  1 mds.0.70680 replay_start
2020-01-14 12:02:18.961 7effe452e700  1 mds.0.70680  recovery set is 
2020-01-14 12:02:18.961 7effe452e700  1 mds.0.70680  waiting for osdmap 293949 (which blacklists prior instance)
2020-01-14 12:02:18.964 7effddd21700  2 mds.0.70680 Booting: 0: opening inotable
2020-01-14 12:02:18.964 7effddd21700  2 mds.0.70680 Booting: 0: opening sessionmap
2020-01-14 12:02:18.964 7effddd21700  2 mds.0.70680 Booting: 0: opening mds log
2020-01-14 12:02:18.964 7effddd21700  2 mds.0.70680 Booting: 0: opening purge queue (async)
2020-01-14 12:02:18.964 7effddd21700  2 mds.0.70680 Booting: 0: loading open file table (async)
2020-01-14 12:02:18.965 7effddd21700  2 mds.0.70680 Booting: 0: opening snap table
2020-01-14 12:02:18.967 7effddd21700 -1 MDSIOContextBase: blacklisted!  Restarting...
2020-01-14 12:02:18.967 7effddd21700  1 mds.cephdwightmds1 respawn!
2020-01-14 12:02:18.967 7effded23700  0 mds.0.journaler.pq(ro) error getting journal off disk
2020-01-14 12:02:18.967 7effded23700 -1 mds.0.purge_queue operator(): Error -108 loading Journaler
2020-01-14 12:02:18.967 7effded23700  1 mds.0.purge_queue _go_readonly: going readonly because internal IO failed: Cannot send after transport endpoint shutdown

other mds booting:

2020-01-14 12:02:19.992 7f2ed7a85700  1 mds.cephdwightmds2 Updating MDS map to version 70682 from mon.2
2020-01-14 12:02:19.993 7f2ed7a85700  1 mds.0.70682 handle_mds_map i am now mds.0.70682
2020-01-14 12:02:19.993 7f2ed7a85700  1 mds.0.70682 handle_mds_map state change up:boot --> up:replay
2020-01-14 12:02:19.993 7f2ed7a85700  1 mds.0.70682 replay_start
2020-01-14 12:02:19.993 7f2ed7a85700  1 mds.0.70682  recovery set is 
2020-01-14 12:02:19.993 7f2ed7a85700  1 mds.0.70682  waiting for osdmap 293950 (which blacklists prior instance)
2020-01-14 12:02:19.995 7f2ed1278700  2 mds.0.70682 Booting: 0: opening inotable
2020-01-14 12:02:19.995 7f2ed1278700  2 mds.0.70682 Booting: 0: opening sessionmap
2020-01-14 12:02:19.995 7f2ed1278700  2 mds.0.70682 Booting: 0: opening mds log
2020-01-14 12:02:19.995 7f2ed1278700  2 mds.0.70682 Booting: 0: opening purge queue (async)
2020-01-14 12:02:19.996 7f2ed1278700  2 mds.0.70682 Booting: 0: loading open file table (async)
2020-01-14 12:02:19.996 7f2ed1278700  2 mds.0.70682 Booting: 0: opening snap table
2020-01-14 12:02:20.014 7f2ed0a77700  2 mds.0.70682 Booting: 1: loading/discovering base inodes
2020-01-14 12:02:20.014 7f2ed0a77700  0 mds.0.cache creating system inode with ino:0x100
2020-01-14 12:02:20.014 7f2ed0a77700  0 mds.0.cache creating system inode with ino:0x1
2020-01-14 12:02:20.017 7f2ed1278700  2 mds.0.70682 Booting: 2: replaying mds log
2020-01-14 12:02:20.017 7f2ed1278700  2 mds.0.70682 Booting: 2: waiting for purge queue recovered
2020-01-14 12:02:24.146 7f2ecfa75700  1 mds.0.70682 Finished replaying journal
2020-01-14 12:02:24.146 7f2ecfa75700  1 mds.0.70682 making mds journal writeable
2020-01-14 12:02:24.146 7f2ecfa75700  2 mds.0.70682 i am alone, moving to state reconnect

It seems to be stable now after enabling v2 on all mons and restarting all mds's.

History

#1 Updated by Patrick Donnelly about 1 month ago

  • Subject changed from mds crash when enable msgr v2 due to lost contact to mds: crash when enable msgr v2 due to lost contact
  • Status changed from New to Need More Info
  • Source set to Community (user)
  • Backport set to nautilus
  • Component(FS) MDS, MDSMonitor added
  • Labels (FS) crash added

It seems to be stIt seems to be stable now after enabling v2 on all mons and restarting all mds's.able now after enabling v2 on all mons and restarting all mds's.

Did you also upgrade the MDS?

Unfortunately when restarting the monitors this can sometimes happen in rare-ish cases. Do you have the monitor logs too?

#2 Updated by Dan van der Ster about 1 month ago

Yes the MDS was upgraded to 14.2.6 also.

Below the mon log when I changed its addr. (Full file that day is at ceph-post-file: 5d13e0fb-9cc8-4722-ad6c-2b1966211d2c)

The strange thing is that the mds says that map e70675 removed him due to lost contact, but when the mon prints e70675 he is still up:active.

2020-01-14 11:59:41.882 7f3cdba7f700  0 mon.cephdwight-mon-1633994557@0(leader) e42 handle_command mon_command({"prefix": "mon set-addrs", "name": "cephdwight-mon-1633994557", "addrs": "[v2:188.184.86.25:3300,v1:188.184.86.25:6790]"} v 0) v1
2020-01-14 11:59:41.882 7f3cdba7f700  0 log_channel(audit) log [INF] : from='client.1024303127 ' entity='client.admin' cmd=[{"prefix": "mon set-addrs", "name": "cephdwight-mon-1633994557", "addrs": "[v2:188.184.86.25:3300,v1:188.184.86.25:6790]"}]: dispatch
2020-01-14 11:59:41.889 7f3ce263f700  0 mon.cephdwight-mon-1633994557@0(leader) e43  monmap addrs for rank 0 changed, i am v1:188.184.86.25:6790/0, monmap is [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0], respawning
2020-01-14 11:59:41.889 7f3ce263f700  0 mon.cephdwight-mon-1633994557@0(leader) e43 respawn
2020-01-14 11:59:41.889 7f3ce263f700  1 mon.cephdwight-mon-1633994557@0(leader) e43  e: '/usr/bin/ceph-mon'
2020-01-14 11:59:41.889 7f3ce263f700  1 mon.cephdwight-mon-1633994557@0(leader) e43  0: '/usr/bin/ceph-mon'
2020-01-14 11:59:41.889 7f3ce263f700  1 mon.cephdwight-mon-1633994557@0(leader) e43  1: '-f'
2020-01-14 11:59:41.889 7f3ce263f700  1 mon.cephdwight-mon-1633994557@0(leader) e43  2: '--cluster'
2020-01-14 11:59:41.889 7f3ce263f700  1 mon.cephdwight-mon-1633994557@0(leader) e43  3: 'ceph'
2020-01-14 11:59:41.889 7f3ce263f700  1 mon.cephdwight-mon-1633994557@0(leader) e43  4: '--id'
2020-01-14 11:59:41.889 7f3ce263f700  1 mon.cephdwight-mon-1633994557@0(leader) e43  5: 'cephdwight-mon-1633994557'
2020-01-14 11:59:41.889 7f3ce263f700  1 mon.cephdwight-mon-1633994557@0(leader) e43  6: '--setuser'
2020-01-14 11:59:41.889 7f3ce263f700  1 mon.cephdwight-mon-1633994557@0(leader) e43  7: 'ceph'
2020-01-14 11:59:41.889 7f3ce263f700  1 mon.cephdwight-mon-1633994557@0(leader) e43  8: '--setgroup'
2020-01-14 11:59:41.889 7f3ce263f700  1 mon.cephdwight-mon-1633994557@0(leader) e43  9: 'ceph'
2020-01-14 11:59:41.890 7f3ce263f700  1 mon.cephdwight-mon-1633994557@0(leader) e43 respawning with exe /usr/bin/ceph-mon
2020-01-14 11:59:41.890 7f3ce263f700  1 mon.cephdwight-mon-1633994557@0(leader) e43  exe_path /proc/self/exe
2020-01-14 11:59:41.932 7fa8a5028040  0 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable), process ceph-mon, pid 2304
2020-01-14 11:59:41.932 7fa8a5028040  0 pidfile_write: ignore empty --pid-file
2020-01-14 11:59:41.965 7fa8a5028040  0 load: jerasure load: lrc load: isa 
2020-01-14 11:59:41.968 7fa8a5028040  1 leveldb: Recovering log #16526187
2020-01-14 11:59:41.974 7fa8a5028040  1 leveldb: Level-0 table #16526214: started
2020-01-14 11:59:41.992 7fa8a5028040  1 leveldb: Level-0 table #16526214: 2664212 bytes OK
2020-01-14 11:59:41.994 7fa8a5028040  1 leveldb: Delete type=3 #16525234

2020-01-14 11:59:41.995 7fa8a5028040  1 leveldb: Delete type=0 #16526187

2020-01-14 11:59:41.995 7fa8a5028040 -1 WARNING: 'mon addr' config option v2:188.184.86.25:6790/0 does not match monmap file
         continuing with monmap configuration
2020-01-14 11:59:41.996 7fa8a5028040  0 starting mon.cephdwight-mon-1633994557 rank 0 at public addrs [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0] at bind addrs [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0] mon_data /var/lib/ceph/mon/ceph-cephdwight-mon-1633994557 fsid dd535a7e-4647-4bee-853d-f34112615f81
2020-01-14 11:59:41.997 7fa8a5028040  1 mon.cephdwight-mon-1633994557@-1(???) e43 preinit fsid dd535a7e-4647-4bee-853d-f34112615f81
2020-01-14 11:59:41.997 7fa8a5028040  0 mon.cephdwight-mon-1633994557@-1(???).mds e70675 new map
2020-01-14 11:59:41.997 7fa8a5028040  0 mon.cephdwight-mon-1633994557@-1(???).mds e70675 print_map
e70675
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'cephfs' (1)
fs_name cephfs
epoch   70673
flags   1e
created 2016-11-02 13:48:49.499990
modified        2020-01-14 11:45:13.836504
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
min_compat_client       -1 (unspecified)
last_failure    0
last_failure_osd_epoch  293944
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {0=1024259360}
failed  
damaged 
stopped 1,2
data_pools      [173]
metadata_pool   174
inline_data     disabled
balancer        
standby_count_wanted    1
1024259360:     v1:137.138.122.17:6801/4076417850 'cephdwightmds2' mds.0.70670 up:active seq 8

Standby daemons:

1024263064:     v1:188.184.95.158:6801/1464356353 'cephdwightmds1' mds.-1.0 up:standby seq 1
1024263091:     v1:188.184.83.152:6801/7247617 'cephdwightmds0' mds.-1.0 up:standby seq 2

2020-01-14 11:59:41.999 7fa8a5028040  0 mon.cephdwight-mon-1633994557@-1(???).osd e293946 crush map has features 432629239337189376, adjusting msgr requires
2020-01-14 11:59:41.999 7fa8a5028040  0 mon.cephdwight-mon-1633994557@-1(???).osd e293946 crush map has features 432629239337189376, adjusting msgr requires
2020-01-14 11:59:41.999 7fa8a5028040  0 mon.cephdwight-mon-1633994557@-1(???).osd e293946 crush map has features 3314933000854323200, adjusting msgr requires
2020-01-14 11:59:41.999 7fa8a5028040  0 mon.cephdwight-mon-1633994557@-1(???).osd e293946 crush map has features 432629239337189376, adjusting msgr requires
2020-01-14 11:59:41.999 7fa8a5028040  1 mon.cephdwight-mon-1633994557@-1(???).paxosservice(auth 623158..623227) refresh upgraded, format 0 -> 3
2020-01-14 11:59:42.012 7fa8a5028040  0 mon.cephdwight-mon-1633994557@-1(probing) e43  my rank is now 0 (was -1)
2020-01-14 11:59:42.016 7fa88cab9700  0 log_channel(cluster) log [INF] : mon.cephdwight-mon-1633994557 calling monitor election
2020-01-14 11:59:42.016 7fa88cab9700  1 mon.cephdwight-mon-1633994557@0(electing).elector(8072) init, last seen epoch 8072
2020-01-14 11:59:42.018 7fa88cab9700 -1 mon.cephdwight-mon-1633994557@0(electing) e43 failed to get devid for : udev_device_new_from_subsystem_sysname failed on ''
2020-01-14 11:59:42.020 7fa88cab9700  0 log_channel(cluster) log [INF] : mon.cephdwight-mon-1633994557 is new leader, mons cephdwight-mon-1633994557,cephdwight-mon-f7df6839c6,cephdwight-mon-d8788e3256 in quorum (ranks 0,1,2)
2020-01-14 11:59:42.026 7fa88cab9700  0 log_channel(cluster) log [DBG] : monmap e43: 3 mons at {cephdwight-mon-1633994557=[v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0],cephdwight-mon-d8788e3256=[v2:188.185.66.208:3300/0,v1:188.185.66.208:6790/0],cephdwight-mon-f7df6839c6=[v2:188.184.94.56:3300/0,v1:188.184.94.56:6790/0]}
2020-01-14 11:59:42.026 7fa88cab9700  0 log_channel(cluster) log [DBG] : fsmap cephfs:1 {0=cephdwightmds2=up:active} 2 up:standby
2020-01-14 11:59:42.026 7fa88cab9700  0 log_channel(cluster) log [DBG] : osdmap e293946: 60 total, 60 up, 60 in
2020-01-14 11:59:42.033 7fa88cab9700  0 log_channel(cluster) log [DBG] : mgrmap e431: cephdwight-mon-1633994557(active, since 45m), standbys: cephdwight-mon-f7df6839c6, cephdwight-mon-d8788e3256
2020-01-14 11:59:42.033 7fa88cab9700  0 log_channel(cluster) log [INF] : Health check cleared: MON_MSGR2_NOT_ENABLED (was: 1 monitors have not enabled msgr2)
2020-01-14 11:59:42.034 7fa88cab9700  0 log_channel(cluster) log [INF] : Cluster is now healthy
2020-01-14 11:59:42.052 7fa893040700  0 log_channel(cluster) log [INF] : overall HEALTH_OK
2020-01-14 11:59:42.356 7fa88cab9700  0 log_channel(cluster) log [INF] : daemon mds.cephdwightmds2 restarted
2020-01-14 11:59:42.356 7fa88cab9700  1 mon.cephdwight-mon-1633994557@0(leader).mds e70675 fail_mds_gid 1024259360 mds.cephdwightmds2 role 0
2020-01-14 11:59:42.356 7fa88cab9700  1 mon.cephdwight-mon-1633994557@0(leader).osd e293946 do_prune osdmap full prune enabled
2020-01-14 11:59:42.369 7fa893040700 -1 mon.cephdwight-mon-1633994557@0(leader).osd e293946 register_cache_with_pcm not using rocksdb
2020-01-14 11:59:42.370 7fa893040700  1 mon.cephdwight-mon-1633994557@0(leader).osd e293947 e293947: 60 total, 60 up, 60 in
2020-01-14 11:59:42.372 7fa893040700  0 log_channel(cluster) log [DBG] : osdmap e293947: 60 total, 60 up, 60 in
2020-01-14 11:59:42.467 7fa88cab9700  0 mon.cephdwight-mon-1633994557@0(leader) e43 handle_command mon_command({"prefix": "df", "format": "json"} v 0) v1
2020-01-14 11:59:42.467 7fa88cab9700  0 log_channel(audit) log [DBG] : from='client.1024331298 128.142.167.20:0/4106550175' entity='client.castorcert' cmd=[{"prefix": "df", "format": "json"}]: dispatch
2020-01-14 11:59:43.104 7fa88f2be700  0 log_channel(cluster) log [WRN] : Health check failed: 1 filesystem is degraded (FS_DEGRADED)
2020-01-14 11:59:43.104 7fa88f2be700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2020-01-14 11:59:43.120 7fa88a274700  1 leveldb: Compacting 1@0 + 3@1 files
2020-01-14 11:59:43.132 7fa893040700  0 mon.cephdwight-mon-1633994557@0(leader).mds e70676 new map
2020-01-14 11:59:43.132 7fa893040700  0 mon.cephdwight-mon-1633994557@0(leader).mds e70676 print_map
e70676
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor tab
le,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'cephfs' (1)
fs_name cephfs
epoch   70676
flags   1e
created 2016-11-02 13:48:49.499990
modified        2020-01-14 11:59:43.104834
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
min_compat_client       -1 (unspecified)
last_failure    0
last_failure_osd_epoch  293947
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor tab
le,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {}
failed  0
damaged 
stopped 1,2
data_pools      [173]
metadata_pool   174
inline_data     disabled
balancer        
standby_count_wanted    1

Standby daemons:

1024263064:     v1:188.184.95.158:6801/1464356353 'cephdwightmds1' mds.-1.0 up:standby seq 1
1024263091:     v1:188.184.83.152:6801/7247617 'cephdwightmds0' mds.-1.0 up:standby seq 2
1024303151:     [v2:137.138.122.17:6800/2692603670,v1:137.138.122.17:6801/2692603670] 'cephdwightmds2' mds.-1.0 up:standby seq 1

2020-01-14 11:59:43.132 7fa88a274700  1 leveldb: Generated table #16526216: 130 keys, 2153003 bytes
2020-01-14 11:59:43.134 7fa893040700  0 log_channel(cluster) log [DBG] : mds.? [v2:137.138.122.17:6800/2692603670,v1:137.138.122.17:6801/2692603670] up:boot
2020-01-14 11:59:43.134 7fa893040700  1 mon.cephdwight-mon-1633994557@0(leader).mds e70676  taking over failed mds.0 with 1024263064/cephdwightmds1 v1:188.184.95.158:6801/1464356353
2020-01-14 11:59:43.134 7fa893040700  0 log_channel(cluster) log [INF] : Standby daemon mds.cephdwightmds1 assigned to filesystem cephfs as rank 0
2020-01-14 11:59:43.134 7fa893040700  0 log_channel(cluster) log [INF] : Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)
2020-01-14 11:59:43.136 7fa893040700  0 log_channel(cluster) log [DBG] : fsmap cephfs:0/1 3 up:standby, 1 failed
2020-01-14 11:59:43.144 7fa893040700  0 mon.cephdwight-mon-1633994557@0(leader).mds e70677 new map
2020-01-14 11:59:43.144 7fa893040700  0 mon.cephdwight-mon-1633994557@0(leader).mds e70677 print_map
e70677
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor tab
le,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'cephfs' (1)
fs_name cephfs
epoch   70677
flags   1e
created 2016-11-02 13:48:49.499990
modified        2020-01-14 11:59:43.135648
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
min_compat_client       -1 (unspecified)
last_failure    0
last_failure_osd_epoch  293947
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {0=1024263064}
failed  
damaged 
stopped 1,2
data_pools      [173]
metadata_pool   174
inline_data     disabled
balancer        
standby_count_wanted    1
1024263064:     v1:188.184.95.158:6801/1464356353 'cephdwightmds1' mds.0.70677 up:replay seq 1

Standby daemons:

1024263091:     v1:188.184.83.152:6801/7247617 'cephdwightmds0' mds.-1.0 up:standby seq 2
1024303151:     [v2:137.138.122.17:6800/2692603670,v1:137.138.122.17:6801/2692603670] 'cephdwightmds2' mds.-1.0 up:standby seq 1

2020-01-14 11:59:43.144 7fa88a274700  1 leveldb: Generated table #16526217: 93 keys, 2104982 bytes
2020-01-14 11:59:43.145 7fa893040700  0 log_channel(cluster) log [DBG] : fsmap cephfs:1/1 {0=cephdwightmds1=up:replay} 2 up:standby
2020-01-14 11:59:43.156 7fa88a274700  1 leveldb: Generated table #16526218: 88 keys, 2107771 bytes
2020-01-14 11:59:43.165 7fa88a274700  1 leveldb: Generated table #16526219: 76 keys, 1610808 bytes
2020-01-14 11:59:43.165 7fa88a274700  1 leveldb: Compacted 1@0 + 3@1 files => 7976564 bytes
2020-01-14 11:59:43.165 7fa88a274700  1 leveldb: compacted to: files[ 0 4 4 1 57 0 0 ]
2020-01-14 11:59:43.165 7fa88a274700  1 leveldb: Delete type=2 #16526190

2020-01-14 11:59:43.166 7fa88a274700  1 leveldb: Delete type=2 #16526191

2020-01-14 11:59:43.166 7fa88a274700  1 leveldb: Delete type=2 #16526192

2020-01-14 11:59:43.167 7fa88a274700  1 leveldb: Delete type=2 #16526214

2020-01-14 11:59:46.618 7fa88cab9700  0 mon.cephdwight-mon-1633994557@0(leader) e43 handle_command mon_command({"prefix": "df", "format": "json"} v 0) v1

#3 Updated by Patrick Donnelly about 1 month ago

That's indeed very odd. I looked through the code but didn't find a good reason why this would happen. It is interesting that the MonClient had just reconnected to the monitor but the global ID shouldn't/didn't change. So I'm currently stumped...

#4 Updated by Patrick Donnelly 28 days ago

  • Status changed from Need More Info to New
  • Priority changed from Normal to Low

#5 Updated by Dan van der Ster 18 days ago

We just updated our 2nd cluster to nautilus and saw the exact same mds respawn at the moment we enabled msgr2:

   -33> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient: handle_monmap mon_map magic: 0 v1
   -32> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient:  got monmap 6 from mon.cephjim-mon-54686b1b98 (according to old e6)
   -31> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient: dump:
epoch 6
fsid 4da6fd06-b069-49af-901f-c9513baabdbd
last_changed 2020-01-30 10:46:48.787499
created 2017-11-09 14:53:29.757237
min_mon_release 14 (nautilus)
0: [v2:137.138.32.34:3300/0,v1:137.138.32.34:6789/0] mon.cephjim-mon-decd03f337
1: [v2:137.138.76.42:3300/0,v1:137.138.76.42:6789/0] mon.cephjim-mon-54686b1b98
2: [v2:137.138.148.253:3300/0,v1:137.138.148.253:6789/0] mon.cephjim-mon-f92526a47a

   -30> 2020-01-30 10:46:48.797 7f933fb53700  1 monclient:  mon.1 has (v2) addrs [v2:137.138.76.42:3300/0,v1:137.138.76.42:6789/0] but i'm connected to v1:137.138.76.42:6789/0, reconnecting
   -29> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient: _reopen_session rank -1
   -28> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient(hunting): picked mon.cephjim-mon-decd03f337 con 0x5653db389c00 addr [v2:137.138.32.34:3300/0,v1:137.138.32.34:6789/0]
   -27> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient(hunting): picked mon.cephjim-mon-f92526a47a con 0x5653da8b3400 addr [v2:137.138.148.253:3300/0,v1:137.138.148.253:6789/0]
   -26> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient(hunting): picked mon.cephjim-mon-54686b1b98 con 0x5653f93d5c00 addr [v2:137.138.76.42:3300/0,v1:137.138.76.42:6789/0]
   -25> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient(hunting): start opening mon connection
   -24> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient(hunting): start opening mon connection
   -23> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient(hunting): start opening mon connection
   -22> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient(hunting): _renew_subs
   -21> 2020-01-30 10:46:48.804 7f933b34a700  2 mds.0.cache Memory usage:  total 1199956, rss 764872, heap 332028, baseline 332028, 3332 / 271652 inodes have caps, 4735 caps, 0.0174304 caps per inode
   -20> 2020-01-30 10:46:49.016 7f93425e2700 10 monclient(hunting): get_auth_request con 0x5653da8b3400 auth_method 0
   -19> 2020-01-30 10:46:49.016 7f93425e2700 10 monclient(hunting): get_auth_request method 2 preferred_modes [1,2]
   -18> 2020-01-30 10:46:49.016 7f93425e2700 10 monclient(hunting): _init_auth method 2
   -17> 2020-01-30 10:46:49.025 7f93425e2700 10 monclient(hunting): handle_auth_reply_more payload 9
   -16> 2020-01-30 10:46:49.025 7f93425e2700 10 monclient(hunting): handle_auth_reply_more payload_len 9
   -15> 2020-01-30 10:46:49.025 7f93425e2700 10 monclient(hunting): handle_auth_reply_more responding with 36 bytes
   -14> 2020-01-30 10:46:49.042 7f93425e2700 10 monclient(hunting): handle_auth_done global_id 76134616 payload 1166
   -13> 2020-01-30 10:46:49.042 7f93425e2700 10 monclient: _finish_hunting 0
   -12> 2020-01-30 10:46:49.042 7f93425e2700  1 monclient: found mon.cephjim-mon-f92526a47a
   -11> 2020-01-30 10:46:49.042 7f93425e2700 10 monclient: _send_mon_message to mon.cephjim-mon-f92526a47a at v2:137.138.148.253:3300/0
   -10> 2020-01-30 10:46:49.042 7f93425e2700 10 monclient: _send_mon_message to mon.cephjim-mon-f92526a47a at v2:137.138.148.253:3300/0
    -9> 2020-01-30 10:46:49.042 7f93425e2700 10 monclient: _finish_auth 0
    -8> 2020-01-30 10:46:49.042 7f93425e2700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2020-01-30 10:46:19.043349)
    -7> 2020-01-30 10:46:49.067 7f933fb53700 10 monclient: handle_monmap mon_map magic: 0 v1
    -6> 2020-01-30 10:46:49.067 7f933fb53700 10 monclient:  got monmap 6 from mon.cephjim-mon-f92526a47a (according to old e6)
    -5> 2020-01-30 10:46:49.067 7f933fb53700 10 monclient: dump:
epoch 6
fsid 4da6fd06-b069-49af-901f-c9513baabdbd
last_changed 2020-01-30 10:46:48.787499
created 2017-11-09 14:53:29.757237
min_mon_release 14 (nautilus)
0: [v2:137.138.32.34:3300/0,v1:137.138.32.34:6789/0] mon.cephjim-mon-decd03f337
1: [v2:137.138.76.42:3300/0,v1:137.138.76.42:6789/0] mon.cephjim-mon-54686b1b98
2: [v2:137.138.148.253:3300/0,v1:137.138.148.253:6789/0] mon.cephjim-mon-f92526a47a

    -4> 2020-01-30 10:46:49.096 7f933fb53700 10 monclient: handle_config config(1 keys) v1
    -3> 2020-01-30 10:46:49.096 7f933fb53700  1 mds.hpc-be143 Updating MDS map to version 62877 from mon.2
    -2> 2020-01-30 10:46:49.096 7f933fb53700  1 mds.hpc-be143 Map removed me (mds.-1 gid:76134616) from cluster due to lost contact; respawning
    -1> 2020-01-30 10:46:49.096 7f933fb53700  1 mds.hpc-be143 respawn!
     0> 2020-01-30 10:46:49.096 7f933e350700  4 set_mon_vals no callback set

Our upgrade procedure follows the doc:
- upgrade mons to v14.2.6
- upgrade mgrs to v14.2.6
- upgrade osds to v14.2.6
- upgrade mds (standby's stopped, only 1 active) to v14.2.6
- ceph osd require-osd-release nautilus
- ceph mon enable-msgr2 (at this point the active mds respawns)

Also available in: Atom PDF