Bug #43596
openmds: crash when enable msgr v2 due to lost contact
0%
Description
We just upgraded from mimic v13.2.7 to v14.2.6 and when we enable msgr v2 on the mon which an MDS is connected to, the mds respawns.
Note that we run v1 on nonstandard port 6790, so we need to enable v2 one-by-one on the mons.
Here's the mds log just after enabling v2 on mon at 188.184.86.25:
-42> 2020-01-14 11:59:38.628 7fb701ca1700 2 mds.0.cache Memory usage: total 3173228, rss 2729004, heap 332028, baseline 332028, 19001 / 1013921 inodes have caps, 25557 caps, 0.0252061 caps per inode -41> 2020-01-14 11:59:39.629 7fb701ca1700 2 mds.0.cache Memory usage: total 3173228, rss 2729004, heap 332028, baseline 332028, 19001 / 1013921 inodes have caps, 25557 caps, 0.0252061 caps per inode -40> 2020-01-14 11:59:40.629 7fb701ca1700 2 mds.0.cache Memory usage: total 3173228, rss 2729004, heap 332028, baseline 332028, 18997 / 1013921 inodes have caps, 25553 caps, 0.0252022 caps per inode -39> 2020-01-14 11:59:41.401 7fb7034a4700 10 monclient: _send_mon_message to mon.cephdwight-mon-1633994557 at v1:188.184.86.25:6790/0 -38> 2020-01-14 11:59:41.630 7fb701ca1700 2 mds.0.cache Memory usage: total 3173228, rss 2729004, heap 332028, baseline 332028, 18997 / 1013921 inodes have caps, 25553 caps, 0.0252022 caps per inode -37> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient: handle_monmap mon_map magic: 0 v1 -36> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient: got monmap 43 from mon.cephdwight-mon-1633994557 (according to old e43) -35> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient: dump: epoch 43 fsid dd535a7e-4647-4bee-853d-f34112615f81 last_changed 2020-01-14 11:59:41.883881 created 2014-03-06 14:41:11.600153 min_mon_release 14 (nautilus) 0: [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0] mon.cephdwight-mon-1633994557 1: [v2:188.184.94.56:3300/0,v1:188.184.94.56:6790/0] mon.cephdwight-mon-f7df6839c6 2: [v2:188.185.66.208:3300/0,v1:188.185.66.208:6790/0] mon.cephdwight-mon-d8788e3256 -34> 2020-01-14 11:59:41.888 7fb7064aa700 1 monclient: mon.0 has (v2) addrs [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0] but i'm connected to v1:188.184.86.25:6790/0, reconnecting -33> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient: _reopen_session rank -1 -32> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): picked mon.cephdwight-mon-f7df6839c6 con 0x555b82718800 addr [v2:188.184.94.56:3300/0,v1:188.184.94.56:6790/0] -31> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): picked mon.cephdwight-mon-d8788e3256 con 0x555b82718400 addr [v2:188.185.66.208:3300/0,v1:188.185.66.208:6790/0] -30> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): picked mon.cephdwight-mon-1633994557 con 0x555ba0827c00 addr [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0] -29> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): start opening mon connection -28> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): start opening mon connection -27> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): start opening mon connection -26> 2020-01-14 11:59:41.888 7fb7064aa700 10 monclient(hunting): _renew_subs -25> 2020-01-14 11:59:41.891 7fb709f3b700 10 monclient(hunting): get_auth_request con 0x555b82718800 auth_method 0 -24> 2020-01-14 11:59:41.891 7fb709f3b700 10 monclient(hunting): get_auth_request method 2 preferred_modes [1,2] -23> 2020-01-14 11:59:41.891 7fb709f3b700 10 monclient(hunting): _init_auth method 2 -22> 2020-01-14 11:59:41.894 7fb709f3b700 10 monclient(hunting): handle_auth_reply_more payload 9 -21> 2020-01-14 11:59:41.894 7fb709f3b700 10 monclient(hunting): handle_auth_reply_more payload_len 9 -20> 2020-01-14 11:59:41.894 7fb709f3b700 10 monclient(hunting): handle_auth_reply_more responding with 36 bytes -19> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): get_auth_request con 0x555b82718400 auth_method 0 -18> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): get_auth_request method 2 preferred_modes [1,2] -17> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): _init_auth method 2 -16> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): handle_auth_reply_more payload 9 -15> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): handle_auth_reply_more payload_len 9 -14> 2020-01-14 11:59:41.896 7fb708f39700 10 monclient(hunting): handle_auth_reply_more responding with 36 bytes -13> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient(hunting): handle_auth_done global_id 1024259360 payload 1198 -12> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient: _finish_hunting 0 -11> 2020-01-14 11:59:41.898 7fb708f39700 1 monclient: found mon.cephdwight-mon-d8788e3256 -10> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient: _send_mon_message to mon.cephdwight-mon-d8788e3256 at v2:188.185.66.208:3300/0 -9> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient: _finish_auth 0 -8> 2020-01-14 11:59:41.898 7fb708f39700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2020-01-14 11:59:11.898795) -7> 2020-01-14 11:59:41.900 7fb7064aa700 10 monclient: handle_monmap mon_map magic: 0 v1 -6> 2020-01-14 11:59:41.900 7fb7064aa700 10 monclient: got monmap 43 from mon.cephdwight-mon-d8788e3256 (according to old e43) -5> 2020-01-14 11:59:41.900 7fb7064aa700 10 monclient: dump: epoch 43 fsid dd535a7e-4647-4bee-853d-f34112615f81 last_changed 2020-01-14 11:59:41.883881 created 2014-03-06 14:41:11.600153 min_mon_release 14 (nautilus) 0: [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0] mon.cephdwight-mon-1633994557 1: [v2:188.184.94.56:3300/0,v1:188.184.94.56:6790/0] mon.cephdwight-mon-f7df6839c6 2: [v2:188.185.66.208:3300/0,v1:188.185.66.208:6790/0] mon.cephdwight-mon-d8788e3256 -4> 2020-01-14 11:59:41.900 7fb7064aa700 10 monclient: handle_config config(1 keys) v1 -3> 2020-01-14 11:59:41.900 7fb7064aa700 1 mds.cephdwightmds2 Updating MDS map to version 70675 from mon.2 -2> 2020-01-14 11:59:41.900 7fb704ca7700 4 set_mon_vals no callback set -1> 2020-01-14 11:59:41.900 7fb7064aa700 1 mds.cephdwightmds2 Map removed me (mds.-1 gid:1024259360) from cluster due to lost contact; respawning 0> 2020-01-14 11:59:41.900 7fb7064aa700 1 mds.cephdwightmds2 respawn!
There was also a case where restarting a standby pushed out a new mon map, triggering the same `lost contact` as above.
And at one point when I was trying to stabilize the cluster I had this:
mds blacklisted:
2020-01-14 12:02:18.959 7effe452e700 1 mds.cephdwightmds1 Updating MDS map to version 70680 from mon.2 2020-01-14 12:02:18.961 7effe452e700 1 mds.0.70680 handle_mds_map i am now mds.0.70680 2020-01-14 12:02:18.961 7effe452e700 1 mds.0.70680 handle_mds_map state change up:boot --> up:replay 2020-01-14 12:02:18.961 7effe452e700 1 mds.0.70680 replay_start 2020-01-14 12:02:18.961 7effe452e700 1 mds.0.70680 recovery set is 2020-01-14 12:02:18.961 7effe452e700 1 mds.0.70680 waiting for osdmap 293949 (which blacklists prior instance) 2020-01-14 12:02:18.964 7effddd21700 2 mds.0.70680 Booting: 0: opening inotable 2020-01-14 12:02:18.964 7effddd21700 2 mds.0.70680 Booting: 0: opening sessionmap 2020-01-14 12:02:18.964 7effddd21700 2 mds.0.70680 Booting: 0: opening mds log 2020-01-14 12:02:18.964 7effddd21700 2 mds.0.70680 Booting: 0: opening purge queue (async) 2020-01-14 12:02:18.964 7effddd21700 2 mds.0.70680 Booting: 0: loading open file table (async) 2020-01-14 12:02:18.965 7effddd21700 2 mds.0.70680 Booting: 0: opening snap table 2020-01-14 12:02:18.967 7effddd21700 -1 MDSIOContextBase: blacklisted! Restarting... 2020-01-14 12:02:18.967 7effddd21700 1 mds.cephdwightmds1 respawn! 2020-01-14 12:02:18.967 7effded23700 0 mds.0.journaler.pq(ro) error getting journal off disk 2020-01-14 12:02:18.967 7effded23700 -1 mds.0.purge_queue operator(): Error -108 loading Journaler 2020-01-14 12:02:18.967 7effded23700 1 mds.0.purge_queue _go_readonly: going readonly because internal IO failed: Cannot send after transport endpoint shutdown
other mds booting:
2020-01-14 12:02:19.992 7f2ed7a85700 1 mds.cephdwightmds2 Updating MDS map to version 70682 from mon.2 2020-01-14 12:02:19.993 7f2ed7a85700 1 mds.0.70682 handle_mds_map i am now mds.0.70682 2020-01-14 12:02:19.993 7f2ed7a85700 1 mds.0.70682 handle_mds_map state change up:boot --> up:replay 2020-01-14 12:02:19.993 7f2ed7a85700 1 mds.0.70682 replay_start 2020-01-14 12:02:19.993 7f2ed7a85700 1 mds.0.70682 recovery set is 2020-01-14 12:02:19.993 7f2ed7a85700 1 mds.0.70682 waiting for osdmap 293950 (which blacklists prior instance) 2020-01-14 12:02:19.995 7f2ed1278700 2 mds.0.70682 Booting: 0: opening inotable 2020-01-14 12:02:19.995 7f2ed1278700 2 mds.0.70682 Booting: 0: opening sessionmap 2020-01-14 12:02:19.995 7f2ed1278700 2 mds.0.70682 Booting: 0: opening mds log 2020-01-14 12:02:19.995 7f2ed1278700 2 mds.0.70682 Booting: 0: opening purge queue (async) 2020-01-14 12:02:19.996 7f2ed1278700 2 mds.0.70682 Booting: 0: loading open file table (async) 2020-01-14 12:02:19.996 7f2ed1278700 2 mds.0.70682 Booting: 0: opening snap table 2020-01-14 12:02:20.014 7f2ed0a77700 2 mds.0.70682 Booting: 1: loading/discovering base inodes 2020-01-14 12:02:20.014 7f2ed0a77700 0 mds.0.cache creating system inode with ino:0x100 2020-01-14 12:02:20.014 7f2ed0a77700 0 mds.0.cache creating system inode with ino:0x1 2020-01-14 12:02:20.017 7f2ed1278700 2 mds.0.70682 Booting: 2: replaying mds log 2020-01-14 12:02:20.017 7f2ed1278700 2 mds.0.70682 Booting: 2: waiting for purge queue recovered 2020-01-14 12:02:24.146 7f2ecfa75700 1 mds.0.70682 Finished replaying journal 2020-01-14 12:02:24.146 7f2ecfa75700 1 mds.0.70682 making mds journal writeable 2020-01-14 12:02:24.146 7f2ecfa75700 2 mds.0.70682 i am alone, moving to state reconnect
It seems to be stable now after enabling v2 on all mons and restarting all mds's.
Updated by Patrick Donnelly over 4 years ago
- Subject changed from mds crash when enable msgr v2 due to lost contact to mds: crash when enable msgr v2 due to lost contact
- Status changed from New to Need More Info
- Source set to Community (user)
- Backport set to nautilus
- Component(FS) MDS, MDSMonitor added
- Labels (FS) crash added
It seems to be stIt seems to be stable now after enabling v2 on all mons and restarting all mds's.able now after enabling v2 on all mons and restarting all mds's.
Did you also upgrade the MDS?
Unfortunately when restarting the monitors this can sometimes happen in rare-ish cases. Do you have the monitor logs too?
Updated by Dan van der Ster over 4 years ago
Yes the MDS was upgraded to 14.2.6 also.
Below the mon log when I changed its addr. (Full file that day is at ceph-post-file: 5d13e0fb-9cc8-4722-ad6c-2b1966211d2c)
The strange thing is that the mds says that map e70675 removed him due to lost contact, but when the mon prints e70675 he is still up:active.
2020-01-14 11:59:41.882 7f3cdba7f700 0 mon.cephdwight-mon-1633994557@0(leader) e42 handle_command mon_command({"prefix": "mon set-addrs", "name": "cephdwight-mon-1633994557", "addrs": "[v2:188.184.86.25:3300,v1:188.184.86.25:6790]"} v 0) v1 2020-01-14 11:59:41.882 7f3cdba7f700 0 log_channel(audit) log [INF] : from='client.1024303127 ' entity='client.admin' cmd=[{"prefix": "mon set-addrs", "name": "cephdwight-mon-1633994557", "addrs": "[v2:188.184.86.25:3300,v1:188.184.86.25:6790]"}]: dispatch 2020-01-14 11:59:41.889 7f3ce263f700 0 mon.cephdwight-mon-1633994557@0(leader) e43 monmap addrs for rank 0 changed, i am v1:188.184.86.25:6790/0, monmap is [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0], respawning 2020-01-14 11:59:41.889 7f3ce263f700 0 mon.cephdwight-mon-1633994557@0(leader) e43 respawn 2020-01-14 11:59:41.889 7f3ce263f700 1 mon.cephdwight-mon-1633994557@0(leader) e43 e: '/usr/bin/ceph-mon' 2020-01-14 11:59:41.889 7f3ce263f700 1 mon.cephdwight-mon-1633994557@0(leader) e43 0: '/usr/bin/ceph-mon' 2020-01-14 11:59:41.889 7f3ce263f700 1 mon.cephdwight-mon-1633994557@0(leader) e43 1: '-f' 2020-01-14 11:59:41.889 7f3ce263f700 1 mon.cephdwight-mon-1633994557@0(leader) e43 2: '--cluster' 2020-01-14 11:59:41.889 7f3ce263f700 1 mon.cephdwight-mon-1633994557@0(leader) e43 3: 'ceph' 2020-01-14 11:59:41.889 7f3ce263f700 1 mon.cephdwight-mon-1633994557@0(leader) e43 4: '--id' 2020-01-14 11:59:41.889 7f3ce263f700 1 mon.cephdwight-mon-1633994557@0(leader) e43 5: 'cephdwight-mon-1633994557' 2020-01-14 11:59:41.889 7f3ce263f700 1 mon.cephdwight-mon-1633994557@0(leader) e43 6: '--setuser' 2020-01-14 11:59:41.889 7f3ce263f700 1 mon.cephdwight-mon-1633994557@0(leader) e43 7: 'ceph' 2020-01-14 11:59:41.889 7f3ce263f700 1 mon.cephdwight-mon-1633994557@0(leader) e43 8: '--setgroup' 2020-01-14 11:59:41.889 7f3ce263f700 1 mon.cephdwight-mon-1633994557@0(leader) e43 9: 'ceph' 2020-01-14 11:59:41.890 7f3ce263f700 1 mon.cephdwight-mon-1633994557@0(leader) e43 respawning with exe /usr/bin/ceph-mon 2020-01-14 11:59:41.890 7f3ce263f700 1 mon.cephdwight-mon-1633994557@0(leader) e43 exe_path /proc/self/exe 2020-01-14 11:59:41.932 7fa8a5028040 0 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable), process ceph-mon, pid 2304 2020-01-14 11:59:41.932 7fa8a5028040 0 pidfile_write: ignore empty --pid-file 2020-01-14 11:59:41.965 7fa8a5028040 0 load: jerasure load: lrc load: isa 2020-01-14 11:59:41.968 7fa8a5028040 1 leveldb: Recovering log #16526187 2020-01-14 11:59:41.974 7fa8a5028040 1 leveldb: Level-0 table #16526214: started 2020-01-14 11:59:41.992 7fa8a5028040 1 leveldb: Level-0 table #16526214: 2664212 bytes OK 2020-01-14 11:59:41.994 7fa8a5028040 1 leveldb: Delete type=3 #16525234 2020-01-14 11:59:41.995 7fa8a5028040 1 leveldb: Delete type=0 #16526187 2020-01-14 11:59:41.995 7fa8a5028040 -1 WARNING: 'mon addr' config option v2:188.184.86.25:6790/0 does not match monmap file continuing with monmap configuration 2020-01-14 11:59:41.996 7fa8a5028040 0 starting mon.cephdwight-mon-1633994557 rank 0 at public addrs [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0] at bind addrs [v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0] mon_data /var/lib/ceph/mon/ceph-cephdwight-mon-1633994557 fsid dd535a7e-4647-4bee-853d-f34112615f81 2020-01-14 11:59:41.997 7fa8a5028040 1 mon.cephdwight-mon-1633994557@-1(???) e43 preinit fsid dd535a7e-4647-4bee-853d-f34112615f81 2020-01-14 11:59:41.997 7fa8a5028040 0 mon.cephdwight-mon-1633994557@-1(???).mds e70675 new map 2020-01-14 11:59:41.997 7fa8a5028040 0 mon.cephdwight-mon-1633994557@-1(???).mds e70675 print_map e70675 enable_multiple, ever_enabled_multiple: 0,0 compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 1 Filesystem 'cephfs' (1) fs_name cephfs epoch 70673 flags 1e created 2016-11-02 13:48:49.499990 modified 2020-01-14 11:45:13.836504 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 min_compat_client -1 (unspecified) last_failure 0 last_failure_osd_epoch 293944 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=1024259360} failed damaged stopped 1,2 data_pools [173] metadata_pool 174 inline_data disabled balancer standby_count_wanted 1 1024259360: v1:137.138.122.17:6801/4076417850 'cephdwightmds2' mds.0.70670 up:active seq 8 Standby daemons: 1024263064: v1:188.184.95.158:6801/1464356353 'cephdwightmds1' mds.-1.0 up:standby seq 1 1024263091: v1:188.184.83.152:6801/7247617 'cephdwightmds0' mds.-1.0 up:standby seq 2 2020-01-14 11:59:41.999 7fa8a5028040 0 mon.cephdwight-mon-1633994557@-1(???).osd e293946 crush map has features 432629239337189376, adjusting msgr requires 2020-01-14 11:59:41.999 7fa8a5028040 0 mon.cephdwight-mon-1633994557@-1(???).osd e293946 crush map has features 432629239337189376, adjusting msgr requires 2020-01-14 11:59:41.999 7fa8a5028040 0 mon.cephdwight-mon-1633994557@-1(???).osd e293946 crush map has features 3314933000854323200, adjusting msgr requires 2020-01-14 11:59:41.999 7fa8a5028040 0 mon.cephdwight-mon-1633994557@-1(???).osd e293946 crush map has features 432629239337189376, adjusting msgr requires 2020-01-14 11:59:41.999 7fa8a5028040 1 mon.cephdwight-mon-1633994557@-1(???).paxosservice(auth 623158..623227) refresh upgraded, format 0 -> 3 2020-01-14 11:59:42.012 7fa8a5028040 0 mon.cephdwight-mon-1633994557@-1(probing) e43 my rank is now 0 (was -1) 2020-01-14 11:59:42.016 7fa88cab9700 0 log_channel(cluster) log [INF] : mon.cephdwight-mon-1633994557 calling monitor election 2020-01-14 11:59:42.016 7fa88cab9700 1 mon.cephdwight-mon-1633994557@0(electing).elector(8072) init, last seen epoch 8072 2020-01-14 11:59:42.018 7fa88cab9700 -1 mon.cephdwight-mon-1633994557@0(electing) e43 failed to get devid for : udev_device_new_from_subsystem_sysname failed on '' 2020-01-14 11:59:42.020 7fa88cab9700 0 log_channel(cluster) log [INF] : mon.cephdwight-mon-1633994557 is new leader, mons cephdwight-mon-1633994557,cephdwight-mon-f7df6839c6,cephdwight-mon-d8788e3256 in quorum (ranks 0,1,2) 2020-01-14 11:59:42.026 7fa88cab9700 0 log_channel(cluster) log [DBG] : monmap e43: 3 mons at {cephdwight-mon-1633994557=[v2:188.184.86.25:3300/0,v1:188.184.86.25:6790/0],cephdwight-mon-d8788e3256=[v2:188.185.66.208:3300/0,v1:188.185.66.208:6790/0],cephdwight-mon-f7df6839c6=[v2:188.184.94.56:3300/0,v1:188.184.94.56:6790/0]} 2020-01-14 11:59:42.026 7fa88cab9700 0 log_channel(cluster) log [DBG] : fsmap cephfs:1 {0=cephdwightmds2=up:active} 2 up:standby 2020-01-14 11:59:42.026 7fa88cab9700 0 log_channel(cluster) log [DBG] : osdmap e293946: 60 total, 60 up, 60 in 2020-01-14 11:59:42.033 7fa88cab9700 0 log_channel(cluster) log [DBG] : mgrmap e431: cephdwight-mon-1633994557(active, since 45m), standbys: cephdwight-mon-f7df6839c6, cephdwight-mon-d8788e3256 2020-01-14 11:59:42.033 7fa88cab9700 0 log_channel(cluster) log [INF] : Health check cleared: MON_MSGR2_NOT_ENABLED (was: 1 monitors have not enabled msgr2) 2020-01-14 11:59:42.034 7fa88cab9700 0 log_channel(cluster) log [INF] : Cluster is now healthy 2020-01-14 11:59:42.052 7fa893040700 0 log_channel(cluster) log [INF] : overall HEALTH_OK 2020-01-14 11:59:42.356 7fa88cab9700 0 log_channel(cluster) log [INF] : daemon mds.cephdwightmds2 restarted 2020-01-14 11:59:42.356 7fa88cab9700 1 mon.cephdwight-mon-1633994557@0(leader).mds e70675 fail_mds_gid 1024259360 mds.cephdwightmds2 role 0 2020-01-14 11:59:42.356 7fa88cab9700 1 mon.cephdwight-mon-1633994557@0(leader).osd e293946 do_prune osdmap full prune enabled 2020-01-14 11:59:42.369 7fa893040700 -1 mon.cephdwight-mon-1633994557@0(leader).osd e293946 register_cache_with_pcm not using rocksdb 2020-01-14 11:59:42.370 7fa893040700 1 mon.cephdwight-mon-1633994557@0(leader).osd e293947 e293947: 60 total, 60 up, 60 in 2020-01-14 11:59:42.372 7fa893040700 0 log_channel(cluster) log [DBG] : osdmap e293947: 60 total, 60 up, 60 in 2020-01-14 11:59:42.467 7fa88cab9700 0 mon.cephdwight-mon-1633994557@0(leader) e43 handle_command mon_command({"prefix": "df", "format": "json"} v 0) v1 2020-01-14 11:59:42.467 7fa88cab9700 0 log_channel(audit) log [DBG] : from='client.1024331298 128.142.167.20:0/4106550175' entity='client.castorcert' cmd=[{"prefix": "df", "format": "json"}]: dispatch 2020-01-14 11:59:43.104 7fa88f2be700 0 log_channel(cluster) log [WRN] : Health check failed: 1 filesystem is degraded (FS_DEGRADED) 2020-01-14 11:59:43.104 7fa88f2be700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline (MDS_ALL_DOWN) 2020-01-14 11:59:43.120 7fa88a274700 1 leveldb: Compacting 1@0 + 3@1 files 2020-01-14 11:59:43.132 7fa893040700 0 mon.cephdwight-mon-1633994557@0(leader).mds e70676 new map 2020-01-14 11:59:43.132 7fa893040700 0 mon.cephdwight-mon-1633994557@0(leader).mds e70676 print_map e70676 enable_multiple, ever_enabled_multiple: 0,0 compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor tab le,9=file layout v2,10=snaprealm v2} legacy client fscid: 1 Filesystem 'cephfs' (1) fs_name cephfs epoch 70676 flags 1e created 2016-11-02 13:48:49.499990 modified 2020-01-14 11:59:43.104834 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 min_compat_client -1 (unspecified) last_failure 0 last_failure_osd_epoch 293947 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor tab le,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {} failed 0 damaged stopped 1,2 data_pools [173] metadata_pool 174 inline_data disabled balancer standby_count_wanted 1 Standby daemons: 1024263064: v1:188.184.95.158:6801/1464356353 'cephdwightmds1' mds.-1.0 up:standby seq 1 1024263091: v1:188.184.83.152:6801/7247617 'cephdwightmds0' mds.-1.0 up:standby seq 2 1024303151: [v2:137.138.122.17:6800/2692603670,v1:137.138.122.17:6801/2692603670] 'cephdwightmds2' mds.-1.0 up:standby seq 1 2020-01-14 11:59:43.132 7fa88a274700 1 leveldb: Generated table #16526216: 130 keys, 2153003 bytes 2020-01-14 11:59:43.134 7fa893040700 0 log_channel(cluster) log [DBG] : mds.? [v2:137.138.122.17:6800/2692603670,v1:137.138.122.17:6801/2692603670] up:boot 2020-01-14 11:59:43.134 7fa893040700 1 mon.cephdwight-mon-1633994557@0(leader).mds e70676 taking over failed mds.0 with 1024263064/cephdwightmds1 v1:188.184.95.158:6801/1464356353 2020-01-14 11:59:43.134 7fa893040700 0 log_channel(cluster) log [INF] : Standby daemon mds.cephdwightmds1 assigned to filesystem cephfs as rank 0 2020-01-14 11:59:43.134 7fa893040700 0 log_channel(cluster) log [INF] : Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline) 2020-01-14 11:59:43.136 7fa893040700 0 log_channel(cluster) log [DBG] : fsmap cephfs:0/1 3 up:standby, 1 failed 2020-01-14 11:59:43.144 7fa893040700 0 mon.cephdwight-mon-1633994557@0(leader).mds e70677 new map 2020-01-14 11:59:43.144 7fa893040700 0 mon.cephdwight-mon-1633994557@0(leader).mds e70677 print_map e70677 enable_multiple, ever_enabled_multiple: 0,0 compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor tab le,9=file layout v2,10=snaprealm v2} legacy client fscid: 1 Filesystem 'cephfs' (1) fs_name cephfs epoch 70677 flags 1e created 2016-11-02 13:48:49.499990 modified 2020-01-14 11:59:43.135648 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 min_compat_client -1 (unspecified) last_failure 0 last_failure_osd_epoch 293947 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=1024263064} failed damaged stopped 1,2 data_pools [173] metadata_pool 174 inline_data disabled balancer standby_count_wanted 1 1024263064: v1:188.184.95.158:6801/1464356353 'cephdwightmds1' mds.0.70677 up:replay seq 1 Standby daemons: 1024263091: v1:188.184.83.152:6801/7247617 'cephdwightmds0' mds.-1.0 up:standby seq 2 1024303151: [v2:137.138.122.17:6800/2692603670,v1:137.138.122.17:6801/2692603670] 'cephdwightmds2' mds.-1.0 up:standby seq 1 2020-01-14 11:59:43.144 7fa88a274700 1 leveldb: Generated table #16526217: 93 keys, 2104982 bytes 2020-01-14 11:59:43.145 7fa893040700 0 log_channel(cluster) log [DBG] : fsmap cephfs:1/1 {0=cephdwightmds1=up:replay} 2 up:standby 2020-01-14 11:59:43.156 7fa88a274700 1 leveldb: Generated table #16526218: 88 keys, 2107771 bytes 2020-01-14 11:59:43.165 7fa88a274700 1 leveldb: Generated table #16526219: 76 keys, 1610808 bytes 2020-01-14 11:59:43.165 7fa88a274700 1 leveldb: Compacted 1@0 + 3@1 files => 7976564 bytes 2020-01-14 11:59:43.165 7fa88a274700 1 leveldb: compacted to: files[ 0 4 4 1 57 0 0 ] 2020-01-14 11:59:43.165 7fa88a274700 1 leveldb: Delete type=2 #16526190 2020-01-14 11:59:43.166 7fa88a274700 1 leveldb: Delete type=2 #16526191 2020-01-14 11:59:43.166 7fa88a274700 1 leveldb: Delete type=2 #16526192 2020-01-14 11:59:43.167 7fa88a274700 1 leveldb: Delete type=2 #16526214 2020-01-14 11:59:46.618 7fa88cab9700 0 mon.cephdwight-mon-1633994557@0(leader) e43 handle_command mon_command({"prefix": "df", "format": "json"} v 0) v1
Updated by Patrick Donnelly over 4 years ago
That's indeed very odd. I looked through the code but didn't find a good reason why this would happen. It is interesting that the MonClient had just reconnected to the monitor but the global ID shouldn't/didn't change. So I'm currently stumped...
Updated by Patrick Donnelly over 4 years ago
- Status changed from Need More Info to New
- Priority changed from Normal to Low
Updated by Dan van der Ster over 4 years ago
We just updated our 2nd cluster to nautilus and saw the exact same mds respawn at the moment we enabled msgr2:
-33> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient: handle_monmap mon_map magic: 0 v1 -32> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient: got monmap 6 from mon.cephjim-mon-54686b1b98 (according to old e6) -31> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient: dump: epoch 6 fsid 4da6fd06-b069-49af-901f-c9513baabdbd last_changed 2020-01-30 10:46:48.787499 created 2017-11-09 14:53:29.757237 min_mon_release 14 (nautilus) 0: [v2:137.138.32.34:3300/0,v1:137.138.32.34:6789/0] mon.cephjim-mon-decd03f337 1: [v2:137.138.76.42:3300/0,v1:137.138.76.42:6789/0] mon.cephjim-mon-54686b1b98 2: [v2:137.138.148.253:3300/0,v1:137.138.148.253:6789/0] mon.cephjim-mon-f92526a47a -30> 2020-01-30 10:46:48.797 7f933fb53700 1 monclient: mon.1 has (v2) addrs [v2:137.138.76.42:3300/0,v1:137.138.76.42:6789/0] but i'm connected to v1:137.138.76.42:6789/0, reconnecting -29> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient: _reopen_session rank -1 -28> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient(hunting): picked mon.cephjim-mon-decd03f337 con 0x5653db389c00 addr [v2:137.138.32.34:3300/0,v1:137.138.32.34:6789/0] -27> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient(hunting): picked mon.cephjim-mon-f92526a47a con 0x5653da8b3400 addr [v2:137.138.148.253:3300/0,v1:137.138.148.253:6789/0] -26> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient(hunting): picked mon.cephjim-mon-54686b1b98 con 0x5653f93d5c00 addr [v2:137.138.76.42:3300/0,v1:137.138.76.42:6789/0] -25> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient(hunting): start opening mon connection -24> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient(hunting): start opening mon connection -23> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient(hunting): start opening mon connection -22> 2020-01-30 10:46:48.797 7f933fb53700 10 monclient(hunting): _renew_subs -21> 2020-01-30 10:46:48.804 7f933b34a700 2 mds.0.cache Memory usage: total 1199956, rss 764872, heap 332028, baseline 332028, 3332 / 271652 inodes have caps, 4735 caps, 0.0174304 caps per inode -20> 2020-01-30 10:46:49.016 7f93425e2700 10 monclient(hunting): get_auth_request con 0x5653da8b3400 auth_method 0 -19> 2020-01-30 10:46:49.016 7f93425e2700 10 monclient(hunting): get_auth_request method 2 preferred_modes [1,2] -18> 2020-01-30 10:46:49.016 7f93425e2700 10 monclient(hunting): _init_auth method 2 -17> 2020-01-30 10:46:49.025 7f93425e2700 10 monclient(hunting): handle_auth_reply_more payload 9 -16> 2020-01-30 10:46:49.025 7f93425e2700 10 monclient(hunting): handle_auth_reply_more payload_len 9 -15> 2020-01-30 10:46:49.025 7f93425e2700 10 monclient(hunting): handle_auth_reply_more responding with 36 bytes -14> 2020-01-30 10:46:49.042 7f93425e2700 10 monclient(hunting): handle_auth_done global_id 76134616 payload 1166 -13> 2020-01-30 10:46:49.042 7f93425e2700 10 monclient: _finish_hunting 0 -12> 2020-01-30 10:46:49.042 7f93425e2700 1 monclient: found mon.cephjim-mon-f92526a47a -11> 2020-01-30 10:46:49.042 7f93425e2700 10 monclient: _send_mon_message to mon.cephjim-mon-f92526a47a at v2:137.138.148.253:3300/0 -10> 2020-01-30 10:46:49.042 7f93425e2700 10 monclient: _send_mon_message to mon.cephjim-mon-f92526a47a at v2:137.138.148.253:3300/0 -9> 2020-01-30 10:46:49.042 7f93425e2700 10 monclient: _finish_auth 0 -8> 2020-01-30 10:46:49.042 7f93425e2700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2020-01-30 10:46:19.043349) -7> 2020-01-30 10:46:49.067 7f933fb53700 10 monclient: handle_monmap mon_map magic: 0 v1 -6> 2020-01-30 10:46:49.067 7f933fb53700 10 monclient: got monmap 6 from mon.cephjim-mon-f92526a47a (according to old e6) -5> 2020-01-30 10:46:49.067 7f933fb53700 10 monclient: dump: epoch 6 fsid 4da6fd06-b069-49af-901f-c9513baabdbd last_changed 2020-01-30 10:46:48.787499 created 2017-11-09 14:53:29.757237 min_mon_release 14 (nautilus) 0: [v2:137.138.32.34:3300/0,v1:137.138.32.34:6789/0] mon.cephjim-mon-decd03f337 1: [v2:137.138.76.42:3300/0,v1:137.138.76.42:6789/0] mon.cephjim-mon-54686b1b98 2: [v2:137.138.148.253:3300/0,v1:137.138.148.253:6789/0] mon.cephjim-mon-f92526a47a -4> 2020-01-30 10:46:49.096 7f933fb53700 10 monclient: handle_config config(1 keys) v1 -3> 2020-01-30 10:46:49.096 7f933fb53700 1 mds.hpc-be143 Updating MDS map to version 62877 from mon.2 -2> 2020-01-30 10:46:49.096 7f933fb53700 1 mds.hpc-be143 Map removed me (mds.-1 gid:76134616) from cluster due to lost contact; respawning -1> 2020-01-30 10:46:49.096 7f933fb53700 1 mds.hpc-be143 respawn! 0> 2020-01-30 10:46:49.096 7f933e350700 4 set_mon_vals no callback set
Our upgrade procedure follows the doc:
- upgrade mons to v14.2.6
- upgrade mgrs to v14.2.6
- upgrade osds to v14.2.6
- upgrade mds (standby's stopped, only 1 active) to v14.2.6
- ceph osd require-osd-release nautilus
- ceph mon enable-msgr2 (at this point the active mds respawns)