Bug #62489
opentesting: did not reconnect to MDS during up:reconnect
0%
Description
This is with the testing branch of the kernel client:
2023-08-09T06:17:03.518 INFO:tasks.workunit.client.0.smithi190.stdout:removed 'linux-2.6.33/arch/mips/include/asm/gpio.h' 2023-08-09T06:17:03.519 INFO:tasks.workunit.client.0.smithi190.stdout:removed 'linux-2.6.33/arch/mips/include/asm/txx9/smsc_fdc37m81x.h' 2023-08-09T06:17:03.519 INFO:tasks.workunit.client.0.smithi190.stdout:removed 'linux-2.6.33/arch/mips/include/asm/txx9/pci.h' 2023-08-09T06:17:03.534 DEBUG:teuthology.orchestra.run:got remote process result: 1 2023-08-09T06:17:03.535 INFO:tasks.workunit.client.0.smithi190.stderr:rm: cannot remove 'linux-2.6.33/arch/mips/include/asm/txx9/rbtx4939.h': Permission denied 2023-08-09T06:17:03.535 INFO:tasks.workunit.client.0.smithi190.stderr:rm: cannot remove 'linux-2.6.33/arch/mips/include/asm/txx9/rbtx4938.h': Permission denied 2023-08-09T06:17:03.535 INFO:tasks.workunit.client.0.smithi190.stderr:rm: cannot remove 'linux-2.6.33/arch/mips/include/asm/txx9/dmac.h': Permission denied 2023-08-09T06:17:03.535 INFO:tasks.workunit.client.0.smithi190.stderr:rm: cannot remove 'linux-2.6.33/arch/mips/include/asm/txx9/spi.h': Permission denied 2023-08-09T06:17:03.535 INFO:tasks.workunit.client.0.smithi190.stderr:rm: cannot remove 'linux-2.6.33/arch/mips/include/asm/txx9/ndfmc.h': Permission denied 2023-08-09T06:17:03.536 INFO:tasks.workunit.client.0.smithi190.stderr:rm: cannot remove 'linux-2.6.33/arch/mips/include/asm/txx9/tx3927.h': Permission denied 2023-08-09T06:17:03.536 INFO:tasks.workunit.client.0.smithi190.stderr:rm: cannot remove 'linux-2.6.33/arch/mips/include/asm/txx9/tx4927.h': Permission denied
From: /teuthology/yuriw-2023-08-09_01:12:27-fs-wip-yuri5-testing-2023-08-08-0807-quincy-distro-default-smithi/7363646/teuthology.log
2023-08-09T06:16:59.925+0000 7f959edf3700 7 mds.0.server reconnect timed out, 1 clients have not reconnected in time 2023-08-09T06:16:59.925+0000 7f959edf3700 1 mds.0.server reconnect gives up on client.4670 172.21.15.190:0/1539903001 2023-08-09T06:16:59.925+0000 7f959edf3700 0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi190:0 (4670), after waiting 45.5268 seconds during MDS startup 2023-08-09T06:16:59.925+0000 7f959edf3700 1 mds.0.231 Evicting (and blocklisting) client session 4670 (172.21.15.190:0/1539903001) 2023-08-09T06:16:59.925+0000 7f959edf3700 0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 4670 (172.21.15.190:0/1539903001)
From: /teuthology/yuriw-2023-08-09_01:12:27-fs-wip-yuri5-testing-2023-08-08-0807-quincy-distro-default-smithi/7363646/remote/smithi190/log/ceph-mds.a.log.gz
2023-08-09T06:15:03.444653+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] send_mds_reconnect: mds0 reconnect start 2023-08-09T06:15:03.454648+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] handle_session: mds0 reconnect success 2023-08-09T06:15:06.461667+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] check_new_map: mds0 recovery completed 2023-08-09T06:15:06.801694+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:06.808699+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:07.198654+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:07.205646+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:14.450656+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:14.457647+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:15.154673+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:15.161676+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:15.574667+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:15.582669+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:20.654669+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:20.654812+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:21.235690+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:21.235789+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:22.122658+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:22.122759+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:25.064811+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:25.064906+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:28.509731+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:28.509892+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:30.345701+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:30.345935+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:32.317652+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:32.317723+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:32.615717+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:32.615910+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:33.633708+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:33.633900+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:35.295696+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:35.295798+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:37.146683+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:37.146815+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:37.986682+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:37.986781+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:38.917746+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:38.917850+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:39.712669+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:39.712796+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:43.592528+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:43.600665+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:43.863653+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:43.870656+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:44.932573+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:44.932675+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:48.186689+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:15:48.193676+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:48.450656+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket error on write 2023-08-09T06:15:48.457652+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:48.961677+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket error on write 2023-08-09T06:15:48.968708+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:49.921672+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket error on write 2023-08-09T06:15:49.928681+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:51.905682+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket error on write 2023-08-09T06:15:51.912670+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:15:55.937653+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket error on write 2023-08-09T06:15:55.944651+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:16:04.330510+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 session reset 2023-08-09T06:16:04.330607+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] mds_peer_reset: mds0 closed our session 2023-08-09T06:16:04.330652+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] send_mds_reconnect: mds0 reconnect start 2023-08-09T06:16:04.351724+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] handle_session: mds0 reconnect denied 2023-08-09T06:16:04.351865+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests: dropping unsafe request 192920 2023-08-09T06:16:04.373983+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests: dropping unsafe request 192922 2023-08-09T06:16:04.374060+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests: dropping unsafe request 192925 2023-08-09T06:16:04.395601+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests: dropping unsafe request 192927 2023-08-09T06:17:02.501817+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN) 2023-08-09T06:17:02.509058+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:17:03.210357+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 session reset 2023-08-09T06:17:03.210449+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] mds_peer_reset: mds0 closed our session 2023-08-09T06:17:03.210482+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] send_mds_reconnect: mds0 reconnect start 2023-08-09T06:17:03.220684+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] handle_session: mds0 reconnect denied 2023-08-09T06:17:03.230657+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state V2_SESSION_CONNECT) 2023-08-09T06:17:03.245781+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established 2023-08-09T06:17:03.510044+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] handle_session: mds0 session blocklisted 2023-08-09T06:17:03.510165+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] handle_session: mds0 rejected session
From: /teuthology/yuriw-2023-08-09_01:12:27-fs-wip-yuri5-testing-2023-08-08-0807-quincy-distro-default-smithi/7363646/remote/smithi190/syslog/kern.log.gz
Updated by Xiubo Li 9 months ago
The kclient just sent a reconnect request after the mds failovered or restarted and when the kclient received the session reset, not RECONNECT mdsmap:
2023-08-09T06:16:04.330510+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 session reset 2023-08-09T06:16:04.330607+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] mds_peer_reset: mds0 closed our session 2023-08-09T06:16:04.330652+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] send_mds_reconnect: mds0 reconnect start
But the reconnect request was received during the mds was in up:replay state:
2023-08-09T06:16:04.335+0000 7f95a05f6700 10 mds.0.0 update_log_config log_to_monitors true 2023-08-09T06:16:04.335+0000 7f95a05f6700 10 mds.0.0 create_logger 2023-08-09T06:16:04.335+0000 7f95a05f6700 7 mds.0.server operator(): full = 0 epoch = 0 2023-08-09T06:16:04.335+0000 7f95a05f6700 4 mds.0.purge_queue operator(): data pool 4 not found in OSDMap 2023-08-09T06:16:04.335+0000 7f95a05f6700 10 mds.0.server apply_blocklist: killed 0 2023-08-09T06:16:04.335+0000 7f95a05f6700 4 mds.0.0 apply_blocklist: killed 0, blocklisted sessions (0 blocklist entries, 0) 2023-08-09T06:16:04.335+0000 7f95a05f6700 10 mds.a handle_mds_map: handling map as rank 0 2023-08-09T06:16:04.335+0000 7f95a05f6700 1 mds.0.231 handle_mds_map i am now mds.0.231 2023-08-09T06:16:04.335+0000 7f95a05f6700 1 mds.0.231 handle_mds_map state change up:standby --> up:replay 2023-08-09T06:16:04.335+0000 7f95a05f6700 5 mds.beacon.a set_want_state: up:standby -> up:replay 2023-08-09T06:16:04.335+0000 7f95a05f6700 1 mds.0.231 replay_start 2023-08-09T06:16:04.335+0000 7f95a05f6700 1 mds.0.231 waiting for osdmap 147 (which blocklists prior instance) ... 2023-08-09T06:16:04.338+0000 7f95a2dfb700 1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >> [v2:172.21.15.190:6800/84327,v1:172.21.15.190:6801/84327] conn(0x555f83345400 msgr2=0x555f8257d180 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).operator() setting up a delay queue 2023-08-09T06:16:04.338+0000 7f95a2dfb700 1 --2- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >> [v2:172.21.15.190:6800/84327,v1:172.21.15.190:6801/84327] conn(0x555f83345400 0x555f8257d180 crc :-1 s=READY pgs=294 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=osd.0 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0 2023-08-09T06:16:04.338+0000 7f95a3dfd700 1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >> [v2:172.21.15.193:6824/84630,v1:172.21.15.193:6825/84630] conn(0x555f83344000 msgr2=0x555f8257c680 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).operator() setting up a delay queue 2023-08-09T06:16:04.338+0000 7f95a3dfd700 1 --2- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >> [v2:172.21.15.193:6824/84630,v1:172.21.15.193:6825/84630] conn(0x555f83344000 0x555f8257c680 crc :-1 s=READY pgs=264 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=osd.7 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0 2023-08-09T06:16:04.340+0000 7f95a3dfd700 1 --2- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >> conn(0x555f83354000 0x555f8257d700 unknown :-1 s=NONE pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).accept 2023-08-09T06:16:04.340+0000 7f95a3dfd700 1 --2- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >> conn(0x555f83354000 0x555f8257d700 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=1 required=1 2023-08-09T06:16:04.341+0000 7f95a3dfd700 10 mds.a parse_caps: parsing auth_cap_str='allow' 2023-08-09T06:16:04.341+0000 7f95a05f6700 10 mds.a new session 0x555f824a0300 for client.4670 172.21.15.190:0/1539903001 con 0x555f83354000 2023-08-09T06:16:04.341+0000 7f95a05f6700 10 mds.a parse_caps: parsing auth_cap_str='allow' 2023-08-09T06:16:04.341+0000 7f95a05f6700 10 mds.a ms_handle_accept 172.21.15.190:0/1539903001 con 0x555f83354000 session 0x555f824a0300 2023-08-09T06:16:04.341+0000 7f95a3dfd700 1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >> 172.21.15.190:0/1539903001 conn(0x555f83354000 msgr2=0x555f8257d700 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).operator() setting up a delay queue 2023-08-09T06:16:04.341+0000 7f95a3dfd700 1 --2- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >> 172.21.15.190:0/1539903001 conn(0x555f83354000 0x555f8257d700 crc :-1 s=READY pgs=1590 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=client.4670 client_cookie=7136c3bb06ae8991 server_cookie=f3d5a73cd05d2d1 in_seq=0 out_seq=0 2023-08-09T06:16:04.341+0000 7f95a05f6700 1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] <== client.4670 172.21.15.190:0/1539903001 1 ==== client_request(client.4670:192920 rmdir #0x10000009597/ip32 2023-08-09T06:15:48.151937+0000 RETRY=1 REPLAY caller_uid=1000, caller_gid=1275{6,36,1000,1275,}) v6 ==== 192+0+0 (crc 0 0 0) 0x555f831b1340 con 0x555f83354000 2023-08-09T06:16:04.341+0000 7f95a05f6700 3 mds.0.server not active yet, waiting 2023-08-09T06:16:04.341+0000 7f95a05f6700 1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] <== client.4670 172.21.15.190:0/1539903001 2 ==== client_request(client.4670:192922 unlink #0x10000009597/gpio.h 2023-08-09T06:15:48.169936+0000 RETRY=1 REPLAY caller_uid=1000, caller_gid=1275{6,36,1000,1275,}) v6 ==== 194+0+0 (crc 0 0 0) 0x555f831b1600 con 0x555f83354000 2023-08-09T06:16:04.341+0000 7f95a05f6700 3 mds.0.server not active yet, waiting 2023-08-09T06:16:04.341+0000 7f95a05f6700 1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] <== client.4670 172.21.15.190:0/1539903001 3 ==== client_request(client.4670:192925 unlink #0x100000097ee/smsc_fdc37m81x.h 2023-08-09T06:15:48.173936+0000 RETRY=1 REPLAY caller_uid=1000, caller_gid=1275{6,36,1000,1275,}) v6 ==== 204+0+0 (crc 0 0 0) 0x555f831b18c0 con 0x555f83354000 2023-08-09T06:16:04.341+0000 7f95a05f6700 3 mds.0.server not active yet, waiting 2023-08-09T06:16:04.341+0000 7f95a05f6700 1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] <== client.4670 172.21.15.190:0/1539903001 4 ==== client_request(client.4670:192927 unlink #0x100000097ee/pci.h 2023-08-09T06:15:48.175936+0000 RETRY=1 REPLAY caller_uid=1000, caller_gid=1275{6,36,1000,1275,}) v6 ==== 193+0+0 (crc 0 0 0) 0x555f831b1b80 con 0x555f83354000 2023-08-09T06:16:04.341+0000 7f95a05f6700 3 mds.0.server not active yet, waiting 2023-08-09T06:16:04.341+0000 7f95a05f6700 1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] <== client.4670 172.21.15.190:0/1539903001 5 ==== client_request(client.4670:192929 unlink #0x100000097ee/rbtx4939.h 2023-08-09T06:15:48.177936+0000 RETRY=1 caller_uid=1000, caller_gid=1275{6,36,1000,1275,}) v6 ==== 198+0+0 (crc 0 0 0) 0x555f8330c000 con 0x555f83354000 2023-08-09T06:16:04.341+0000 7f95a05f6700 3 mds.0.server not active yet, waiting 2023-08-09T06:16:04.341+0000 7f95a05f6700 1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] <== client.4670 172.21.15.190:0/1539903001 6 ==== client_reconnect(14 caps 3 realms ) v3 ==== 0+0+1105 (crc 0 0 0) 0x555f8325eea0 con 0x555f83354000 2023-08-09T06:16:04.341+0000 7f95a05f6700 7 mds.0.server handle_client_reconnect client.4670 2023-08-09T06:16:04.341+0000 7f95a05f6700 20 mds.0.231 get_session have 0x555f824a0300 client.4670 172.21.15.190:0/1539903001 state closed 2023-08-09T06:16:04.341+0000 7f95a05f6700 0 mds.0.server ignoring msg from not-open sessionclient_reconnect(14 caps 3 realms ) v3 2023-08-09T06:16:04.341+0000 7f95a05f6700 1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] --> 172.21.15.190:0/1539903001 -- client_session(close) v5 -- 0x555f82480a80 con 0x555f83354000
Just before the sessionmap being loaded and replayed and the mds just create a new session, which the state was STATE_CLOSED, and then rejected the session connection as above. And just after this the sessionmap journal logs was load and replayed, but it's too late:
2023-08-09T06:16:04.447+0000 7f959a5ea700 10 MDSIOContextBase::complete: N12_GLOBAL__N_112C_IO_SM_LoadE 2023-08-09T06:16:04.447+0000 7f959a5ea700 10 MDSContext::complete: N12_GLOBAL__N_112C_IO_SM_LoadE 2023-08-09T06:16:04.447+0000 7f959a5ea700 10 mds.0.sessionmap _load_finish loaded version 62854 2023-08-09T06:16:04.447+0000 7f959a5ea700 10 mds.0.sessionmap _load_finish: omap load complete 2023-08-09T06:16:04.447+0000 7f959a5ea700 10 mds.0.sessionmap _load_finish: v 62854, 1 sessions 2023-08-09T06:16:04.447+0000 7f959a5ea700 10 mds.0.sessionmap dump 2023-08-09T06:16:04.447+0000 7f959a5ea700 10 mds.0.sessionmap client.4670 0x555f83366500 state open completed {182651=0x0} free_prealloc_inos [0x1000000f4cb~0x6e,0x1000000f560~0xfb,0x1000000f673~0x3d0] delegated_inos []
And then in kclient it just closed and then removed the session:
2023-08-09T06:16:04.351724+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] handle_session: mds0 reconnect denied 2023-08-09T06:16:04.351865+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests: dropping unsafe request 192920 2023-08-09T06:16:04.373983+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests: dropping unsafe request 192922 2023-08-09T06:16:04.374060+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests: dropping unsafe request 192925 2023-08-09T06:16:04.395601+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests: dropping unsafe request 192927 2023-08-09T06:17:02.501817+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
And then when the RECONNECT mdsmap came, the kclient will skip reconnecting it and then mds timedout waiting reconnection:
2023-08-09T06:16:57.688+0000 7f95a2dfb700 5 mds.beacon.a received beacon reply up:reconnect seq 16 rtt 0 2023-08-09T06:16:58.338+0000 7f959edf3700 20 mds.0.231 get_task_status 2023-08-09T06:16:58.338+0000 7f959edf3700 20 mds.0.231 schedule_update_timer_task 2023-08-09T06:16:58.370+0000 7f959c5ee700 2 mds.0.cache Memory usage: total 477004, rss 168768, heap 207104, baseline 182528, 0 / 24084 inodes have caps, 0 caps, 0 caps per inode 2023-08-09T06:16:58.370+0000 7f959c5ee700 10 mds.0.cache cache not ready for trimming 2023-08-09T06:16:58.370+0000 7f959c5ee700 10 mds.0.cache releasing free memory 2023-08-09T06:16:58.370+0000 7f959c5ee700 20 mds.0.cache upkeep thread waiting interval 1.000000000s 2023-08-09T06:16:59.370+0000 7f959c5ee700 2 mds.0.cache Memory usage: total 477004, rss 168768, heap 207104, baseline 182528, 0 / 24084 inodes have caps, 0 caps, 0 caps per inode 2023-08-09T06:16:59.370+0000 7f959c5ee700 10 mds.0.cache cache not ready for trimming 2023-08-09T06:16:59.370+0000 7f959c5ee700 20 mds.0.cache upkeep thread waiting interval 1.000000000s 2023-08-09T06:16:59.925+0000 7f959edf3700 7 mds.0.server reconnect timed out, 1 clients have not reconnected in time 2023-08-09T06:16:59.925+0000 7f959edf3700 1 mds.0.server reconnect gives up on client.4670 172.21.15.190:0/1539903001 2023-08-09T06:16:59.925+0000 7f959edf3700 0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi190:0 (4670), after waiting 45.5268 seconds during MDS startup 2023-08-09T06:16:59.925+0000 7f959edf3700 1 mds.0.231 Evicting (and blocklisting) client session 4670 (172.21.15.190:0/1539903001) 2023-08-09T06:16:59.925+0000 7f959edf3700 0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 4670 (172.21.15.190:0/1539903001) 2023-08-09T06:16:59.925+0000 7f959edf3700 4 mds.0.231 Preparing blocklist command... (wait=0) 2023-08-09T06:16:59.925+0000 7f959edf3700 4 mds.0.231 Sending mon blocklist command: {"prefix":"osd blocklist", "blocklistop":"add","addr":"172.21.15.190:0/1539903001"} 2023-08-09T06:16:59.925+0000 7f959edf3700 1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] --> [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] -- mon_command({"prefix":"osd blocklist", "blocklistop":"add","addr":"172.21.15.190:0/1539903001"} v 0) v1 -- 0x555f83323200 con 0x555f831b3000 2023-08-09T06:16:59.925+0000 7f959edf3700 1 mds.0.server reconnect will complete once clients are evicted 2023-08-09T06:16:59.929+0000 7f959f5f4700 1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] --> [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] -- log(2 entries from seq 1 at 2023-08-09T06:16:59.927048+0000) v1 -- 0x555f824808c0 con 0x555f831b3000 ... 2023-08-09T06:17:00.376+0000 7f95a05f6700 7 mds.0.server operator(): full = 0 epoch = 148 2023-08-09T06:17:00.376+0000 7f95a05f6700 10 mds.0.server kill_session 0x555f83366500 2023-08-09T06:17:00.376+0000 7f95a05f6700 10 mds.0.server journal_close_session : client.4670 172.21.15.190:0/1539903001 pending_prealloc_inos [] free_prealloc_inos [] delegated_inos [0x1000000f4cb~0x6e,0x1000000f560~0xfb,0x1000000f673~0x3d0] 2023-08-09T06:17:00.376+0000 7f95a05f6700 20 mds.0.sessionmap mark_projected s=0x555f83366500 name=client.4670 pv=62855 -> 62856 2023-08-09T06:17:00.376+0000 7f95a05f6700 10 mds.0.server apply_blocklist: killed 1 2023-08-09T06:17:00.376+0000 7f95a05f6700 4 mds.0.231 apply_blocklist: killed 1, blocklisted sessions (1 blocklist entries, 1) 2023-08-09T06:17:00.376+0000 7f95a05f6700 4 mds.0.231 set_osd_epoch_barrier: epoch=148 2023-08-09T06:17:00.376+0000 7f95a05f6700 1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] --> [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] -- mon_subscribe({osdmap=149}) v3 -- 0x555f83265380 con 0x555f831b3000
To fix this in kclient we should skip reconnecting if the mds is not ready.