Project

General

Profile

Actions

Bug #62489

open

testing: did not reconnect to MDS during up:reconnect

Added by Patrick Donnelly 9 months ago. Updated 9 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

This is with the testing branch of the kernel client:

2023-08-09T06:17:03.518 INFO:tasks.workunit.client.0.smithi190.stdout:removed 'linux-2.6.33/arch/mips/include/asm/gpio.h'
2023-08-09T06:17:03.519 INFO:tasks.workunit.client.0.smithi190.stdout:removed 'linux-2.6.33/arch/mips/include/asm/txx9/smsc_fdc37m81x.h'
2023-08-09T06:17:03.519 INFO:tasks.workunit.client.0.smithi190.stdout:removed 'linux-2.6.33/arch/mips/include/asm/txx9/pci.h'
2023-08-09T06:17:03.534 DEBUG:teuthology.orchestra.run:got remote process result: 1
2023-08-09T06:17:03.535 INFO:tasks.workunit.client.0.smithi190.stderr:rm: cannot remove 'linux-2.6.33/arch/mips/include/asm/txx9/rbtx4939.h': Permission denied
2023-08-09T06:17:03.535 INFO:tasks.workunit.client.0.smithi190.stderr:rm: cannot remove 'linux-2.6.33/arch/mips/include/asm/txx9/rbtx4938.h': Permission denied
2023-08-09T06:17:03.535 INFO:tasks.workunit.client.0.smithi190.stderr:rm: cannot remove 'linux-2.6.33/arch/mips/include/asm/txx9/dmac.h': Permission denied
2023-08-09T06:17:03.535 INFO:tasks.workunit.client.0.smithi190.stderr:rm: cannot remove 'linux-2.6.33/arch/mips/include/asm/txx9/spi.h': Permission denied
2023-08-09T06:17:03.535 INFO:tasks.workunit.client.0.smithi190.stderr:rm: cannot remove 'linux-2.6.33/arch/mips/include/asm/txx9/ndfmc.h': Permission denied
2023-08-09T06:17:03.536 INFO:tasks.workunit.client.0.smithi190.stderr:rm: cannot remove 'linux-2.6.33/arch/mips/include/asm/txx9/tx3927.h': Permission denied
2023-08-09T06:17:03.536 INFO:tasks.workunit.client.0.smithi190.stderr:rm: cannot remove 'linux-2.6.33/arch/mips/include/asm/txx9/tx4927.h': Permission denied

From: /teuthology/yuriw-2023-08-09_01:12:27-fs-wip-yuri5-testing-2023-08-08-0807-quincy-distro-default-smithi/7363646/teuthology.log

2023-08-09T06:16:59.925+0000 7f959edf3700  7 mds.0.server reconnect timed out, 1 clients have not reconnected in time
2023-08-09T06:16:59.925+0000 7f959edf3700  1 mds.0.server reconnect gives up on client.4670 172.21.15.190:0/1539903001
2023-08-09T06:16:59.925+0000 7f959edf3700  0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi190:0 (4670), after waiting 45.5268 seconds during MDS startup
2023-08-09T06:16:59.925+0000 7f959edf3700  1 mds.0.231 Evicting (and blocklisting) client session 4670 (172.21.15.190:0/1539903001)
2023-08-09T06:16:59.925+0000 7f959edf3700  0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 4670 (172.21.15.190:0/1539903001)

From: /teuthology/yuriw-2023-08-09_01:12:27-fs-wip-yuri5-testing-2023-08-08-0807-quincy-distro-default-smithi/7363646/remote/smithi190/log/ceph-mds.a.log.gz


2023-08-09T06:15:03.444653+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] send_mds_reconnect: mds0 reconnect start
2023-08-09T06:15:03.454648+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] handle_session: mds0 reconnect success
2023-08-09T06:15:06.461667+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] check_new_map: mds0 recovery completed
2023-08-09T06:15:06.801694+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:06.808699+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:07.198654+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:07.205646+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:14.450656+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:14.457647+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:15.154673+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:15.161676+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:15.574667+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:15.582669+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:20.654669+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:20.654812+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:21.235690+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:21.235789+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:22.122658+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:22.122759+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:25.064811+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:25.064906+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:28.509731+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:28.509892+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:30.345701+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:30.345935+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:32.317652+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:32.317723+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:32.615717+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:32.615910+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:33.633708+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:33.633900+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:35.295696+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:35.295798+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:37.146683+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:37.146815+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:37.986682+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:37.986781+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:38.917746+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:38.917850+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:39.712669+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:39.712796+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:43.592528+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:43.600665+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:43.863653+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:43.870656+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:44.932573+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:44.932675+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:48.186689+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:15:48.193676+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:48.450656+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket error on write
2023-08-09T06:15:48.457652+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:48.961677+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket error on write
2023-08-09T06:15:48.968708+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:49.921672+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket error on write
2023-08-09T06:15:49.928681+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:51.905682+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket error on write
2023-08-09T06:15:51.912670+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:15:55.937653+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket error on write
2023-08-09T06:15:55.944651+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:16:04.330510+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 session reset
2023-08-09T06:16:04.330607+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] mds_peer_reset: mds0 closed our session
2023-08-09T06:16:04.330652+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] send_mds_reconnect: mds0 reconnect start
2023-08-09T06:16:04.351724+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] handle_session: mds0 reconnect denied
2023-08-09T06:16:04.351865+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests:  dropping unsafe request 192920
2023-08-09T06:16:04.373983+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests:  dropping unsafe request 192922
2023-08-09T06:16:04.374060+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests:  dropping unsafe request 192925
2023-08-09T06:16:04.395601+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests:  dropping unsafe request 192927
2023-08-09T06:17:02.501817+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)
2023-08-09T06:17:02.509058+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:17:03.210357+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 session reset
2023-08-09T06:17:03.210449+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] mds_peer_reset: mds0 closed our session
2023-08-09T06:17:03.210482+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] send_mds_reconnect: mds0 reconnect start
2023-08-09T06:17:03.220684+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] handle_session: mds0 reconnect denied
2023-08-09T06:17:03.230657+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state V2_SESSION_CONNECT)
2023-08-09T06:17:03.245781+00:00 smithi190 kernel: libceph: mon2 (2)172.21.15.193:3301 session established
2023-08-09T06:17:03.510044+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] handle_session: mds0 session blocklisted
2023-08-09T06:17:03.510165+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] handle_session: mds0 rejected session

From: /teuthology/yuriw-2023-08-09_01:12:27-fs-wip-yuri5-testing-2023-08-08-0807-quincy-distro-default-smithi/7363646/remote/smithi190/syslog/kern.log.gz

Actions #1

Updated by Xiubo Li 9 months ago

  • Assignee set to Xiubo Li
Actions #2

Updated by Xiubo Li 9 months ago

The kclient just sent a reconnect request after the mds failovered or restarted and when the kclient received the session reset, not RECONNECT mdsmap:

2023-08-09T06:16:04.330510+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 session reset
2023-08-09T06:16:04.330607+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] mds_peer_reset: mds0 closed our session
2023-08-09T06:16:04.330652+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] send_mds_reconnect: mds0 reconnect start

But the reconnect request was received during the mds was in up:replay state:

2023-08-09T06:16:04.335+0000 7f95a05f6700 10 mds.0.0 update_log_config log_to_monitors true
2023-08-09T06:16:04.335+0000 7f95a05f6700 10 mds.0.0 create_logger
2023-08-09T06:16:04.335+0000 7f95a05f6700  7 mds.0.server operator(): full = 0 epoch = 0
2023-08-09T06:16:04.335+0000 7f95a05f6700  4 mds.0.purge_queue operator():  data pool 4 not found in OSDMap
2023-08-09T06:16:04.335+0000 7f95a05f6700 10 mds.0.server apply_blocklist: killed 0
2023-08-09T06:16:04.335+0000 7f95a05f6700  4 mds.0.0 apply_blocklist: killed 0, blocklisted sessions (0 blocklist entries, 0)
2023-08-09T06:16:04.335+0000 7f95a05f6700 10 mds.a handle_mds_map: handling map as rank 0
2023-08-09T06:16:04.335+0000 7f95a05f6700  1 mds.0.231 handle_mds_map i am now mds.0.231
2023-08-09T06:16:04.335+0000 7f95a05f6700  1 mds.0.231 handle_mds_map state change up:standby --> up:replay
2023-08-09T06:16:04.335+0000 7f95a05f6700  5 mds.beacon.a set_want_state: up:standby -> up:replay
2023-08-09T06:16:04.335+0000 7f95a05f6700  1 mds.0.231 replay_start
2023-08-09T06:16:04.335+0000 7f95a05f6700  1 mds.0.231  waiting for osdmap 147 (which blocklists prior instance)

...

2023-08-09T06:16:04.338+0000 7f95a2dfb700  1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >> [v2:172.21.15.190:6800/84327,v1:172.21.15.190:6801/84327] conn(0x555f83345400 msgr2=0x555f8257d180 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).operator() setting up a delay queue
2023-08-09T06:16:04.338+0000 7f95a2dfb700  1 --2- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >> [v2:172.21.15.190:6800/84327,v1:172.21.15.190:6801/84327] conn(0x555f83345400 0x555f8257d180 crc :-1 s=READY pgs=294 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=osd.0 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2023-08-09T06:16:04.338+0000 7f95a3dfd700  1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >> [v2:172.21.15.193:6824/84630,v1:172.21.15.193:6825/84630] conn(0x555f83344000 msgr2=0x555f8257c680 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).operator() setting up a delay queue
2023-08-09T06:16:04.338+0000 7f95a3dfd700  1 --2- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >> [v2:172.21.15.193:6824/84630,v1:172.21.15.193:6825/84630] conn(0x555f83344000 0x555f8257c680 crc :-1 s=READY pgs=264 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=osd.7 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2023-08-09T06:16:04.340+0000 7f95a3dfd700  1 --2- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >>  conn(0x555f83354000 0x555f8257d700 unknown :-1 s=NONE pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).accept
2023-08-09T06:16:04.340+0000 7f95a3dfd700  1 --2- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >>  conn(0x555f83354000 0x555f8257d700 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=1 required=1
2023-08-09T06:16:04.341+0000 7f95a3dfd700 10 mds.a parse_caps: parsing auth_cap_str='allow'
2023-08-09T06:16:04.341+0000 7f95a05f6700 10 mds.a  new session 0x555f824a0300 for client.4670 172.21.15.190:0/1539903001 con 0x555f83354000
2023-08-09T06:16:04.341+0000 7f95a05f6700 10 mds.a parse_caps: parsing auth_cap_str='allow'
2023-08-09T06:16:04.341+0000 7f95a05f6700 10 mds.a ms_handle_accept 172.21.15.190:0/1539903001 con 0x555f83354000 session 0x555f824a0300
2023-08-09T06:16:04.341+0000 7f95a3dfd700  1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >> 172.21.15.190:0/1539903001 conn(0x555f83354000 msgr2=0x555f8257d700 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).operator() setting up a delay queue
2023-08-09T06:16:04.341+0000 7f95a3dfd700  1 --2- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] >> 172.21.15.190:0/1539903001 conn(0x555f83354000 0x555f8257d700 crc :-1 s=READY pgs=1590 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=client.4670 client_cookie=7136c3bb06ae8991 server_cookie=f3d5a73cd05d2d1 in_seq=0 out_seq=0
2023-08-09T06:16:04.341+0000 7f95a05f6700  1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] <== client.4670 172.21.15.190:0/1539903001 1 ==== client_request(client.4670:192920 rmdir #0x10000009597/ip32 2023-08-09T06:15:48.151937+0000 RETRY=1 REPLAY caller_uid=1000, caller_gid=1275{6,36,1000,1275,}) v6 ==== 192+0+0 (crc 0 0 0) 0x555f831b1340 con 0x555f83354000
2023-08-09T06:16:04.341+0000 7f95a05f6700  3 mds.0.server not active yet, waiting
2023-08-09T06:16:04.341+0000 7f95a05f6700  1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] <== client.4670 172.21.15.190:0/1539903001 2 ==== client_request(client.4670:192922 unlink #0x10000009597/gpio.h 2023-08-09T06:15:48.169936+0000 RETRY=1 REPLAY caller_uid=1000, caller_gid=1275{6,36,1000,1275,}) v6 ==== 194+0+0 (crc 0 0 0) 0x555f831b1600 con 0x555f83354000
2023-08-09T06:16:04.341+0000 7f95a05f6700  3 mds.0.server not active yet, waiting
2023-08-09T06:16:04.341+0000 7f95a05f6700  1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] <== client.4670 172.21.15.190:0/1539903001 3 ==== client_request(client.4670:192925 unlink #0x100000097ee/smsc_fdc37m81x.h 2023-08-09T06:15:48.173936+0000 RETRY=1 REPLAY caller_uid=1000, caller_gid=1275{6,36,1000,1275,}) v6 ==== 204+0+0 (crc 0 0 0) 0x555f831b18c0 con 0x555f83354000
2023-08-09T06:16:04.341+0000 7f95a05f6700  3 mds.0.server not active yet, waiting
2023-08-09T06:16:04.341+0000 7f95a05f6700  1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] <== client.4670 172.21.15.190:0/1539903001 4 ==== client_request(client.4670:192927 unlink #0x100000097ee/pci.h 2023-08-09T06:15:48.175936+0000 RETRY=1 REPLAY caller_uid=1000, caller_gid=1275{6,36,1000,1275,}) v6 ==== 193+0+0 (crc 0 0 0) 0x555f831b1b80 con 0x555f83354000
2023-08-09T06:16:04.341+0000 7f95a05f6700  3 mds.0.server not active yet, waiting
2023-08-09T06:16:04.341+0000 7f95a05f6700  1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] <== client.4670 172.21.15.190:0/1539903001 5 ==== client_request(client.4670:192929 unlink #0x100000097ee/rbtx4939.h 2023-08-09T06:15:48.177936+0000 RETRY=1 caller_uid=1000, caller_gid=1275{6,36,1000,1275,}) v6 ==== 198+0+0 (crc 0 0 0) 0x555f8330c000 con 0x555f83354000
2023-08-09T06:16:04.341+0000 7f95a05f6700  3 mds.0.server not active yet, waiting
2023-08-09T06:16:04.341+0000 7f95a05f6700  1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] <== client.4670 172.21.15.190:0/1539903001 6 ==== client_reconnect(14 caps 3 realms ) v3 ==== 0+0+1105 (crc 0 0 0) 0x555f8325eea0 con 0x555f83354000
2023-08-09T06:16:04.341+0000 7f95a05f6700  7 mds.0.server handle_client_reconnect client.4670
2023-08-09T06:16:04.341+0000 7f95a05f6700 20 mds.0.231 get_session have 0x555f824a0300 client.4670 172.21.15.190:0/1539903001 state closed
2023-08-09T06:16:04.341+0000 7f95a05f6700  0 mds.0.server  ignoring msg from not-open sessionclient_reconnect(14 caps 3 realms ) v3
2023-08-09T06:16:04.341+0000 7f95a05f6700  1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] --> 172.21.15.190:0/1539903001 -- client_session(close) v5 -- 0x555f82480a80 con 0x555f83354000

Just before the sessionmap being loaded and replayed and the mds just create a new session, which the state was STATE_CLOSED, and then rejected the session connection as above. And just after this the sessionmap journal logs was load and replayed, but it's too late:

2023-08-09T06:16:04.447+0000 7f959a5ea700 10 MDSIOContextBase::complete: N12_GLOBAL__N_112C_IO_SM_LoadE
2023-08-09T06:16:04.447+0000 7f959a5ea700 10 MDSContext::complete: N12_GLOBAL__N_112C_IO_SM_LoadE
2023-08-09T06:16:04.447+0000 7f959a5ea700 10 mds.0.sessionmap _load_finish loaded version 62854
2023-08-09T06:16:04.447+0000 7f959a5ea700 10 mds.0.sessionmap _load_finish: omap load complete
2023-08-09T06:16:04.447+0000 7f959a5ea700 10 mds.0.sessionmap _load_finish: v 62854, 1 sessions
2023-08-09T06:16:04.447+0000 7f959a5ea700 10 mds.0.sessionmap dump
2023-08-09T06:16:04.447+0000 7f959a5ea700 10 mds.0.sessionmap client.4670 0x555f83366500 state open completed {182651=0x0} free_prealloc_inos [0x1000000f4cb~0x6e,0x1000000f560~0xfb,0x1000000f673~0x3d0] delegated_inos []

And then in kclient it just closed and then removed the session:

2023-08-09T06:16:04.351724+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] handle_session: mds0 reconnect denied
2023-08-09T06:16:04.351865+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests:  dropping unsafe request 192920
2023-08-09T06:16:04.373983+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests:  dropping unsafe request 192922
2023-08-09T06:16:04.374060+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests:  dropping unsafe request 192925
2023-08-09T06:16:04.395601+00:00 smithi190 kernel: ceph: [45034a4b-ab2e-4a3a-9945-7e70a3231d64 4670] cleanup_session_requests:  dropping unsafe request 192927
2023-08-09T06:17:02.501817+00:00 smithi190 kernel: libceph: mds0 (2)172.21.15.190:6834 socket closed (con state OPEN)

And then when the RECONNECT mdsmap came, the kclient will skip reconnecting it and then mds timedout waiting reconnection:

2023-08-09T06:16:57.688+0000 7f95a2dfb700  5 mds.beacon.a received beacon reply up:reconnect seq 16 rtt 0
2023-08-09T06:16:58.338+0000 7f959edf3700 20 mds.0.231 get_task_status
2023-08-09T06:16:58.338+0000 7f959edf3700 20 mds.0.231 schedule_update_timer_task
2023-08-09T06:16:58.370+0000 7f959c5ee700  2 mds.0.cache Memory usage:  total 477004, rss 168768, heap 207104, baseline 182528, 0 / 24084 inodes have caps, 0 caps, 0 caps per inode
2023-08-09T06:16:58.370+0000 7f959c5ee700 10 mds.0.cache cache not ready for trimming
2023-08-09T06:16:58.370+0000 7f959c5ee700 10 mds.0.cache releasing free memory
2023-08-09T06:16:58.370+0000 7f959c5ee700 20 mds.0.cache upkeep thread waiting interval 1.000000000s
2023-08-09T06:16:59.370+0000 7f959c5ee700  2 mds.0.cache Memory usage:  total 477004, rss 168768, heap 207104, baseline 182528, 0 / 24084 inodes have caps, 0 caps, 0 caps per inode
2023-08-09T06:16:59.370+0000 7f959c5ee700 10 mds.0.cache cache not ready for trimming
2023-08-09T06:16:59.370+0000 7f959c5ee700 20 mds.0.cache upkeep thread waiting interval 1.000000000s
2023-08-09T06:16:59.925+0000 7f959edf3700  7 mds.0.server reconnect timed out, 1 clients have not reconnected in time
2023-08-09T06:16:59.925+0000 7f959edf3700  1 mds.0.server reconnect gives up on client.4670 172.21.15.190:0/1539903001
2023-08-09T06:16:59.925+0000 7f959edf3700  0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi190:0 (4670), after waiting 45.5268 seconds during MDS startup
2023-08-09T06:16:59.925+0000 7f959edf3700  1 mds.0.231 Evicting (and blocklisting) client session 4670 (172.21.15.190:0/1539903001)
2023-08-09T06:16:59.925+0000 7f959edf3700  0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 4670 (172.21.15.190:0/1539903001)
2023-08-09T06:16:59.925+0000 7f959edf3700  4 mds.0.231 Preparing blocklist command... (wait=0)
2023-08-09T06:16:59.925+0000 7f959edf3700  4 mds.0.231 Sending mon blocklist command: {"prefix":"osd blocklist", "blocklistop":"add","addr":"172.21.15.190:0/1539903001"}
2023-08-09T06:16:59.925+0000 7f959edf3700  1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] --> [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] -- mon_command({"prefix":"osd blocklist", "blocklistop":"add","addr":"172.21.15.190:0/1539903001"} v 0) v1 -- 0x555f83323200 con 0x555f831b3000
2023-08-09T06:16:59.925+0000 7f959edf3700  1 mds.0.server reconnect will complete once clients are evicted
2023-08-09T06:16:59.929+0000 7f959f5f4700  1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] --> [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] -- log(2 entries from seq 1 at 2023-08-09T06:16:59.927048+0000) v1 -- 0x555f824808c0 con 0x555f831b3000
...
2023-08-09T06:17:00.376+0000 7f95a05f6700  7 mds.0.server operator(): full = 0 epoch = 148
2023-08-09T06:17:00.376+0000 7f95a05f6700 10 mds.0.server kill_session 0x555f83366500
2023-08-09T06:17:00.376+0000 7f95a05f6700 10 mds.0.server journal_close_session : client.4670 172.21.15.190:0/1539903001 pending_prealloc_inos [] free_prealloc_inos [] delegated_inos [0x1000000f4cb~0x6e,0x1000000f560~0xfb,0x1000000f673~0x3d0]
2023-08-09T06:17:00.376+0000 7f95a05f6700 20 mds.0.sessionmap mark_projected s=0x555f83366500 name=client.4670 pv=62855 -> 62856
2023-08-09T06:17:00.376+0000 7f95a05f6700 10 mds.0.server apply_blocklist: killed 1
2023-08-09T06:17:00.376+0000 7f95a05f6700  4 mds.0.231 apply_blocklist: killed 1, blocklisted sessions (1 blocklist entries, 1)
2023-08-09T06:17:00.376+0000 7f95a05f6700  4 mds.0.231 set_osd_epoch_barrier: epoch=148
2023-08-09T06:17:00.376+0000 7f95a05f6700  1 -- [v2:172.21.15.190:6834/4117966589,v1:172.21.15.190:6835/4117966589] --> [v2:172.21.15.190:3300/0,v1:172.21.15.190:6789/0] -- mon_subscribe({osdmap=149}) v3 -- 0x555f83265380 con 0x555f831b3000

To fix this in kclient we should skip reconnecting if the mds is not ready.

Actions #3

Updated by Xiubo Li 9 months ago

  • Status changed from New to Fix Under Review
Actions

Also available in: Atom PDF