Bug #61843
openkclient: was evicted without responding for 300 seconds
0%
Description
Venky reported it in another tracker days ago: https://tracker.ceph.com/issues/61394#note-11
The qa teuthology test:
It failed dues to:
failure_reason: '"2023-06-20T14:05:30.304943+0000 mds.c (mds.0) 1 : cluster [WRN] evicting unresponsive client smithi202 (6105), after 305.619 seconds" in cluster log'
The kclient client.6105 was blocklisted.
From remote/smithi202/syslog/kern.log.gz, we can see that the session of client.6105 was closed and then rejected:
Jun 20 14:00:24 smithi202 kernel: libceph: mon0 (1)172.21.15.33:6789 session established Jun 20 14:00:24 smithi202 kernel: libceph: client6105 fsid 054ad731-bfaf-4587-9674-7b36e070bce0 Jun 20 14:00:24 smithi202 kernel: ceph: test_dummy_encryption mode enabled Jun 20 14:00:25 smithi202 sudo[185381]: ubuntu : PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/sbin/logrotate /etc/logrotate.d/ceph-test.conf Jun 20 14:00:25 smithi202 sudo[185381]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000) Jun 20 14:00:25 smithi202 sudo[185381]: pam_unix(sudo:session): session closed for user root Jun 20 14:00:25 smithi202 systemd[1]: fstests-generic-460.scope: Deactivated successfully. Jun 20 14:00:25 smithi202 systemd[1]: fstests-generic-460.scope: Consumed 1.351s CPU time. Jun 20 14:00:25 smithi202 systemd[1]: tmp-tmp.NXtTPrQZ4vtest.mount: Deactivated successfully. Jun 20 14:00:25 smithi202 systemd[1]: tmp-tmp.UR7IGvggwFscratch.mount: Deactivated successfully. Jun 20 14:00:55 smithi202 sudo[185389]: ubuntu : PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/sbin/logrotate /etc/logrotate.d/ceph-test.conf Jun 20 14:00:55 smithi202 sudo[185389]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000) Jun 20 14:00:55 smithi202 sudo[185389]: pam_unix(sudo:session): session closed for user root Jun 20 14:04:52 smithi202 sudo[185395]: ubuntu : PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/sbin/logrotate /etc/logrotate.d/ceph-test.conf Jun 20 14:04:52 smithi202 sudo[185395]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000) Jun 20 14:04:52 smithi202 sudo[185395]: pam_unix(sudo:session): session closed for user root Jun 20 14:05:01 smithi202 CRON[185398]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0) Jun 20 14:05:01 smithi202 CRON[185399]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Jun 20 14:05:01 smithi202 CRON[185398]: pam_unix(cron:session): session closed for user root Jun 20 14:05:30 smithi202 kernel: libceph: mds0 (1)172.21.15.33:6837 socket closed (con state OPEN) Jun 20 14:05:30 smithi202 kernel: ceph: [054ad731-bfaf-4587-9674-7b36e070bce0 6105] check_session_state: mds0 hung Jun 20 14:05:30 smithi202 kernel: libceph: mds0 (1)172.21.15.33:6837 session reset Jun 20 14:05:30 smithi202 kernel: ceph: [054ad731-bfaf-4587-9674-7b36e070bce0 6105] mds_peer_reset: mds0 closed our session Jun 20 14:05:30 smithi202 kernel: ceph: [054ad731-bfaf-4587-9674-7b36e070bce0 6105] send_mds_reconnect: mds0 reconnect start Jun 20 14:05:30 smithi202 kernel: ceph: [054ad731-bfaf-4587-9674-7b36e070bce0 6105] handle_session: mds0 reconnect denied Jun 20 14:05:30 smithi202 root[185434]: run xfstest generic/462
And the xfstests-dev test case should be generic/460 and the xfstests-dev tool have successfully deactived the scratch mount but it seems the kclient didn't finish the unmouting in time.
From the teuthology.log we can see that the generic/460 succeeded, but it took 306 seconds:
2023-06-20T14:00:20.460 INFO:teuthology.orchestra.run.smithi202.stdout:generic/451 31s 2023-06-20T14:00:22.305 INFO:teuthology.orchestra.run.smithi202.stdout:generic/452 2s 2023-06-20T14:00:22.678 INFO:teuthology.orchestra.run.smithi202.stdout:generic/453 [not run] ceph does not allow unrestricted byte streams for names 2023-06-20T14:00:23.022 INFO:teuthology.orchestra.run.smithi202.stdout:generic/454 [not run] ceph does not allow unrestricted byte streams for names 2023-06-20T14:00:23.361 INFO:teuthology.orchestra.run.smithi202.stdout:generic/456 [not run] xfs_io falloc failed (old kernel/wrong fs?) 2023-06-20T14:00:24.217 INFO:teuthology.orchestra.run.smithi202.stdout:generic/458 [not run] Reflink not supported by scratch filesystem type: ceph 2023-06-20T14:00:25.160 DEBUG:teuthology.orchestra.run.smithi033:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-06-20T14:00:25.163 DEBUG:teuthology.orchestra.run.smithi045:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-06-20T14:00:25.165 DEBUG:teuthology.orchestra.run.smithi202:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-06-20T14:00:55.249 DEBUG:teuthology.orchestra.run.smithi033:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-06-20T14:00:55.251 DEBUG:teuthology.orchestra.run.smithi045:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-06-20T14:00:55.254 DEBUG:teuthology.orchestra.run.smithi202:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-06-20T14:00:55.303 INFO:tasks.ceph.osd.2.smithi033.stderr:2023-06-20T14:00:55.302+0000 7fc573687640 -1 received signal: Hangup from killall ceph-osd -1 (PID: 47901) UID: 0 2023-06-20T14:00:55.303 INFO:tasks.ceph.osd.0.smithi033.stderr:2023-06-20T14:00:55.302+0000 7ff9a4b52640 -1 received signal: Hangup from killall ceph-osd -1 (PID: 47901) UID: 0 2023-06-20T14:00:55.304 INFO:tasks.ceph.osd.3.smithi033.stderr:2023-06-20T14:00:55.302+0000 7fcc53edf640 -1 received signal: Hangup from killall ceph-osd -1 (PID: 47901) UID: 0 2023-06-20T14:00:55.305 INFO:tasks.ceph.osd.1.smithi033.stderr:2023-06-20T14:00:55.302+0000 7fc2e95e0640 -1 received signal: Hangup from killall ceph-osd -1 (PID: 47901) UID: 0 2023-06-20T14:04:52.099 DEBUG:teuthology.orchestra.run.smithi033:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-06-20T14:04:52.102 DEBUG:teuthology.orchestra.run.smithi045:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-06-20T14:04:52.104 DEBUG:teuthology.orchestra.run.smithi202:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-06-20T14:04:52.116 INFO:tasks.ceph.osd.6.smithi045.stderr:2023-06-20T14:04:52.114+0000 7f9fa0df4640 -1 received signal: Hangup from killall ceph-osd -1 (PID: 43452) UID: 0 2023-06-20T14:04:52.116 INFO:tasks.ceph.osd.7.smithi045.stderr:2023-06-20T14:04:52.114+0000 7f9307c81640 -1 received signal: Hangup from killall ceph-osd -1 (PID: 43452) UID: 0 2023-06-20T14:04:52.117 INFO:tasks.ceph.osd.5.smithi045.stderr:2023-06-20T14:04:52.114+0000 7f5dc749b640 -1 received signal: Hangup from killall ceph-osd -1 (PID: 43452) UID: 0 2023-06-20T14:04:52.117 INFO:tasks.ceph.osd.4.smithi045.stderr:2023-06-20T14:04:52.114+0000 7f7008e46640 -1 received signal: Hangup from killall ceph-osd -1 (PID: 43452) UID: 0 2023-06-20T14:04:52.128 INFO:tasks.ceph.osd.1.smithi033.stderr:2023-06-20T14:04:52.127+0000 7fc2e95e0640 -1 received signal: Hangup from killall ceph-osd -1 (PID: 47916) UID: 0 2023-06-20T14:04:52.129 INFO:tasks.ceph.osd.0.smithi033.stderr:2023-06-20T14:04:52.127+0000 7ff9a4b52640 -1 received signal: Hangup from killall ceph-osd -1 (PID: 47916) UID: 0 2023-06-20T14:04:52.129 INFO:tasks.ceph.osd.3.smithi033.stderr:2023-06-20T14:04:52.127+0000 7fcc53edf640 -1 received signal: Hangup from killall ceph-osd -1 (PID: 47916) UID: 0 2023-06-20T14:04:52.130 INFO:tasks.ceph.osd.2.smithi033.stderr:2023-06-20T14:04:52.127+0000 7fc573687640 -1 received signal: Hangup from killall ceph-osd -1 (PID: 47916) UID: 0 2023-06-20T14:05:30.945 INFO:teuthology.orchestra.run.smithi202.stdout:generic/460 306s 2023-06-20T14:05:31.476 INFO:teuthology.orchestra.run.smithi202.stdout:generic/462 [not run] mount 172.21.15.33:6789:/scratch with dax failed 2023-06-20T14:05:31.908 INFO:teuthology.orchestra.run.smithi202.stdout:generic/463 [not run] Reflink not supported by test filesystem type: ceph
There is no kclient logs so we couldn't say what has happened directly from logs.
While from the remote/smithi033/log/ceph-mds.c.log.4.gz logs,we can see that just after the scratch mount was deactivated it didn't receive any ping and reqeust from client.6105 in around 300s:
2023-06-20T14:00:25.931+0000 7fa2d6058640 1 -- [v2:172.21.15.33:6834/1215596373,v1:172.21.15.33:6837/1215596373] <== client.6105 v1:172.21.15.202:0/2228992724 20 ==== client_caps(update ino 0x1000000b7f6 1 seq 15 tid 16 caps=pAsxXsxFsxcrwb dirty=Fw wanted=pAsxXsxFsxcrwb follows 1 size 1073741824/0 mtime 2023-06-20T14:00:25.703292+0000 ctime 2023-06-20T14:00:25.703292+0000 change_attr 1024 xattrs(v=4171204432 l=0)) v12 ==== 316+0+0 (unknown 3999766560 0 0) 0x5606c4a66380 con 0x5606bd6fd800 ... 2023-06-20T14:05:30.842+0000 7fa2d6058640 10 mds.c new session 0x5606c10a6500 for client.6105 v1:172.21.15.202:0/2228992724 con 0x5606bd6fd800 2023-06-20T14:05:30.842+0000 7fa2d6058640 10 mds.c parse_caps: parsing auth_cap_str='allow *' 2023-06-20T14:05:30.842+0000 7fa2d6058640 10 mds.c ms_handle_accept v1:172.21.15.202:0/2228992724 con 0x5606bd6fd800 session 0x5606c10a6500 2023-06-20T14:05:30.842+0000 7fa2d6058640 1 -- [v2:172.21.15.33:6834/1215596373,v1:172.21.15.33:6837/1215596373] <== client.6105 v1:172.21.15.202:0/2228992724 1 ==== client_reconnect(1 caps 1 realms ) v3 ==== 0+0+98 (unknown 0 0 3797099533) 0x5606bd4aaea0 con 0x5606bd6fd800
While reading the kclient code I found that when umounting it would set mdsc->stopping = CEPH_MDSC_STOPPING_BEGIN or CEPH_MDSC_STOPPING_FLUSHED, then the tick thread would be stopped, this should be why mds.0 couldn't receive the renew caps request in 300 seconds.
Just after setting the mdsc->stopping it will try to flush the dirty buffer to Rados just before closing all the mds sessions, that means if the Rados couldn't response in time the kclient would stop pinging to MDS while keeping the sessions opened.
It's a bug in kclient and should be fixed.
Updated by Xiubo Li 11 months ago
Flushing the dirty buffer may take a long time if the Rados is overloaded or there is network issue. So we should ping the MDSs to keep alive, else the MDS will blocklist us.
The following is one case when running the ffstress test and the flush dirty buffer blocked more than 2 seconds on my VM, on which the disk space is bit low:
[xiubli@ceph modules]$ sudo cat /proc/kmsg <4>[54094.501840] hrtimer: interrupt took 2449200 ns <3>[55176.534510] INFO: task ffsb:13911 blocked for more than 122 seconds. <3>[55176.534521] Tainted: G W 6.4.0+ #8 <3>[55176.534526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <6>[55176.534529] task:ffsb state:D stack:0 pid:13911 ppid:10993 flags:0x00004002 <6>[55176.534537] Call Trace: <6>[55176.534541] <TASK> <6>[55176.534547] __schedule+0x50e/0xde0 <6>[55176.534558] ? __pfx___schedule+0x10/0x10 <6>[55176.534565] ? rcu_is_watching+0x1f/0x50 <6>[55176.534572] ? lock_acquire+0xc7/0x3e0 <6>[55176.534579] ? __pfx_lock_acquire+0x10/0x10 <6>[55176.534585] ? do_raw_spin_trylock+0xd0/0x120 <6>[55176.534592] ? __pfx_do_raw_spin_trylock+0x10/0x10 <6>[55176.534598] ? rcu_is_watching+0x1f/0x50 <6>[55176.534605] schedule+0x8e/0x120 <6>[55176.534612] io_schedule+0x6d/0xa0 <6>[55176.534617] folio_wait_bit_common+0x243/0x590 <6>[55176.534635] ? __pfx_folio_wait_bit_common+0x10/0x10 <6>[55176.534641] ? filemap_get_folios_tag+0x4a1/0x5b0 <6>[55176.534649] ? __pfx_wake_page_function+0x10/0x10 <6>[55176.534656] ? rcu_is_watching+0x1f/0x50 <6>[55176.534662] ? lockdep_hardirqs_on_prepare+0xc/0x60 <6>[55176.534669] folio_wait_writeback+0x1b/0x110 <6>[55176.534677] __filemap_fdatawait_range+0xcc/0x170 <6>[55176.534683] ? __pfx___filemap_fdatawait_range+0x10/0x10 <6>[55176.534689] ? wait_sb_inodes+0xb9/0x520 <6>[55176.534696] ? rcu_is_watching+0x1f/0x50 <6>[55176.534701] ? lock_acquire+0xc7/0x3e0 <6>[55176.534710] ? __list_del_entry_valid+0x66/0x100 <6>[55176.534722] filemap_fdatawait_keep_errors+0x1a/0x30 <6>[55176.534729] wait_sb_inodes+0x2de/0x520 <6>[55176.534736] ? __pfx_wait_sb_inodes+0x10/0x10 <6>[55176.534741] ? __pfx_autoremove_wake_function+0x10/0x10 <6>[55176.534749] ? __up_write+0x184/0x2d0 <6>[55176.534757] sync_inodes_sb+0x14d/0x1a0 <6>[55176.534764] ? __pfx_sync_inodes_sb+0x10/0x10 <6>[55176.534770] ? rcu_is_watching+0x1f/0x50 <6>[55176.534780] ? __pfx_sync_inodes_one_sb+0x10/0x10 <6>[55176.534785] iterate_supers+0xbe/0x140 <6>[55176.534793] ksys_sync+0x84/0x110 <6>[55176.534799] ? __pfx_ksys_sync+0x10/0x10 <6>[55176.534804] ? ktime_get_coarse_real_ts64+0x100/0x110 <6>[55176.534812] ? syscall_trace_enter.constprop.0+0x153/0x1d0 <6>[55176.534819] __do_sys_sync+0xa/0x20 <6>[55176.534824] do_syscall_64+0x5c/0x90 <6>[55176.534830] ? rcu_is_watching+0x1f/0x50 <6>[55176.534836] ? trace_hardirqs_on_prepare+0xc9/0xe0 <6>[55176.534841] ? lockdep_hardirqs_on_prepare+0xc/0x60 <6>[55176.534847] ? do_syscall_64+0x69/0x90 <6>[55176.534852] ? rcu_is_watching+0x1f/0x50 <6>[55176.534858] ? trace_hardirqs_off+0x61/0xe0 <6>[55176.534862] ? rcu_is_watching+0x1f/0x50 <6>[55176.534868] ? trace_hardirqs_on_prepare+0xc9/0xe0 <6>[55176.534873] entry_SYSCALL_64_after_hwframe+0x72/0xdc <6>[55176.534880] RIP: 0033:0x7fd0b5c3ed5b <6>[55176.534885] RSP: 002b:00007ffc05a3fe38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 <6>[55176.534892] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fd0b5c3ed5b <6>[55176.534896] RDX: 000000000051b722 RSI: 00000000649cf749 RDI: 0000000000000001 <6>[55176.534900] RBP: 00007ffc05a40460 R08: 0000000000000000 R09: 00000000ffffffff <6>[55176.534904] R10: 00007ffc05a3fe40 R11: 0000000000000246 R12: 00000000012599a8 <6>[55176.534907] R13: 00007ffc05a3ffc0 R14: 0000000001259970 R15: 0000000000000001 <6>[55176.534917] </TASK> <4>[55176.534934] INFO: lockdep is turned off. <3>[55299.414106] INFO: task ffsb:13911 blocked for more than 245 seconds. <3>[55299.414117] Tainted: G W 6.4.0+ #8 <3>[55299.414122] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <6>[55299.414125] task:ffsb state:D stack:0 pid:13911 ppid:10993 flags:0x00004002 <6>[55299.414132] Call Trace: <6>[55299.414136] <TASK> <6>[55299.414141] __schedule+0x50e/0xde0 <6>[55299.414153] ? __pfx___schedule+0x10/0x10 <6>[55299.414158] ? rcu_is_watching+0x1f/0x50 <6>[55299.414165] ? lock_acquire+0xc7/0x3e0 <6>[55299.414172] ? __pfx_lock_acquire+0x10/0x10 <6>[55299.414179] ? do_raw_spin_trylock+0xd0/0x120 <6>[55299.414185] ? __pfx_do_raw_spin_trylock+0x10/0x10 <6>[55299.414191] ? rcu_is_watching+0x1f/0x50 <6>[55299.414198] schedule+0x8e/0x120 <6>[55299.414206] io_schedule+0x6d/0xa0 <6>[55299.414215] folio_wait_bit_common+0x243/0x590 <6>[55299.414233] ? __pfx_folio_wait_bit_common+0x10/0x10 <6>[55299.414242] ? filemap_get_folios_tag+0x4a1/0x5b0 <6>[55299.414257] ? __pfx_wake_page_function+0x10/0x10 <6>[55299.414269] ? rcu_is_watching+0x1f/0x50 <6>[55299.414278] ? lockdep_hardirqs_on_prepare+0xc/0x60 <6>[55299.414290] folio_wait_writeback+0x1b/0x110 <6>[55299.414300] __filemap_fdatawait_range+0xcc/0x170 <6>[55299.414310] ? __pfx___filemap_fdatawait_range+0x10/0x10 <6>[55299.414316] ? wait_sb_inodes+0xb9/0x520 <6>[55299.414322] ? rcu_is_watching+0x1f/0x50 <6>[55299.414328] ? lock_acquire+0xc7/0x3e0 <6>[55299.414337] ? __list_del_entry_valid+0x66/0x100 <6>[55299.414345] filemap_fdatawait_keep_errors+0x1a/0x30 <6>[55299.414352] wait_sb_inodes+0x2de/0x520 <6>[55299.414359] ? __pfx_wait_sb_inodes+0x10/0x10 <6>[55299.414368] ? __pfx_autoremove_wake_function+0x10/0x10 <6>[55299.414381] ? __up_write+0x184/0x2d0 <6>[55299.414391] sync_inodes_sb+0x14d/0x1a0 <6>[55299.414400] ? __pfx_sync_inodes_sb+0x10/0x10 <6>[55299.414408] ? rcu_is_watching+0x1f/0x50 <6>[55299.414418] ? __pfx_sync_inodes_one_sb+0x10/0x10 <6>[55299.414423] iterate_supers+0xbe/0x140 <6>[55299.414433] ksys_sync+0x84/0x110 <6>[55299.414442] ? __pfx_ksys_sync+0x10/0x10 <6>[55299.414450] ? ktime_get_coarse_real_ts64+0x100/0x110 <6>[55299.414462] ? syscall_trace_enter.constprop.0+0x153/0x1d0 <6>[55299.414472] __do_sys_sync+0xa/0x20 <6>[55299.414479] do_syscall_64+0x5c/0x90 <6>[55299.414484] ? rcu_is_watching+0x1f/0x50 <6>[55299.414490] ? trace_hardirqs_on_prepare+0xc9/0xe0 <6>[55299.414495] ? lockdep_hardirqs_on_prepare+0xc/0x60 <6>[55299.414501] ? do_syscall_64+0x69/0x90 <6>[55299.414506] ? rcu_is_watching+0x1f/0x50 <6>[55299.414511] ? trace_hardirqs_off+0x61/0xe0 <6>[55299.414516] ? rcu_is_watching+0x1f/0x50 <6>[55299.414521] ? trace_hardirqs_on_prepare+0xc9/0xe0 <6>[55299.414526] entry_SYSCALL_64_after_hwframe+0x72/0xdc <6>[55299.414533] RIP: 0033:0x7fd0b5c3ed5b <6>[55299.414538] RSP: 002b:00007ffc05a3fe38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 <6>[55299.414546] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fd0b5c3ed5b <6>[55299.414551] RDX: 000000000051b722 RSI: 00000000649cf749 RDI: 0000000000000001 <6>[55299.414554] RBP: 00007ffc05a40460 R08: 0000000000000000 R09: 00000000ffffffff <6>[55299.414558] R10: 00007ffc05a3fe40 R11: 0000000000000246 R12: 00000000012599a8 <6>[55299.414597] R13: 00007ffc05a3ffc0 R14: 0000000001259970 R15: 0000000000000001 <6>[55299.414612] </TASK> <4>[55299.414623] INFO: lockdep is turned off.