Project

General

Profile

Actions

Bug #61591

closed

[RHEL stock] kclient appears to wrongly believe it has been blocklisted

Added by Patrick Donnelly 12 months ago. Updated 12 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

2023-05-28T19:57:05.892 DEBUG:teuthology.orchestra.run.smithi062:> (cd /home/ubuntu/cephtest/mnt.0 && exec bash -c 'sudo dd if=/sys/kernel/debug/ceph/7e93fa48-fd90-11ed-9b1d-001a4aab830c.client15495/status')
2023-05-28T19:57:05.973 INFO:teuthology.orchestra.run.smithi062.stdout:instance: client.15495 (3)192.168.0.1:0/3891638898
2023-05-28T19:57:05.973 INFO:teuthology.orchestra.run.smithi062.stdout:blocklisted: false
...
2023-05-28T20:16:04.912 DEBUG:teuthology.orchestra.run.smithi062:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd blocklist ls
...
2023-05-28T20:16:05.653 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:0/3903573374 2023-05-29T19:57:15.421701+0000
2023-05-28T20:16:05.653 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:0/3810945376 2023-05-29T19:57:15.421701+0000
2023-05-28T20:16:05.653 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:0/532471957 2023-05-29T19:57:15.421701+0000
2023-05-28T20:16:05.654 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:0/3996750006 2023-05-29T19:49:17.588828+0000
2023-05-28T20:16:05.654 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:6800/1518238550 2023-05-29T19:49:17.588828+0000
2023-05-28T20:16:05.654 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:0/325104692 2023-05-29T19:49:17.588828+0000
2023-05-28T20:16:05.655 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:6800/43495430 2023-05-29T19:48:51.176696+0000
2023-05-28T20:16:05.655 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:6800/2478711060 2023-05-29T19:57:15.421701+0000
2023-05-28T20:16:05.655 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:0/4180666335 2023-05-29T19:48:51.176696+0000
2023-05-28T20:16:05.656 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:6801/2478711060 2023-05-29T19:57:15.421701+0000
2023-05-28T20:16:05.656 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:0/3239137281 2023-05-29T19:48:51.176696+0000
2023-05-28T20:16:05.656 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:6801/43495430 2023-05-29T19:48:51.176696+0000
2023-05-28T20:16:05.656 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:0/3171002779 2023-05-29T19:57:15.421701+0000
2023-05-28T20:16:05.657 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:6801/1518238550 2023-05-29T19:49:17.588828+0000
2023-05-28T20:16:05.657 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:0/781602504 2023-05-29T19:48:51.176696+0000
2023-05-28T20:16:05.657 INFO:teuthology.orchestra.run.smithi062.stdout:172.21.15.62:0/4265251340 2023-05-29T19:49:17.588828+0000
...
2023-05-28T20:16:05.676 INFO:teuthology.orchestra.run.smithi062.stderr:listed 16 entries
2023-05-28T20:16:05.700 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.0...
2023-05-28T20:16:05.701 INFO:teuthology.orchestra.run:Running command with timeout 300
2023-05-28T20:16:05.701 DEBUG:teuthology.orchestra.run.smithi062:> sudo umount /home/ubuntu/cephtest/mnt.0
...
2023-05-28T20:21:05.754 ERROR:teuthology:Uncaught exception (Hub)
...
2023-05-28T20:21:09.947 INFO:journalctl@ceph.mon.a.smithi062.stdout:May 28 20:21:09 smithi062 ceph-mon[94025]: evicting unresponsive client smithi062:0 (15495), after 361.407 seconds
2023-05-28T20:21:09.947 INFO:journalctl@ceph.mon.a.smithi062.stdout:May 28 20:21:09 smithi062 ceph-mon[94025]: Evicting (and blocklisting) client session 15495 (v1:192.168.0.1:0/3891638898)
2023-05-28T20:21:09.947 INFO:journalctl@ceph.mon.a.smithi062.stdout:May 28 20:21:09 smithi062 ceph-mon[94025]: from='mds.? [v2:172.21.15.176:6838/3335453750,v1:172.21.15.176:6839/3335453750]' entity='mds.i' cmd=[{"prefix":"osd blocklist", "blocklistop":"add","addr":"v1:192.168.0.1:0/3891638898"}]: dispatch
2023-05-28T20:21:09.947 INFO:journalctl@ceph.mon.a.smithi062.stdout:May 28 20:21:09 smithi062 ceph-mon[94025]: from='mds.? ' entity='mds.i' cmd=[{"prefix":"osd blocklist", "blocklistop":"add","addr":"v1:192.168.0.1:0/3891638898"}]: dispatch
2023-05-28T20:21:10.036 INFO:journalctl@ceph.mon.b.smithi157.stdout:May 28 20:21:09 smithi157 ceph-mon[95431]: pgmap v764: 129 pgs: 129 active+clean; 1.6 GiB data, 16 GiB used, 1.0 TiB / 1.0 TiB avail; 16 KiB/s rd, 53 KiB/s wr, 27 op/s
2023-05-28T20:21:10.037 INFO:journalctl@ceph.mon.b.smithi157.stdout:May 28 20:21:09 smithi157 ceph-mon[95431]: evicting unresponsive client smithi062:0 (15495), after 361.407 seconds
2023-05-28T20:21:10.037 INFO:journalctl@ceph.mon.b.smithi157.stdout:May 28 20:21:09 smithi157 ceph-mon[95431]: Evicting (and blocklisting) client session 15495 (v1:192.168.0.1:0/3891638898)
2023-05-28T20:21:10.037 INFO:journalctl@ceph.mon.b.smithi157.stdout:May 28 20:21:09 smithi157 ceph-mon[95431]: from='mds.? [v2:172.21.15.176:6838/3335453750,v1:172.21.15.176:6839/3335453750]' entity='mds.i' cmd=[{"prefix":"osd blocklist", "blocklistop":"add","addr":"v1:192.168.0.1:0/3891638898"}]: dispatch
2023-05-28T20:21:10.038 INFO:journalctl@ceph.mon.b.smithi157.stdout:May 28 20:21:09 smithi157 ceph-mon[95431]: from='mds.? ' entity='mds.i' cmd=[{"prefix":"osd blocklist", "blocklistop":"add","addr":"v1:192.168.0.1:0/3891638898"}]: dispatch
2023-05-28T20:21:10.117 INFO:journalctl@ceph.mon.c.smithi176.stdout:May 28 20:21:09 smithi176 ceph-mon[95827]: pgmap v764: 129 pgs: 129 active+clean; 1.6 GiB data, 16 GiB used, 1.0 TiB / 1.0 TiB avail; 16 KiB/s rd, 53 KiB/s wr, 27 op/s
2023-05-28T20:21:10.117 INFO:journalctl@ceph.mon.c.smithi176.stdout:May 28 20:21:09 smithi176 ceph-mon[95827]: evicting unresponsive client smithi062:0 (15495), after 361.407 seconds
2023-05-28T20:21:10.117 INFO:journalctl@ceph.mon.c.smithi176.stdout:May 28 20:21:09 smithi176 ceph-mon[95827]: Evicting (and blocklisting) client session 15495 (v1:192.168.0.1:0/3891638898)
2023-05-28T20:21:10.118 INFO:journalctl@ceph.mon.c.smithi176.stdout:May 28 20:21:09 smithi176 ceph-mon[95827]: from='mds.? [v2:172.21.15.176:6838/3335453750,v1:172.21.15.176:6839/3335453750]' entity='mds.i' cmd=[{"prefix":"osd blocklist", "blocklistop":"add","addr":"v1:192.168.0.1:0/3891638898"}]: dispatch
2023-05-28T20:21:10.118 INFO:journalctl@ceph.mon.c.smithi176.stdout:May 28 20:21:09 smithi176 ceph-mon[95827]: from='mds.? ' entity='mds.i' cmd=[{"prefix":"osd blocklist", "blocklistop":"add","addr":"v1:192.168.0.1:0/3891638898"}]: dispatch
2023-05-28T20:21:10.946 INFO:journalctl@ceph.mon.a.smithi062.stdout:May 28 20:21:10 smithi062 ceph-mon[94025]: from='mds.? ' entity='mds.i' cmd='[{"prefix":"osd blocklist", "blocklistop":"add","addr":"v1:192.168.0.1:0/3891638898"}]': finished
2023-05-28T20:21:10.947 INFO:journalctl@ceph.mon.a.smithi062.stdout:May 28 20:21:10 smithi062 ceph-mon[94025]: osdmap e135: 12 total, 12 up, 12 in
2023-05-28T20:21:11.036 INFO:journalctl@ceph.mon.b.smithi157.stdout:May 28 20:21:10 smithi157 ceph-mon[95431]: from='mds.? ' entity='mds.i' cmd='[{"prefix":"osd blocklist", "blocklistop":"add","addr":"v1:192.168.0.1:0/3891638898"}]': finished
2023-05-28T20:21:11.036 INFO:journalctl@ceph.mon.b.smithi157.stdout:May 28 20:21:10 smithi157 ceph-mon[95431]: osdmap e135: 12 total, 12 up, 12 in
2023-05-28T20:21:11.117 INFO:journalctl@ceph.mon.c.smithi176.stdout:May 28 20:21:10 smithi176 ceph-mon[95827]: from='mds.? ' entity='mds.i' cmd='[{"prefix":"osd blocklist", "blocklistop":"add","addr":"v1:192.168.0.1:0/3891638898"}]': finished

From: /teuthology/yuriw-2023-05-28_14:46:14-fs-reef-release-distro-default-smithi/7288964/teuthology.log

2023-05-28T19:57:05.499690+00:00 smithi062 kernel: ceph: loaded (mds proto 32)
2023-05-28T19:57:05.516681+00:00 smithi062 kernel: ceph: device name is missing path (no : separator in 0@7e93fa48-fd90-11ed-9b1d-001a4aab830c.cephfs=/volumes/_nogroup/sv_1/73b32d9a-d1cb-4d38-ac72-222cea3557ab)
2023-05-28T19:57:05.527684+00:00 smithi062 kernel: libceph: mon0 (1)172.21.15.62:6789 session established
2023-05-28T19:57:05.527774+00:00 smithi062 kernel: libceph: client15495 fsid 7e93fa48-fd90-11ed-9b1d-001a4aab830c
2023-05-28T19:57:05.543697+00:00 smithi062 kernel: ceph: mds3 session blocklisted
2023-05-28T19:57:05.561687+00:00 smithi062 kernel: ceph: mds0 session blocklisted
2023-05-28T19:57:17.951686+00:00 smithi062 kernel: ceph: mds1 session blocklisted
2023-05-28T19:57:17.975685+00:00 smithi062 kernel: ceph: mds4 session blocklisted
2023-05-28T19:58:01.123723+00:00 smithi062 kernel: ceph: mds2 session blocklisted
2023-05-28T20:19:09.319869+00:00 smithi062 kernel: INFO: task umount:154689 blocked for more than 120 seconds.
2023-05-28T20:19:09.320011+00:00 smithi062 kernel:      Not tainted 4.18.0-372.9.1.el8.x86_64 #1
2023-05-28T20:19:09.320054+00:00 smithi062 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2023-05-28T20:19:09.336919+00:00 smithi062 kernel: task:umount          state:D stack:    0 pid:154689 ppid:154669 flags:0x00004080
2023-05-28T20:19:09.336980+00:00 smithi062 kernel: Call Trace:
2023-05-28T20:19:09.343691+00:00 smithi062 kernel: __schedule+0x2d1/0x830
2023-05-28T20:19:09.343742+00:00 smithi062 kernel: schedule+0x35/0xa0
2023-05-28T20:19:09.352093+00:00 smithi062 kernel: ceph_mdsc_sync+0x2f1/0x350 [ceph]
2023-05-28T20:19:09.352144+00:00 smithi062 kernel: ? finish_wait+0x80/0x80
2023-05-28T20:19:09.356071+00:00 smithi062 kernel: ceph_sync_fs+0x2f/0xb0 [ceph]
2023-05-28T20:19:09.364710+00:00 smithi062 kernel: sync_filesystem+0x71/0x90
2023-05-28T20:19:09.364761+00:00 smithi062 kernel: generic_shutdown_super+0x22/0x100
2023-05-28T20:19:09.369549+00:00 smithi062 kernel: kill_anon_super+0x14/0x30
2023-05-28T20:19:09.373670+00:00 smithi062 kernel: ceph_kill_sb+0x39/0x70 [ceph]
2023-05-28T20:19:09.382918+00:00 smithi062 kernel: deactivate_locked_super+0x34/0x70
2023-05-28T20:19:09.382971+00:00 smithi062 kernel: cleanup_mnt+0x3b/0x70
2023-05-28T20:19:09.390667+00:00 smithi062 kernel: task_work_run+0x8a/0xb0
2023-05-28T20:19:09.390714+00:00 smithi062 kernel: exit_to_usermode_loop+0xeb/0xf0
2023-05-28T20:19:09.395294+00:00 smithi062 kernel: do_syscall_64+0x198/0x1a0
2023-05-28T20:19:09.399409+00:00 smithi062 kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
2023-05-28T20:19:09.408761+00:00 smithi062 kernel: RIP: 0033:0x7f82e4991dab
2023-05-28T20:19:09.408807+00:00 smithi062 kernel: Code: Unable to access opcode bytes at RIP 0x7f82e4991d81.
2023-05-28T20:19:09.415675+00:00 smithi062 kernel: RSP: 002b:00007ffe0ada7a78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
2023-05-28T20:19:09.423626+00:00 smithi062 kernel: RAX: 0000000000000000 RBX: 00005603ed4ce5d0 RCX: 00007f82e4991dab
2023-05-28T20:19:09.431142+00:00 smithi062 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005603ed4ce7b0
2023-05-28T20:19:09.438654+00:00 smithi062 kernel: RBP: 0000000000000000 R08: 00005603ed4ce7e0 R09: 00007f82e4ad8820
2023-05-28T20:19:09.446163+00:00 smithi062 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00005603ed4ce7b0
2023-05-28T20:19:09.453681+00:00 smithi062 kernel: R13: 00007f82e5802184 R14: 0000000000000000 R15: 00000000ffffffff
2023-05-28T20:21:09.680697+00:00 smithi062 kernel: libceph: mds1 (1)172.21.15.176:6839 socket closed (con state OPEN)
2023-05-28T20:21:09.680835+00:00 smithi062 kernel: libceph: mds0 (1)172.21.15.157:6839 socket closed (con state OPEN)
2023-05-28T20:21:09.680871+00:00 smithi062 kernel: libceph: mds3 (1)172.21.15.176:6837 socket closed (con state OPEN)
2023-05-28T20:21:09.802654+00:00 smithi062 kernel: libceph: mds2 (1)172.21.15.157:6837 socket closed (con state OPEN)
2023-05-28T20:21:09.871354+00:00 smithi062 kernel: libceph: mds4 (1)172.21.15.176:6835 socket closed (con state OPEN)

From: /teuthology/yuriw-2023-05-28_14:46:14-fs-reef-release-distro-default-smithi/7288964/remote/smithi062/syslog/kern.log.gz

It was not actually blocklisted at the time ~19:57. This happened much later after umount timed out (~5 seconds) and the MDS noted the mount was unresponsive for 5 minutes.

First signs of trouble are:

./remote/smithi157/log/7e93fa48-fd90-11ed-9b1d-001a4aab830c/ceph-mds.b.log-20230529.gz:2023-05-28T19:57:05.826+0000 7faf32c58700 20 mds.0.server laggiest active session is client.15495 v1:192.168.0.1:0/3891638898 and renewed caps recently (0.278995s ago)
Actions

Also available in: Atom PDF