https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2022-02-01T18:34:38ZCeph CephFS - Bug #54106: kclient: hang during workunit cleanuphttps://tracker.ceph.com/issues/54106?journal_id=2096162022-02-01T18:34:38ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-3 priority-5 priority-high3 closed" href="/issues/54107">Bug #54107</a>: kclient: hang during umount</i> added</li></ul> CephFS - Bug #54106: kclient: hang during workunit cleanuphttps://tracker.ceph.com/issues/54106?journal_id=2096172022-02-01T18:34:51ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-3 priority-4 priority-default closed" href="/issues/44100">Bug #44100</a>: cephfs rsync kworker high load.</i> added</li></ul> CephFS - Bug #54106: kclient: hang during workunit cleanuphttps://tracker.ceph.com/issues/54106?journal_id=2096192022-02-01T18:35:02ZPatrick Donnellypdonnell@redhat.com
<ul></ul><p>May be related to <a class="issue tracker-1 status-3 priority-4 priority-default closed" title="Bug: cephfs rsync kworker high load. (Resolved)" href="https://tracker.ceph.com/issues/44100">#44100</a>, per Jeff.</p> CephFS - Bug #54106: kclient: hang during workunit cleanuphttps://tracker.ceph.com/issues/54106?journal_id=2098432022-02-04T20:25:31ZJeff Laytonjlayton@redhat.com
<ul></ul><p>First, note that this is an 8.4 kernel which is about a year old. Some of this code has seen s</p>
<p>So, 7 mins between mount and when the warning popped:</p>
<pre>
2022-01-29T02:25:02.949137+00:00 smithi084 kernel: libceph: mon1 (2)172.21.15.112:3300 session established
2022-01-29T02:25:02.949852+00:00 smithi084 kernel: libceph: client15195 fsid 0e36bce6-80a9-11ec-8c35-001a4aab830c
2022-01-29T02:32:50.009527+00:00 smithi084 kernel: WARNING: CPU: 2 PID: 92503 at fs/ceph/caps.c:759 ceph_add_cap+0x638/0x650 [ceph]
</pre>
<p>The warning is this in ceph_add_cap:</p>
<pre>
if (flags & CEPH_CAP_FLAG_AUTH) {
...
} else {
WARN_ON(ci->i_auth_cap == cap);
}
</pre>
<p>So, while processing a trace, we got issued a cap -- probably one we already had? That new cap didn't have the AUTH flag set, but the ci->i_auth_cap pointer still pointed at it so it must have been the auth mds at one time.</p>
<p>There's this in the log too, which is a similar issue re the auth flag changing.</p>
<pre>
2022-01-29T02:32:50.961980+00:00 smithi084 kernel: WARNING: CPU: 6 PID: 67864 at fs/ceph/caps.c:3315 handle_cap_grant+0xb9c/0xbb0 [ceph]
</pre>
<p>After that are a bunch of hung task warnings, mostly from stuff stuck waiting for the snap_rwsem:</p>
<pre>
2022-01-29T02:36:24.325780+00:00 smithi084 kernel: INFO: task kworker/1:5:33032 blocked for more than 120 seconds.
2022-01-29T02:36:24.325882+00:00 smithi084 kernel: Tainted: G W --------- - - 4.18.0-305.el8.x86_64 #1
2022-01-29T02:36:24.325912+00:00 smithi084 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-01-29T02:36:24.333977+00:00 smithi084 kernel: kworker/1:5 D 0 33032 2 0x80004080
2022-01-29T02:36:24.339831+00:00 smithi084 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph]
2022-01-29T02:36:24.345703+00:00 smithi084 kernel: Call Trace:
2022-01-29T02:36:24.348537+00:00 smithi084 kernel: __schedule+0x2c4/0x700
2022-01-29T02:36:24.352404+00:00 smithi084 kernel: schedule+0x38/0xa0
2022-01-29T02:36:24.355890+00:00 smithi084 kernel: rwsem_down_read_slowpath+0x360/0x3d0
2022-01-29T02:36:24.361008+00:00 smithi084 kernel: ceph_check_caps+0x38a/0xa60 [ceph]
2022-01-29T02:36:24.365886+00:00 smithi084 kernel: ? ktime_get_with_offset+0x4d/0xc0
2022-01-29T02:36:24.370679+00:00 smithi084 kernel: ? sg_init_table+0x11/0x30
2022-01-29T02:36:24.374768+00:00 smithi084 kernel: ? scatterwalk_map_and_copy+0x4e/0x70
2022-01-29T02:36:24.379798+00:00 smithi084 kernel: ? __cap_is_valid+0x1c/0xa0 [ceph]
2022-01-29T02:36:24.384578+00:00 smithi084 kernel: ? __ceph_caps_issued+0x6f/0xd0 [ceph]
2022-01-29T02:36:24.389711+00:00 smithi084 kernel: handle_cap_grant+0x977/0xbb0 [ceph]
2022-01-29T02:36:24.394670+00:00 smithi084 kernel: ceph_handle_caps+0xce4/0x1820 [ceph]
2022-01-29T02:36:24.399708+00:00 smithi084 kernel: mds_dispatch+0x136/0xbe0 [ceph]
2022-01-29T02:36:24.404306+00:00 smithi084 kernel: ? setup_message_sgs+0x208/0x360 [libceph]
2022-01-29T02:36:24.409775+00:00 smithi084 kernel: ? sg_miter_start+0x60/0x60
2022-01-29T02:36:24.413910+00:00 smithi084 kernel: ceph_con_process_message+0x79/0x140 [libceph]
2022-01-29T02:36:24.419719+00:00 smithi084 kernel: ? sg_miter_start+0x60/0x60
2022-01-29T02:36:24.423900+00:00 smithi084 kernel: process_message+0xe/0x60 [libceph]
2022-01-29T02:36:24.428755+00:00 smithi084 kernel: ceph_con_v2_try_read+0xc3f/0x1010 [libceph]
2022-01-29T02:36:24.434400+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70
2022-01-29T02:36:24.438657+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70
2022-01-29T02:36:24.442894+00:00 smithi084 kernel: ? __switch_to_asm+0x35/0x70
2022-01-29T02:36:24.447128+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70
2022-01-29T02:36:24.451305+00:00 smithi084 kernel: ? __switch_to_asm+0x35/0x70
2022-01-29T02:36:24.455550+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70
2022-01-29T02:36:24.459746+00:00 smithi084 kernel: ? __switch_to_asm+0x35/0x70
2022-01-29T02:36:24.463939+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70
2022-01-29T02:36:24.468137+00:00 smithi084 kernel: ? __switch_to_asm+0x35/0x70
2022-01-29T02:36:24.472350+00:00 smithi084 kernel: ceph_con_workfn+0x18f/0x690 [libceph]
2022-01-29T02:36:24.477440+00:00 smithi084 kernel: process_one_work+0x1a7/0x360
2022-01-29T02:36:24.481737+00:00 smithi084 kernel: worker_thread+0x30/0x390
2022-01-29T02:36:24.485684+00:00 smithi084 kernel: ? create_worker+0x1a0/0x1a0
2022-01-29T02:36:24.489895+00:00 smithi084 kernel: kthread+0x116/0x130
2022-01-29T02:36:24.493449+00:00 smithi084 kernel: ? kthread_flush_work_fn+0x10/0x10
2022-01-29T02:36:24.498187+00:00 smithi084 kernel: ret_from_fork+0x35/0x40
</pre>
<p>Note that these are read rwsem attempts, so that implies that something was holding it for write and not releasing it.</p>
<p>I'll also note that this is in there too:</p>
<pre>
2022-01-29T02:37:48.218123+00:00 smithi084 kernel: ceph: mds0 hung
</pre>
<p>...but that gets set under the session->s_mutex so it's likely that that's just fallout from all of the hung tasks.</p>
<p>Beyond that, I can't tell much. I don't suppose we have a vmcore from this do we?</p> CephFS - Bug #54106: kclient: hang during workunit cleanuphttps://tracker.ceph.com/issues/54106?journal_id=2192442022-06-28T16:56:22ZJeff Laytonjlayton@redhat.com
<ul><li><strong>Assignee</strong> changed from <i>Jeff Layton</i> to <i>Xiubo Li</i></li></ul> CephFS - Bug #54106: kclient: hang during workunit cleanuphttps://tracker.ceph.com/issues/54106?journal_id=2195212022-07-04T12:39:18ZXiubo Lixiubli@redhat.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Duplicate</i></li><li><strong>Parent task</strong> set to <i>#55857</i></li></ul><p>This is duplicated to <a class="external" href="https://tracker.ceph.com/issues/55857">https://tracker.ceph.com/issues/55857</a>.</p>