https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2022-06-02T10:19:41ZCeph Linux kernel client - Bug #55823: NETDEV WATCHDOG: enp3s0f1 (ixgbe): transmit queue 3 timed outhttps://tracker.ceph.com/issues/55823?journal_id=2170522022-06-02T10:19:41ZXiubo Lixiubli@redhat.com
<ul></ul><p>Another failure: <a class="external" href="https://pulpito.ceph.com/vshankar-2022-05-31_02:47:51-fs-wip-vshankar-fscrypt-20220530-091336-testing-default-smithi/6853742/">https://pulpito.ceph.com/vshankar-2022-05-31_02:47:51-fs-wip-vshankar-fscrypt-20220530-091336-testing-default-smithi/6853742/</a></p> Linux kernel client - Bug #55823: NETDEV WATCHDOG: enp3s0f1 (ixgbe): transmit queue 3 timed outhttps://tracker.ceph.com/issues/55823?journal_id=2170532022-06-02T10:20:09ZXiubo Lixiubli@redhat.com
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>Urgent</i></li></ul> Linux kernel client - Bug #55823: NETDEV WATCHDOG: enp3s0f1 (ixgbe): transmit queue 3 timed outhttps://tracker.ceph.com/issues/55823?journal_id=2171502022-06-02T21:07:20ZJeff Laytonjlayton@redhat.com
<ul></ul><p>2 of the 3 failures on this test run were due to this problem:</p>
<p><a class="external" href="https://pulpito.ceph.com/jlayton-2022-06-02_19:51:44-fs-wip-jlayton-qa-custom-kernel-wip-fscrypt-default-smithi/">https://pulpito.ceph.com/jlayton-2022-06-02_19:51:44-fs-wip-jlayton-qa-custom-kernel-wip-fscrypt-default-smithi/</a></p>
<p>The symptoms seem to be pretty consistent. First, the tx queue times out, and the adapter is reset. Then we see socket errors on write, and then soon afterward we start seeing softlockups.</p>
<p>One hypothesis is that the adapter reset doesn't actually clear the problem. We end up not being able to transmit the cap flushes and we end up spinning in this function.</p>
<p>I think we probably need to engage the upstream ixgbe developers and see whether they've seen this issue and can offer any advice.</p>
<p>While this case seems to involve some sort of problem transmitting requests on the socket, flush_dirty_session_caps looks problematic since it appears to involve an unbounded set of work. Can we end up dirtying caps faster than we can clean them and never fully empty the list? I think we may need some way to bound what gets done in that function.</p> Linux kernel client - Bug #55823: NETDEV WATCHDOG: enp3s0f1 (ixgbe): transmit queue 3 timed outhttps://tracker.ceph.com/issues/55823?journal_id=2172252022-06-06T05:51:37ZXiubo Lixiubli@redhat.com
<ul></ul><p>Jeff Layton wrote:</p>
<blockquote>
<p>2 of the 3 failures on this test run were due to this problem:</p>
<p><a class="external" href="https://pulpito.ceph.com/jlayton-2022-06-02_19:51:44-fs-wip-jlayton-qa-custom-kernel-wip-fscrypt-default-smithi/">https://pulpito.ceph.com/jlayton-2022-06-02_19:51:44-fs-wip-jlayton-qa-custom-kernel-wip-fscrypt-default-smithi/</a></p>
<p>The symptoms seem to be pretty consistent. First, the tx queue times out, and the adapter is reset. Then we see socket errors on write, and then soon afterward we start seeing softlockups.</p>
<p>One hypothesis is that the adapter reset doesn't actually clear the problem. We end up not being able to transmit the cap flushes and we end up spinning in this function.</p>
<p>I think we probably need to engage the upstream ixgbe developers and see whether they've seen this issue and can offer any advice.</p>
<p>While this case seems to involve some sort of problem transmitting requests on the socket, flush_dirty_session_caps looks problematic since it appears to involve an unbounded set of work. Can we end up dirtying caps faster than we can clean them and never fully empty the list? I think we may need some way to bound what gets done in that function.</p>
</blockquote>
<p>Couldn't get where the problematic is in <strong><em>flush_dirty_session_caps()</em></strong>.</p>
<pre>
4378 static void flush_dirty_session_caps(struct ceph_mds_session *s)
4379 {
4380 struct ceph_mds_client *mdsc = s->s_mdsc;
4381 struct ceph_inode_info *ci;
4382 struct inode *inode;
4383
4384 dout("flush_dirty_caps\n");
4385 spin_lock(&mdsc->cap_dirty_lock);
4386 while (!list_empty(&s->s_cap_dirty)) {
4387 ci = list_first_entry(&s->s_cap_dirty, struct ceph_inode_info,
4388 i_dirty_item);
4389 inode = &ci->vfs_inode;
4390 ihold(inode);
4391 dout("flush_dirty_caps %llx.%llx\n", ceph_vinop(inode));
4392 spin_unlock(&mdsc->cap_dirty_lock);
4393 ceph_check_caps(ci, CHECK_CAPS_FLUSH, NULL);
4394 iput(inode);
4395 spin_lock(&mdsc->cap_dirty_lock);
4396 }
4397 spin_unlock(&mdsc->cap_dirty_lock);
4398 dout("flush_dirty_caps done\n");
4399 }
</pre>
<p>From the code before each inode it will release the <strong><em>mdsc->cap_dirty_lock</em></strong>. And for each inode it shouldn't have too many caps and shouldn't take too long unless it's buggy and keeps retrying in <strong><em>ceph_check_caps()</em></strong> by holding the <strong><em>ci->i_ceph_lock</em></strong> ?</p> Linux kernel client - Bug #55823: NETDEV WATCHDOG: enp3s0f1 (ixgbe): transmit queue 3 timed outhttps://tracker.ceph.com/issues/55823?journal_id=2172732022-06-06T14:36:32ZJeff Laytonjlayton@redhat.com
<ul></ul><p>That function takes the first entry of the s_cap_dirty list and then calls ceph_check_caps on it. It's implicitly expecting that the entry will be dequeued from s_cap_dirty, but I don't see where that's guaranteed to occur. There are a lot of special cases in that function that can result in no cap being sent and the thing not being marked flushing.</p>
<p>For instance, we have this in ceph_check_caps:</p>
<pre>
spin_lock(&ci->i_ceph_lock);
if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
/* Don't send messages until we get async create reply */
spin_unlock(&ci->i_ceph_lock);
ceph_put_mds_session(session);
return;
}
</pre>
<p>If an async create reply gets delayed for a long time (as we might see when there are network issues), then this could cause it to continually loop. There are other potential things that could cause it to remain on s_cap_dirty as well.</p>
<p>What might be best at this point is to add some extra logging in ceph_check_caps when CEPH_CAPS_FLUSH is set, and it returns without dequeueing the cap. That may help us to confirm the problem we're seeing (and may eventually suggest a fix).</p>
<p>A more long term solution probably involves a redesign of how caps are managed in the client. The current code for this is completely ad-hoc and rife with weird emergent behaviors.</p> Linux kernel client - Bug #55823: NETDEV WATCHDOG: enp3s0f1 (ixgbe): transmit queue 3 timed outhttps://tracker.ceph.com/issues/55823?journal_id=2173012022-06-06T19:22:43ZJeff Laytonjlayton@redhat.com
<ul></ul><p>I'm testing this patch now which I think might be at least part of the solution here. If it's spinning due to waiting on async create replies, then this prevent that:</p>
<pre>
From 445e9c77524879060120b53c84b129da34405762 Mon Sep 17 00:00:00 2001
From: Jeff Layton <jlayton@kernel.org>
Date: Mon, 6 Jun 2022 11:07:59 -0400
Subject: [PATCH] ceph: wait on async create before checking caps
Currently, we'll call ceph_check_caps, but if we're still waiting on the
reply, we'll end up spinning around on the same inode. Wait for the
async create reply before we try to flush caps.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/ceph/caps.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 0a48bf829671..5ecfff4b37c9 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -4389,6 +4389,7 @@ static void flush_dirty_session_caps(struct ceph_mds_session *s)
ihold(inode);
dout("flush_dirty_caps %llx.%llx\n", ceph_vinop(inode));
spin_unlock(&mdsc->cap_dirty_lock);
+ ceph_wait_on_async_create(inode);
ceph_check_caps(ci, CHECK_CAPS_FLUSH, NULL);
iput(inode);
spin_lock(&mdsc->cap_dirty_lock);
--
2.36.1
</pre>
<p>I'll post it upstream if it looks ok. My first run with it ran into teuthology infrastructure issues (a lot of dead jobs).</p>