Project

General

Profile

Bug #16186

kclient: drops requests without poking system calls on reconnect

Added by Greg Farnum about 3 years ago. Updated about 3 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
Start date:
06/08/2016
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
kceph
Labels (FS):
Pull request ID:

Description

If I'm understanding the way things currently work:
) kernel client loses network connection
) MDS times out kernel client
*) kernel client reconnects, resets session
*) throws out outstanding MDS/OSD requests
*) does NOT do anything with the system calls those requests correspond to

Can we do something with the system calls by returning an error code? We see this on teuthology pretty frequently; sometimes they're writes which will still break but often they are reads and simple terminating the process will let our CI as a whole move on with its life, and avoid a node reboot.


Related issues

Related to Linux kernel client - Bug #15255: kclient: lost filesystem operations (on mds disconnect?) Resolved 03/23/2016

History

#1 Updated by Zheng Yan about 3 years ago

Client only drops unsafe MDS requests after session reset. It also tries re-sending outstanding requests

please try using kernel compiled from wip-cephfs branch. I think it does not have issues you mentioned.

#2 Updated by Greg Farnum about 3 years ago

  • Assignee changed from Zheng Yan to Sage Weil

I'm concerned to hear that; I thought those patches had been zapped from the queue. If we disconnect and reconnect, resending old requests and pretending they're new ones still violates our consistency guarantees. I'd naively much prefer that we just return an error code on all the system calls in progress. Sage, any thoughts?

#3 Updated by Greg Farnum about 3 years ago

  • Related to Bug #15255: kclient: lost filesystem operations (on mds disconnect?) added

#4 Updated by Sage Weil about 3 years ago

I think it is working the way it is supposed to work.

We skip unsafe requests because the mds already got them and there is no need to resend. I think the idea is the fact that were were timed out implies that they were eventually persisted? Not sure here... Zheng?

The rest of our outstanding requests we resend/replay because either

1) the MDS already committed them before and it will just send us an ack, or
2) the MDS didn't see or commit them, so the operations restart from scratch. If they restart, we may get a different result than we might have had we not been disconnected, but that is irrelevant since it never happened that way.

Either way, the initiating syscalls just block until we get a reply from the mds (because it did it long ago, because it did it just now, or an error that it tried just now and it failed for whatever reason).

I'm not sure how this relates to syscalls not getting woken up...

#5 Updated by Greg Farnum about 3 years ago

But if we restart requests from scratch, we're dramatically re-ordering them. We can seemingly send files back in time by flushing out setattr requests or flushing caps, for instance — maybe somebody else got locks and modified them, but now we send a (very outdated) request which changes them?

Similarly, if we're sending out OSD requests for file data.

But maybe I'm misunderstanding what the code actually does from the discussions here (I haven't actually looked at it).

#6 Updated by Jeff Layton about 3 years ago

If the mds has torn down the client's session, then I don't see what can reasonably be done other than to return an error (-EIO or something) and wake up the waiting tasks. At that point you don't really have any way to know what the state of the outstanding requests was, do you?

#7 Updated by Jeff Layton about 3 years ago

  • Assignee changed from Sage Weil to Jeff Layton

#8 Updated by Greg Farnum about 3 years ago

Well, if we have unsafe requests the MDS will in fact have committed them (assuming the MDS didn't crash or something prior the data getting persisted to disk); we can try and identify that case. Not sure if it's worth it though, especially in a first pass.

#9 Updated by Jeff Layton about 3 years ago

I don't suppose we have a way to reproduce this, do we? Maybe drive a lot of MDS ops and continually stop and restart the MDS? Ahh...nm -- should be able to simulate this with a network partition...

#10 Updated by Jeff Layton about 3 years ago

Ok, I tried reproducing this by issuing a stat() while outbound traffic from the client was blocked (on a v4.7-rc4 kernel). I made sure to wait longer than the lease time before unblocking the traffic. The client did reestablish its session, and the stat call proceeded as expected.

I'll give a go to reproducing this with write() calls as well -- maybe the page I/O path is more susceptible to disconnection issues? It's also possible that some of Zheng's latest patches have had an effect here.

#11 Updated by Zheng Yan about 3 years ago

there is a 'ceph daemon mds.xxx session evict' command, which makes mds close client session. (use 'ceph daemon mds.xxx session ls' to list all client sessions)

#12 Updated by Jeff Layton about 3 years ago

Ok, the mds session evict command definitely did the trick. Once I issued that (while running a fio test in another shell), the client quickly ground to a halt, and this output showed up in the ring buffer:


[17516.354395] libceph: client4123 fsid e405bf17-b326-446b-ad73-737a0104442b
[17516.355032] libceph: mon0 192.168.1.3:6789 session established
[17765.993341] libceph: mds0 192.168.1.3:6812 socket closed (con state OPEN)
[17766.249773] libceph: mds0 192.168.1.3:6812 connection reset
[17766.249808] libceph: reset on mds0
[17766.249809] ceph: mds0 closed our session
[17766.249810] ceph: mds0 reconnect start
[17766.250896] ceph: mds0 reconnect denied
[17766.250901] ceph:  dropping dirty Fw state for ffff8800a8480370 1099511627779
[17766.250902] ceph:  dropping dirty Fw state for ffff8800a8483c00 1099511627777
[17766.250903] ceph:  dropping dirty Fw state for ffff8800a8482560 1099511627778
[17766.250904] ceph:  dropping dirty Fw state for ffff8800a8486ee8 1099511627776
[17766.260100] ceph: __mark_dirty_caps ffff8800a8480370 10000000003 mask Fw, but no auth cap (session was closed?)
[17766.261105] ceph: __mark_dirty_caps ffff8800a8483c00 10000000001 mask Fw, but no auth cap (session was closed?)
[17766.261132] ceph: __mark_dirty_caps ffff8800a8482560 10000000002 mask Fw, but no auth cap (session was closed?)
[17766.261262] ceph: __mark_dirty_caps ffff8800a8486ee8 10000000000 mask Fw, but no auth cap (session was closed?)
[17766.264445] libceph: mds0 192.168.1.3:6812 socket closed (con state NEGOTIATING)

I'll start looking at how to improve that behavior.

#13 Updated by Jeff Layton about 3 years ago

The fio threads at this point are all sitting in ceph_get_caps:


[jlayton@cephclnt ~]$ cat /proc/1169/stack
[<ffffffffa048b25d>] ceph_get_caps+0x25d/0x380 [ceph]
[<ffffffffa047c42c>] ceph_write_iter+0x2ec/0xc60 [ceph]
[<ffffffff81241c3b>] __vfs_write+0xcb/0x120
[<ffffffff812424b2>] vfs_write+0xa2/0x190
[<ffffffff812433e5>] SyS_write+0x55/0xc0
[<ffffffff817cecee>] entry_SYSCALL_64_fastpath+0x12/0x6d
[<ffffffffffffffff>] 0xffffffffffffffff


Which is:

(gdb) list *(ceph_get_caps+0x25d)
0x1928d is in ceph_get_caps (fs/ceph/caps.c:2489).
2484                if (err == -EAGAIN)
2485                    continue;
2486                if (err < 0)
2487                    return err;
2488            } else {
2489                ret = wait_event_interruptible(ci->i_cap_wq,
2490                        try_get_cap_refs(ci, need, want, endoff,
2491                                 true, &_got, &err));
2492                if (err == -EAGAIN)
2493                    continue;

#14 Updated by Jeff Layton about 3 years ago

Ahh, the reason I could reproduce this yesterday is because the client box was running a v4.5 kernel. With a v4.7-rc5 kernel, it doesn't get stuck in this situation. That said, I don't see any errors either, so I'm not 100% clear on whether that's safe. I think I need to check the behavior when doing these instructions to fully evict the client and make sure that it still works:

http://docs.ceph.com/docs/master/cephfs/eviction/

#15 Updated by Greg Farnum about 3 years ago

  • Component(FS) kceph added

#16 Updated by Greg Farnum about 3 years ago

  • Category changed from 53 to Correctness/Safety

#17 Updated by Jeff Layton about 3 years ago

I'm going to go ahead and close this out, and pursue the follow-up work in tracker #15255.

#18 Updated by Jeff Layton about 3 years ago

  • Status changed from New to Duplicate

Also available in: Atom PDF