Project

General

Profile

Bug #15255

kclient: lost filesystem operations (on mds disconnect?)

Added by Greg Farnum over 2 years ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
03/23/2016
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

We saw this on our sepia teuthology kernel client mount. I saved the dmesg output at teuthology:~ubuntu/322_kernel_hang.log, but we don't have MDS logging turned on so there was nothing useful there. :(
Looking at the debugfs ceph output, there were no outstanding mdsc requests. The MDS didn't have any either. The MDS was not aware of any issues. But access to many files (not all of them!) was hanging. New file creation worked. We eventually had to reboot the box; a lazy reboot did not repair all issues.

Judging by the caps going stale and then a swarm of OSD disconnects, it looks like a timeout occurred (...for unknown reasons) and then the client got blacklisted (starting at Tue Mar 22 08:00:10 2016). Then we start getting backtraces

[Tue Mar 22 08:02:38 2016] INFO: task teuthology:27943 blocked for more than 120 seconds.
[Tue Mar 22 08:02:38 2016]       Tainted: G        W   E   4.2.0-rc8-ceph-00011-g4a18ede #1
[Tue Mar 22 08:02:38 2016] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Mar 22 08:02:38 2016] teuthology      D ffff88036d78cdd8     0 27943   2902 0x00000000
[Tue Mar 22 08:02:38 2016]  ffff8802c8bbf938 0000000000000092 ffff8806a3f817e8 0000000000000042
[Tue Mar 22 08:02:38 2016]  ffffffff81c13580 ffff88036d78c880 ffff88081b68d098 ffff8802c8bc0000
[Tue Mar 22 08:02:38 2016]  7fffffffffffffff 7fffffffffffffff 0000000000000002 ffffffff817bd3d0
[Tue Mar 22 08:02:38 2016] Call Trace:
[Tue Mar 22 08:02:38 2016]  [<ffffffff817bd3d0>] ? bit_wait_io_timeout+0x60/0x60
[Tue Mar 22 08:02:38 2016]  [<ffffffff817bcc6d>] schedule+0x6d/0x90
[Tue Mar 22 08:02:38 2016]  [<ffffffff817c0a6f>] schedule_timeout+0x3f/0x350
[Tue Mar 22 08:02:38 2016]  [<ffffffff81025139>] ? read_tsc+0x9/0x10
[Tue Mar 22 08:02:38 2016]  [<ffffffff81106764>] ? ktime_get+0xc4/0x110
[Tue Mar 22 08:02:38 2016]  [<ffffffff81152a13>] ? __delayacct_blkio_start+0x23/0x30
[Tue Mar 22 08:02:38 2016]  [<ffffffff817bd3d0>] ? bit_wait_io_timeout+0x60/0x60
[Tue Mar 22 08:02:38 2016]  [<ffffffff817bc0e4>] io_schedule_timeout+0xa4/0x110
[Tue Mar 22 08:02:38 2016]  [<ffffffff817bd418>] bit_wait_io+0x48/0x50
[Tue Mar 22 08:02:38 2016]  [<ffffffff817bcf8b>] __wait_on_bit_lock+0x4b/0xa0
[Tue Mar 22 08:02:38 2016]  [<ffffffff8119ed3e>] __lock_page+0xae/0xb0
[Tue Mar 22 08:02:38 2016]  [<ffffffff810ca690>] ? wake_atomic_t_function+0x30/0x30
[Tue Mar 22 08:02:38 2016]  [<ffffffff811a062a>] pagecache_get_page+0x9a/0x1e0
[Tue Mar 22 08:02:38 2016]  [<ffffffff811a079e>] grab_cache_page_write_begin+0x2e/0x50
[Tue Mar 22 08:02:38 2016]  [<ffffffffc046dc43>] ceph_write_begin+0x43/0x100 [ceph]
[Tue Mar 22 08:02:38 2016]  [<ffffffff8119e92f>] generic_perform_write+0xaf/0x1c0
[Tue Mar 22 08:02:38 2016]  [<ffffffffc0467f32>] ceph_write_iter+0x9d2/0xc80 [ceph]
[Tue Mar 22 08:02:38 2016]  [<ffffffff81231025>] ? putname+0x55/0x60
[Tue Mar 22 08:02:38 2016]  [<ffffffff812220e6>] ? vfs_write+0xb6/0x190
[Tue Mar 22 08:02:38 2016]  [<ffffffff81221256>] __vfs_write+0xa6/0xd0
[Tue Mar 22 08:02:38 2016]  [<ffffffff812220f7>] vfs_write+0xc7/0x190
[Tue Mar 22 08:02:38 2016]  [<ffffffff81222402>] SyS_write+0x52/0xb0
[Tue Mar 22 08:02:38 2016]  [<ffffffff817c26f2>] entry_SYSCALL_64_fastpath+0x16/0x7a

The pagecache_get_page frame makes it look like this isn't actually the Ceph client getting stuck, but it comes just about exactly 2 minutes after caps go stale, so I presume it's actually blocked on caps?

...okay, I missed this yesterday:

[Tue Mar 22 08:12:34 2016] ceph: mds0 closed our session
[Tue Mar 22 08:12:34 2016] ceph: mds0 reconnect start
[Tue Mar 22 08:12:36 2016] ceph: mds0 reconnect denied
[Tue Mar 22 08:12:36 2016] ceph:  dropping unsafe request 25351907
[Tue Mar 22 08:12:36 2016] ceph:  dropping dirty+flushing Fw state for ffff8802db23db08 1099864856894

So, this seems like a bit late to be starting a reconnect, but anything after this point (eg, the hung file accesses) are probably by deliberate choice, right? So the question becomes how the kernel client got so stuck to begin with.


Related issues

Related to fs - Bug #16186: kclient: drops requests without poking system calls on reconnect Duplicate 06/08/2016

History

#1 Updated by Greg Farnum over 2 years ago

  • Assignee set to Zheng Yan

Zheng, can you please take a look and see if there's anything we can do with this? :)

#2 Updated by Zheng Yan over 2 years ago

The backtrace show some pages were in locked state. This seems like OSD requests hang. (pages get unlocked in osd request's complete callback)

#3 Updated by Greg Farnum over 2 years ago

The osdc also reported no outstanding requests.

If we see this again, Zheng has asked us to get backtraces on all the threads — "#echo 1 > /proc/sys/kernel/sysrq" to enable and then "#echo t > /proc/sys/kernel/sysrq" (and then turn it off with 0).

#4 Updated by Greg Farnum over 2 years ago

Checking out the ceph log from that day, there's nothing going on except some scrubbing and the usual pg map/throughput updates. We do record the client's session getting closed

2016-03-22 08:14:07.797397 mds.0 172.21.4.136:6814/2176 723 : cluster [INF] closing stale session client.23608350 172.21.9.23:0/2896907583 after 302.151630

There are some slow op warnings, but they're ~1.5 hours earlier and only on the MDS. The last is

2016-03-22 06:51:07.813166 osd.47 172.21.7.110:6806/8627 44238 : cluster [WRN] slow request 30.149121 seconds old, received at 2016-03-22 06:50:37.663351: osd_op(mds.0.211:45481777 100150d6a67.00000000 [tmap2omap 1~0,omap-set-header 0~226,omap-set-vals 0~873] 1.e76cd691 ondisk+write+known_if_redirected+full_force e686038) currently commit_sent

#5 Updated by Zheng Yan over 2 years ago

I try cutting network and waiting until MDS cleanup client session, I found one process was stuck at ceph_get_caps(). This can explain this issue too. (ceph_get_caps() is interruptable)

[ 1079.045266] ffsb            S ffff880079bba400 13344  2529   2386 0x00000080
[ 1079.045936]  ffff880079afbba8 0000000000000092 ffff880079f41000 0000000000000000
[ 1079.046886]  ffff880079b88000 ffff880077cac140 0000100079afbc00 ffff880079afc000
[ 1079.047837]  00000000023e3000 0000000000000000 ffff880079afbc2c 0000000000001000
[ 1079.048781] Call Trace:
[ 1079.049171]  [<ffffffff8191b9ec>] schedule+0x3c/0x90
[ 1079.049706]  [<ffffffffa0061f73>] ceph_get_caps+0x203/0x2d0 [ceph]
[ 1079.050326]  [<ffffffff8109b710>] ? prepare_to_wait_event+0xf0/0xf0
[ 1079.050930]  [<ffffffffa0053ec3>] ceph_write_iter+0x283/0xef0 [ceph]
[ 1079.051554]  [<ffffffff810a24c3>] ? __lock_acquire+0xc03/0x19a0
[ 1079.052150]  [<ffffffff810a24c3>] ? __lock_acquire+0xc03/0x19a0
[ 1079.052732]  [<ffffffff813429d3>] ? __this_cpu_preempt_check+0x13/0x20
[ 1079.053366]  [<ffffffff813494e5>] ? __percpu_counter_add+0x85/0xb0
[ 1079.053965]  [<ffffffff811a2095>] __vfs_write+0xa5/0xe0
[ 1079.054518]  [<ffffffff811a2781>] vfs_write+0xa1/0x1a0
[ 1079.055067]  [<ffffffff811a33a4>] SyS_write+0x44/0xa0
[ 1079.055598]  [<ffffffff81921bd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

#7 Updated by Greg Farnum over 2 years ago

This happened again yesterday. It kind of seems like there's more going on that leads to the failed reconnect. :/

Apr  7 04:09:14 teuthology kernel: [1271710.750915] libceph: osd49 down
Apr  7 04:09:19 teuthology kernel: [1271715.757087] libceph: osd13 down
Apr  7 04:09:19 teuthology kernel: [1271715.760814] libceph: osd9 down
Apr  7 04:09:19 teuthology kernel: [1271715.764172] libceph: osd12 down
Apr  7 04:09:29 teuthology kernel: [1271725.766704] libceph: osd18 down
Apr  7 04:09:32 teuthology kernel: [1271728.087099] libceph: osd9 up
Apr  7 04:09:32 teuthology kernel: [1271728.090273] libceph: osd12 up
Apr  7 04:09:32 teuthology kernel: [1271728.093449] libceph: osd13 up
Apr  7 04:09:32 teuthology kernel: [1271728.096703] libceph: osd49 up
Apr  7 04:09:34 teuthology kernel: [1271730.772925] libceph: osd18 up
Apr  7 04:09:54 teuthology kernel: [1271750.648419] ceph: mds0 caps stale
Apr  7 04:10:14 teuthology kernel: [1271770.637905] ceph: mds0 caps stale
Apr  7 04:10:17 teuthology kernel: [1271773.518794] ceph: mds0 reconnect start
Apr  7 04:10:17 teuthology kernel: [1271773.766181] ceph: mds0 reconnect success
Apr  7 04:10:54 teuthology kernel: [1271810.358872] ceph: mds0 recovery completed
Apr  7 06:02:04 teuthology kernel: [1278476.794716] ceph: build_path did not end path lookup where expected, namelen is 149, pos is 0
Apr  7 09:01:43 teuthology kernel: [1289250.522398] libceph: osd48 down
Apr  7 09:01:44 teuthology kernel: [1289250.661701] libceph: osd32 172.21.7.134:6827 socket closed (con state OPEN)
Apr  7 09:01:52 teuthology kernel: [1289259.095179] libceph: osd48 up
Apr  7 09:01:54 teuthology kernel: [1289260.639732] libceph: osd111 172.21.4.114:6800 socket closed (con state OPEN)
Apr  7 09:11:47 teuthology kernel: [1289854.187903] libceph: osd70 172.21.9.116:6800 socket closed (con state OPEN)
Apr  7 12:56:15 teuthology kernel: [1303315.055906] libceph: osd1 down
Apr  7 12:56:16 teuthology kernel: [1303315.619872] libceph: osd123 172.21.7.110:6825 socket closed (con state OPEN)
Apr  7 12:56:16 teuthology kernel: [1303315.762598] libceph: osd18 172.21.5.114:6810 socket closed (con state OPEN)
Apr  7 12:56:25 teuthology kernel: [1303324.540286] libceph: osd10 down
Apr  7 12:56:25 teuthology kernel: [1303324.543708] libceph: osd26 down
Apr  7 12:56:29 teuthology kernel: [1303328.166294] libceph: osd121 172.21.6.108:6804 socket closed (con state OPEN)
Apr  7 12:56:36 teuthology kernel: [1303336.046999] libceph: osd22 down
Apr  7 12:56:36 teuthology kernel: [1303336.050397] libceph: osd23 down
Apr  7 12:56:36 teuthology kernel: [1303336.053766] libceph: osd25 down
Apr  7 12:56:37 teuthology kernel: [1303336.057170] libceph: osd27 down
Apr  7 12:56:44 teuthology kernel: [1303343.767991] ceph: mds0 caps stale
Apr  7 12:57:04 teuthology kernel: [1303363.757181] ceph: mds0 caps stale
Apr  7 13:00:49 teuthology kernel: [1303588.628956] ceph: mds0 hung
Apr  7 13:17:16 teuthology kernel: [1304575.333713] libceph: mds0 172.21.6.130:6801 socket closed (con state OPEN)
Apr  7 13:17:20 teuthology kernel: [1304579.091775] libceph: mds0 172.21.6.130:6801 socket closed (con state CONNECTING)
Apr  7 13:17:23 teuthology kernel: [1304582.090307] libceph: mds0 172.21.6.130:6801 socket closed (con state CONNECTING)
Apr  7 13:17:26 teuthology kernel: [1304584.544826] libceph: mon2 172.21.6.130:6789 socket closed (con state OPEN)
Apr  7 13:17:26 teuthology kernel: [1304584.552012] libceph: mon2 172.21.6.130:6789 session lost, hunting for new mon
Apr  7 13:17:26 teuthology kernel: [1304584.588615] ceph: mds0 reconnect start
Apr  7 13:17:27 teuthology kernel: [1304585.741860] libceph: mon0 172.21.4.136:6789 session established
Apr  7 13:17:27 teuthology kernel: [1304585.759056] ceph: mds0 reconnect denied
Apr  7 13:17:27 teuthology kernel: [1304585.763191] ceph:  dropping unsafe request 15878863
Apr  7 13:17:27 teuthology kernel: [1304585.768290] ceph:  dropping unsafe request 15878864
Apr  7 13:17:27 teuthology kernel: [1304585.777933] ceph:  dropping dirty+flushing Fw state for ffff88003a31c9d8 1099872537254
Apr  7 13:17:27 teuthology kernel: [1304585.787524] ceph:  dropping dirty+flushing Fw state for ffff88003a31d270 1099872537255
Apr  7 13:17:27 teuthology kernel: [1304585.797358] ceph:  dropping dirty+flushing Fw state for ffff88003a31e3a0 1099872537257
Apr  7 13:17:27 teuthology kernel: [1304585.805665] ceph:  dropping dirty+flushing Fw state for ffff88003a31ec38 1099872537258
Apr  7 13:17:27 teuthology kernel: [1304585.813959] ceph:  dropping dirty+flushing Fw state for ffff880126b0db08 1099872537259
Apr  7 13:17:27 teuthology kernel: [1304585.822134] ceph:  dropping dirty+flushing Fxw state for ffff88003a31db08 1099872537256
Apr  7 13:17:27 teuthology kernel: [1304585.830373] ceph:  dropping dirty+flushing Fw state for ffff880126b0a778 1099872537262
Apr  7 13:17:27 teuthology kernel: [1304585.838532] ceph:  dropping dirty+flushing Fw state for ffff880126b0b010 1099872537263
Apr  7 13:17:27 teuthology kernel: [1304585.846695] ceph:  dropping dirty+flushing Fw state for ffff880126b0c140 1099872537265
Apr  7 13:17:27 teuthology kernel: [1304585.854874] ceph:  dropping dirty+flushing Fxw state for ffff88043faf9ee0 1099872537271
Apr  7 13:17:27 teuthology kernel: [1304585.867344] libceph: mds0 172.21.8.114:6800 socket closed (con state OPEN)
Apr  7 13:17:28 teuthology kernel: [1304587.081284] libceph: mds0 172.21.8.114:6800 connection reset
Apr  7 13:17:28 teuthology kernel: [1304587.087182] libceph: reset on mds0
Apr  7 13:17:28 teuthology kernel: [1304587.090802] ceph: mds0 closed our session
Apr  7 13:17:28 teuthology kernel: [1304587.095077] ceph: mds0 reconnect start

#8 Updated by Greg Farnum over 2 years ago

Zheng, does that patch mean that the client no longer drops any dirty state? Or does it keep the dirty page cache and potentially write stale data out, if somebody else updated it in the meantime?
Or do we have some way of identifying that nobody has touched the files since we did? That might be something we end up needing to do.... :/

#9 Updated by Greg Farnum over 2 years ago

  • Status changed from New to Feedback
  • Assignee changed from Zheng Yan to Sage Weil

Sage, I think we discussed this a day or two ago in standup but I don't think you've looked at it yet. Please comment! :)

#10 Updated by Zheng Yan over 2 years ago

Greg Farnum wrote:

Zheng, does that patch mean that the client no longer drops any dirty state? Or does it keep the dirty page cache and potentially write stale data out, if somebody else updated it in the meantime?
Or do we have some way of identifying that nobody has touched the files since we did? That might be something we end up needing to do.... :/

client drops dirty metadata, but it keeps old dirty data and will write them out.

#11 Updated by Greg Farnum over 2 years ago

I don't think that's acceptable unless we have a way of verifying nobody else has modified (or maybe even just touched?) the files since we lost our caps. Or at least, in the past we've decided it wasn't acceptable. Perhaps we'd like to change our minds after experiencing the trauma. ;)
(I don't want to change my mind.)

But if we can detect that nobody has touched them, we should be okay flushing out our metadata updates as well!

#12 Updated by Greg Farnum over 2 years ago

  • Related to Bug #16186: kclient: drops requests without poking system calls on reconnect added

#13 Updated by Sage Weil over 2 years ago

This type of race has never been addressed:

- client A has dirty writeback data
- client A is disconnected
- mds times out caps, lets client B do stuff to the file
- client A reconnects

I could never figure out what it should do in that case. We could check to see if there has been no change, and if so, flush out our dirty data.. that optimistically might work sometimes and at least won't clobber newer stuff, but it will screw up the timeline from client B's perspective in that client A overwrote X with Y before it was kicked out, and client B read X after A was kicked out, and at the end of all this the file will contain Y.

The most correct thing is probably to throw out the dirty data (I think that's what we do now?) and make sure that the next fsync() on client A returns an EIO?

#14 Updated by Greg Farnum over 2 years ago

Yeah, I think we should be throwing it out as well (I don't know if that actually happens or not). If we wanted to get really clever, we could try to detect if we're in a scenario where there's a client B or not, by eg tagging inodes as "stale caps for A" and clearing them on B's access, or otherwise letting A have the caps back. That at least narrows the unpleasant failure space, at the cost of yet more complexity. ;)

But separately, we seem to have a bug where processes on client A invoke a syscall to read or write data, and just don't get any response back at all after the reconnect. That's what #16186 is about.

#15 Updated by John Spray over 2 years ago

  • Assignee changed from Sage Weil to Jeff Layton

#16 Updated by Jeff Layton over 2 years ago

Sage Weil wrote:

The most correct thing is probably to throw out the dirty data (I think that's what we do now?) and make sure that the next fsync() on client A returns an EIO?

Yeah, given the strict semantics that ceph caps seem to imply, that's really the only choice that I can see. Once the client figures out that its session has been revoked, I think it has to go through any open files and invalidate their pages. If any of the files have dirty pages, then I think we basically need to SetPageError on them and then call mapping_set_error on the mapping.

Eventually the struct files will get closed somehow (maybe the program exits with an error), the inode will get evicted from the cache and then things can start anew. I'll go through the code and see if I can figure out how that works now and whether we need to change anything there.

Oh, and to be clear -- I don't see lost operations with the latest kernels. I think Zheng's recent patches have cleaned this up significantly.

#17 Updated by Greg Farnum over 2 years ago

  • Component(FS) kceph added

#18 Updated by Greg Farnum over 2 years ago

  • Category changed from 53 to Correctness/Safety

#19 Updated by Jeff Layton over 2 years ago

I started looking at this today. Some notes:

My understanding of how this works is that when the client goes unresponsive for a while, the MDS will destroy its session. At that point, the client will eventually get a CEPH_SESSION_STALE message back from the server. That code does this:

        case CEPH_SESSION_STALE:                                                
                pr_info("mds%d caps went stale, renewing\n",                    
                        session->s_mds);                                        
                spin_lock(&session->s_gen_ttl_lock);                            
                session->s_cap_gen++;                                           
                session->s_cap_ttl = jiffies - 1;                               
                spin_unlock(&session->s_gen_ttl_lock);                          
                send_renew_caps(mdsc, session);                                 
                break;                                                          

So it looks like when the session ends up stale, that we try to regain our caps and soldier on. I don't think that's really what we want. It seems like we ought to be calling remove_session_caps in this situation. though I guess it's possible that the MDS will refuse to renew the caps in this case, and the client will be forced to close the session?

Which means another dilemma...how do we distinguish between the eviction of a client due to a network partition, and the case where the server rebooted? In either case, doesn't the session end up "stale" once the client tries to talk to it again?

I think I need to sort out these scenarios and document them...

#20 Updated by Greg Farnum over 2 years ago

Not sure how far along you got, Jeff, but you should be able to identify which scenario you're in based on the MDSMap.
I think there is a short period of your session being stale where the MDS will let you back in as a sort of warning period, but basically, if the MDS is in reconnect you can rejoin; otherwise you're going to get knocked off. I believe the userspace client does the right stuff here so you might want to check its logic for reference. :)

#21 Updated by Patrick Donnelly 8 months ago

  • Target version set to v14.0.0

#22 Updated by Patrick Donnelly 3 months ago

  • Status changed from Feedback to New

#23 Updated by Jeff Layton about 2 months ago

  • Project changed from fs to Linux kernel client
  • Category deleted (Correctness/Safety)

Also available in: Atom PDF