Project

General

Profile

Bug #40706

races in ceph_uninline_data

Added by Jeff Layton over 4 years ago. Updated over 4 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

I've been staring at ceph_uninline_data() for a while now, and I think I've convinced myself that there are potential races here.

That function takes the i_ceph_lock and grabs the inline version. If it still appears to be inlined, it then locks the page and starts writing
out the data to the OSDs. Eventually once it's done, it unlocks the page. The caller eventually then sets the inline version to CEPH_INLINE_NONE.

I think that 2 tasks in the kernel could race and end up serializing on the page lock in there. The first task is in ceph_write_iter() and is doing an O_DIRECT write. It finishes uninlining the data, and then does the write directly to the OSD. The second task then comes in before the inline_version is changed, and does the uninlining again, overwriting the data that was written by the first task with stale data from the pagecache.

My guess is that it was done this way to try and minimize taking and dropping the i_ceph_lock, but I don't think it's safe.


Related issues

Related to CephFS - Feature #41311: deprecate CephFS inline_data support Resolved

History

#1 Updated by Jeff Layton over 4 years ago

I think we need to make ceph_uninline_data idempotent, such that if it gets called a second time it becomes a no-op.

#2 Updated by Jeff Layton over 4 years ago

  • Subject changed from potential races in ceph_uninline_data to races in ceph_uninline_data

#3 Updated by Jeff Layton over 4 years ago

My initial thinking was to use the page lock of the first page in the cache to serialize the callers, since we do that already anyway, but there are some cases where we pass this function page pointers that are not in the cache, or let it allocate a page on its own to do the uninlining. That means that racing tasks could end up operating on different pages, so that scheme may not be possible.

I really don't want to add another mutex to the inode, so I'm considering serializing the callers on the i_truncate_mutex. We'd typically only need to take it on the slow path anyway, so that may be acceptable.

#4 Updated by Zheng Yan over 4 years ago

For ceph_write_iter(), it's protected by i_rwsem. But in ceph_page_mkwrite, i_rwsem is not locked.

#5 Updated by Jeff Layton over 4 years ago

Not exactly.

After the uninlining is done, the i_inline_version is not updated until much later, after the i_rwsem is dropped. Thus, you can have one task in write_iter do the uninlining and then drop the i_rwsem and do the write, and then have another task do the uninlining again, overwriting the write done by the first task.

I think we have to ensure that some sort of lock is held during the uninlining and that the i_inline_version is updated before we drop it. Then, racing tasks can re-check the version after they get the lock but before they do the uninlining.

#7 Updated by Jeff Layton over 4 years ago

My latest version of this set seems to have uncovered a ABBA deadlock in the inline write handling code. ceph_fill_inline_data takes the s_mutex and then the page lock, but ceph_page_mkwrite ends up doing the reverse. We'll need to fix this up in the context of this set.

#8 Updated by Jeff Layton over 4 years ago

Looking over the code some more, I think the main issue is that we have so much of this code running under the session->s_mutex. The code just says:

        struct mutex      s_mutex;    /* serialize session messages */          

...but it's held even when we're not dealing with message ordering. It's not 100% clear what this lock protects.

I think we need to look at reducing the footprint of the s_mutex. If we can get to where ceph_fill_inline_data runs without it, then that would fix this deadlock (and probably have performance benefits as well).

#9 Updated by Jeff Layton over 4 years ago

More breadcrumbs -- from handle_reply():

/*
 * Handle mds reply.
 *
 * We take the session mutex and parse and process the reply immediately.
 * This preserves the logical ordering of replies, capabilities, etc., sent
 * by the MDS as they are applied to our local cache.
 */

...I think this rationale is somewhat bogus though.

There is a 1:1 relationship between a ceph_connection and ceph_mds_session. The dispatcher for the connection runs in workqueue context, so only a single dispatch() call for a connection will be running at a time. Therefore the reply handlers are already serialized by virtue of running under the workqueue. I think the s_mutex is superfluous here.

#10 Updated by Jeff Layton over 4 years ago

  • Project changed from Linux kernel client to CephFS
  • Subject changed from races in ceph_uninline_data to deprecate inline_data support

The current plan is to just deprecate inline_data support. While it's an interesting feature, it's a fairly significant maintenance burden, particularly in the kclient.

I've posted a PR to add an extra hurdle to anyone trying to enable this on new filesystems:

https://github.com/ceph/ceph/pull/29824

Closing this as a duplicate of #41311.

#11 Updated by Jeff Layton over 4 years ago

  • Project changed from CephFS to Linux kernel client
  • Subject changed from deprecate inline_data support to races in ceph_uninline_data

#12 Updated by Jeff Layton over 4 years ago

  • Status changed from New to Duplicate

#13 Updated by Jeff Layton over 4 years ago

  • Related to Feature #41311: deprecate CephFS inline_data support added

#14 Updated by Jeff Layton over 4 years ago

  • Status changed from Duplicate to Won't Fix

Also available in: Atom PDF