Project

General

Profile

Actions

Bug #17564

open

close race window when handling writes on a file descriptor opened with O_APPEND

Added by Jeff Layton over 7 years ago. Updated about 5 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This comment is in _write() in the userland client code:

  // use/adjust fd pos?
  if (offset < 0) {
    lock_fh_pos(f);
    /*
     * FIXME: this is racy in that we may block _after_ this point waiting for caps, and size may
     * change out from under us.
     */
    if (f->flags & O_APPEND) {
      int r = _lseek(f, 0, SEEK_END);
      if (r < 0) {
        unlock_fh_pos(f);
        return r;
      }
    }

I think we can close this race window by ensuring that when we're doing a write to an O_APPEND file descriptor, that we request the caps we'll need to fetch the size and then just fetch that size out of the Inode directly before repositioning the offset.

Actions #1

Updated by Zheng Yan over 7 years ago

Not that easy to do. MDS does not always issue caps what client wants. In this case, client wants Fsw, but MDS may only give client Fw.

Actions #2

Updated by Jeff Layton over 7 years ago

For the record -- I really don't care for O_APPEND semantics, but...

_write() does this currently:

  int r = get_caps(in, CEPH_CAP_FILE_WR, CEPH_CAP_FILE_BUFFER, &have, endoff);                                                                

What I'd propose is that when we're doing an write to eof on an O_APPEND fd, that we request CEPH_CAP_FILE_WR|CEPH_CAP_FILE_SHARED, and then just fail the write if we can't get Fsw. If it succeeds, we do the lseek to EOF (maybe open-coding it to avoid losing mutex) and then issue the actual write to the correct offset.

That at least allows you to know that the size hasn't changed since you acquired caps, at least up to the point where the client_lock is released after issuing the write.

Actions #3

Updated by Greg Farnum over 7 years ago

Are we allowed to fail writes like that? :/

It doesn't actually close the race in all cases, but for sane use it's probably safe to do something like grab the size as we do here, and send the op off with a rados assert that the final object is the expected size, and do a retry loop if we fail. (That doesn't cover cases where somebody writes to a larger different object.)

Actions #4

Updated by Jeff Layton over 7 years ago

Good point. Yeah, a retry loop may be the best we can do in that case. I'll have to read up on rados asserts to make sure I understand what you're proposing there though.

Actions #5

Updated by Patrick Donnelly about 5 years ago

  • Assignee deleted (Jeff Layton)
Actions

Also available in: Atom PDF