non-idempotent transactions (clone) under ext3 may not replay correct result
The writeahead journaling will restore the store to a known state regardless of which operations have committed, but only if the transactions are idempotent. i.e. can be repeated and still end up at a known result. This is true of all the normal operations (remove, write, truncate, setxattr, etc., but not for clone and clone_range. E.g.,
clone foo_head -> foo_2
will put the old head content in _2, but if it is replayed _2 will contain the new _head content. Meh!
How to fix? We could mark which transaction are non-idempotent, and sync the store before applying them.. that's expensive, but possibly the price you pay for not using btrfs! :)
filestore: sync after non-idempotent operations
This is a big hammer to fix journal replay on non-btrfs fs backends (extN,
xfs, whatever). The problem is that it is not safe to replay some journal
operations more than once, notably things like CLONE whose source data
may be changed by subsequent operations.
The simple fix is to initiate a full commit after any non-idempotent
operations prior to any subsequent operation within the same Sequencer.
This is done by calling trigger_commit() in _do_transactions(), which means
any potentially dependent operation that follows will get blocked because
a commit is about to start.
I made trigger_commit() a bit more robust to callers who are not holding
an open_ops ref to also succeeding if the given op_seq is already
committing. For the current caller, that can't happen.
There are probably better performing solutions, but this one is at least
#8 Updated by Sage Weil over 7 years ago
Tommi Virtanen wrote:
Isn't the idempotency in that case "clone foo_head -> foo_2 IFF foo_2 does not exist" ?
That's almost enough for clone() (if we add O_EXCL and whitelist EEXIST for non-btrfs). It wouldn't catch something like
1 clone A->B
2 modify A
3 delete B
<replay from 1>
That trick also wouldn't work for clone_range(), which doesn't create a file.
It may be that we need to make transactions idempotent at a higher level, but it'd be dependent on what you clone to, and whether it is ever modified/removed... it'd be dependent on the particular, though, and hard to analyze/verify.
#10 Updated by Sage Weil over 7 years ago
I think the simplest solution would be:
- for all operations, set an xattr with the last op_seq to write to that file.
- for any operation that is potentially non-idempotent, fsync(2) after doing it.
- on replay, verify the xattr isn't == or newer to avoid re-doing the operation.
Those operations would be:
- create collection
- clone range
We need to set the attr on all operations to avoid something like
1 truncate B
2 clone A->B
3 modify A
<skip 2 due to xattr>
#16 Updated by Sage Weil about 7 years ago
Update: the current first pass plan is to initiate a FileStore sync after any non-idempotent operation. This updates commit_op_seq on disk and ensures that it won't be replayed.
It's also heavyweight as it calls sync(2). So it's a big hammer, but at least it's correct.
We can set a non-idempotent bool in the do_transaction method on CLONE or anything similar and then do the commit at the end (before any other operations occur under the