Project

General

Profile

Actions

Bug #7335

closed

librbd does not raise "Object Not Found", instead returning NUL bytes

Added by Florian Haas about 10 years ago. Updated about 10 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
librbd
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Test case attached. test_remove_object fails. Prose description:

  • Create RBD image
  • Remove RADOS object belonging to that image
  • Read back image
  • Instead of raising "Object Not Found", rbd substitutes zeroes

Files

rbdtest.py (2.17 KB) rbdtest.py Florian Haas, 02/04/2014 09:32 AM
Actions #1

Updated by Florian Haas about 10 years ago

  • Description updated (diff)
Actions #2

Updated by Greg Farnum about 10 years ago

At present this is expected behavior. In order to raise an error we would need to know that the object was supposed to be present, and since RBD uses thin allocation and does not keep any kind of object index, we don't know if objects should be present or not. Instead we rely on RADOS for data integrity. Do you have some reason to believe that
1) we can't rely on RADOS for that, and
2) returning an error is better than not? (I guess that a real disk would keep attempting to read until it hits a timeout and reports that?)
(I've pushed for an index in the past so that it's cheaper to remove images and figure out where we need to go to look up data when they're layered, but it would not be a trivial change at this point.)

Actions #3

Updated by Florian Haas about 10 years ago

Greg Farnum wrote:

At present this is expected behavior. In order to raise an error we would need to know that the object was supposed to be present, and since RBD uses thin allocation and does not keep any kind of object index, we don't know if objects should be present or not. Instead we rely on RADOS for data integrity. Do you have some reason to believe that
1) we can't rely on RADOS for that, and

The "OSDs zapping xattrs when they get too large" issue (is there a Redmine bug for that?) has somewhat shaken my confidence in RADOS being the only line of defense safeguarding my data. Having an additional layer ensuring that my data is still what I think it is would certainly be nice.

Surely librbd calls map to underlying librados calls, and RBD addressing an object that simply isn't there sounds like a condition that can be caught and handled more elegantly than returning NULs. Right?

2) returning an error is better than not? (I guess that a real disk would keep attempting to read until it hits a timeout and reports that?)

Yes, you are obviously correct that any "Object Not Found" error should be raised only after a timeout expires -- but as far as I recall, librados already does block on objects being unavailable because they are unfound or they are part of a stuck PG.

Actions #4

Updated by Greg Farnum about 10 years ago

Nonexistent objects are defined as zeros in RBD. We'd have to distinguish between deliberately nonexistent and lost within RBD itself to do what you're asking (thus, the index I referenced).

Actions #5

Updated by Florian Haas about 10 years ago

Nonexistent objects are defined as zeros in RBD.

Erm, OK. I take it that this is also how TRIM/DISCARD is implemented, you just drop the object?

We'd have to distinguish between deliberately nonexistent and lost within RBD itself to do what you're asking (thus, the index I referenced).

Probably. Now, in the present state of things, under which of the below-listed circumstances does RBD return NUL bytes?

  • Object does not exist in RADOS (confirmed)
  • Object is part of a stuck PG
  • Object is unreadable in all OSDs (missing user.ceph.* xattrs, for example)
  • Object is unfound

I think it would be counterintuitive if rbd blocked on the latter three, but silently returned NUL on the first one. However, it would be utterly frightening if RBD returned NULs for all four.

Actions #6

Updated by Greg Farnum about 10 years ago

Just the first; as you say, it's how we do TRIM and that implementation is very common behavior among storage systems.

Actions #7

Updated by Florian Haas about 10 years ago

OK. Now the recent rgw/xattr bug has shown that it is possible for a Ceph application to mess up objects so badly that manual recovery is the only option (I believe you referred to it as a "hack job" in #ceph). What if at some point an OSD/RADOS/rbd bug just zaps objects or renders their corresponding filestore files unreadable? In that case I'd really like my VMs (most people's primary use case for RBD) to grind to a halt, rather than believing everything is hunky dory and be fed zeroes. Agree/disagree?

Actions #8

Updated by Greg Farnum about 10 years ago

There's a big difference between what happens if the objects get zapped versus rendered unreadable — if unreadable, the read hangs. If it gets zapped, I honestly have no idea what the "desired behavior" would be, but I suspect you're right that it's some kind of hang or error code. I'm just letting you know that the solution would require an RBD storage architecture change, and I don't know that anybody's going to prioritize this (unlikely) scenario highly enough for that to happen.
(In particular, if RADOS randomly deletes objects we're screwed by a lot more than VMs getting back erroneously-zeroed data. It's just as easy to construct a hypothetical scenario where RBD erroneously deletes the object and marks it as correctly deleted as it is to construct one where it deletes blocks incorrectly.)

Actions #9

Updated by Dan Mick about 10 years ago

Another way to state this: rbd images are defined such that if an object doesn't exist within the defined size of the image, that represents a "hole" in the image that reads as zeroes. (the same effect occurs when an object exists but is
shorter than the block size of the image). This allows sparse provisioning and really is something you want.

But, note, this is also no different from normal files in a filesystem....if some external agent comes along and deletes the middle of a backing-store file, zeroes will be read from that file and no history that the data was not zeroes will be available.

Actions #10

Updated by Florian Haas about 10 years ago

Dan Mick wrote:

Another way to state this: rbd images are defined such that if an object doesn't exist within the defined size of the image, that represents a "hole" in the image that reads as zeroes. (the same effect occurs when an object exists but is
shorter than the block size of the image). This allows sparse provisioning and really is something you want.

Well, I see the problem with the index that Greg was mentioning -- it seems that that could cause some pretty nasty contention when you have a VM doing random I/O all over the volume, and then needing to update some central index.

But, note, this is also no different from normal files in a filesystem....if some external agent comes along and deletes the middle of a backing-store file, zeroes will be read from that file and no history that the data was not zeroes will be available.

That's fine, though I must admit I'm a bit allergic to this type of reasoning... when your technology wants to be better than X, then saying "X is no better than this" is a bit weak. And I suppose Ceph does want to be better than conventional filesystem-based storage. (I have the same issues when in an OpenStack discussion someone says "but AWS doesn't do this either, why should we?" But I digress and that's a philosophical discussion that isn't to be had in an issue report.)

One thing that would be nice, of course, would be that if something held an rbd lock, even a RADOS client couldn't delete or update an object associated with an RBD without first pre-empting the lock. But the only way I would see that happening would be librbd enumerating all the objects associated with a volume when said volume is locked, and then doing rados::cls::lock::lock() on all of them, rather than just the header object which I believe is what it does now. That will probably make the logging call significantly slower, but could be acceptable because it only needs to happen once which the image is opened/locked. Thoughts?

Actions #11

Updated by Greg Farnum about 10 years ago

Locking in RADOS is strictly voluntary; making it a mandatory thing would require keeping global knowledge about which clients are alive, among a host of other issues. And even if it weren't that wouldn't solve your problem, since you could remove RBD images with impunity while the machine was shut down.
We'll definitely keep this concern in mind next time we're revising RBD, but we can't do much until then and the specific details here (requiring either a malicious administrator or a very serious software bug) are of the sort that can always break your storage, no matter how many precautions you take.

Actions #12

Updated by Florian Haas about 10 years ago

Locking in RADOS is strictly voluntary; making it a mandatory thing would require keeping global knowledge about which clients are alive, among a host of other issues.

Yeah, I don't think anything in Linux ever got mandatory locking right. I was referring to advisory locking and just out of curiosity tried to hack up some Python code to confirm that at least the header object is properly locked when the image is locked -- that's how I found out about #7340. :)

And even if it weren't that wouldn't solve your problem, since you could remove RBD images with impunity while the machine was shut down.

Removing a whole RBD image is quite a different animal compared to ripping objects out of an image, but I see your point and the same considerations apply to the latter.

Actions #13

Updated by Sage Weil about 10 years ago

  • Status changed from New to Won't Fix
Actions

Also available in: Atom PDF