Bug #23194
closed
librados client is sending bad omap value just before program exits
Added by Jeff Layton about 6 years ago.
Updated about 6 years ago.
Component(RADOS):
librados
Description
I've been tracking down a problem in nfs-ganesha where an omap value in an object ends up truncated. It doesn't always happen, but I can make it occur pretty consistently when I run ganesha in a docker container. Attached is a packet capture:
Frame 153: client (172.17.0.2) sends the correct 63 byte value to the OSD
Frame 201: OSD sends value back to the client
Frame 202: client sends same value back to OSD but only 29 bytes of it
Soon after, the program (and container) dies. I don't think ganesha is driving an update to the omap value at that point, so I'm assuming that this is some internal librados thread doing it. Maybe it's catching a signal or something? In any case, I'm not sure of the significance of the 29 bytes, but it quite consistently truncates it to that length.
Files
Ahh, the object name is 29 bytes in this case, so maybe there is some confusion about lengths down in the code that is sending this request?
- Subject changed from librados client is sending bad omap value to librados client is sending bad omap value just before program exits
- Severity changed from 3 - minor to 2 - major
I do have the ability to collect client logs within the container, and can turn up debugging in there if it'll help.
Frame 201:
Object: rec-00000000:0000000000000017
Key: 6528071705456279553
Value: ::ffff:192.168.1.243-(37:Linux NFSv4.2 nfsclnt.poochiereds.net)
Frame 202:
Object: rec-00000000:0000000000000017
Key: 6528071705456279553
Value: ::ffff:192.168.1.243-(37:Linu
rados_kv_get does look hinky, but I don't think we're calling into it here. We're basically doing a rados_kv_put into the object, early on, around frame 153, when the only client is connecting to the server.
We don't touch the omap again afterward as we don't have any reason to touch it again after that point. Something however is issuing another call to the OSD to store a truncated version of the string at close to program exit (23s after the store).
I'll plan to poke at it some more tomorrow. Maybe I can come up with a more self-contained reproducer.
... there was a "omap get" right before the store and the values stored where the (truncated) values that were just retrieved. That's why it looks very odd -- regardless, that's a bug that should be addressed.
- Status changed from New to Rejected
Thanks Jason. You were absolutely right -- the omap get/put at exit is being driven by ganesha. I had missed that before in debugging, but it looks to be the result of a different bug, that I'll be chasing down today.
I've queued up a patch to fix up rados_kv_get too, as you point out. It turns out that that problem is not what's causing this, but while I'm in here I'll go ahead and fix it:
https://review.gerrithub.io/402289
Also available in: Atom
PDF