Project

General

Profile

Actions

Bug #23194

closed

librados client is sending bad omap value just before program exits

Added by Jeff Layton about 6 years ago. Updated about 6 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
librados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've been tracking down a problem in nfs-ganesha where an omap value in an object ends up truncated. It doesn't always happen, but I can make it occur pretty consistently when I run ganesha in a docker container. Attached is a packet capture:

Frame 153: client (172.17.0.2) sends the correct 63 byte value to the OSD
Frame 201: OSD sends value back to the client
Frame 202: client sends same value back to OSD but only 29 bytes of it

Soon after, the program (and container) dies. I don't think ganesha is driving an update to the omap value at that point, so I'm assuming that this is some internal librados thread doing it. Maybe it's catching a signal or something? In any case, I'm not sure of the significance of the 29 bytes, but it quite consistently truncates it to that length.


Files

ganesha-ceph.pcapng.gz (12.8 KB) ganesha-ceph.pcapng.gz capture of traffic between ganesha and vstart cluster Jeff Layton, 03/01/2018 08:49 PM
Actions #1

Updated by Jeff Layton about 6 years ago

Ahh, the object name is 29 bytes in this case, so maybe there is some confusion about lengths down in the code that is sending this request?

Actions #2

Updated by Jeff Layton about 6 years ago

  • Subject changed from librados client is sending bad omap value to librados client is sending bad omap value just before program exits
  • Severity changed from 3 - minor to 2 - major

I do have the ability to collect client logs within the container, and can turn up debugging in there if it'll help.

Actions #3

Updated by Jason Dillaman about 6 years ago

Frame 201:
Object: rec-00000000:0000000000000017
Key: 6528071705456279553
Value: ::ffff:192.168.1.243-(37:Linux NFSv4.2 nfsclnt.poochiereds.net)

Frame 202:
Object: rec-00000000:0000000000000017
Key: 6528071705456279553
Value: ::ffff:192.168.1.243-(37:Linu

Actions #4

Updated by Jason Dillaman about 6 years ago

I don't know what nfs-ganesha code to look at, but this [1] looks very suspect to me since you are returning a pointer to a string whose memory you don't own and will most likely be freed before leaving that function.

[1] https://github.com/nfs-ganesha/nfs-ganesha/blob/981cb36abf4f64a478201a90d961798e75f14bb4/src/SAL/recovery/recovery_rados_kv.c#L195

Actions #5

Updated by Jeff Layton about 6 years ago

rados_kv_get does look hinky, but I don't think we're calling into it here. We're basically doing a rados_kv_put into the object, early on, around frame 153, when the only client is connecting to the server.

We don't touch the omap again afterward as we don't have any reason to touch it again after that point. Something however is issuing another call to the OSD to store a truncated version of the string at close to program exit (23s after the store).

I'll plan to poke at it some more tomorrow. Maybe I can come up with a more self-contained reproducer.

Actions #6

Updated by Jason Dillaman about 6 years ago

... there was a "omap get" right before the store and the values stored where the (truncated) values that were just retrieved. That's why it looks very odd -- regardless, that's a bug that should be addressed.

Actions #7

Updated by Jeff Layton about 6 years ago

  • Status changed from New to Rejected

Thanks Jason. You were absolutely right -- the omap get/put at exit is being driven by ganesha. I had missed that before in debugging, but it looks to be the result of a different bug, that I'll be chasing down today.

I've queued up a patch to fix up rados_kv_get too, as you point out. It turns out that that problem is not what's causing this, but while I'm in here I'll go ahead and fix it:

https://review.gerrithub.io/402289

Actions

Also available in: Atom PDF