Bug #13032: client nonce collision due to unshared pid namespaces - Ceph - Ceph

Actions

Copy link

Bug #13032

closed

client nonce collision due to unshared pid namespaces

Added by Kjetil Joergensen over 8 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

High

Assignee:

Josh Durgin

Category:

librados

Target version:

% Done:

Source:

other

Tags:

Backport:

firefly, hammer

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We have a process which concurrently ends up starting 3 jobs concurrently which among other things do a mix of rbd list / rbd map / rbd unmap / rbd snap create (+ operations on said rbd images), which inconsistently will end up with some invocations of rbd (i.e. list) not completing, and there's a rather rapid stream of connections from the rbd client to the osd holding the rbd directory, with the osd rather quickly closing it again. (On occasion, this also ends up having other side-effects such as the conntrack table filling up on the osd host, or the osd itself failing to create a new thread, fails the assert and aborts, likely due to the churn in connections, and potentially also starving other requests for i.e. the rbd directory to suffer as well).

Usually; the requests do seem to involve call rbd.dir_list.

I have unfortunately yet to make a reduced test-case which reproduces this, which doesn't require a whole lot of additional scaffolding.

This is on ceph hammer, osd v0.94.3 client v0.94.3 (and v0.94.2)

Hosts are running ubnutu 14.04.3 with kernel 3.19.0-26-generic.

I've attaced one second worth of rather verbose logs from the osd and from the client side (although; it might only be one of the multiple clients on that host).

The slightly larger variants of the logs can be found here: http://www.pvv.ntnu.no/~kjetijor/ceph_logs/ (Where *.smallish.bz2 is a 10s interval from both sides).

I'm attaching log-files, one from the client and one from the osd (with among others; debug ms = 20).

Example of the logs in the default log-level:

2015-09-10 17:51:57.954112 7fa7ee674700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1efb5000 sd=649 :6830 s=0 pgs=0 cs=0 l=1 c=0x29bcbb80).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.954497 7fa7afd4e700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1e74f000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x21d57a20).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.954569 7fa7b3373700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1d767000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x27fc5340).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.954937 7fa7e3dd3700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1d2e5000 sd=651 :6830 s=0 pgs=0 cs=0 l=1 c=0x20a51860).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955012 7fa7ae839700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1d636000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x20a512e0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955352 7fa7ee674700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1e74f000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x1ef4f8c0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955389 7fa7afd4e700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1d767000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x2975f9c0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955778 7fa7b3373700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1efb5000 sd=649 :6830 s=0 pgs=0 cs=0 l=1 c=0x1eefc520).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955834 7fa7e3dd3700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x26cde000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x1ef4edc0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956224 7fa7af445700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1eae5000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x26014c00).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956260 7fa7ae839700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1ddcf000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x26013340).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956625 7fa7ee674700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x26cde000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x26013fa0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956677 7fa7b3373700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1efb5000 sd=649 :6830 s=0 pgs=0 cs=0 l=1 c=0x26012420).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956982 7fa7e3dd3700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1e74f000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x26015700).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957124 7fa7ae839700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1d767000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x21d58c00).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957501 7fa7afd4e700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1d2e5000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x21d59b20).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957549 7fa7b3373700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x26cde000 sd=649 :6830 s=0 pgs=0 cs=0 l=1 c=0x20a51440).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957926 7fa7ee674700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1d767000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x2975e100).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957966 7fa7af445700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1ddcf000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x20a51b20).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.958335 7fa7e3dd3700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1efb5000 sd=651 :6830 s=0 pgs=0 cs=0 l=1 c=0x295dc680).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.958435 7fa7afd4e700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1e74f000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x295dd2e0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.958514 7fa7fedf6700  0 -- 10.225.10.102:6830/14634 submit_message osd_op_reply(2 rbd_directory [call rbd.dir_list] v0'0 uv714361 ondisk = 0) v6 remote, 10.225.17.6:0/1000012, failed lossy con, dropping message 0xfbe3600

Files

Download all files

rbd_client_verbose.evensmaller.log.bz2 (210 KB) rbd_client_verbose.evensmaller.log.bz2		Kjetil Joergensen, 09/11/2015 02:03 AM
ceph-osd.119.evensmaller.log.bz2 (622 KB) ceph-osd.119.evensmaller.log.bz2		Kjetil Joergensen, 09/11/2015 02:03 AM
strace_attached.out.bz2 (558 KB) strace_attached.out.bz2		Kjetil Joergensen, 09/11/2015 05:58 PM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Josh Durgin over 8 years ago

This looks like the client is running into the fd limit, preventing it from opening a socket to receive a reply, and retrying, manifesting as a hang. Does it still occur when you raise it with sysctl or ulimit -n?

Actions

Copy link

Updated by Kjetil Joergensen over 8 years ago

Josh Durgin wrote:

This looks like the client is running into the fd limit, preventing it from opening a socket to receive a reply, and retrying, manifesting as a hang. Does it still occur when you raise it with sysctl or ulimit -n?

This should be running with RLIMIT_NOFILE hard/soft 102400. I'd have been more inclined to agree at 1024, but a 100k fd's for rbd list seems like it should be reasonable headroom.

Under "normal" circumstances judging by strace -f rbd list; it never went beyond fd number 4 for any of the threads.

I'll try to run under RLIMIT_NOFILE with hard/soft RLIM_INFINITY.

Actions

Copy link

Updated by Kjetil Joergensen over 8 years ago

File strace_attached.out.bz2 strace_attached.out.bz2 added

I don't think this is regarding fd's, while I didn't run it under RLIMIT_NOFILE/RLIM_INFINITY, judging by attaching strace to one of the ones that's in this state, it never went beyond fd number 5 (judging by calls to socket).

Attaching a small-ish sample of an strace of rbd list, which sadly doesn't include how we got there as it were attached while it were running.

In terms of what information you'd like gathered; where would my efforts be best spent ? (Rather than collecting random bits of information I think might be useful).

Actions

Copy link

Updated by Kjetil Joergensen over 8 years ago

One other potential contributing factor, multiple invocations of rbd list on the same host running within different mount/ipc namespaces.

I'll see if I can reproduce with namespaces; and probably also see if we can lift the rbd list/rm interactions out of the namespaces (if that works; it'd solve my immediate problems).

Actions

Copy link

Updated by Kjetil Joergensen over 8 years ago

I'm guessing namespaces aren't entirely supported yet.

Essentially; my reduced breaking testcase.

Makefile with the following; then make -j 10. On occasion, it fails miserably, sometimes it works as expected.

all: t1 t2 t3 t4 t5 t6 t1: sudo unshare --ipc --pid --fork -- rbd list >/dev/null t2: sudo unshare --ipc --pid --fork -- rbd list >/dev/null t3: sudo unshare --ipc --pid --fork -- rbd list >/dev/null t4: sudo unshare --ipc --pid --fork -- rbd list >/dev/null t5: sudo unshare --ipc --pid --fork -- rbd list >/dev/null t6: sudo unshare --ipc --pid --fork -- rbd list >/dev/null <pre>

Actions

Copy link

Updated by Kjetil Joergensen over 8 years ago

You can strike "--ipc" as well, concurrent invocations on the same machine of "unshare --pid --fork -- rbd list" triggers this. I guess that some combination of operation/pid and/or ip-address is used to uniquely identify the client.

Anyway; I'll try to lift the rbd cli invocations out of the namespaces as they don't need to be there.

Actions

Copy link

Updated by Josh Durgin over 8 years ago

Yes, that would do it: https://github.com/ceph/ceph/blob/da96a89033590277460aef1c80f385bd93d625e1/src/librados/RadosClient.cc#L209 The nonce used to identify the client is based on the pid and per-process state. Is there a good way to incorporate namespace information in that (I'm not sure there is any), or would adding a random number to that be better?

Actions

Copy link

Updated by Kjetil Joergensen over 8 years ago

It's hacky as hell (and quite linux specific); but I were about to readlink("/proc/self/ns/pid", .., ..) and extract the number from "pid:[this number]" and add that to the nonce. Under the assumption that it's unique, and that I'm statistically very unlikely to end up with another collision.

Actions

Copy link

Updated by Kjetil Joergensen over 8 years ago

Sadly; adding a random number would probably be better than using /proc/self/ns/pid, it seems to increment when I make a new pid namespace. So I suspect we'll have a rather small delta in /proc/self/ns/pid, combined with the pid numbers inside the namespaces is likely going to be skewed towards the lower end, again heading in the direction of a higher probability of collision.

Actions

Copy link

#10

Updated by Josh Durgin over 8 years ago

Subject changed from rbd list / map ends up flooding osd with requests to client nonce collision due to unshared pid namespaces

Yeah, looks like we should just randomize it, as libcephfs does already: https://github.com/ceph/ceph/blob/f6bf6c2a969b3d79179d1f14375ed9dfa3fd49ea/src/libcephfs.cc#L293

I'm thinking we may want to just put the nonce randomization in Messenger::create() to avoid this sort of confusion, as ceph-fuse also initializes it based on pid, as do several other places. For daemons it doesn't matter much since they have different ids, but for client side things I don't see a reason not to randomize the nonce.

Anyone have a reason not to always randomize it?

Actions

Copy link

#11

Updated by Greg Farnum over 8 years ago

I'd rather do incremental changes than muck around with the messenger interface internals to fix a bug we only see in one case. The monitors always have nonce 0 and I can't think of anything off the top of my head that would break if they didn't, but I don't want to find out.

The idea of changing it for clients ought to be okay, but we'll need to check it doesn't break the client metadata stuff the MDS maintains (I don't think it should, just spin up a vstart cluster and look).

Actions

Copy link

#12