Project

General

Profile

Bug #13032

client nonce collision due to unshared pid namespaces

Added by Kjetil Joergensen about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
librados
Target version:
-
Start date:
09/11/2015
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
firefly, hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

We have a process which concurrently ends up starting 3 jobs concurrently which among other things do a mix of rbd list / rbd map / rbd unmap / rbd snap create (+ operations on said rbd images), which inconsistently will end up with some invocations of rbd (i.e. list) not completing, and there's a rather rapid stream of connections from the rbd client to the osd holding the rbd directory, with the osd rather quickly closing it again. (On occasion, this also ends up having other side-effects such as the conntrack table filling up on the osd host, or the osd itself failing to create a new thread, fails the assert and aborts, likely due to the churn in connections, and potentially also starving other requests for i.e. the rbd directory to suffer as well).

Usually; the requests do seem to involve call rbd.dir_list.

I have unfortunately yet to make a reduced test-case which reproduces this, which doesn't require a whole lot of additional scaffolding.

This is on ceph hammer, osd v0.94.3 client v0.94.3 (and v0.94.2)

Hosts are running ubnutu 14.04.3 with kernel 3.19.0-26-generic.

I've attaced one second worth of rather verbose logs from the osd and from the client side (although; it might only be one of the multiple clients on that host).

The slightly larger variants of the logs can be found here: http://www.pvv.ntnu.no/~kjetijor/ceph_logs/ (Where *.smallish.bz2 is a 10s interval from both sides).

I'm attaching log-files, one from the client and one from the osd (with among others; debug ms = 20).

Example of the logs in the default log-level:

2015-09-10 17:51:57.954112 7fa7ee674700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1efb5000 sd=649 :6830 s=0 pgs=0 cs=0 l=1 c=0x29bcbb80).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.954497 7fa7afd4e700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1e74f000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x21d57a20).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.954569 7fa7b3373700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1d767000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x27fc5340).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.954937 7fa7e3dd3700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1d2e5000 sd=651 :6830 s=0 pgs=0 cs=0 l=1 c=0x20a51860).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955012 7fa7ae839700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1d636000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x20a512e0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955352 7fa7ee674700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1e74f000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x1ef4f8c0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955389 7fa7afd4e700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1d767000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x2975f9c0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955778 7fa7b3373700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1efb5000 sd=649 :6830 s=0 pgs=0 cs=0 l=1 c=0x1eefc520).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955834 7fa7e3dd3700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x26cde000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x1ef4edc0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956224 7fa7af445700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1eae5000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x26014c00).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956260 7fa7ae839700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1ddcf000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x26013340).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956625 7fa7ee674700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x26cde000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x26013fa0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956677 7fa7b3373700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1efb5000 sd=649 :6830 s=0 pgs=0 cs=0 l=1 c=0x26012420).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956982 7fa7e3dd3700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1e74f000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x26015700).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957124 7fa7ae839700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1d767000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x21d58c00).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957501 7fa7afd4e700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1d2e5000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x21d59b20).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957549 7fa7b3373700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x26cde000 sd=649 :6830 s=0 pgs=0 cs=0 l=1 c=0x20a51440).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957926 7fa7ee674700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1d767000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x2975e100).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957966 7fa7af445700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1ddcf000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x20a51b20).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.958335 7fa7e3dd3700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1efb5000 sd=651 :6830 s=0 pgs=0 cs=0 l=1 c=0x295dc680).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.958435 7fa7afd4e700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1e74f000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x295dd2e0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.958514 7fa7fedf6700  0 -- 10.225.10.102:6830/14634 submit_message osd_op_reply(2 rbd_directory [call rbd.dir_list] v0'0 uv714361 ondisk = 0) v6 remote, 10.225.17.6:0/1000012, failed lossy con, dropping message 0xfbe3600

rbd_client_verbose.evensmaller.log.bz2 (210 KB) Kjetil Joergensen, 09/11/2015 02:03 AM

ceph-osd.119.evensmaller.log.bz2 (622 KB) Kjetil Joergensen, 09/11/2015 02:03 AM

strace_attached.out.bz2 (558 KB) Kjetil Joergensen, 09/11/2015 05:58 PM


Related issues

Copied to Ceph - Backport #13244: client nonce collision due to unshared pid namespaces Resolved 09/11/2015
Copied to Ceph - Backport #13245: client nonce collision due to unshared pid namespaces Resolved

Associated revisions

Revision a3a8c85b (diff)
Added by Josh Durgin about 3 years ago

use simplifed messenger constructor for clients

This is all mechanical except the calculation of the nonce, which is
now always randomized for clients.

Fixes: #13032
Signed-off-by: Josh Durgin <>

Revision c85d0638 (diff)
Added by Josh Durgin about 3 years ago

use simplifed messenger constructor for clients

This is all mechanical except the calculation of the nonce, which is
now always randomized for clients.

Fixes: #13032
Signed-off-by: Josh Durgin <>
(cherry picked from commit a3a8c85b79afef67681c32c57b591c0e0a87a349)

Conflicts:
src/ceph_fuse.cc
src/ceph_syn.cc
src/libcephfs.cc
src/librados/RadosClient.cc
src/mds/MDSUtility.cc
src/mon/MonClient.cc
src/test/mon/test_mon_workloadgen.cc
- different arguments to Messenger::create() in firefly

Revision 8610de81 (diff)
Added by Josh Durgin about 3 years ago

use simplifed messenger constructor for clients

This is all mechanical except the calculation of the nonce, which is
now always randomized for clients.

Fixes: #13032
Signed-off-by: Josh Durgin <>
(cherry picked from commit a3a8c85b79afef67681c32c57b591c0e0a87a349)

History

#1 Updated by Josh Durgin about 3 years ago

This looks like the client is running into the fd limit, preventing it from opening a socket to receive a reply, and retrying, manifesting as a hang. Does it still occur when you raise it with sysctl or ulimit -n?

#2 Updated by Kjetil Joergensen about 3 years ago

Josh Durgin wrote:

This looks like the client is running into the fd limit, preventing it from opening a socket to receive a reply, and retrying, manifesting as a hang. Does it still occur when you raise it with sysctl or ulimit -n?

This should be running with RLIMIT_NOFILE hard/soft 102400. I'd have been more inclined to agree at 1024, but a 100k fd's for rbd list seems like it should be reasonable headroom.

Under "normal" circumstances judging by strace -f rbd list; it never went beyond fd number 4 for any of the threads.

I'll try to run under RLIMIT_NOFILE with hard/soft RLIM_INFINITY.

#3 Updated by Kjetil Joergensen about 3 years ago

I don't think this is regarding fd's, while I didn't run it under RLIMIT_NOFILE/RLIM_INFINITY, judging by attaching strace to one of the ones that's in this state, it never went beyond fd number 5 (judging by calls to socket).

Attaching a small-ish sample of an strace of rbd list, which sadly doesn't include how we got there as it were attached while it were running.

In terms of what information you'd like gathered; where would my efforts be best spent ? (Rather than collecting random bits of information I think might be useful).

#4 Updated by Kjetil Joergensen about 3 years ago

One other potential contributing factor, multiple invocations of rbd list on the same host running within different mount/ipc namespaces.

I'll see if I can reproduce with namespaces; and probably also see if we can lift the rbd list/rm interactions out of the namespaces (if that works; it'd solve my immediate problems).

#5 Updated by Kjetil Joergensen about 3 years ago

I'm guessing namespaces aren't entirely supported yet.

Essentially; my reduced breaking testcase.

Makefile with the following; then make -j 10. On occasion, it fails miserably, sometimes it works as expected.

all: t1 t2 t3 t4 t5 t6

t1:
        sudo unshare --ipc --pid --fork -- rbd list >/dev/null

t2:
        sudo unshare --ipc --pid --fork -- rbd list >/dev/null

t3:
        sudo unshare --ipc --pid --fork -- rbd list >/dev/null

t4:
        sudo unshare --ipc --pid --fork -- rbd list >/dev/null

t5:
        sudo unshare --ipc --pid --fork -- rbd list >/dev/null

t6:
        sudo unshare --ipc --pid --fork -- rbd list >/dev/null
<pre>

#6 Updated by Kjetil Joergensen about 3 years ago

You can strike "--ipc" as well, concurrent invocations on the same machine of "unshare --pid --fork -- rbd list" triggers this. I guess that some combination of operation/pid and/or ip-address is used to uniquely identify the client.

Anyway; I'll try to lift the rbd cli invocations out of the namespaces as they don't need to be there.

#7 Updated by Josh Durgin about 3 years ago

Yes, that would do it: https://github.com/ceph/ceph/blob/da96a89033590277460aef1c80f385bd93d625e1/src/librados/RadosClient.cc#L209 The nonce used to identify the client is based on the pid and per-process state. Is there a good way to incorporate namespace information in that (I'm not sure there is any), or would adding a random number to that be better?

#8 Updated by Kjetil Joergensen about 3 years ago

It's hacky as hell (and quite linux specific); but I were about to readlink("/proc/self/ns/pid", .., ..) and extract the number from "pid:[this number]" and add that to the nonce. Under the assumption that it's unique, and that I'm statistically very unlikely to end up with another collision.

#9 Updated by Kjetil Joergensen about 3 years ago

Sadly; adding a random number would probably be better than using /proc/self/ns/pid, it seems to increment when I make a new pid namespace. So I suspect we'll have a rather small delta in /proc/self/ns/pid, combined with the pid numbers inside the namespaces is likely going to be skewed towards the lower end, again heading in the direction of a higher probability of collision.

#10 Updated by Josh Durgin about 3 years ago

  • Subject changed from rbd list / map ends up flooding osd with requests to client nonce collision due to unshared pid namespaces

Yeah, looks like we should just randomize it, as libcephfs does already: https://github.com/ceph/ceph/blob/f6bf6c2a969b3d79179d1f14375ed9dfa3fd49ea/src/libcephfs.cc#L293

I'm thinking we may want to just put the nonce randomization in Messenger::create() to avoid this sort of confusion, as ceph-fuse also initializes it based on pid, as do several other places. For daemons it doesn't matter much since they have different ids, but for client side things I don't see a reason not to randomize the nonce.

Anyone have a reason not to always randomize it?

#11 Updated by Greg Farnum about 3 years ago

I'd rather do incremental changes than muck around with the messenger interface internals to fix a bug we only see in one case. The monitors always have nonce 0 and I can't think of anything off the top of my head that would break if they didn't, but I don't want to find out.

The idea of changing it for clients ought to be okay, but we'll need to check it doesn't break the client metadata stuff the MDS maintains (I don't think it should, just spin up a vstart cluster and look).

#12 Updated by Josh Durgin about 3 years ago

  • Status changed from New to Need Review
  • Assignee set to Josh Durgin

#13 Updated by Josh Durgin about 3 years ago

  • Category set to librados
  • Priority changed from Normal to High
  • Backport set to firefly, hammer

#14 Updated by Sage Weil about 3 years ago

  • Status changed from Need Review to Pending Backport

#15 Updated by Loic Dachary about 3 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF