Project

General

Profile

Actions

Bug #13032

closed

client nonce collision due to unshared pid namespaces

Added by Kjetil Joergensen over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
librados
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
firefly, hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a process which concurrently ends up starting 3 jobs concurrently which among other things do a mix of rbd list / rbd map / rbd unmap / rbd snap create (+ operations on said rbd images), which inconsistently will end up with some invocations of rbd (i.e. list) not completing, and there's a rather rapid stream of connections from the rbd client to the osd holding the rbd directory, with the osd rather quickly closing it again. (On occasion, this also ends up having other side-effects such as the conntrack table filling up on the osd host, or the osd itself failing to create a new thread, fails the assert and aborts, likely due to the churn in connections, and potentially also starving other requests for i.e. the rbd directory to suffer as well).

Usually; the requests do seem to involve call rbd.dir_list.

I have unfortunately yet to make a reduced test-case which reproduces this, which doesn't require a whole lot of additional scaffolding.

This is on ceph hammer, osd v0.94.3 client v0.94.3 (and v0.94.2)

Hosts are running ubnutu 14.04.3 with kernel 3.19.0-26-generic.

I've attaced one second worth of rather verbose logs from the osd and from the client side (although; it might only be one of the multiple clients on that host).

The slightly larger variants of the logs can be found here: http://www.pvv.ntnu.no/~kjetijor/ceph_logs/ (Where *.smallish.bz2 is a 10s interval from both sides).

I'm attaching log-files, one from the client and one from the osd (with among others; debug ms = 20).

Example of the logs in the default log-level:

2015-09-10 17:51:57.954112 7fa7ee674700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1efb5000 sd=649 :6830 s=0 pgs=0 cs=0 l=1 c=0x29bcbb80).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.954497 7fa7afd4e700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1e74f000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x21d57a20).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.954569 7fa7b3373700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1d767000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x27fc5340).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.954937 7fa7e3dd3700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1d2e5000 sd=651 :6830 s=0 pgs=0 cs=0 l=1 c=0x20a51860).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955012 7fa7ae839700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1d636000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x20a512e0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955352 7fa7ee674700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1e74f000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x1ef4f8c0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955389 7fa7afd4e700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1d767000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x2975f9c0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955778 7fa7b3373700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1efb5000 sd=649 :6830 s=0 pgs=0 cs=0 l=1 c=0x1eefc520).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.955834 7fa7e3dd3700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x26cde000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x1ef4edc0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956224 7fa7af445700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1eae5000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x26014c00).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956260 7fa7ae839700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1ddcf000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x26013340).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956625 7fa7ee674700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x26cde000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x26013fa0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956677 7fa7b3373700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1efb5000 sd=649 :6830 s=0 pgs=0 cs=0 l=1 c=0x26012420).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.956982 7fa7e3dd3700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1e74f000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x26015700).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957124 7fa7ae839700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1d767000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x21d58c00).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957501 7fa7afd4e700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1d2e5000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x21d59b20).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957549 7fa7b3373700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x26cde000 sd=649 :6830 s=0 pgs=0 cs=0 l=1 c=0x20a51440).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957926 7fa7ee674700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1d767000 sd=647 :6830 s=0 pgs=0 cs=0 l=1 c=0x2975e100).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.957966 7fa7af445700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1ddcf000 sd=237 :6830 s=0 pgs=0 cs=0 l=1 c=0x20a51b20).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.958335 7fa7e3dd3700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000006 pipe(0x1efb5000 sd=651 :6830 s=0 pgs=0 cs=0 l=1 c=0x295dc680).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.958435 7fa7afd4e700  0 -- 10.225.10.102:6830/14634 >> 10.225.17.6:0/1000012 pipe(0x1e74f000 sd=638 :6830 s=0 pgs=0 cs=0 l=1 c=0x295dd2e0).accept replacing existing (lossy) channel (new one lossy=1)
2015-09-10 17:51:57.958514 7fa7fedf6700  0 -- 10.225.10.102:6830/14634 submit_message osd_op_reply(2 rbd_directory [call rbd.dir_list] v0'0 uv714361 ondisk = 0) v6 remote, 10.225.17.6:0/1000012, failed lossy con, dropping message 0xfbe3600


Files

rbd_client_verbose.evensmaller.log.bz2 (210 KB) rbd_client_verbose.evensmaller.log.bz2 Kjetil Joergensen, 09/11/2015 02:03 AM
ceph-osd.119.evensmaller.log.bz2 (622 KB) ceph-osd.119.evensmaller.log.bz2 Kjetil Joergensen, 09/11/2015 02:03 AM
strace_attached.out.bz2 (558 KB) strace_attached.out.bz2 Kjetil Joergensen, 09/11/2015 05:58 PM

Related issues 2 (0 open2 closed)

Copied to Ceph - Backport #13244: client nonce collision due to unshared pid namespacesResolvedJosh Durgin09/11/2015Actions
Copied to Ceph - Backport #13245: client nonce collision due to unshared pid namespacesResolvedLoïc DacharyActions
Actions

Also available in: Atom PDF