Project

General

Profile

Actions

Bug #41036

closed

concurrent "rbd unmap" failures due to udev

Added by Ilya Dryomov over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous,mimic,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Unmapping 200 images concurrently leaves behind 10-20 mappings:

# map-200.sh
OK
# rbd showmapped | wc -l
201
# for ((i = 0; i < 200; i++)); do rbd unmap /dev/rbd$i & done
rbd: '/dev/rbd4' is not an rbd device
rbd: unmap failed: (22) Invalid argument
rbd: unmap failed: (19) No such device
rbd: unmap failed: (19) No such device
rbd: '/dev/rbd70' is not an rbd device
rbd: '/dev/rbd106' is not an rbd device
rbd: unmap failed: rbd: unmap failed: (22) Invalid argument
(22) Invalid argument
rbd: '/dev/rbd136' is not an rbd device
rbd: unmap failed: (22) Invalid argument
rbd: '/dev/rbd167' is not an rbd device
rbd: '/dev/rbd138' is not an rbd device
rbd: unmap failed: rbd: unmap failed: (22) Invalid argument
rbd: unmap failed: (22) Invalid argument(19) No such device
rbd: unmap failed: (19) No such device
rbd: '/dev/rbd160' is not an rbd device
rbd: unmap failed: (22) Invalid argument
rbd: '/dev/rbd163' is not an rbd device
rbd: unmap failed: (19) No such device
rbd: unmap failed: (22) Invalid argument
rbd: unmap failed: (19) No such device
rbd: '/dev/rbd173' is not an rbd device
rbd: unmap failed: (22) Invalid argument
rbd: unmap failed: (19) No such device
rbd: unmap failed: (19) No such device
rbd: '/dev/rbd181' is not an rbd device
rbd: unmap failed: (22) Invalid argument
rbd: unmap failed: (19) No such device

# rbd showmapped 
id  pool namespace image  snap device      
106 rbd            img106 -    /dev/rbd106 
136 rbd            img137 -    /dev/rbd136 
138 rbd            img140 -    /dev/rbd138 
140 rbd            img141 -    /dev/rbd140 
160 rbd            img158 -    /dev/rbd160 
162 rbd            img165 -    /dev/rbd162 
163 rbd            img162 -    /dev/rbd163 
167 rbd            img168 -    /dev/rbd167 
173 rbd            img173 -    /dev/rbd173 
177 rbd            img176 -    /dev/rbd177 
181 rbd            img183 -    /dev/rbd181 
184 rbd            img187 -    /dev/rbd184 
187 rbd            img184 -    /dev/rbd187 
188 rbd            img186 -    /dev/rbd188 
189 rbd            img189 -    /dev/rbd189 
4   rbd            img5   -    /dev/rbd4   
70  rbd            img70  -    /dev/rbd70  
83  rbd            img82  -    /dev/rbd83  
93  rbd            img96  -    /dev/rbd93

This is because udev_enumerate_scan_devices() called from devno_to_krbd_id() sporadically fails with either ENODEV or ENOENT. Under normal circumstances devno_to_krbd_id() returns ENOENT explicitly, which the caller treats as "not an rbd device" and translates to EINVAL.

Looking at strace output, the filtering code inside libudev does find the right device and continues on. udev_enumerate_scan_devices() fails later.


Related issues 5 (0 open5 closed)

Related to rbd - Bug #41404: [rbd] rbd map hangs up infinitely after osd downResolvedIlya Dryomov08/23/2019

Actions
Related to Ceph - Fix #42523: backport "common/thread: Fix race condition in make_named_thread" to mimic and nautilusResolvedIlya Dryomov10/29/2019

Actions
Copied to rbd - Backport #42524: nautilus: concurrent "rbd unmap" failures due to udevResolvedNathan CutlerActions
Copied to rbd - Backport #42526: mimic: concurrent "rbd unmap" failures due to udevResolvedIlya DryomovActions
Copied to rbd - Backport #42527: luminous: concurrent "rbd unmap" failures due to udevResolvedIlya DryomovActions
Actions #1

Updated by Ilya Dryomov over 4 years ago

# cat map-200.sh
for ((i = 1; i <= 200; i++)); do
    rbd map img$i &
done
wait
echo OK

turns out to be a very good reproducer for #39089 if run from a VM with 1 CPU.

Actions #2

Updated by Jason Dillaman over 4 years ago

Ilya Dryomov wrote:

[...]

turns out to be a very good reproducer for #39089 if run from a VM with 1 CPU.

Does this mean you are able to reproduce the 'rbd map' hangs now?

Actions #3

Updated by Ilya Dryomov over 4 years ago

Yes, very easily on a build without https://github.com/ceph/ceph/pull/27339. Somewhat strangely, though, in one instance that I have examined I saw events come in the "right" order but it could be that udev can deliver different permutations to different listeners...

I have yet to do an extended run on a build with https://github.com/ceph/ceph/pull/27339 to verify this fix. I will also work on adding this test to krbd suite.

Actions #4

Updated by Ilya Dryomov over 4 years ago

May have hit this with fsx:

http://qa-proxy.ceph.com/teuthology/teuthology-2019-09-07_03:20:02-krbd-master-testing-basic-smithi/4285348/teuthology.log

2019-09-16T22:24:36.212 INFO:teuthology.orchestra.run.smithi079.stdout:checking clone #43, image image_client.2-clone43 against file /home/ubuntu/cephtest/archive/fsx-image_client.2-parent44
2019-09-16T22:24:36.217 INFO:teuthology.orchestra.run.smithi079.stdout:krbd_unmap(/dev/rbd3) failed
2019-09-16T22:24:36.218 INFO:teuthology.orchestra.run.smithi079.stdout:check_clone: ops->close: No such device
2019-09-16T22:24:36.234 DEBUG:teuthology.orchestra.run:got remote process result: 174

Actions #5

Updated by Ilya Dryomov over 4 years ago

  • Assignee set to Ilya Dryomov

Filed a bug against systemd (libudev): https://github.com/systemd/systemd/issues/13814.

Actions #6

Updated by Ilya Dryomov over 4 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 31023
Actions #7

Updated by Ilya Dryomov over 4 years ago

  • Related to Bug #41404: [rbd] rbd map hangs up infinitely after osd down added
Actions #8

Updated by Ilya Dryomov over 4 years ago

  • Backport set to luminous,mimic,nautilus
Actions #9

Updated by Nathan Cutler over 4 years ago

Backports could be done together with #41404

Actions #10

Updated by Ilya Dryomov over 4 years ago

  • Related to Fix #42523: backport "common/thread: Fix race condition in make_named_thread" to mimic and nautilus added
Actions #11

Updated by Nathan Cutler over 4 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #12

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #42524: nautilus: concurrent "rbd unmap" failures due to udev added
Actions #13

Updated by Ilya Dryomov over 4 years ago

For mimic and nautilus, this backport should be staged together with #42523 (ideally after it) to avoid sporadic test failures.

For luminous, this will be a manual backport because make_named_thread() was added in mimic.

Actions #14

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #42526: mimic: concurrent "rbd unmap" failures due to udev added
Actions #15

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #42527: luminous: concurrent "rbd unmap" failures due to udev added
Actions #16

Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF