Bug #41036
closedconcurrent "rbd unmap" failures due to udev
0%
Description
Unmapping 200 images concurrently leaves behind 10-20 mappings:
# map-200.sh OK # rbd showmapped | wc -l 201 # for ((i = 0; i < 200; i++)); do rbd unmap /dev/rbd$i & done rbd: '/dev/rbd4' is not an rbd device rbd: unmap failed: (22) Invalid argument rbd: unmap failed: (19) No such device rbd: unmap failed: (19) No such device rbd: '/dev/rbd70' is not an rbd device rbd: '/dev/rbd106' is not an rbd device rbd: unmap failed: rbd: unmap failed: (22) Invalid argument (22) Invalid argument rbd: '/dev/rbd136' is not an rbd device rbd: unmap failed: (22) Invalid argument rbd: '/dev/rbd167' is not an rbd device rbd: '/dev/rbd138' is not an rbd device rbd: unmap failed: rbd: unmap failed: (22) Invalid argument rbd: unmap failed: (22) Invalid argument(19) No such device rbd: unmap failed: (19) No such device rbd: '/dev/rbd160' is not an rbd device rbd: unmap failed: (22) Invalid argument rbd: '/dev/rbd163' is not an rbd device rbd: unmap failed: (19) No such device rbd: unmap failed: (22) Invalid argument rbd: unmap failed: (19) No such device rbd: '/dev/rbd173' is not an rbd device rbd: unmap failed: (22) Invalid argument rbd: unmap failed: (19) No such device rbd: unmap failed: (19) No such device rbd: '/dev/rbd181' is not an rbd device rbd: unmap failed: (22) Invalid argument rbd: unmap failed: (19) No such device # rbd showmapped id pool namespace image snap device 106 rbd img106 - /dev/rbd106 136 rbd img137 - /dev/rbd136 138 rbd img140 - /dev/rbd138 140 rbd img141 - /dev/rbd140 160 rbd img158 - /dev/rbd160 162 rbd img165 - /dev/rbd162 163 rbd img162 - /dev/rbd163 167 rbd img168 - /dev/rbd167 173 rbd img173 - /dev/rbd173 177 rbd img176 - /dev/rbd177 181 rbd img183 - /dev/rbd181 184 rbd img187 - /dev/rbd184 187 rbd img184 - /dev/rbd187 188 rbd img186 - /dev/rbd188 189 rbd img189 - /dev/rbd189 4 rbd img5 - /dev/rbd4 70 rbd img70 - /dev/rbd70 83 rbd img82 - /dev/rbd83 93 rbd img96 - /dev/rbd93
This is because udev_enumerate_scan_devices() called from devno_to_krbd_id() sporadically fails with either ENODEV or ENOENT. Under normal circumstances devno_to_krbd_id() returns ENOENT explicitly, which the caller treats as "not an rbd device" and translates to EINVAL.
Looking at strace output, the filtering code inside libudev does find the right device and continues on. udev_enumerate_scan_devices() fails later.
Updated by Ilya Dryomov over 4 years ago
# cat map-200.sh for ((i = 1; i <= 200; i++)); do rbd map img$i & done wait echo OK
turns out to be a very good reproducer for #39089 if run from a VM with 1 CPU.
Updated by Jason Dillaman over 4 years ago
Ilya Dryomov wrote:
[...]
turns out to be a very good reproducer for #39089 if run from a VM with 1 CPU.
Does this mean you are able to reproduce the 'rbd map' hangs now?
Updated by Ilya Dryomov over 4 years ago
Yes, very easily on a build without https://github.com/ceph/ceph/pull/27339. Somewhat strangely, though, in one instance that I have examined I saw events come in the "right" order but it could be that udev can deliver different permutations to different listeners...
I have yet to do an extended run on a build with https://github.com/ceph/ceph/pull/27339 to verify this fix. I will also work on adding this test to krbd suite.
Updated by Ilya Dryomov over 4 years ago
May have hit this with fsx:
2019-09-16T22:24:36.212 INFO:teuthology.orchestra.run.smithi079.stdout:checking clone #43, image image_client.2-clone43 against file /home/ubuntu/cephtest/archive/fsx-image_client.2-parent44 2019-09-16T22:24:36.217 INFO:teuthology.orchestra.run.smithi079.stdout:krbd_unmap(/dev/rbd3) failed 2019-09-16T22:24:36.218 INFO:teuthology.orchestra.run.smithi079.stdout:check_clone: ops->close: No such device 2019-09-16T22:24:36.234 DEBUG:teuthology.orchestra.run:got remote process result: 174
Updated by Ilya Dryomov over 4 years ago
- Assignee set to Ilya Dryomov
Filed a bug against systemd (libudev): https://github.com/systemd/systemd/issues/13814.
Updated by Ilya Dryomov over 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 31023
Updated by Ilya Dryomov over 4 years ago
- Related to Bug #41404: [rbd] rbd map hangs up infinitely after osd down added
Updated by Ilya Dryomov over 4 years ago
- Backport set to luminous,mimic,nautilus
Updated by Nathan Cutler over 4 years ago
Backports could be done together with #41404
Updated by Ilya Dryomov over 4 years ago
- Related to Fix #42523: backport "common/thread: Fix race condition in make_named_thread" to mimic and nautilus added
Updated by Nathan Cutler over 4 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #42524: nautilus: concurrent "rbd unmap" failures due to udev added
Updated by Ilya Dryomov over 4 years ago
For mimic and nautilus, this backport should be staged together with #42523 (ideally after it) to avoid sporadic test failures.
For luminous, this will be a manual backport because make_named_thread() was added in mimic.
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #42526: mimic: concurrent "rbd unmap" failures due to udev added
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #42527: luminous: concurrent "rbd unmap" failures due to udev added
Updated by Nathan Cutler over 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".