Bug #41404: [rbd] rbd map hangs up infinitely after osd down - rbd - Ceph

Actions

Copy link

Bug #41404

closed

[rbd] rbd map hangs up infinitely after osd down

Added by fan chen over 4 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Ilya Dryomov

Target version:

% Done:

Source:

Tags:

Backport:

luminous,mimic,nautilus

Regression:

Yes

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

30965

Crash signature (v1):

Crash signature (v2):

Description

this issue could be 100% reproduced, after down several osd services for a long time, it would cause the "rbd map" command hanging up. after doing investigation, we found the steps of reproduction:
1. down serveral osd services
2. rbd map would hangup at:
[<ffffffffc05cc571>] __rbd_obj_request_wait.constprop.55+0x31/0xe0 [rbd]
[<ffffffffc05cd1fb>] rbd_obj_method_sync.constprop.54+0x1ab/0x250 [rbd]
[<ffffffffc05d1965>] rbd_dev_image_id+0xe5/0x280 [rbd]
[<ffffffffc05d36ba>] do_rbd_add.isra.47+0x4ba/0xe50 [rbd]
[<ffffffffc05d4094>] rbd_add+0x24/0x30 [rbd]
[<ffffffff8143d3e7>] bus_attr_store+0x27/0x30
[<ffffffff81280046>] sysfs_write_file+0xc6/0x140
[<ffffffff81200a9d>] vfs_write+0xbd/0x1e0
[<ffffffff812018af>] SyS_write+0x7f/0xe0
[<ffffffff816b50c9>] system_call_fastpath+0x16/0x1b
3. then up the osd services
4. /dev/rbdX created
5. but rbd map hanged up at poll() in wait_for_udev_add

we found the root reason:
after down osd services, it would cause the io hang up in step 2, but the udev monitor fd have already setup and enabled, so during this period of osd down, the monitor fd could also receive messages from systemd udevd service, in this situation, the monitor fd socket buffer may be full because of rbd_add hanging up. event if we recover the osd services and the sysfs_write_rbd_add returned, the udevd netlink event may also be dropped because of socket buffer is full(see netlink manpage). so here the rbd map may never be able to receive the rbd and block events, then cause hanging up.

so in order to solve this issue, can we increase the socket buffer size, by default on some distros, such as centos, the buffer size
is 208k, which is too small?

or fork an individual thread to receive udevd events beside the main thread?

Related issues 5 (0 open — 5 closed)

Actions

Copy link

Updated by fan chen over 4 years ago

ping...

Does anyone notice this case?

Actions

Copy link

Updated by Ilya Dryomov over 4 years ago

What is the output of "rbd --version"? This might be http://tracker.ceph.com/issues/39089.

Actions

Copy link

Updated by Ilya Dryomov over 4 years ago

Assignee set to Ilya Dryomov

Actions

Copy link

Updated by fan chen over 4 years ago

Ilya Dryomov wrote:

What is the output of "rbd --version"? This might be http://tracker.ceph.com/issues/39089.

hi Ilya, the rbd verison is ceph-12.2.12, I notice that the latest ceph code also don't make any changes about this segments.

I think the reason is not same as the issues/39089, and I have already explained the reason, so here I want to show some codes to help explanation:
if some osds down for a long enough time, it will cause the do_map() function hang at line 272:

 256   mon = udev_monitor_new_from_netlink(udev, "udev");
 257   if (!mon)
 258     return -ENOMEM;
 259 
 260   r = udev_monitor_filter_add_match_subsystem_devtype(mon, "rbd", NULL);
 261   if (r < 0)
 262     goto out_mon;
 263 
 264   r = udev_monitor_filter_add_match_subsystem_devtype(mon, "block", "disk");
 265   if (r < 0)
 266     goto out_mon;
 267 
 268   r = udev_monitor_enable_receiving(mon);
 269   if (r < 0)
 270     goto out_mon;
 271 
 272   r = sysfs_write_rbd_add(buf);
 273   if (r < 0) {
 274     cerr << "rbd: sysfs write failed" << std::endl;
 275     goto out_mon;
 276   }
 277 
 278   r = wait_for_udev_add(mon, pool, image, snap, pname);
 279   if (r < 0) {
 280     cerr << "rbd: wait failed" << std::endl;
 281     goto out_mon;

but at that time, seeing at line 268, the mon fd has been opened and enabled, so the socket buffer can be filled up by other udev events, then after osds up, the
wait_for_udev_add() function would never be able to wait for the events that the do_map expected.

so any ideas?

Actions

Copy link

Updated by Ilya Dryomov over 4 years ago

Did you confirm that the socket actually fills up? It seems highly unlikely to me, because it would take a ton of udev events to do that.

Actions

Copy link

Updated by Ilya Dryomov over 4 years ago

fan chen wrote:

hi Ilya, the rbd verison is ceph-12.2.12, I notice that the latest ceph code also don't make any changes about this segments.

http://tracker.ceph.com/issues/39089 is not fixed in luminous (yet). Can you try reproducing with https://github.com/ceph/ceph/commit/42dd1eae630f31a628051049e4007175561f1939 backported to 12.2.12 or alternatively with "rbd" from nautilus?

Actions

Copy link

Updated by fan chen over 4 years ago

Ilya Dryomov wrote:

Did you confirm that the socket actually fills up? It seems highly unlikely to me, because it would take a ton of udev events to do that.

I confirmed, I have done the test:
case 1:
rebuild code with:

index 4f8b671b42..730f5e7f7f 100644
--- a/src/krbd.cc
+++ b/src/krbd.cc
@@ -200,6 +200,10 @@ static int wait_for_udev_add(struct udev_monitor *mon, const char *pool,
     if (!dev)
       continue;

+    cerr << "DEBUG: get udev event: action:'" << udev_device_get_action(dev)
+         << "' subsystem:'" << udev_device_get_subsystem(dev)
+         << "'"<< std::endl;
+
     if (strcmp(udev_device_get_action(dev), "add") != 0)
       goto next;

@@ -210,6 +214,9 @@ static int wait_for_udev_add(struct udev_monitor *mon, const char *pool,
         const char *this_snap = udev_device_get_sysattr_value(dev,
                                                               "current_snap");

+        cerr << "DEBUG: read pool:'" << this_pool << "' name:'" << this_image
+             << "' current_snap:'" << this_snap << "'" << std::endl;
+
         if (this_pool && strcmp(this_pool, pool) == 0 &&
             this_image && strcmp(this_image, image) == 0 &&
             this_snap && strcmp(this_snap, snap) == 0) {

do map and unmap in a loop:
while((1)); do ./usr/bin/rbd map rbd/kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008; ./usr/bin/rbd unmap kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008; done

then down the osds and outputs:
DEBUG: get udev event: action:'add' subsystem:'rbd'
DEBUG: read pool:'rbd' name:'kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008' current_snap:'-'
DEBUG: get udev event: action:'add' subsystem:'block'
/dev/rbd4

[during osds down, the rbd map hangs, there are no output messages]

[after osds up, the output immediately shows unable to receive message, this is a hint of fulled buffer.]
unable to receive message
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'add' subsystem:'block'
DEBUG: get udev event: action:'change' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'add' subsystem:'block'

and then we can't receive the expected events anymore. though we didn't try to get the data length in the buffer directly, but we also used "udevadm montior" to get the events.

KERNEL[1810625.824541] add /devices/rbd/4 (rbd)
ACTION=add
DEVPATH=/devices/rbd/4
DEVTYPE=rbd
SEQNUM=1754193
SUBSYSTEM=rbd

UDEV [1810625.827186] add /devices/rbd/4 (rbd)
ACTION=add
DEVPATH=/devices/rbd/4
DEVTYPE=rbd
SEQNUM=1754193
SUBSYSTEM=rbd
USEC_INITIALIZED=10625824778

KERNEL[1810625.829706] add /devices/virtual/block/rbd4 (block)
ACTION=add
DEVNAME=/dev/rbd4
DEVPATH=/devices/virtual/block/rbd4
DEVTYPE=disk
MAJOR=246
MINOR=0
SEQNUM=1754195
SUBSYSTEM=block

UDEV [1810626.295368] add /devices/virtual/block/rbd4 (block)
ACTION=add
DEVLINKS=/dev/rbd/rbd/kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008
DEVNAME=/dev/rbd4
DEVPATH=/devices/virtual/block/rbd4
DEVTYPE=disk
MAJOR=246
MINOR=0
MPATH_SBIN_PATH=/sbin
SEQNUM=1754195
SUBSYSTEM=block
TAGS=:systemd:
USEC_INITIALIZED=625829737

and udevadm info:
[root@node-3 ~]# udevadm info /dev/rbd4
calling: info
P: /devices/virtual/block/rbd4
N: rbd4
S: rbd/rbd/kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008
E: DEVLINKS=/dev/rbd/rbd/kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008
E: DEVNAME=/dev/rbd4
E: DEVPATH=/devices/virtual/block/rbd4
E: DEVTYPE=disk
E: MAJOR=246
E: MINOR=0
E: MPATH_SBIN_PATH=/sbin
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=10625829737

[root@node-3 ~]# ls -l /dev/rbd/rbd/kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008
lrwxrwxrwx 1 root root 10 Aug 21 09:57 /dev/rbd/rbd/kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008 -> ../../rbd4

we see that all events in udevd side haven't any problem, just the "rbd map" can't receive the events.

Actions

Copy link

Updated by fan chen over 4 years ago

Ilya Dryomov wrote:

fan chen wrote:

hi Ilya, the rbd verison is ceph-12.2.12, I notice that the latest ceph code also don't make any changes about this segments.

http://tracker.ceph.com/issues/39089 is not fixed in luminous (yet). Can you try reproducing with https://github.com/ceph/ceph/commit/42dd1eae630f31a628051049e4007175561f1939 backported to 12.2.12 or alternatively with "rbd" from nautilus?

thanks for your suggestion, we have tested this path yet, the rbd map can return when poll timeout, but the client(rbd map) we used in k8s
checked the return code was ETIMEOUT, and thought it was also not success. so the client treated that as failure too.

maybe the client can check that the udev devpath whether be succeed to create when "rbd map" return etimeout. do you have any suggestion?

Actions

Copy link

Updated by Ilya Dryomov over 4 years ago

fan chen wrote:

[during osds down, the rbd map hangs, there are no output messages]

[after osds up, the output immediately shows unable to receive message, this is a hint of fulled buffer.]
unable to receive message
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'add' subsystem:'block'
DEBUG: get udev event: action:'change' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'add' subsystem:'block'

and then we can't receive the expected events anymore. though we didn't try to get the data length in the buffer directly, but we also used "udevadm montior" to get the events.

I see, you didn't mention "unable to receive message" error message from udev in the original post. It could indeed be ENOBUFS, udev doesn't return it and doesn't print it:

         buflen = recvmsg(udev_monitor->sock, &smsg, 0);
         if (buflen < 0) {
                 if (errno != EINTR)
                         log_debug("unable to receive message");
                 return NULL;

If it is ENOBUFS, I wonder how many events does it take to fill up the buffer and where they are coming from. Do you have something else generating rbd events besides that while((1)) loop?

Actions

Copy link

#10

Updated by fan chen over 4 years ago

Ilya Dryomov wrote:

fan chen wrote:

[during osds down, the rbd map hangs, there are no output messages]

[after osds up, the output immediately shows unable to receive message, this is a hint of fulled buffer.]
unable to receive message
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'add' subsystem:'block'
DEBUG: get udev event: action:'change' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'add' subsystem:'block'

and then we can't receive the expected events anymore. though we didn't try to get the data length in the buffer directly, but we also used "udevadm montior" to get the events.

I see, you didn't mention "unable to receive message" error message from udev in the original post. It could indeed be ENOBUFS, udev doesn't return it and doesn't print it:

[...]

If it is ENOBUFS, I wonder how many events does it take to fill up the buffer and where they are coming from. Do you have something else generating rbd events besides that while((1)) loop?

yes, there were usually generating lots of events in our k8s platform, not just rbd events, the udev filter should not filter rbd subsystem before recive mssages. on our enviorment the docker used the devicemapper storage driver as storage backend, so there were lots of udev events like "dm-x" add/remove/change were generated, I didn't count the total number of events, but I have tested that if I down the osds for more than 15 minites, hangup would appear.

I also noticed that in systemd source code, the udevd change the buffer size to 128 * 1024 * 1024 at https://github.com/systemd/systemd/blob/master/src/udev/udevd.c:1602.

Actions

Copy link

#11

Updated by Ilya Dryomov over 4 years ago

Status changed from New to 12

Yeah, I realized right after I posted that these events don't all have to be rbd events, since the filtering is done on client side, not in udevd.

I don't think increasing the buffer size is the right solution. We are also considering removing the timeout you mentioned in https://tracker.ceph.com/issues/41404#note-8, so we don't want to be relying on it either.

Actions

Copy link

#12

Updated by fan chen over 4 years ago

Ilya Dryomov wrote:

Yeah, I realized right after I posted that these events don't all have to be rbd events, since the filtering is done on client side, not in udevd.

I don't think increasing the buffer size is the right solution. We are also considering removing the timeout you mentioned in https://tracker.ceph.com/issues/41404#note-8, so we don't want to be relying on it either.

I think it would incur another problem if we didn't have a timeout. Imagining that after krbd had created the block device and had sent out the uevent, the event worker hanged up when executing related rules in udevd, this was possible because the rules might need to access the block device, eventually the udeved would kill the worker after the event timeout, there would be no events really were sent out.

Actions

Copy link

#13

Updated by Jason Dillaman over 4 years ago

Am I correct in presuming that we can just tweak the krbd.c code to run the udev monitoring code in its own thread? It would launch the thread prior to attempting to map the block device and it would join w/ it to retrieve the block path or error after writing to the sysfs file.

Actions

Copy link

#14

Updated by fan chen over 4 years ago

Jason Dillaman wrote:

Am I correct in presuming that we can just tweak the krbd.c code to run the udev monitoring code in its own thread? It would launch the thread prior to attempting to map the block device and it would join w/ it to retrieve the block path or error after writing to the sysfs file.

I think it is correct, I metioned that method in the beginning of the issue.

Actions

Copy link

#15

Updated by Ilya Dryomov over 4 years ago

Status changed from 12 to In Progress

sysfs doesn't support O_DIRECT, so a separate thread seems like the only option (although I would probably use it for the blocking write and keep processing the udev events in the main thread instead of the other way around).

I'll try to post a patch in the next couple of days.

Actions

Copy link

#16

Updated by Jason Dillaman over 4 years ago

Backport set to luminous,mimic,nautilus

Actions

Copy link

#17

Updated by Ilya Dryomov over 4 years ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 30965

Actions

Copy link

#18

Updated by fan chen over 4 years ago

I have tested it, it works for me, thanks

Actions

Copy link

#19

Updated by Ilya Dryomov over 4 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#20

Updated by Ilya Dryomov over 4 years ago

Related to Bug #41036: concurrent "rbd unmap" failures due to udev added

Actions

Copy link

#21

Updated by Nathan Cutler over 4 years ago

Copied to Backport #42425: luminous: [rbd] rbd map hangs up infinitely after osd down added

Actions

Copy link

#22

Updated by Nathan Cutler over 4 years ago

Copied to Backport #42426: mimic: [rbd] rbd map hangs up infinitely after osd down added

Actions

Copy link

#23

Updated by Nathan Cutler over 4 years ago

Copied to Backport #42427: nautilus: [rbd] rbd map hangs up infinitely after osd down added

Actions

Copy link

#24

Updated by Ilya Dryomov over 4 years ago

Related to Fix #42523: backport "common/thread: Fix race condition in make_named_thread" to mimic and nautilus added

Actions

Copy link

#25

Updated by fan chen over 4 years ago

I have tested the patchset of "Backport #42425" on my environment, problem solved.
Thanks

Actions

Copy link

#26

Updated by Nathan Cutler over 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #41404

[rbd] rbd map hangs up infinitely after osd down

Updated by fan chen over 4 years ago

Updated by Ilya Dryomov over 4 years ago

Updated by Ilya Dryomov over 4 years ago

Updated by fan chen over 4 years ago

Updated by Ilya Dryomov over 4 years ago

Updated by Ilya Dryomov over 4 years ago

Updated by fan chen over 4 years ago

Updated by fan chen over 4 years ago

Updated by Ilya Dryomov over 4 years ago

Updated by fan chen over 4 years ago

Updated by Ilya Dryomov over 4 years ago

Updated by fan chen over 4 years ago

Updated by Jason Dillaman over 4 years ago

Updated by fan chen over 4 years ago

Updated by Ilya Dryomov over 4 years ago

Updated by Jason Dillaman over 4 years ago

Updated by Ilya Dryomov over 4 years ago

Updated by fan chen over 4 years ago

Updated by Ilya Dryomov over 4 years ago

Updated by Ilya Dryomov over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Ilya Dryomov over 4 years ago

Updated by fan chen over 4 years ago

Updated by Nathan Cutler over 4 years ago