Project

General

Profile

Bug #41404

[rbd] rbd map hangs up infinitely after osd down

Added by fan chen almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous,mimic,nautilus
Regression:
Yes
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

this issue could be 100% reproduced, after down several osd services for a long time, it would cause the "rbd map" command hanging up. after doing investigation, we found the steps of reproduction:
1. down serveral osd services
2. rbd map would hangup at:
[<ffffffffc05cc571>] __rbd_obj_request_wait.constprop.55+0x31/0xe0 [rbd]
[<ffffffffc05cd1fb>] rbd_obj_method_sync.constprop.54+0x1ab/0x250 [rbd]
[<ffffffffc05d1965>] rbd_dev_image_id+0xe5/0x280 [rbd]
[<ffffffffc05d36ba>] do_rbd_add.isra.47+0x4ba/0xe50 [rbd]
[<ffffffffc05d4094>] rbd_add+0x24/0x30 [rbd]
[<ffffffff8143d3e7>] bus_attr_store+0x27/0x30
[<ffffffff81280046>] sysfs_write_file+0xc6/0x140
[<ffffffff81200a9d>] vfs_write+0xbd/0x1e0
[<ffffffff812018af>] SyS_write+0x7f/0xe0
[<ffffffff816b50c9>] system_call_fastpath+0x16/0x1b
3. then up the osd services
4. /dev/rbdX created
5. but rbd map hanged up at poll() in wait_for_udev_add

we found the root reason:
after down osd services, it would cause the io hang up in step 2, but the udev monitor fd have already setup and enabled, so during this period of osd down, the monitor fd could also receive messages from systemd udevd service, in this situation, the monitor fd socket buffer may be full because of rbd_add hanging up. event if we recover the osd services and the sysfs_write_rbd_add returned, the udevd netlink event may also be dropped because of socket buffer is full(see netlink manpage). so here the rbd map may never be able to receive the rbd and block events, then cause hanging up.

so in order to solve this issue, can we increase the socket buffer size, by default on some distros, such as centos, the buffer size
is 208k, which is too small?

or fork an individual thread to receive udevd events beside the main thread?


Related issues

Related to rbd - Bug #41036: concurrent "rbd unmap" failures due to udev Resolved 07/31/2019
Related to Ceph - Fix #42523: backport "common/thread: Fix race condition in make_named_thread" to mimic and nautilus Resolved 10/29/2019
Copied to rbd - Backport #42425: luminous: [rbd] rbd map hangs up infinitely after osd down Resolved
Copied to rbd - Backport #42426: mimic: [rbd] rbd map hangs up infinitely after osd down Resolved
Copied to rbd - Backport #42427: nautilus: [rbd] rbd map hangs up infinitely after osd down Resolved

History

#1 Updated by fan chen almost 2 years ago

ping...

Does anyone notice this case?

#2 Updated by Ilya Dryomov almost 2 years ago

What is the output of "rbd --version"? This might be http://tracker.ceph.com/issues/39089.

#3 Updated by Ilya Dryomov almost 2 years ago

  • Assignee set to Ilya Dryomov

#4 Updated by fan chen almost 2 years ago

Ilya Dryomov wrote:

What is the output of "rbd --version"? This might be http://tracker.ceph.com/issues/39089.

hi Ilya, the rbd verison is ceph-12.2.12, I notice that the latest ceph code also don't make any changes about this segments.

I think the reason is not same as the issues/39089, and I have already explained the reason, so here I want to show some codes to help explanation:
if some osds down for a long enough time, it will cause the do_map() function hang at line 272:

 256   mon = udev_monitor_new_from_netlink(udev, "udev");
 257   if (!mon)
 258     return -ENOMEM;
 259 
 260   r = udev_monitor_filter_add_match_subsystem_devtype(mon, "rbd", NULL);
 261   if (r < 0)
 262     goto out_mon;
 263 
 264   r = udev_monitor_filter_add_match_subsystem_devtype(mon, "block", "disk");
 265   if (r < 0)
 266     goto out_mon;
 267 
 268   r = udev_monitor_enable_receiving(mon);
 269   if (r < 0)
 270     goto out_mon;
 271 
 272   r = sysfs_write_rbd_add(buf);
 273   if (r < 0) {
 274     cerr << "rbd: sysfs write failed" << std::endl;
 275     goto out_mon;
 276   }
 277 
 278   r = wait_for_udev_add(mon, pool, image, snap, pname);
 279   if (r < 0) {
 280     cerr << "rbd: wait failed" << std::endl;
 281     goto out_mon;

but at that time, seeing at line 268, the mon fd has been opened and enabled, so the socket buffer can be filled up by other udev events, then after osds up, the
wait_for_udev_add() function would never be able to wait for the events that the do_map expected.

so any ideas?

#5 Updated by Ilya Dryomov almost 2 years ago

Did you confirm that the socket actually fills up? It seems highly unlikely to me, because it would take a ton of udev events to do that.

#6 Updated by Ilya Dryomov almost 2 years ago

fan chen wrote:

hi Ilya, the rbd verison is ceph-12.2.12, I notice that the latest ceph code also don't make any changes about this segments.

http://tracker.ceph.com/issues/39089 is not fixed in luminous (yet). Can you try reproducing with https://github.com/ceph/ceph/commit/42dd1eae630f31a628051049e4007175561f1939 backported to 12.2.12 or alternatively with "rbd" from nautilus?

#7 Updated by fan chen almost 2 years ago

Ilya Dryomov wrote:

Did you confirm that the socket actually fills up? It seems highly unlikely to me, because it would take a ton of udev events to do that.

I confirmed, I have done the test:
case 1:
rebuild code with:

index 4f8b671b42..730f5e7f7f 100644
--- a/src/krbd.cc
+++ b/src/krbd.cc
@@ -200,6 +200,10 @@ static int wait_for_udev_add(struct udev_monitor *mon, const char *pool,
     if (!dev)
       continue;

+    cerr << "DEBUG: get udev event: action:'" << udev_device_get_action(dev)
+         << "' subsystem:'" << udev_device_get_subsystem(dev)
+         << "'"<< std::endl;
+
     if (strcmp(udev_device_get_action(dev), "add") != 0)
       goto next;

@@ -210,6 +214,9 @@ static int wait_for_udev_add(struct udev_monitor *mon, const char *pool,
         const char *this_snap = udev_device_get_sysattr_value(dev,
                                                               "current_snap");

+        cerr << "DEBUG: read pool:'" << this_pool << "' name:'" << this_image
+             << "' current_snap:'" << this_snap << "'" << std::endl;
+
         if (this_pool && strcmp(this_pool, pool) == 0 &&
             this_image && strcmp(this_image, image) == 0 &&
             this_snap && strcmp(this_snap, snap) == 0) {

do map and unmap in a loop:
while((1)); do ./usr/bin/rbd map rbd/kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008; ./usr/bin/rbd unmap kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008; done

then down the osds and outputs:
DEBUG: get udev event: action:'add' subsystem:'rbd'
DEBUG: read pool:'rbd' name:'kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008' current_snap:'-'
DEBUG: get udev event: action:'add' subsystem:'block'
/dev/rbd4

[during osds down, the rbd map hangs, there are no output messages]

[after osds up, the output immediately shows unable to receive message, this is a hint of fulled buffer.]
unable to receive message
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'add' subsystem:'block'
DEBUG: get udev event: action:'change' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'add' subsystem:'block'

and then we can't receive the expected events anymore. though we didn't try to get the data length in the buffer directly, but we also used "udevadm montior" to get the events.

KERNEL[1810625.824541] add /devices/rbd/4 (rbd)
ACTION=add
DEVPATH=/devices/rbd/4
DEVTYPE=rbd
SEQNUM=1754193
SUBSYSTEM=rbd

UDEV [1810625.827186] add /devices/rbd/4 (rbd)
ACTION=add
DEVPATH=/devices/rbd/4
DEVTYPE=rbd
SEQNUM=1754193
SUBSYSTEM=rbd
USEC_INITIALIZED=10625824778

KERNEL[1810625.829706] add /devices/virtual/block/rbd4 (block)
ACTION=add
DEVNAME=/dev/rbd4
DEVPATH=/devices/virtual/block/rbd4
DEVTYPE=disk
MAJOR=246
MINOR=0
SEQNUM=1754195
SUBSYSTEM=block

UDEV [1810626.295368] add /devices/virtual/block/rbd4 (block)
ACTION=add
DEVLINKS=/dev/rbd/rbd/kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008
DEVNAME=/dev/rbd4
DEVPATH=/devices/virtual/block/rbd4
DEVTYPE=disk
MAJOR=246
MINOR=0
MPATH_SBIN_PATH=/sbin
SEQNUM=1754195
SUBSYSTEM=block
TAGS=:systemd:
USEC_INITIALIZED=625829737

and udevadm info:
[root@node-3 ~]# udevadm info /dev/rbd4
calling: info
P: /devices/virtual/block/rbd4
N: rbd4
S: rbd/rbd/kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008
E: DEVLINKS=/dev/rbd/rbd/kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008
E: DEVNAME=/dev/rbd4
E: DEVPATH=/devices/virtual/block/rbd4
E: DEVTYPE=disk
E: MAJOR=246
E: MINOR=0
E: MPATH_SBIN_PATH=/sbin
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=10625829737

[root@node-3 ~]# ls -l /dev/rbd/rbd/kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008
lrwxrwxrwx 1 root root 10 Aug 21 09:57 /dev/rbd/rbd/kubernetes-dynamic-pvc-FFFFFFFF-FFFF-FFFF-FFFF-0a580ae80008 -> ../../rbd4

we see that all events in udevd side haven't any problem, just the "rbd map" can't receive the events.

#8 Updated by fan chen almost 2 years ago

Ilya Dryomov wrote:

fan chen wrote:

hi Ilya, the rbd verison is ceph-12.2.12, I notice that the latest ceph code also don't make any changes about this segments.

http://tracker.ceph.com/issues/39089 is not fixed in luminous (yet). Can you try reproducing with https://github.com/ceph/ceph/commit/42dd1eae630f31a628051049e4007175561f1939 backported to 12.2.12 or alternatively with "rbd" from nautilus?

thanks for your suggestion, we have tested this path yet, the rbd map can return when poll timeout, but the client(rbd map) we used in k8s
checked the return code was ETIMEOUT, and thought it was also not success. so the client treated that as failure too.

maybe the client can check that the udev devpath whether be succeed to create when "rbd map" return etimeout. do you have any suggestion?

#9 Updated by Ilya Dryomov almost 2 years ago

fan chen wrote:

[during osds down, the rbd map hangs, there are no output messages]

[after osds up, the output immediately shows unable to receive message, this is a hint of fulled buffer.]
unable to receive message
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'add' subsystem:'block'
DEBUG: get udev event: action:'change' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'add' subsystem:'block'

and then we can't receive the expected events anymore. though we didn't try to get the data length in the buffer directly, but we also used "udevadm montior" to get the events.

I see, you didn't mention "unable to receive message" error message from udev in the original post. It could indeed be ENOBUFS, udev doesn't return it and doesn't print it:

         buflen = recvmsg(udev_monitor->sock, &smsg, 0);
         if (buflen < 0) {
                 if (errno != EINTR)
                         log_debug("unable to receive message");
                 return NULL;

If it is ENOBUFS, I wonder how many events does it take to fill up the buffer and where they are coming from. Do you have something else generating rbd events besides that while((1)) loop?

#10 Updated by fan chen almost 2 years ago

Ilya Dryomov wrote:

fan chen wrote:

[during osds down, the rbd map hangs, there are no output messages]

[after osds up, the output immediately shows unable to receive message, this is a hint of fulled buffer.]
unable to receive message
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'add' subsystem:'block'
DEBUG: get udev event: action:'change' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'remove' subsystem:'block'
DEBUG: get udev event: action:'add' subsystem:'block'

and then we can't receive the expected events anymore. though we didn't try to get the data length in the buffer directly, but we also used "udevadm montior" to get the events.

I see, you didn't mention "unable to receive message" error message from udev in the original post. It could indeed be ENOBUFS, udev doesn't return it and doesn't print it:

[...]

If it is ENOBUFS, I wonder how many events does it take to fill up the buffer and where they are coming from. Do you have something else generating rbd events besides that while((1)) loop?

yes, there were usually generating lots of events in our k8s platform, not just rbd events, the udev filter should not filter rbd subsystem before recive mssages. on our enviorment the docker used the devicemapper storage driver as storage backend, so there were lots of udev events like "dm-x" add/remove/change were generated, I didn't count the total number of events, but I have tested that if I down the osds for more than 15 minites, hangup would appear.

I also noticed that in systemd source code, the udevd change the buffer size to 128 * 1024 * 1024 at https://github.com/systemd/systemd/blob/master/src/udev/udevd.c:1602.

#11 Updated by Ilya Dryomov almost 2 years ago

  • Status changed from New to 12

Yeah, I realized right after I posted that these events don't all have to be rbd events, since the filtering is done on client side, not in udevd.

I don't think increasing the buffer size is the right solution. We are also considering removing the timeout you mentioned in https://tracker.ceph.com/issues/41404#note-8, so we don't want to be relying on it either.

#12 Updated by fan chen almost 2 years ago

Ilya Dryomov wrote:

Yeah, I realized right after I posted that these events don't all have to be rbd events, since the filtering is done on client side, not in udevd.

I don't think increasing the buffer size is the right solution. We are also considering removing the timeout you mentioned in https://tracker.ceph.com/issues/41404#note-8, so we don't want to be relying on it either.

I think it would incur another problem if we didn't have a timeout. Imagining that after krbd had created the block device and had sent out the uevent, the event worker hanged up when executing related rules in udevd, this was possible because the rules might need to access the block device, eventually the udeved would kill the worker after the event timeout, there would be no events really were sent out.

#13 Updated by Jason Dillaman almost 2 years ago

Am I correct in presuming that we can just tweak the krbd.c code to run the udev monitoring code in its own thread? It would launch the thread prior to attempting to map the block device and it would join w/ it to retrieve the block path or error after writing to the sysfs file.

#14 Updated by fan chen almost 2 years ago

Jason Dillaman wrote:

Am I correct in presuming that we can just tweak the krbd.c code to run the udev monitoring code in its own thread? It would launch the thread prior to attempting to map the block device and it would join w/ it to retrieve the block path or error after writing to the sysfs file.

I think it is correct, I metioned that method in the beginning of the issue.

#15 Updated by Ilya Dryomov almost 2 years ago

  • Status changed from 12 to In Progress

sysfs doesn't support O_DIRECT, so a separate thread seems like the only option (although I would probably use it for the blocking write and keep processing the udev events in the main thread instead of the other way around).

I'll try to post a patch in the next couple of days.

#16 Updated by Jason Dillaman almost 2 years ago

  • Backport set to luminous,mimic,nautilus

#17 Updated by Ilya Dryomov almost 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 30965

#18 Updated by fan chen almost 2 years ago

I have tested it, it works for me, thanks

#19 Updated by Ilya Dryomov almost 2 years ago

  • Status changed from Fix Under Review to Pending Backport

#20 Updated by Ilya Dryomov almost 2 years ago

  • Related to Bug #41036: concurrent "rbd unmap" failures due to udev added

#21 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #42425: luminous: [rbd] rbd map hangs up infinitely after osd down added

#22 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #42426: mimic: [rbd] rbd map hangs up infinitely after osd down added

#23 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #42427: nautilus: [rbd] rbd map hangs up infinitely after osd down added

#24 Updated by Ilya Dryomov almost 2 years ago

  • Related to Fix #42523: backport "common/thread: Fix race condition in make_named_thread" to mimic and nautilus added

#25 Updated by fan chen over 1 year ago

I have tested the patchset of "Backport #42425" on my environment, problem solved.
Thanks

#26 Updated by Nathan Cutler over 1 year ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF