Project

General

Profile

Bug #23246

[OSD bug] KernelDevice.cc: 539: FAILED assert(r == 0)

Added by Christoffer Lilja about 1 year ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Target version:
-
Start date:
03/06/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Pull request ID:

Description

Hi,

Got a bug at one of my OSD's, see a snippet below. Full logfile also attached.
I've tested with "dev_aio_max_queue_depth = 4096" as some gave as a workaround, but it didn't work.

I'll be glad to assist with more information when needed.

2018-03-06 15:50:35.045499 7f34b893ce00 1 freelist init
2018-03-06 15:50:35.145876 7f34b893ce00 1 bluestore(/var/lib/ceph/osd/ceph-6) _open_alloc opening allocation metadata
2018-03-06 15:51:41.962675 7f34b893ce00 1 bluestore(/var/lib/ceph/osd/ceph-6) _open_alloc loaded 718 G in 1331688 extents
2018-03-06 15:51:51.199593 7f34b893ce00 -1 bdev(0x563e2b634fc0 /var/lib/ceph/osd/ceph-6/block) aio_submit retries 16
2018-03-06 15:51:51.199801 7f34b893ce00 -1 bdev(0x563e2b634fc0 /var/lib/ceph/osd/ceph-6/block) aio submit got (11) Resource temporarily unavailable
2018-03-06 15:51:51.207950 7f34b893ce00 -1 /build/ceph-12.2.4/src/os/bluestore/KernelDevice.cc: In function 'virtual void KernelDevice::aio_submit(IOContext*)' thread 7f34b893ce00 time 2018-03-06 15:51:51.199884
/build/ceph-12.2.4/src/os/bluestore/KernelDevice.cc: 539: FAILED assert(r == 0)

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x563e21d76872]
2: (KernelDevice::aio_submit(IOContext*)+0x6ed) [0x563e21d1088d]
3: (BlueStore::_deferred_submit_unlock(BlueStore::OpSequencer*)+0x779) [0x563e21bc5e49]
4: (BlueStore::deferred_try_submit()+0x7b1) [0x563e21bc6d71]
5: (BlueStore::_osr_drain_all()+0x31d) [0x563e21bd0e0d]
6: (BlueStore::_deferred_replay()+0x4ad) [0x563e21c2b40d]
7: (BlueStore::_mount(bool)+0x437) [0x563e21c39b77]
8: (OSD::init()+0x3e2) [0x563e217a0ed2]
9: (main()+0x2f07) [0x563e216b1f17]
10: (__libc_start_main()+0xf0) [0x7f34b5da5830]
11: (_start()+0x29) [0x563e2173d6b9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ceph-osd.6.log.gz (722 KB) Christoffer Lilja, 03/06/2018 03:05 PM


Related issues

Copied to bluestore - Backport #23700: luminous: osd: KernelDevice.cc: 539: FAILED assert(r == 0) Resolved

History

#1 Updated by Radoslaw Zarzynski about 1 year ago

Yeah, looks like io_submit() was constantly returning EAGAIN and the number of retries (16) has been exhausted. Let's verify how many ops were pending. There is a debug:

void KernelDevice::aio_submit(IOContext *ioc)
{
  dout(20) << __func__ << " ioc " << ioc
           << " pending " << ioc->num_pending.load()
           << " running " << ioc->num_running.load()
           << dendl;

However, it will be visible on much higher level. Could you please set debug_bluestore=20 and provide a log?

Also, is your dmesg clean?

#2 Updated by Radoslaw Zarzynski about 1 year ago

  • Status changed from New to Need More Info

#3 Updated by Christoffer Lilja about 1 year ago

Hi,

Here comes a logfile with debug info, it's pretty big so I share it through Google Drive:
https://drive.google.com/file/d/183UepSeqvf9H5v7bSPlnkN5l2MPpmkOz/view?usp=sharing

dmesg is clean, no disk crash or such.

#4 Updated by Christoffer Lilja about 1 year ago

Now I remember that the sata controller that this drive was connected to had a glitch and took down the drives for a minute.
Maybe this bug report is to any use anyway?

#5 Updated by Radoslaw Zarzynski about 1 year ago

Could be that the kernel was just reasonably rejecting the requests because of HW issue.
I would need even more logs than last time (sorry). If it still happens, could please provide osd.log with debug_bluestore=20 and debug_bdev=20?

#6 Updated by Christoffer Lilja about 1 year ago

It still occurs and I'll come back with the logs asap.

#7 Updated by Christoffer Lilja about 1 year ago

Here you have a osd log with debug_bluestore=20 and debug_bdev=20:
https://drive.google.com/open?id=11oW6yAG0M6rdMSzx_GWqxVGk9w58n9MR

#8 Updated by Radoslaw Zarzynski about 1 year ago

Thanks for taking the log once again! It looks the IOContext carried really huge number of operations:

2018-03-07 13:18:55.408740 7fb8b8b49e00  5 bdev(0x55d52ff32fc0 /var/lib/ceph/osd/ceph-6/block) aio_write 0xd60aeb0000~2000 aio 0x55d54510f550
2018-03-07 13:18:55.408744 7fb8b8b49e00 20 bdev(0x55d52ff32fc0 /var/lib/ceph/osd/ceph-6/block) aio_submit ioc 0x55d52ff38fd8 pending 4641 running 0

Try to give even higher values to dev_aio_max_queue_depth.

#9 Updated by Radoslaw Zarzynski about 1 year ago

  • Status changed from Need More Info to Verified

#10 Updated by Christoffer Lilja about 1 year ago

As I mentioned in the initial bug report :-)
"I've tested with "dev_aio_max_queue_depth = 4096" as some gave as a workaround, but it didn't work."

Or should I try to increase it even more?

#11 Updated by Radoslaw Zarzynski about 1 year ago

Yeah, saw that. However, your IOContext is so large it exceeds even that. :-)
I would try bigger values just as a workaround for this specific situation.

The ultimate solution will be to implement batch division to never deplete the number of iocb's we requested during io_setup(). However, it will take some time to make a patch and have it landed in new minor.

#12 Updated by Christoffer Lilja about 1 year ago

Hi,

Thank you for looking into this.
I'd set dev_aio_max_queue_depth higher and increased it until I reached 68719476736, but it still crashes with the same error.

#13 Updated by Radoslaw Zarzynski about 1 year ago

Hmm, that's surprising. Setting extreme values should cause EINVAL during io_setup because exceeding system limit (see /proc/sys/fs/aio-max-nr).

Now I see there is a typo in the option's name. It should be bdev_aio_max_queue_depth. If the correct one was applied before, please provide OSD log (the same debug levels) and the value of /proc/sys/fs/aio-max-nr.

#14 Updated by Christoffer Lilja about 1 year ago

Ah, yeah I see that. Oops.
Corrected and I did set bdev_aio_max_queue_depth = 65536 aaand all OSD's are up!

#15 Updated by Sage Weil about 1 year ago

  • Status changed from Verified to In Progress
  • Assignee set to Radoslaw Zarzynski

#16 Updated by Radoslaw Zarzynski about 1 year ago

  • Status changed from In Progress to Need Review
  • Backport set to luminous

#17 Updated by Kefu Chai 12 months ago

  • Status changed from Need Review to Pending Backport

#18 Updated by Kefu Chai 12 months ago

  • Copied to Backport #23700: luminous: osd: KernelDevice.cc: 539: FAILED assert(r == 0) added

#19 Updated by Nathan Cutler 10 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF