Project

General

Profile

Actions

Bug #20940

closed

IO stall/hang with ceph 10.2.7 on Arch Linux

Added by Jamin Collins over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On Arch Linux, starting with ceph 10.2.5 I noticed that VMs using ceph/rbd backed volumes experience a complete I/O stall when attempting to access the ceph/rbd volume.

I've replicated the hang with ceph 10.2.6 and 10.2.7. The later 10.2.x releases will not build on Arch due to ICEs.

I've also found that a development build from the 12.1.x series (git reference g171104cb93) does not exhibit the IO hang.

I've also filed an Arch Linux bug report for this: https://bugs.archlinux.org/task/55044


Files

qemu-guest-1215.log.lz4 (13.2 KB) qemu-guest-1215.log.lz4 Jamin Collins, 08/07/2017 09:16 PM
qemu-guest-1094.log.lz4 (20.2 KB) qemu-guest-1094.log.lz4 Jamin Collins, 08/08/2017 12:46 AM
gdb-thread-apply-all-bt.log (19.7 KB) gdb-thread-apply-all-bt.log Jamin Collins, 08/08/2017 04:04 PM
Actions #1

Updated by Jason Dillaman over 6 years ago

  • Status changed from New to Need More Info

@Jamin: nothing unusual in that log from a librbd perspective -- it just looks like write requests were sent out but never completed. First guess would be file descriptor limits. Second would be perhaps some type of cephx permission issue (since it complained about loading the keyring). Otherwise, I would re-run w/ "debug objecter = 20" included as well.

When you tried different versions, are you only changing the librbd/librados version or are you changing the OSDs as well?

Actions #2

Updated by Jamin Collins over 6 years ago

The only change being made is the ceph package version on the VM host.

I believe the cephx permissions can be ruled out as the VM definition is not change between tests, same secret, etc.

All the storage nodes (OSD hosts) are left completely unchanged between tests.

Actions #3

Updated by Jason Dillaman over 6 years ago

One of the threads appears to hang:

2017-08-08 00:39:35.456651 7f51d3a74700 10 client.57056317.objecter ms_dispatch 0x7f51f2cbb000 osd_op_reply(24 rbd_header.669bcb238e1f29 [watch ping cookie 139989425823744] v0'0 uv79776 ondisk = 0) v7

Any chance you can install the debug packages, re-run, attach gdb to the process, and run "thread apply all bt" once you experience the deadlock?

Actions #4

Updated by Jamin Collins over 6 years ago

I'm assuming you wanted me to attach to the qemu process.

Actions #5

Updated by Jason Dillaman over 6 years ago

@Jamin: it looks like you have lots of threads hung inside jemalloc attempting to allocate memory. This thread in particular is in a location where the last set of logs indicated things just stopped:

Thread 25 (Thread 0x7fdf5e37e700 (LWP 20522)):
#0  0x00007fe00e56e54c in __lll_lock_wait () at /usr/lib/libpthread.so.0
#1  0x00007fe00e567905 in pthread_mutex_lock () at /usr/lib/libpthread.so.0
#2  0x00007fe00f0fc810 in  () at /usr/lib/libjemalloc.so.2
#3  0x00007fe00f0c52f9 in  () at /usr/lib/libjemalloc.so.2
#4  0x00007fe00f107ea5 in  () at /usr/lib/libjemalloc.so.2
#5  0x00007fe00f0bb2e0 in malloc () at /usr/lib/libjemalloc.so.2
#6  0x00007fe00f10c933 in  () at /usr/lib/libjemalloc.so.2
#7  0x00007fdfed4dd956 in ceph::log::Log::create_entry(int, int, unsigned long*) (this=<optimized out>, level=level@entry=15, subsys=subsys@entry=14, expected_size=expected_size@entry=0x7fdfed7f7560 <Objecter::_finish_op(Objecter::Op*, int)::_log_exp_length>)
    at log/Log.cc:254
Actions #6

Updated by Jamin Collins over 6 years ago

That's odd memory is not an issue on this host. It's got 32G of RAM with nothing else currently running on it.

$ free -m
total used free shared buff/cache available
Mem: 31998 2289 8271 1 21437 29270
Swap: 0 0 0

The above is with the hung VM (from the gdb backtrace) still running.

Actions #7

Updated by Jamin Collins over 6 years ago

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          31998        2289        8271           1       21437       29270
Swap:             0           0           0

One more time with formatting (sorry about that).

Actions #8

Updated by Jason Dillaman over 6 years ago

  • Status changed from Need More Info to Resolved

Appears to be some sort of jemalloc / glibc issue on Arch -- running QEMU built under glibc malloc does not result in IO hanging.

Actions #9

Updated by Jamin Collins over 6 years ago

@Jason Borden indicated that he doesn't normally see ceph/rbd running under jemalloc. Digging into Arch's qemu package, I found they explicitly enable jemalloc. Rebuilding Arch's qemu package without jemalloc appears to resolve this IO hang.

Resolving ticket.

Actions

Also available in: Atom PDF