Bug #7125: Assertion failure in rbd_img_obj_callback() - rbd - Ceph

Actions

Copy link

Bug #7125

closed

Assertion failure in rbd_img_obj_callback()

Added by Eric Eastman over 10 years ago. Updated almost 10 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Ilya Dryomov

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

My system hung while stress testing an RBD backed XFS file system. After power cycling the system the error message in /var/log/kern.log just before the reboot messages was:

Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022689] Assertion failure in rbd_img_obj_callback() at line 2127:
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022689]
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022689] rbd_assert(img_request != NULL);
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022689]
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022758] ------------[ cut here ]------------
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022768] kernel BUG at /home/apw/COD/linux/drivers/block/rbd.c:2127!
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022779] invalid opcode: 0000 [#1] SMP
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022796] Modules linked in: xfs zfs(POF) zunicode(POF) zavl(POF) zcommon(POF) znvpair(POF) spl(OF) target_core_mod configfs nfsv3 x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd gpio_ich microcode psmouse serio_raw sb_edac edac_core joydev hpwdt hpilo lpc_ich ipmi_si nfsd ipmi_msghandler ioatdma dca auth_rpcgss acpi_power_meter mac_hid rbd nfs_acl nfs libceph libcrc32c lockd sunrpc lp fscache parport hid_generic qla2xxx usbhid tg3 hid scsi_transport_fc ptp be2net hpsa pps_core scsi_tgt
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.023120] CPU: 0 PID: 1794 Comm: kworker/0:4 Tainted: PF O 3.12.1-031201-generic #201311201654
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.023147] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 12/14/2012
Jan 9 14:28:58 ksJan 9 14:47:16 ks2-p1 kernel: imklog 5.8.11, log source = /proc/kmsg started.

The kernel was:
Linux version 3.12.1-031201-generic (apw@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201311201654 SMP Wed Nov 20 21:54:49 UTC 2013

modinfo rbd
filename: /lib/modules/3.12.1-031201-generic/kernel/drivers/block/rbd.ko
license: GPL
author: Jeff Garzik <jeff@garzik.org>
description: rados block device
author: Yehuda Sadeh <yehuda@hq.newdream.net>
author: Sage Weil <sage@newdream.net>
author: Alex Elder <elder@inktank.com>
srcversion: A993220E8E5D714D1F1429C
depends: libceph
intree: Y
vermagic: 3.12.1-031201-generic SMP mod_unload modversions

Files

kern.log.1 (272 KB) kern.log.1

Eric Eastman, 01/12/2014 10:39 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Ian Colle over 10 years ago

Assignee set to Ilya Dryomov

Actions

Copy link

Updated by Ilya Dryomov over 10 years ago

Status changed from New to Need More Info

Hi Eric,

Is it reproducible?

What kind of stress testing were you doing? Can you share a script or
at least describe it in more detail?

It would help if you could describe your setup: the size of the RBD
image, mkfs.xfs parameters, were there rbd snapshots or anything else
involved, etc.

Judging from the timestamps, kern.log shouldn't be long, can you attach
it in its entirety?

Actions

Copy link

Updated by Eric Eastman over 10 years ago

File kern.log.1 kern.log.1 added

Hi Ilya,

So far I have not reproduced the problem.

Ceph cluster info:
ceph --version
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)

There are 180 OSDs, on 6 OSD nodes, using a mix of XFS and BTRFS file systems for the OSDs. There are 3 separate monitors. There is both a frontend and backend 10Gb Ethernet network for the OSDs and the client is connected with 10Gb.

From my notes of the setup:

ceph osd pool create iscsi 5000 5000
rbd create iscsi/iscsi-00 --size 10000000 --image-format 2

rbd -p iscsi ls -l
NAME SIZE PARENT FMT PROT LOCK
iscsi-00 9765G 2

On the client:
rbd -p iscsi map iscsi-00

parted ~~s /dev/rbd/iscsi/iscsi-00 mklabel gpt mkpart primary -~~ 8192s '-1'

time mkfs.xfs /dev/rbd/iscsi/iscsi-00-part1
log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/rbd/iscsi/iscsi-00-part1 isize=256 agcount=33, agsize=79998976 blks = sectsz=512 attr=2, projid32bit=0
data = bsize=4096 blocks=2559998971, imaxpct=5 = sunit=1024 swidth=1024 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

real 2m38.826s
user 0m0.020s
sys 0m0.576s

mount /dev/rbd/iscsi/iscsi-00-part1 /XFS00

I had filled about 92% of the XFS file system with files from 1 byte to 1GB, in a dated tree structure, with
year, month, date, hour directories. In each of the leaf nodes I had 100 data files and 1 md5sum file. I had 10
processes walking this tree verifying the md5sum files, so all read activity on the 10 processes. If you need the
scripts I can provide them.

While that was happening, I created 3 more rbd images with:

rbd create iscsi/iscsi-01 --size 10000000 --image-format 2
rbd create iscsi/iscsi-02 --size 10000000 --image-format 2
rbd create iscsi/iscsi-03 --size 10000000 --image-format 2

I them mapped then on the client, partitioned them, and was creating a new XFS file system on iscsi-02-part1 when it hung. I was using mkfs.xfs with no options as above.

The kern.log file for the last two boots is attached.

Let me know if you have any more questions.

My lab took a power hit this weekend, so I am still trying to get the cluster back on line to do more testing.

Eric

Actions

Copy link