Project

General

Profile

Actions

Bug #7125

closed

Assertion failure in rbd_img_obj_callback()

Added by Eric Eastman over 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

My system hung while stress testing an RBD backed XFS file system. After power cycling the system the error message in /var/log/kern.log just before the reboot messages was:

Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022689] Assertion failure in rbd_img_obj_callback() at line 2127:
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022689]
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022689] rbd_assert(img_request != NULL);
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022689]
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022758] ------------[ cut here ]------------
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022768] kernel BUG at /home/apw/COD/linux/drivers/block/rbd.c:2127!
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022779] invalid opcode: 0000 [#1] SMP
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022796] Modules linked in: xfs zfs(POF) zunicode(POF) zavl(POF) zcommon(POF) znvpair(POF) spl(OF) target_core_mod configfs nfsv3 x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd gpio_ich microcode psmouse serio_raw sb_edac edac_core joydev hpwdt hpilo lpc_ich ipmi_si nfsd ipmi_msghandler ioatdma dca auth_rpcgss acpi_power_meter mac_hid rbd nfs_acl nfs libceph libcrc32c lockd sunrpc lp fscache parport hid_generic qla2xxx usbhid tg3 hid scsi_transport_fc ptp be2net hpsa pps_core scsi_tgt
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.023120] CPU: 0 PID: 1794 Comm: kworker/0:4 Tainted: PF O 3.12.1-031201-generic #201311201654
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.023147] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 12/14/2012
Jan 9 14:28:58 ksJan 9 14:47:16 ks2-p1 kernel: imklog 5.8.11, log source = /proc/kmsg started.

The kernel was:
Linux version 3.12.1-031201-generic (apw@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201311201654 SMP Wed Nov 20 21:54:49 UTC 2013

modinfo rbd
filename: /lib/modules/3.12.1-031201-generic/kernel/drivers/block/rbd.ko
license: GPL
author: Jeff Garzik <>
description: rados block device
author: Yehuda Sadeh <>
author: Sage Weil <>
author: Alex Elder <>
srcversion: A993220E8E5D714D1F1429C
depends: libceph
intree: Y
vermagic: 3.12.1-031201-generic SMP mod_unload modversions


Files

kern.log.1 (272 KB) kern.log.1 Eric Eastman, 01/12/2014 10:39 PM

Related issues 1 (0 open1 closed)

Is duplicate of rbd - Bug #5876: Assertion failure in rbd_img_obj_callback() : rbd_assert(which >= img_request->next_completion);ResolvedIlya Dryomov08/05/2013

Actions
Actions #1

Updated by Ian Colle over 10 years ago

  • Assignee set to Ilya Dryomov
Actions #2

Updated by Ilya Dryomov over 10 years ago

  • Status changed from New to Need More Info

Hi Eric,

Is it reproducible?

What kind of stress testing were you doing? Can you share a script or
at least describe it in more detail?

It would help if you could describe your setup: the size of the RBD
image, mkfs.xfs parameters, were there rbd snapshots or anything else
involved, etc.

Judging from the timestamps, kern.log shouldn't be long, can you attach
it in its entirety?

Actions #3

Updated by Eric Eastman over 10 years ago

Hi Ilya,

So far I have not reproduced the problem.

Ceph cluster info:
ceph --version
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)

There are 180 OSDs, on 6 OSD nodes, using a mix of XFS and BTRFS file systems for the OSDs. There are 3 separate monitors. There is both a frontend and backend 10Gb Ethernet network for the OSDs and the client is connected with 10Gb.

From my notes of the setup:

ceph osd pool create iscsi 5000 5000
rbd create iscsi/iscsi-00 --size 10000000 --image-format 2

rbd -p iscsi ls -l
NAME SIZE PARENT FMT PROT LOCK
iscsi-00 9765G 2

On the client:
rbd -p iscsi map iscsi-00

parted s /dev/rbd/iscsi/iscsi-00 mklabel gpt mkpart primary - 8192s '-1'

time mkfs.xfs /dev/rbd/iscsi/iscsi-00-part1
log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/rbd/iscsi/iscsi-00-part1 isize=256 agcount=33, agsize=79998976 blks = sectsz=512 attr=2, projid32bit=0
data = bsize=4096 blocks=2559998971, imaxpct=5 = sunit=1024 swidth=1024 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

real 2m38.826s
user 0m0.020s
sys 0m0.576s

mount /dev/rbd/iscsi/iscsi-00-part1 /XFS00

I had filled about 92% of the XFS file system with files from 1 byte to 1GB, in a dated tree structure, with
year, month, date, hour directories. In each of the leaf nodes I had 100 data files and 1 md5sum file. I had 10
processes walking this tree verifying the md5sum files, so all read activity on the 10 processes. If you need the
scripts I can provide them.

While that was happening, I created 3 more rbd images with:

rbd create iscsi/iscsi-01 --size 10000000 --image-format 2
rbd create iscsi/iscsi-02 --size 10000000 --image-format 2
rbd create iscsi/iscsi-03 --size 10000000 --image-format 2

I them mapped then on the client, partitioned them, and was creating a new XFS file system on iscsi-02-part1 when it hung. I was using mkfs.xfs with no options as above.

The kern.log file for the last two boots is attached.

Let me know if you have any more questions.

My lab took a power hit this weekend, so I am still trying to get the cluster back on line to do more testing.

Eric

Actions #4

Updated by Ilya Dryomov over 10 years ago

Thanks Eric, I'll try to reproduce it here on a smaller scale this week.

Actions #6

Updated by Sage Weil about 10 years ago

  • Project changed from Ceph to rbd
Actions #7

Updated by Ilya Dryomov almost 10 years ago

  • Status changed from Need More Info to Resolved

Should be fixed by commit 0f2d5be792b0 ("rbd: use reference counts for image requests"), which went into 3.16-rc1.

Actions

Also available in: Atom PDF