Bug #7125
Assertion failure in rbd_img_obj_callback()
0%
Description
My system hung while stress testing an RBD backed XFS file system. After power cycling the system the error message in /var/log/kern.log just before the reboot messages was:
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022689] Assertion failure in rbd_img_obj_callback() at line 2127:
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022689]
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022689] rbd_assert(img_request != NULL);
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022689]
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022758] ------------[ cut here ]------------
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022768] kernel BUG at /home/apw/COD/linux/drivers/block/rbd.c:2127!
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022779] invalid opcode: 0000 [#1] SMP
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.022796] Modules linked in: xfs zfs(POF) zunicode(POF) zavl(POF) zcommon(POF) znvpair(POF) spl(OF) target_core_mod configfs nfsv3 x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd gpio_ich microcode psmouse serio_raw sb_edac edac_core joydev hpwdt hpilo lpc_ich ipmi_si nfsd ipmi_msghandler ioatdma dca auth_rpcgss acpi_power_meter mac_hid rbd nfs_acl nfs libceph libcrc32c lockd sunrpc lp fscache parport hid_generic qla2xxx usbhid tg3 hid scsi_transport_fc ptp be2net hpsa pps_core scsi_tgt
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.023120] CPU: 0 PID: 1794 Comm: kworker/0:4 Tainted: PF O 3.12.1-031201-generic #201311201654
Jan 9 14:28:58 ks2-p1 kernel: [ 3467.023147] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 12/14/2012
Jan 9 14:28:58 ksJan 9 14:47:16 ks2-p1 kernel: imklog 5.8.11, log source = /proc/kmsg started.
The kernel was:
Linux version 3.12.1-031201-generic (apw@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201311201654 SMP Wed Nov 20 21:54:49 UTC 2013
modinfo rbd
filename: /lib/modules/3.12.1-031201-generic/kernel/drivers/block/rbd.ko
license: GPL
author: Jeff Garzik <jeff@garzik.org>
description: rados block device
author: Yehuda Sadeh <yehuda@hq.newdream.net>
author: Sage Weil <sage@newdream.net>
author: Alex Elder <elder@inktank.com>
srcversion: A993220E8E5D714D1F1429C
depends: libceph
intree: Y
vermagic: 3.12.1-031201-generic SMP mod_unload modversions
Related issues
History
#1 Updated by Ian Colle over 9 years ago
- Assignee set to Ilya Dryomov
#2 Updated by Ilya Dryomov over 9 years ago
- Status changed from New to Need More Info
Hi Eric,
Is it reproducible?
What kind of stress testing were you doing? Can you share a script or
at least describe it in more detail?
It would help if you could describe your setup: the size of the RBD
image, mkfs.xfs parameters, were there rbd snapshots or anything else
involved, etc.
Judging from the timestamps, kern.log shouldn't be long, can you attach
it in its entirety?
#3 Updated by Eric Eastman over 9 years ago
- File kern.log.1 added
Hi Ilya,
So far I have not reproduced the problem.
Ceph cluster info:
ceph --version
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
There are 180 OSDs, on 6 OSD nodes, using a mix of XFS and BTRFS file systems for the OSDs. There are 3 separate monitors. There is both a frontend and backend 10Gb Ethernet network for the OSDs and the client is connected with 10Gb.
From my notes of the setup:
ceph osd pool create iscsi 5000 5000
rbd create iscsi/iscsi-00 --size 10000000 --image-format 2
rbd -p iscsi ls -l
NAME SIZE PARENT FMT PROT LOCK
iscsi-00 9765G 2
On the client:
rbd -p iscsi map iscsi-00
parted s /dev/rbd/iscsi/iscsi-00 mklabel gpt mkpart primary - 8192s '-1'
time mkfs.xfs /dev/rbd/iscsi/iscsi-00-part1
log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/rbd/iscsi/iscsi-00-part1 isize=256 agcount=33, agsize=79998976 blks
= sectsz=512 attr=2, projid32bit=0
data = bsize=4096 blocks=2559998971, imaxpct=5
= sunit=1024 swidth=1024 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
real 2m38.826s
user 0m0.020s
sys 0m0.576s
mount /dev/rbd/iscsi/iscsi-00-part1 /XFS00
I had filled about 92% of the XFS file system with files from 1 byte to 1GB, in a dated tree structure, with
year, month, date, hour directories. In each of the leaf nodes I had 100 data files and 1 md5sum file. I had 10
processes walking this tree verifying the md5sum files, so all read activity on the 10 processes. If you need the
scripts I can provide them.
While that was happening, I created 3 more rbd images with:
rbd create iscsi/iscsi-01 --size 10000000 --image-format 2
rbd create iscsi/iscsi-02 --size 10000000 --image-format 2
rbd create iscsi/iscsi-03 --size 10000000 --image-format 2
I them mapped then on the client, partitioned them, and was creating a new XFS file system on iscsi-02-part1 when it hung. I was using mkfs.xfs with no options as above.
The kern.log file for the last two boots is attached.
Let me know if you have any more questions.
My lab took a power hit this weekend, so I am still trying to get the cluster back on line to do more testing.
Eric
#4 Updated by Ilya Dryomov over 9 years ago
Thanks Eric, I'll try to reproduce it here on a smaller scale this week.
#5 Updated by Ilya Dryomov over 9 years ago
#6 Updated by Sage Weil over 9 years ago
- Project changed from Ceph to rbd
#7 Updated by Ilya Dryomov over 9 years ago
- Status changed from Need More Info to Resolved
Should be fixed by commit 0f2d5be792b0 ("rbd: use reference counts for image requests"), which went into 3.16-rc1.