Bug #5418
kceph: crash in remove_session_caps
0%
Description
<6>[27710.014724] libceph: loaded (mon/osd proto 15/24) <6>[27710.100140] ceph: loaded (mds proto 32) <6>[27710.110299] libceph: client4103 fsid e14625c7-3a58-4167-bfab-520c922939eb <6>[27710.119943] libceph: mon1 10.214.133.30:6790 session established [6]kdb> [6]kdb> bt Stack traceback for pid 8545 0xffff880225dc3f20 8545 2 1 6 R 0xffff880225dc43a8 *kworker/6:2 ffff880224a8fae8 0000000000000018 ffffffffa07b2d53 ffff88010007d800 ffff88020ce62f68 ffff88010007d800 ffff880224cd2800 ffff880224a8fc08 ffffffffa07b81bf ffffffffffffffff ffff880224a8ffd8 ffffffffffffffff Call Trace: [<ffffffffa07b2d53>] ? remove_session_caps+0x33/0x140 [ceph] [<ffffffffa07b81bf>] ? dispatch+0x7ff/0x1740 [ceph] [<ffffffff81510b06>] ? kernel_recvmsg+0x46/0x60 [<ffffffffa0762e38>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] [<ffffffff810a309d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffffa07661f8>] ? con_work+0x1948/0x2d50 [libceph] [<ffffffff81080bb3>] ? idle_balance+0x133/0x180 [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110 [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110 [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540 [<ffffffff8105f3da>] ? process_one_work+0x1da/0x540 [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540 [<ffffffff81637b5c>] ? retint_restore_args+0xe/0xe [<ffffffff810605bc>] ? worker_thread+0x11c/0x370 [<ffffffff810604a0>] ? manage_workers.isra.20+0x2e0/0x2e0 [6]kdb> rd ax: 0000000000000000 bx: ffff88010007d800 cx: 0000000000003332 dx: ffffffffa07b1d64 si: ffffffffa07b1d64 di: ffff88010007de20 bp: ffff880224a8fb08 sp: ffff880224a8fae8 r8: 0000000000000002 r9: 0000000000000001 r10: 0000000000000000 r11: 0000000000000000 r12: ffff88010007d800 r13: ffff880224cd2800 r14: ffff88020c02dfa0 r15: 0000000000000003 ip: ffffffffa07b2e54 flags: 00010202 cs: 00000010 ss: 00000018 ds: 00000018 es: 00000018 fs: 00000018 gs: 00000018
test was
ubuntu@teuthology:/a/teuthology-2013-06-21_01:01:00-kernel-master-testing-basic/41775$ cat orig.config.yaml kernel: kdb: true sha1: 2dd322b42d608a37f3e5beed57a8fbc673da6e32 machine_type: plana nuke-on-error: true overrides: admin_socket: branch: master ceph: conf: mon: debug mon: 20 debug ms: 20 debug paxos: 20 osd: osd op thread timeout: 60 fs: btrfs log-whitelist: - slow request sha1: 4bf5b732cd8869276e87d4bbc4f261ee9e0c6a4c install: ceph: sha1: 4bf5b732cd8869276e87d4bbc4f261ee9e0c6a4c s3tests: branch: master workunit: sha1: 4bf5b732cd8869276e87d4bbc4f261ee9e0c6a4c roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - - client.0 tasks: - chef: null - clock.check: null - install: null - ceph: null - kclient: null - workunit: clients: all: - suites/fsync-tester.sh
History
#2 Updated by Sage Weil almost 11 years ago
- Priority changed from High to Urgent
ubuntu@teuthology:/a/teuthology-2013-06-25_01:00:47-kernel-next-testing-basic/45603
#3 Updated by Zheng Yan almost 11 years ago
I still don't figure out the cause of the crash, infinite loop in iterate_session_caps(), BUG_ON(session->s_nr_caps > 0) or BUG_ON(!list_empty(&session->s_cap_flushing))? please upload ceph.ko
#4 Updated by Sage Weil almost 11 years ago
Zheng Yan wrote:
I still don't figure out that root cause of the crash, infinite loop in iterate_session_caps(), BUG_ON(session->s_nr_caps > 0) or BUG_ON(!list_empty(&session->s_cap_flushing))?
bah, i rebooted the machine. next time i'll gather more info from kdb. the dump.txt is above, but remove_session_caps+0x33 doesn't line up with the current kernel builds
#5 Updated by Sage Weil over 10 years ago
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-07-12_01:01:16-kernel-master-testing-basic/63639$ cat orig.config.yaml kernel: kdb: true sha1: 365b57b1317524bb0cdd15859a224ba1ab58d1d7 machine_type: plana nuke-on-error: true overrides: admin_socket: branch: master ceph: conf: mon: debug mon: 20 debug ms: 20 debug paxos: 20 osd: osd op thread timeout: 60 fs: btrfs log-whitelist: - slow request sha1: cf8f16d7433b86b0bdfc192f719f3029f04996a6 install: ceph: sha1: cf8f16d7433b86b0bdfc192f719f3029f04996a6 s3tests: branch: master workunit: sha1: cf8f16d7433b86b0bdfc192f719f3029f04996a6 roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - - client.0 tasks: - chef: null - clock.check: null - install: null - ceph: null - kclient: null - workunit: clients: all: - suites/fsync-tester.sh
#6 Updated by Sage Weil over 10 years ago
- Status changed from 12 to Need More Info
#7 Updated by Sage Weil over 10 years ago
dump attached
i'll leave this box in kdb in case more information is needed
#8 Updated by Zheng Yan over 10 years ago
I need to know which line caused the crash. looks like it was triggered by one of the BUG_ONs in remove_session_caps. but I don't see any BUG_ON kernel message, so I'm confused.
#9 Updated by Sage Weil over 10 years ago
- File objdump.txt View added
#10 Updated by Sage Weil over 10 years ago
Zheng Yan wrote:
I need to know which line caused the crash. looks like it was triggered by one of the BUG_ONs in remove_session_caps. but I don't see any BUG_ON kernel message, so I'm confused.
yeah strangely there isn't one. remove_session_caps+0x33/0x140 also isn't an exact match.. it'd be 0x19f93, but
static int remove_session_caps_cb(struct inode *inode, struct ceph_cap *cap, void *arg) { 19f74: 48 89 5d d8 mov %rbx,-0x28(%rbp) 19f78: 4c 89 65 e0 mov %r12,-0x20(%rbp) 19f7c: 48 89 fb mov %rdi,%rbx 19f7f: 4c 89 6d e8 mov %r13,-0x18(%rbp) 19f83: 4c 89 75 f0 mov %r14,-0x10(%rbp) 19f87: 49 89 f4 mov %rsi,%r12 19f8a: 4c 89 7d f8 mov %r15,-0x8(%rbp) struct ceph_inode_info *ci = ceph_inode(inode); int drop = 0; dout("removing cap %p, ci is %p, inode is %p\n", 19f8e: 0f 85 f9 01 00 00 jne 1a18d <remove_session_caps_cb+0x22d> raw_spin_lock_init(&(_lock)->rlock); \ } while (0) static inline void spin_lock(spinlock_t *lock) { raw_spin_lock(&lock->rlock); 19f94: 4c 8d ab b0 fb ff ff lea -0x450(%rbx),%r13 19f9b: 4c 89 ef mov %r13,%rdi 19f9e: e8 00 00 00 00 callq 19fa3 <remove_session_caps_cb+0x43> 19f9f: R_X86_64_PC32 _raw_spin_lock-0x4 cap, ci, &ci->vfs_inode); spin_lock(&ci->i_ceph_lock);
full objdump is attached
#11 Updated by Sage Weil over 10 years ago
registers:
[0]kdb> rd ax: 0000000000000000 bx: ffff88022310f800 cx: 0000000000003332 dx: ffffffffa079bcf4 si: ffffffffa079bcf4 di: ffff88022310fe20 bp: ffff88020da1bb08 sp: ffff88020da1bae8 r8: 0000000000000002 r9: 0000000000000001 r10: 0000000000000000 r11: 0000000000000000 r12: ffff88022310f800 r13: ffff88017a887000 r14: ffff88020bd41560 r15: 0000000000000003 ip: ffffffffa079cde4 flags: 00010202 cs: 00000010 ss: 00000018 ds: 00000018 es: 00000018 fs: 00000018 gs: 00000018
#12 Updated by Zheng Yan over 10 years ago
- File 0001-ceph-fix-freeing-inode-vs-removing-sessioncaps-race.patch added
I think BUG_ON(session->s_nr_caps > 0) caused the crash . (looks like kdb traps the undefined instruction and prevents the BUG_ON message from showing). One possible explanation for "session->s_nr_caps > 0" is iterate_session_caps() skipped some I_FREEING/I_WILL_FREE inodes. please try the attached patch.
#13 Updated by Zheng Yan over 10 years ago
- File deleted (
0001-ceph-fix-freeing-inode-vs-removing-sessioncaps-race.patch)
#14 Updated by Zheng Yan over 10 years ago
#15 Updated by Sage Weil over 10 years ago
- Priority changed from Urgent to High
#16 Updated by Zheng Yan over 10 years ago
- Status changed from 12 to 7
#17 Updated by Sage Weil over 10 years ago
- Status changed from 7 to Resolved
#18 Updated by Greg Farnum over 7 years ago
- Component(FS) kceph added