Project

General

Profile

Bug #5418

kceph: crash in remove_session_caps

Added by Sage Weil almost 11 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
kceph
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

<6>[27710.014724] libceph: loaded (mon/osd proto 15/24)
<6>[27710.100140] ceph: loaded (mds proto 32)
<6>[27710.110299] libceph: client4103 fsid e14625c7-3a58-4167-bfab-520c922939eb
<6>[27710.119943] libceph: mon1 10.214.133.30:6790 session established
[6]kdb>                      
[6]kdb> bt
Stack traceback for pid 8545
0xffff880225dc3f20     8545        2  1    6   R  0xffff880225dc43a8 *kworker/6:2
 ffff880224a8fae8 0000000000000018 ffffffffa07b2d53 ffff88010007d800
 ffff88020ce62f68 ffff88010007d800 ffff880224cd2800 ffff880224a8fc08
 ffffffffa07b81bf ffffffffffffffff ffff880224a8ffd8 ffffffffffffffff
Call Trace:
 [<ffffffffa07b2d53>] ? remove_session_caps+0x33/0x140 [ceph]
 [<ffffffffa07b81bf>] ? dispatch+0x7ff/0x1740 [ceph]
 [<ffffffff81510b06>] ? kernel_recvmsg+0x46/0x60
 [<ffffffffa0762e38>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
 [<ffffffff810a309d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffffa07661f8>] ? con_work+0x1948/0x2d50 [libceph]
 [<ffffffff81080bb3>] ? idle_balance+0x133/0x180
 [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110
 [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110
 [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540
 [<ffffffff8105f3da>] ? process_one_work+0x1da/0x540
 [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540
 [<ffffffff81637b5c>] ? retint_restore_args+0xe/0xe
 [<ffffffff810605bc>] ? worker_thread+0x11c/0x370
 [<ffffffff810604a0>] ? manage_workers.isra.20+0x2e0/0x2e0

[6]kdb> rd
ax: 0000000000000000  bx: ffff88010007d800  cx: 0000000000003332
dx: ffffffffa07b1d64  si: ffffffffa07b1d64  di: ffff88010007de20
bp: ffff880224a8fb08  sp: ffff880224a8fae8  r8: 0000000000000002
r9: 0000000000000001  r10: 0000000000000000  r11: 0000000000000000
r12: ffff88010007d800  r13: ffff880224cd2800  r14: ffff88020c02dfa0
r15: 0000000000000003  ip: ffffffffa07b2e54  flags: 00010202  cs: 00000010
ss: 00000018  ds: 00000018  es: 00000018  fs: 00000018  gs: 00000018

test was

ubuntu@teuthology:/a/teuthology-2013-06-21_01:01:00-kernel-master-testing-basic/41775$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 2dd322b42d608a37f3e5beed57a8fbc673da6e32
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 4bf5b732cd8869276e87d4bbc4f261ee9e0c6a4c
  install:
    ceph:
      sha1: 4bf5b732cd8869276e87d4bbc4f261ee9e0c6a4c
  s3tests:
    branch: master
  workunit:
    sha1: 4bf5b732cd8869276e87d4bbc4f261ee9e0c6a4c
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- kclient: null
- workunit:
    clients:
      all:
      - suites/fsync-tester.sh

dump.txt View (85.4 KB) Sage Weil, 06/21/2013 12:02 PM

dump.txt View (88.8 KB) Sage Weil, 07/22/2013 09:04 AM

objdump.txt View (3.22 MB) Sage Weil, 07/22/2013 01:46 PM

0001-ceph-fix-freeing-inode-vs-removing-session-caps-race.patch View (3.19 KB) Zheng Yan, 07/23/2013 11:34 PM

History

#1 Updated by Sage Weil almost 11 years ago

kdb dumpall attached

#2 Updated by Sage Weil almost 11 years ago

  • Priority changed from High to Urgent

ubuntu@teuthology:/a/teuthology-2013-06-25_01:00:47-kernel-next-testing-basic/45603

#3 Updated by Zheng Yan almost 11 years ago

I still don't figure out the cause of the crash, infinite loop in iterate_session_caps(), BUG_ON(session->s_nr_caps > 0) or BUG_ON(!list_empty(&session->s_cap_flushing))? please upload ceph.ko

#4 Updated by Sage Weil almost 11 years ago

Zheng Yan wrote:

I still don't figure out that root cause of the crash, infinite loop in iterate_session_caps(), BUG_ON(session->s_nr_caps > 0) or BUG_ON(!list_empty(&session->s_cap_flushing))?

bah, i rebooted the machine. next time i'll gather more info from kdb. the dump.txt is above, but remove_session_caps+0x33 doesn't line up with the current kernel builds

#5 Updated by Sage Weil over 10 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-07-12_01:01:16-kernel-master-testing-basic/63639$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 365b57b1317524bb0cdd15859a224ba1ab58d1d7
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: cf8f16d7433b86b0bdfc192f719f3029f04996a6
  install:
    ceph:
      sha1: cf8f16d7433b86b0bdfc192f719f3029f04996a6
  s3tests:
    branch: master
  workunit:
    sha1: cf8f16d7433b86b0bdfc192f719f3029f04996a6
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- kclient: null
- workunit:
    clients:
      all:
      - suites/fsync-tester.sh

#6 Updated by Sage Weil over 10 years ago

  • Status changed from 12 to Need More Info

#7 Updated by Sage Weil over 10 years ago

  • File dump.txt View added
  • Status changed from Need More Info to 12
  • Assignee set to Zheng Yan

dump attached

i'll leave this box in kdb in case more information is needed

#8 Updated by Zheng Yan over 10 years ago

I need to know which line caused the crash. looks like it was triggered by one of the BUG_ONs in remove_session_caps. but I don't see any BUG_ON kernel message, so I'm confused.

#9 Updated by Sage Weil over 10 years ago

#10 Updated by Sage Weil over 10 years ago

Zheng Yan wrote:

I need to know which line caused the crash. looks like it was triggered by one of the BUG_ONs in remove_session_caps. but I don't see any BUG_ON kernel message, so I'm confused.

yeah strangely there isn't one. remove_session_caps+0x33/0x140 also isn't an exact match.. it'd be 0x19f93, but

static int remove_session_caps_cb(struct inode *inode, struct ceph_cap *cap,
                                  void *arg)
{
   19f74:       48 89 5d d8             mov    %rbx,-0x28(%rbp)
   19f78:       4c 89 65 e0             mov    %r12,-0x20(%rbp)
   19f7c:       48 89 fb                mov    %rdi,%rbx
   19f7f:       4c 89 6d e8             mov    %r13,-0x18(%rbp)
   19f83:       4c 89 75 f0             mov    %r14,-0x10(%rbp)
   19f87:       49 89 f4                mov    %rsi,%r12
   19f8a:       4c 89 7d f8             mov    %r15,-0x8(%rbp)
        struct ceph_inode_info *ci = ceph_inode(inode);
        int drop = 0;

        dout("removing cap %p, ci is %p, inode is %p\n",
   19f8e:       0f 85 f9 01 00 00       jne    1a18d <remove_session_caps_cb+0x22d>
        raw_spin_lock_init(&(_lock)->rlock);            \
} while (0)

static inline void spin_lock(spinlock_t *lock)
{
        raw_spin_lock(&lock->rlock);
   19f94:       4c 8d ab b0 fb ff ff    lea    -0x450(%rbx),%r13
   19f9b:       4c 89 ef                mov    %r13,%rdi
   19f9e:       e8 00 00 00 00          callq  19fa3 <remove_session_caps_cb+0x43>
                        19f9f: R_X86_64_PC32    _raw_spin_lock-0x4
             cap, ci, &ci->vfs_inode);
        spin_lock(&ci->i_ceph_lock);

full objdump is attached

#11 Updated by Sage Weil over 10 years ago

registers:

[0]kdb> rd
ax: 0000000000000000  bx: ffff88022310f800  cx: 0000000000003332
dx: ffffffffa079bcf4  si: ffffffffa079bcf4  di: ffff88022310fe20
bp: ffff88020da1bb08  sp: ffff88020da1bae8  r8: 0000000000000002
r9: 0000000000000001  r10: 0000000000000000  r11: 0000000000000000
r12: ffff88022310f800  r13: ffff88017a887000  r14: ffff88020bd41560
r15: 0000000000000003  ip: ffffffffa079cde4  flags: 00010202  cs: 00000010
ss: 00000018  ds: 00000018  es: 00000018  fs: 00000018  gs: 00000018

#12 Updated by Zheng Yan over 10 years ago

  • File 0001-ceph-fix-freeing-inode-vs-removing-sessioncaps-race.patch added

I think BUG_ON(session->s_nr_caps > 0) caused the crash . (looks like kdb traps the undefined instruction and prevents the BUG_ON message from showing). One possible explanation for "session->s_nr_caps > 0" is iterate_session_caps() skipped some I_FREEING/I_WILL_FREE inodes. please try the attached patch.

#13 Updated by Zheng Yan over 10 years ago

  • File deleted (0001-ceph-fix-freeing-inode-vs-removing-sessioncaps-race.patch)

#15 Updated by Sage Weil over 10 years ago

  • Priority changed from Urgent to High

#16 Updated by Zheng Yan over 10 years ago

  • Status changed from 12 to 7

#17 Updated by Sage Weil over 10 years ago

  • Status changed from 7 to Resolved

#18 Updated by Greg Farnum over 7 years ago

  • Component(FS) kceph added

Also available in: Atom PDF