Project

General

Profile

Actions

Bug #5429

closed

libceph: rcu stall, null deref in osd_reset->__reset_osd->__remove_osd

Added by Sage Weil almost 11 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
rbd
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

<1>[19828.585548] BUG: unable to handle kernel NULL pointer dereference at           (null)
<1>[19828.593437] IP: [<ffffffff813185cb>] rb_erase+0x1bb/0x370
<4>[19828.598865] PGD 0 
<4>[19828.600899] Oops: 0002 [#1] SMP 
[dumpcommon]kdb>   -bt

Stack traceback for pid 29967
0xffff88020dd03f20    29967        2  1    4   R  0xffff88020dd043a8 *kworker/4:1
 ffff88020b257b48 0000000000000018 0000000000000000 ffff88020b257b68
 ffffffffa05487bc ffff8802204e4000 ffff880224ec7950 ffff88020b257b98
 ffffffffa0548abf ffff8802204e4030 ffff880224ec7950 0000000000000000
Call Trace:
 [<ffffffffa05487bc>] ? __remove_osd+0x3c/0xa0 [libceph]
 [<ffffffffa0548abf>] ? __reset_osd+0x12f/0x170 [libceph]
 [<ffffffffa054a6de>] ? osd_reset+0x7e/0x2b0 [libceph]
 [<ffffffffa0541e21>] ? con_work+0x571/0x2d50 [libceph]
 [<ffffffff81080bb3>] ? idle_balance+0x133/0x180
 [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110
 [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110
 [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540
 [<ffffffff8105f3da>] ? process_one_work+0x1da/0x540
 [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540
 [<ffffffff810605bc>] ? worker_thread+0x11c/0x370
 [<ffffffff810604a0>] ? manage_workers.isra.20+0x2e0/0x2e0
 [<ffffffff8106727a>] ? kthread+0xea/0xf0
 [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150
 [<ffffffff8163ff9c>] ? ret_from_fork+0x7c/0xb0
 [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150
[dumpall]kdb>   -bta


but preceeded by 7 seconds earlier by
<4>[19778.015116] libceph: osd0 10.214.132.16:6801 socket closed (con state CONNECTING)
<3>[19799.355399] INFO: rcu_sched self-detected stall on CPU { 6}  (t=2100 jiffies g=245350 c=245349 q=2640)
<4>[19799.364789] CPU: 6 PID: 19284 Comm: kworker/6:2 Tainted: G        W    3.10.0-rc6-ceph-00091-g2dd322b #1
<4>[19799.374303] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011
<3>[19799.375424] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=2102 jiffies, g=245350, c=245349, q=2640)
<6>[19799.375425] Task dump for CPU 6:
<6>[19799.375429] kworker/6:2     R  running task        0 19284      2 0x00000000
<6>[19799.375442] Workqueue: ceph-msgr con_work [libceph]
<4>[19799.375445]  ffff880125ff7de8 ffffffff8105f3da ffffffff8105f36f ffff8802272d3a00
<4>[19799.375447]  0000000000000000 00000006272d2f98 ffff880125ff7fd8 ffff8802272d2f80
<4>[19799.375450]  ffffffffa05666d0 0000000000000000 0000000000000000 ffffffffa055940e
<4>[19799.375451] Call Trace:
<4>[19799.375457]  [<ffffffff8105f3da>] ? process_one_work+0x1da/0x540
<4>[19799.375459]  [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540
<4>[19799.375462]  [<ffffffff810605bc>] worker_thread+0x11c/0x370
<4>[19799.375464]  [<ffffffff810604a0>] ? manage_workers.isra.20+0x2e0/0x2e0
<4>[19799.375468]  [<ffffffff8106727a>] kthread+0xea/0xf0
<4>[19799.375471]  [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150
<4>[19799.375476]  [<ffffffff8163ff9c>] ret_from_fork+0x7c/0xb0
<4>[19799.375478]  [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150
<4>[19799.480150] Workqueue: ceph-msgr con_work [libceph]
<4>[19799.485054]  ffffffff81c4ca00 ffff8802272c3db8 ffffffff81630b82 ffff8802272c3e38
<4>[19799.492523]  ffffffff810e285a 0000000000000006 ffff8802272cd4e0 ffff8802272c3de8
<4>[19799.499995]  ffffffff810e644c 0000000000000086 0000000000000001 0000000000000006
<4>[19799.507469] Call Trace:
<4>[19799.509929]  <IRQ>  [<ffffffff81630b82>] dump_stack+0x19/0x1b
<4>[19799.515718]  [<ffffffff810e285a>] rcu_check_callbacks+0x21a/0x710
<4>[19799.521831]  [<ffffffff810e644c>] ? acct_account_cputime+0x1c/0x20
<4>[19799.528034]  [<ffffffff81050f68>] update_process_times+0x48/0x80
<4>[19799.534062]  [<ffffffff8109b616>] tick_sched_handle.isra.10+0x36/0x50
<4>[19799.540524]  [<ffffffff8109b71c>] tick_sched_timer+0x4c/0x80
<4>[19799.546203]  [<ffffffff8106a841>] __run_hrtimer+0x81/0x1e0
<4>[19799.551709]  [<ffffffff8109b6d0>] ? tick_nohz_handler+0xa0/0xa0
<4>[19799.557647]  [<ffffffff8106b147>] hrtimer_interrupt+0x107/0x260
<4>[19799.563588]  [<ffffffff81641b69>] smp_apic_timer_interrupt+0x69/0x99
<4>[19799.569964]  [<ffffffff81640caf>] apic_timer_interrupt+0x6f/0x80
<4>[19799.575987]  <EOI>  [<ffffffff8112edec>] ? shrink_inactive_list+0x18c/0x400
<4>[19799.582995]  [<ffffffff81637590>] ? _raw_spin_unlock_irq+0x30/0x40
<4>[19799.589195]  [<ffffffff81637595>] ? _raw_spin_unlock_irq+0x35/0x40
<4>[19799.595397]  [<ffffffff8112edec>] shrink_inactive_list+0x18c/0x400
<4>[19799.601595]  [<ffffffff8112f66d>] shrink_lruvec+0x2cd/0x4d0
<4>[19799.607187]  [<ffffffff8119855b>] ? bdi_queue_work+0x8b/0xf0
<4>[19799.612869]  [<ffffffff8112fc1c>] do_try_to_free_pages+0x11c/0x3a0
<4>[19799.619068]  [<ffffffff81130066>] try_to_free_pages+0xd6/0x1b0
<4>[19799.624922]  [<ffffffff811375b0>] ? next_zone+0x30/0x40
<4>[19799.630165]  [<ffffffff81125406>] __alloc_pages_nodemask+0x596/0x8f0
<4>[19799.636541]  [<ffffffff8115bb1a>] alloc_pages_current+0xba/0x170
<4>[19799.642569]  [<ffffffff81516d3e>] sk_page_frag_refill+0x7e/0x130
<4>[19799.648593]  [<ffffffff8156f5a5>] tcp_sendmsg+0x305/0xe50
<4>[19799.654010]  [<ffffffff8159af99>] inet_sendmsg+0xb9/0xf0
<4>[19799.659339]  [<ffffffff8159aee5>] ? inet_sendmsg+0x5/0xf0
<4>[19799.664760]  [<ffffffff81510de2>] sock_sendmsg+0xc2/0xe0
<4>[19799.670090]  [<ffffffff812ee35b>] ? chksum_update+0x1b/0x30
<4>[19799.675686]  [<ffffffff812ea1e8>] ? crypto_shash_update+0x18/0x30
<4>[19799.681814]  [<ffffffffa0000056>] ? crc32c+0x56/0x7c [libcrc32c]
<4>[19799.687842]  [<ffffffff81510e40>] kernel_sendmsg+0x40/0x60
<4>[19799.693353]  [<ffffffffa05424d8>] con_work+0xc28/0x2d50 [libceph]
<4>[19799.699468]  [<ffffffff81080bb3>] ? idle_balance+0x133/0x180
<4>[19799.705145]  [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110
<4>[19799.711257]  [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110
<4>[19799.717371]  [<ffffffff8105f3da>] process_one_work+0x1da/0x540
<4>[19799.723220]  [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540
<4>[19799.729247]  [<ffffffff810605bc>] worker_thread+0x11c/0x370
<4>[19799.734840]  [<ffffffff810604a0>] ? manage_workers.isra.20+0x2e0/0x2e0
<4>[19799.741388]  [<ffffffff8106727a>] kthread+0xea/0xf0
<4>[19799.746283]  [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150
<4>[19799.752658]  [<ffffffff8163ff9c>] ret_from_fork+0x7c/0xb0
<4>[19799.758076]  [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150
<3>[19820.788280] INFO: rcu_sched self-detected stall on CPU
<3>[19820.788286] INFO: rcu_sched self-detected stall on CP
<4>[19820.788287]  

job was
ubuntu@teuthology:/a/teuthology-2013-06-22_01:00:51-kernel-next-testing-basic/42857$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 2dd322b42d608a37f3e5beed57a8fbc673da6e32
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: next
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
      osd:
        filestore flush min: 0
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 94eada40460cc6010be23110ef8ce0e3d92691af
  install:
    ceph:
      sha1: 94eada40460cc6010be23110ef8ce0e3d92691af
  s3tests:
    branch: next
  workunit:
    sha1: 94eada40460cc6010be23110ef8ce0e3d92691af
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds: null
- kclient: null
- workunit:
    clients:
      all:
      - suites/ffsb.sh


Files

dump2.txt (141 KB) dump2.txt Sage Weil, 06/23/2013 10:23 AM
dump3.txt (144 KB) dump3.txt Sage Weil, 06/28/2013 10:51 AM
Actions #1

Updated by Sage Weil almost 11 years ago

leaving plana56 in kdb

Actions #2

Updated by Ian Colle almost 11 years ago

  • Assignee set to Josh Durgin
Actions #3

Updated by Sage Weil almost 11 years ago

hit this again, ubuntu@teuthology:/a/teuthology-2013-06-28_01:01:07-kernel-master-testing-basic/48683

Actions #4

Updated by Sage Weil almost 11 years ago

plana72 still sitting in kdb.

Actions #5

Updated by Sage Weil over 10 years ago

  • Priority changed from Urgent to High
Actions #6

Updated by Sage Weil over 10 years ago

hit this again, ubuntu@teuthology:/a/teuthology-2013-08-14_01:01:26-kcephfs-next-testing-basic-plana/106215

it was here:

static void __remove_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd)
{
    a78d:       48 89 e5                mov    %rsp,%rbp
    a790:       41 54                   push   %r12
    a792:       49 89 fc                mov    %rdi,%r12
    a795:       53                      push   %rbx
    a796:       48 89 f3                mov    %rsi,%rbx
        dout("__remove_osd %p\n", osd);
    a799:       75 61                   jne    a7fc <__remove_osd+0x7c>
        BUG_ON(!list_empty(&osd->o_requests));
    a79b:       48 8d 83 38 05 00 00    lea    0x538(%rbx),%rax
    a7a2:       48 39 83 38 05 00 00    cmp    %rax,0x538(%rbx)
    a7a9:       75 6b                   jne    a816 <__remove_osd+0x96>
        rb_erase(&osd->o_node, &osdc->osds);
    a7ab:       49 8d b4 24 60 01 00    lea    0x160(%r12),%rsi
    a7b2:       00 
    a7b3:       48 8d 7b 18             lea    0x18(%rbx),%rdi
    a7b7:       e8 00 00 00 00          callq  a7bc <__remove_osd+0x3c>
                        a7b8: R_X86_64_PC32     rb_erase+0xfffffffffffffffc
 * in an undefined state.
 */
#ifndef CONFIG_DEBUG_LIST
static inline void __list_del_entry(struct list_head *entry)
{
        __list_del(entry->prev, entry->next);
    a7bc:       48 8b 8b 58 05 00 00    mov    0x558(%rbx),%rcx
^^^^^^^^^^^^^^^^^
    a7c3:       48 8b 93 60 05 00 00    mov    0x560(%rbx),%rdx
        list_del_init(&osd->o_osd_lru);
    a7ca:       48 8d 83 58 05 00 00    lea    0x558(%rbx),%rax
        ceph_con_close(&osd->o_con);
    a7d1:       48 8d 7b 30             lea    0x30(%rbx),%rdi
 * This is only for internal list manipulation where we know
 * the prev/next entries already!
 */
static inline void __list_del(struct list_head * prev, struct list_head * next)

Actions #7

Updated by Sage Weil over 10 years ago

<6>[17485.734714]  rbd1: unknown partition table
<4>[17485.735740] libceph: mon2 10.214.132.4:6790 socket closed (con state OPEN)
<6>[17485.735759] libceph: mon2 10.214.132.4:6790 session lost, hunting for new mon
<6>[17485.737794] libceph: mon2 10.214.132.4:6790 session established
<6>[17485.759013] rbd: rbd1: added with size 0x40000000
<4>[17485.858921] libceph: osd2 10.214.132.4:6808 socket closed (con state OPEN)
<6>[17485.966118] libceph: client4411 fsid 1af5918f-c950-454f-9769-f3b857fac855
<6>[17485.974997] libceph: mon1 10.214.132.38:6789 session established
<6>[17486.017942]  rbd1: unknown partition table
<6>[17486.022331] rbd: rbd1: added with size 0x40000000
<4>[17486.199275] libceph: osd3 10.214.132.38:6809 socket closed (con state OPEN)
<6>[17486.233630] libceph: client4445 fsid 1af5918f-c950-454f-9769-f3b857fac855
<6>[17486.242439] libceph: mon2 10.214.132.4:6790 session established
<6>[17486.277232]  rbd1: unknown partition table
<6>[17486.281544] rbd: rbd1: added with size 0x40000000
<4>[17486.331466] libceph: osd2 10.214.132.4:6808 socket closed (con state OPEN)
<4>[17486.381566] libceph: osd2 10.214.132.4:6808 socket closed (con state OPEN)
...
[2]kdb> bt
Stack traceback for pid 25803
0xffff8802238dbf20    25803    13496  1    2   R  0xffff8802238dc3a8 *rbd
 ffff88020d87bd68 0000000000000018 ffff8801b5ff7950 ffff88020d87bd88
 ffffffffa05fe7bc ffff8801b5ff7950 ffff8801b5ff7ab0 ffff88020d87bdb8
 ffffffffa0602d24 ffff88012c8f16c0 ffff8801b5ff7000 ffff88012c8f16c0
Call Trace:
 [<ffffffffa05fe7bc>] ? __remove_osd+0x3c/0xa0 [libceph]
 [<ffffffffa0602d24>] ? ceph_osdc_stop+0xa4/0x110 [libceph]
 [<ffffffffa05f4790>] ? ceph_destroy_client+0x30/0xa0 [libceph]
 [<ffffffffa022fb41>] ? rbd_client_release+0x71/0xb0 [rbd]
 [<ffffffffa0230798>] ? rbd_put_client+0x28/0x30 [rbd]
 [<ffffffffa02307ba>] ? rbd_dev_destroy+0x1a/0x40 [rbd]
 [<ffffffffa023083b>] ? rbd_dev_image_release+0x5b/0x70 [rbd]
 [<ffffffffa0231095>] ? rbd_remove+0x155/0x180 [rbd]
 [<ffffffff81407187>] ? bus_attr_store+0x27/0x30
 [<ffffffff811f2d66>] ? sysfs_write_file+0xe6/0x170
 [<ffffffff8117feae>] ? vfs_write+0xce/0x200
 [<ffffffff8119cf0c>] ? fget_light+0x3c/0x130
 [<ffffffff811803b5>] ? SyS_write+0x55/0xa0
 [<ffffffff81653782>] ? system_call_fastpath+0x16/0x1b

ubuntu@teuthology:/a/teuthology-2013-09-02_01:01:32-krbd-master-testing-basic-plana/17253$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 263cbbcaf605e359a46e30889595d82629f82080
machine_type: plana
nuke-on-error: true
os_type: ubuntu
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      global:
        ms inject socket failures: 500
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 1c5e58a85ef7f26b2c617ecb6c08de5632bb0fe3
  ceph-deploy:
    branch:
      dev: master
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
  install:
    ceph:
      sha1: 1c5e58a85ef7f26b2c617ecb6c08de5632bb0fe3
  s3tests:
    branch: master
  workunit:
    sha1: 1c5e58a85ef7f26b2c617ecb6c08de5632bb0fe3
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- workunit:
    clients:
      all:
      - rbd/map-unmap.sh
teuthology_branch: master
ubuntu@teuthology:/a/teuthology-2013-09-02_01:01:32-krbd-master-testing-basic-plana/17253$ 
Actions #8

Updated by Sage Weil over 10 years ago

  • Status changed from New to Duplicate

i think/hope this is a duplicate of the async notify racing with shutdown

Actions #9

Updated by Josh Durgin over 9 years ago

  • Status changed from Duplicate to 12

Got reports of the 2nd trace (http://tracker.ceph.com/issues/5429#note-7) occuring on a kernel with the notify fixes.

Actions #10

Updated by Josh Durgin over 9 years ago

  • Project changed from rbd to Linux kernel client
  • Category set to rbd
  • Assignee deleted (Josh Durgin)
Actions #11

Updated by Ilya Dryomov over 9 years ago

  • Assignee set to Ilya Dryomov

I bet there is another trace of this somewhere, no rcu stall, just plain NULL deref in rb_erase(). Will try to investigate.

Actions #12

Updated by JuanJose Galvez over 9 years ago

Is there anything which needs to be gathered from the cluster currently displaying this issue which could help out?

Actions #13

Updated by Ilya Dryomov over 9 years ago

If it's crashed again, a full dmesg and a tail (say, last 5-10 minutes before the crash) of osd/messenger logs would help.

Actions #14

Updated by Ilya Dryomov over 9 years ago

And if it hasn't, the same (or at least a full dmesg) from the previous crash won't hurt, if you still have it around.

Actions #15

Updated by Ilya Dryomov over 9 years ago

  • Status changed from 12 to Resolved

What Josh got a report of was not a referenced trace, but the
following (pulled off of the vmcore):

[197102.902802] ------------[ cut here ]------------
[197102.903670] kernel BUG at /builddir/build/BUILD/ceph-3.10-dc9ac62/net/ceph//osd_client.c:1003!
[197102.904553] invalid opcode: 0000 [#1] SMP
[197102.905393] Modules linked in: fuse btrfs zlib_deflate raid6_pq xor vfat msdos fat xfs bridge stp llc xt_nat xt_REDIRECT rbd(OF) libceph(OF) ip6table_filter ip6_tables sg openvswitch vxlan ip_tunnel gre libcrc32c ipt_REJECT xt_comment xt_conntrack xt_multiport iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip_tables iTCO_wdt iTCO_vendor_support ipmi_devintf coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel nfsd aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich mfd_core shpchp wmi ipmi_si ipmi_msghandler mperf acpi_power_meter auth_rpcgss nfs_acl lockd sunrpc binfmt_misc dm_multipath ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect
[197102.911118] sysimgblt i2c_algo_bit drm_kms_helper ttm drm i2c_core enic megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[197102.913242] CPU: 4 PID: 18929 Comm: rbd Tainted: GF O-------------- 3.10.0-123.8.1.el7.x86_64 #1
[197102.914487] Hardware name: Cisco Systems Inc UCSB-B200-M3/UCSB-B200-M3, BIOS B200M3.2.2.2.0.042820141643 04/28/2014
[197102.915959] task: ffff882f75fe38e0 ti: ffff882f60e24000 task.ti: ffff882f60e24000
[197102.917434] RIP: 0010:[<ffffffffa0448dc9>] [<ffffffffa0448dc9>] __remove_osd+0x89/0x90 [libceph]
[197102.918961] RSP: 0018:ffff882f60e25da0 EFLAGS: 00010206
[197102.920408] RAX: ffff885ea5043ca0 RBX: ffff885ea5043800 RCX: 0000000180190011
[197102.921488] RDX: 0000000000000000 RSI: ffff885ea5043800 RDI: ffff880036837768
[197102.922561] RBP: ffff882f60e25db0 R08: ffff882de51caf80 R09: 0000000180190011
[197102.923636] R10: ffffffff814b65af R11: ffffea00b7947200 R12: ffff880036837768
[197102.924709] R13: ffff8800368377c0 R14: 0000000000000000 R15: 0000000000000000
[197102.925782] FS: 00007fae849447c0(0000) GS:ffff882fbfc80000(0000) knlGS:0000000000000000
[197102.926863] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[197102.927936] CR2: 00007f3af4745f20 CR3: 0000002f61bdf000 CR4: 00000000001407e0
[197102.929038] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[197102.930075] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[197102.931133] Stack:
[197102.932198] ffff880036837768 ffff8800368377e8 ffff882f60e25dd8 ffffffffa044d2a4
[197102.933306] ffff880036837000 ffff885ecd83ef00 0000000000000001 ffff882f60e25df0
[197102.934390] ffffffffa043e82c ffff885ecd83ef08 ffff882f60e25e10 ffffffffa048d3e6
[197102.935474] Call Trace:
[197102.936547] [<ffffffffa044d2a4>] ceph_osdc_stop+0x94/0x100 [libceph]
[197102.937633] [<ffffffffa043e82c>] ceph_destroy_client+0x2c/0xa0 [libceph]
[197102.938709] [<ffffffffa048d3e6>] rbd_client_release+0x46/0x80 [rbd]
[197102.939809] [<ffffffffa048e705>] rbd_dev_destroy+0x65/0x70 [rbd]
[197102.940875] [<ffffffffa048e9a7>] rbd_dev_image_release+0x57/0x60 [rbd]
[197102.941946] [<ffffffffa048fe43>] do_rbd_remove.isra.33+0x163/0x1f0 [rbd]
[197102.943050] [<ffffffffa048ff14>] rbd_remove+0x24/0x30 [rbd]
[197102.944110] [<ffffffff813b41a7>] bus_attr_store+0x27/0x30
[197102.945166] [<ffffffff81225286>] sysfs_write_file+0xc6/0x140
[197102.946232] [<ffffffff811af6dd>] vfs_write+0xbd/0x1e0
[197102.947346] [<ffffffff811b0128>] SyS_write+0x58/0xb0
[197102.948420] [<ffffffff815f2a59>] system_call_fastpath+0x16/0x1b
[197102.949473] Code: 2e 97 ff ff 48 89 df e8 06 ff ff ff 5b 41 5c 5d c3 48 89 f2 48 c7 c7 f8 7f 46 a0 48 c7 c6 be c6 45 a0 31 c0 e8 b9 fe e8 e0 eb 92 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 55 f6 05 8d f2 01 00 04 48
[197102.951679] RIP [<ffffffffa0448dc9>] __remove_osd+0x89/0x90 [libceph]
[197102.952791] RSP <ffff882f60e25da0>

This is a

BUG_ON(!list_empty(&osd->o_requests));

in __remove_osd() in our original rhel7 kmod (dc9ac62e1e1a, rhel7
branch @ github).

vmcore showed that o_requests had a single entry on it which turned out
to be a requeued due to a connection reset and half resent lingering
request. request structures were completely messed up due to rbd unmap
unregistering requeued request with __unregister_linger_request().
This (request cancellation) was fixed upstream a while ago and the
fixes are also in the updated kmod.

Actions #16

Updated by Markus Blank-Burian about 9 years ago

During some tests, I stumbled upon this bug in rb_erase triggered osd_reset() -> __reset_osd() -> __remove_osd(). But I have not been using rbd but cephfs with kernel v3.14.28 + patches mentioned in #10449 and #10450. The bug was triggered by restarting all OSDs of our cluster simultaneously.

Actions #17

Updated by Ilya Dryomov about 9 years ago

This ticket has it mixed up with another issue, we are tracking rb_erase() in #8087.
I'll post your comment and reply there.

Actions

Also available in: Atom PDF