Bug #5429: libceph: rcu stall, null deref in osd_reset->__reset_osd->__remove_osd - Linux kernel client - Ceph

Actions

Copy link

Bug #5429

closed

libceph: rcu stall, null deref in osd_reset->__reset_osd->__remove_osd

Added by Sage Weil almost 11 years ago. Updated about 9 years ago.

Status:

Resolved

Priority:

High

Assignee:

Ilya Dryomov

Category:

rbd

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

<1>[19828.585548] BUG: unable to handle kernel NULL pointer dereference at           (null)
<1>[19828.593437] IP: [<ffffffff813185cb>] rb_erase+0x1bb/0x370
<4>[19828.598865] PGD 0 
<4>[19828.600899] Oops: 0002 [#1] SMP 
[dumpcommon]kdb>   -bt

Stack traceback for pid 29967
0xffff88020dd03f20    29967        2  1    4   R  0xffff88020dd043a8 *kworker/4:1
 ffff88020b257b48 0000000000000018 0000000000000000 ffff88020b257b68
 ffffffffa05487bc ffff8802204e4000 ffff880224ec7950 ffff88020b257b98
 ffffffffa0548abf ffff8802204e4030 ffff880224ec7950 0000000000000000
Call Trace:
 [<ffffffffa05487bc>] ? __remove_osd+0x3c/0xa0 [libceph]
 [<ffffffffa0548abf>] ? __reset_osd+0x12f/0x170 [libceph]
 [<ffffffffa054a6de>] ? osd_reset+0x7e/0x2b0 [libceph]
 [<ffffffffa0541e21>] ? con_work+0x571/0x2d50 [libceph]
 [<ffffffff81080bb3>] ? idle_balance+0x133/0x180
 [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110
 [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110
 [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540
 [<ffffffff8105f3da>] ? process_one_work+0x1da/0x540
 [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540
 [<ffffffff810605bc>] ? worker_thread+0x11c/0x370
 [<ffffffff810604a0>] ? manage_workers.isra.20+0x2e0/0x2e0
 [<ffffffff8106727a>] ? kthread+0xea/0xf0
 [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150
 [<ffffffff8163ff9c>] ? ret_from_fork+0x7c/0xb0
 [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150
[dumpall]kdb>   -bta

but preceeded by 7 seconds earlier by

<4>[19778.015116] libceph: osd0 10.214.132.16:6801 socket closed (con state CONNECTING)
<3>[19799.355399] INFO: rcu_sched self-detected stall on CPU { 6}  (t=2100 jiffies g=245350 c=245349 q=2640)
<4>[19799.364789] CPU: 6 PID: 19284 Comm: kworker/6:2 Tainted: G        W    3.10.0-rc6-ceph-00091-g2dd322b #1
<4>[19799.374303] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011
<3>[19799.375424] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=2102 jiffies, g=245350, c=245349, q=2640)
<6>[19799.375425] Task dump for CPU 6:
<6>[19799.375429] kworker/6:2     R  running task        0 19284      2 0x00000000
<6>[19799.375442] Workqueue: ceph-msgr con_work [libceph]
<4>[19799.375445]  ffff880125ff7de8 ffffffff8105f3da ffffffff8105f36f ffff8802272d3a00
<4>[19799.375447]  0000000000000000 00000006272d2f98 ffff880125ff7fd8 ffff8802272d2f80
<4>[19799.375450]  ffffffffa05666d0 0000000000000000 0000000000000000 ffffffffa055940e
<4>[19799.375451] Call Trace:
<4>[19799.375457]  [<ffffffff8105f3da>] ? process_one_work+0x1da/0x540
<4>[19799.375459]  [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540
<4>[19799.375462]  [<ffffffff810605bc>] worker_thread+0x11c/0x370
<4>[19799.375464]  [<ffffffff810604a0>] ? manage_workers.isra.20+0x2e0/0x2e0
<4>[19799.375468]  [<ffffffff8106727a>] kthread+0xea/0xf0
<4>[19799.375471]  [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150
<4>[19799.375476]  [<ffffffff8163ff9c>] ret_from_fork+0x7c/0xb0
<4>[19799.375478]  [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150
<4>[19799.480150] Workqueue: ceph-msgr con_work [libceph]
<4>[19799.485054]  ffffffff81c4ca00 ffff8802272c3db8 ffffffff81630b82 ffff8802272c3e38
<4>[19799.492523]  ffffffff810e285a 0000000000000006 ffff8802272cd4e0 ffff8802272c3de8
<4>[19799.499995]  ffffffff810e644c 0000000000000086 0000000000000001 0000000000000006
<4>[19799.507469] Call Trace:
<4>[19799.509929]  <IRQ>  [<ffffffff81630b82>] dump_stack+0x19/0x1b
<4>[19799.515718]  [<ffffffff810e285a>] rcu_check_callbacks+0x21a/0x710
<4>[19799.521831]  [<ffffffff810e644c>] ? acct_account_cputime+0x1c/0x20
<4>[19799.528034]  [<ffffffff81050f68>] update_process_times+0x48/0x80
<4>[19799.534062]  [<ffffffff8109b616>] tick_sched_handle.isra.10+0x36/0x50
<4>[19799.540524]  [<ffffffff8109b71c>] tick_sched_timer+0x4c/0x80
<4>[19799.546203]  [<ffffffff8106a841>] __run_hrtimer+0x81/0x1e0
<4>[19799.551709]  [<ffffffff8109b6d0>] ? tick_nohz_handler+0xa0/0xa0
<4>[19799.557647]  [<ffffffff8106b147>] hrtimer_interrupt+0x107/0x260
<4>[19799.563588]  [<ffffffff81641b69>] smp_apic_timer_interrupt+0x69/0x99
<4>[19799.569964]  [<ffffffff81640caf>] apic_timer_interrupt+0x6f/0x80
<4>[19799.575987]  <EOI>  [<ffffffff8112edec>] ? shrink_inactive_list+0x18c/0x400
<4>[19799.582995]  [<ffffffff81637590>] ? _raw_spin_unlock_irq+0x30/0x40
<4>[19799.589195]  [<ffffffff81637595>] ? _raw_spin_unlock_irq+0x35/0x40
<4>[19799.595397]  [<ffffffff8112edec>] shrink_inactive_list+0x18c/0x400
<4>[19799.601595]  [<ffffffff8112f66d>] shrink_lruvec+0x2cd/0x4d0
<4>[19799.607187]  [<ffffffff8119855b>] ? bdi_queue_work+0x8b/0xf0
<4>[19799.612869]  [<ffffffff8112fc1c>] do_try_to_free_pages+0x11c/0x3a0
<4>[19799.619068]  [<ffffffff81130066>] try_to_free_pages+0xd6/0x1b0
<4>[19799.624922]  [<ffffffff811375b0>] ? next_zone+0x30/0x40
<4>[19799.630165]  [<ffffffff81125406>] __alloc_pages_nodemask+0x596/0x8f0
<4>[19799.636541]  [<ffffffff8115bb1a>] alloc_pages_current+0xba/0x170
<4>[19799.642569]  [<ffffffff81516d3e>] sk_page_frag_refill+0x7e/0x130
<4>[19799.648593]  [<ffffffff8156f5a5>] tcp_sendmsg+0x305/0xe50
<4>[19799.654010]  [<ffffffff8159af99>] inet_sendmsg+0xb9/0xf0
<4>[19799.659339]  [<ffffffff8159aee5>] ? inet_sendmsg+0x5/0xf0
<4>[19799.664760]  [<ffffffff81510de2>] sock_sendmsg+0xc2/0xe0
<4>[19799.670090]  [<ffffffff812ee35b>] ? chksum_update+0x1b/0x30
<4>[19799.675686]  [<ffffffff812ea1e8>] ? crypto_shash_update+0x18/0x30
<4>[19799.681814]  [<ffffffffa0000056>] ? crc32c+0x56/0x7c [libcrc32c]
<4>[19799.687842]  [<ffffffff81510e40>] kernel_sendmsg+0x40/0x60
<4>[19799.693353]  [<ffffffffa05424d8>] con_work+0xc28/0x2d50 [libceph]
<4>[19799.699468]  [<ffffffff81080bb3>] ? idle_balance+0x133/0x180
<4>[19799.705145]  [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110
<4>[19799.711257]  [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110
<4>[19799.717371]  [<ffffffff8105f3da>] process_one_work+0x1da/0x540
<4>[19799.723220]  [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540
<4>[19799.729247]  [<ffffffff810605bc>] worker_thread+0x11c/0x370
<4>[19799.734840]  [<ffffffff810604a0>] ? manage_workers.isra.20+0x2e0/0x2e0
<4>[19799.741388]  [<ffffffff8106727a>] kthread+0xea/0xf0
<4>[19799.746283]  [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150
<4>[19799.752658]  [<ffffffff8163ff9c>] ret_from_fork+0x7c/0xb0
<4>[19799.758076]  [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150
<3>[19820.788280] INFO: rcu_sched self-detected stall on CPU
<3>[19820.788286] INFO: rcu_sched self-detected stall on CP
<4>[19820.788287]

job was

ubuntu@teuthology:/a/teuthology-2013-06-22_01:00:51-kernel-next-testing-basic/42857$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 2dd322b42d608a37f3e5beed57a8fbc673da6e32
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: next
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
      osd:
        filestore flush min: 0
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 94eada40460cc6010be23110ef8ce0e3d92691af
  install:
    ceph:
      sha1: 94eada40460cc6010be23110ef8ce0e3d92691af
  s3tests:
    branch: next
  workunit:
    sha1: 94eada40460cc6010be23110ef8ce0e3d92691af
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds: null
- kclient: null
- workunit:
    clients:
      all:
      - suites/ffsb.sh

Files

Download all files

dump2.txt (141 KB) dump2.txt		Sage Weil, 06/23/2013 10:23 AM
dump3.txt (144 KB) dump3.txt		Sage Weil, 06/28/2013 10:51 AM

Actions

Copy link

Updated by Sage Weil almost 11 years ago

leaving plana56 in kdb

Actions

Copy link

Updated by Ian Colle almost 11 years ago

Assignee set to Josh Durgin

Actions

Copy link

Updated by Sage Weil almost 11 years ago

File dump3.txt dump3.txt added

hit this again, ubuntu@teuthology:/a/teuthology-2013-06-28_01:01:07-kernel-master-testing-basic/48683

Actions

Copy link

Updated by Sage Weil almost 11 years ago

plana72 still sitting in kdb.

Actions

Copy link

Updated by Sage Weil over 10 years ago

Priority changed from Urgent to High

Actions

Copy link

Updated by Sage Weil over 10 years ago

hit this again, ubuntu@teuthology:/a/teuthology-2013-08-14_01:01:26-kcephfs-next-testing-basic-plana/106215

it was here:

static void __remove_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd)
{
    a78d:       48 89 e5                mov    %rsp,%rbp
    a790:       41 54                   push   %r12
    a792:       49 89 fc                mov    %rdi,%r12
    a795:       53                      push   %rbx
    a796:       48 89 f3                mov    %rsi,%rbx
        dout("__remove_osd %p\n", osd);
    a799:       75 61                   jne    a7fc <__remove_osd+0x7c>
        BUG_ON(!list_empty(&osd->o_requests));
    a79b:       48 8d 83 38 05 00 00    lea    0x538(%rbx),%rax
    a7a2:       48 39 83 38 05 00 00    cmp    %rax,0x538(%rbx)
    a7a9:       75 6b                   jne    a816 <__remove_osd+0x96>
        rb_erase(&osd->o_node, &osdc->osds);
    a7ab:       49 8d b4 24 60 01 00    lea    0x160(%r12),%rsi
    a7b2:       00 
    a7b3:       48 8d 7b 18             lea    0x18(%rbx),%rdi
    a7b7:       e8 00 00 00 00          callq  a7bc <__remove_osd+0x3c>
                        a7b8: R_X86_64_PC32     rb_erase+0xfffffffffffffffc
 * in an undefined state.
 */
#ifndef CONFIG_DEBUG_LIST
static inline void __list_del_entry(struct list_head *entry)
{
        __list_del(entry->prev, entry->next);
    a7bc:       48 8b 8b 58 05 00 00    mov    0x558(%rbx),%rcx
^^^^^^^^^^^^^^^^^
    a7c3:       48 8b 93 60 05 00 00    mov    0x560(%rbx),%rdx
        list_del_init(&osd->o_osd_lru);
    a7ca:       48 8d 83 58 05 00 00    lea    0x558(%rbx),%rax
        ceph_con_close(&osd->o_con);
    a7d1:       48 8d 7b 30             lea    0x30(%rbx),%rdi
 * This is only for internal list manipulation where we know
 * the prev/next entries already!
 */
static inline void __list_del(struct list_head * prev, struct list_head * next)

Actions

Copy link

Updated by Sage Weil over 10 years ago

<6>[17485.734714]  rbd1: unknown partition table
<4>[17485.735740] libceph: mon2 10.214.132.4:6790 socket closed (con state OPEN)
<6>[17485.735759] libceph: mon2 10.214.132.4:6790 session lost, hunting for new mon
<6>[17485.737794] libceph: mon2 10.214.132.4:6790 session established
<6>[17485.759013] rbd: rbd1: added with size 0x40000000
<4>[17485.858921] libceph: osd2 10.214.132.4:6808 socket closed (con state OPEN)
<6>[17485.966118] libceph: client4411 fsid 1af5918f-c950-454f-9769-f3b857fac855
<6>[17485.974997] libceph: mon1 10.214.132.38:6789 session established
<6>[17486.017942]  rbd1: unknown partition table
<6>[17486.022331] rbd: rbd1: added with size 0x40000000
<4>[17486.199275] libceph: osd3 10.214.132.38:6809 socket closed (con state OPEN)
<6>[17486.233630] libceph: client4445 fsid 1af5918f-c950-454f-9769-f3b857fac855
<6>[17486.242439] libceph: mon2 10.214.132.4:6790 session established
<6>[17486.277232]  rbd1: unknown partition table
<6>[17486.281544] rbd: rbd1: added with size 0x40000000
<4>[17486.331466] libceph: osd2 10.214.132.4:6808 socket closed (con state OPEN)
<4>[17486.381566] libceph: osd2 10.214.132.4:6808 socket closed (con state OPEN)
...
[2]kdb> bt
Stack traceback for pid 25803
0xffff8802238dbf20    25803    13496  1    2   R  0xffff8802238dc3a8 *rbd
 ffff88020d87bd68 0000000000000018 ffff8801b5ff7950 ffff88020d87bd88
 ffffffffa05fe7bc ffff8801b5ff7950 ffff8801b5ff7ab0 ffff88020d87bdb8
 ffffffffa0602d24 ffff88012c8f16c0 ffff8801b5ff7000 ffff88012c8f16c0
Call Trace:
 [<ffffffffa05fe7bc>] ? __remove_osd+0x3c/0xa0 [libceph]
 [<ffffffffa0602d24>] ? ceph_osdc_stop+0xa4/0x110 [libceph]
 [<ffffffffa05f4790>] ? ceph_destroy_client+0x30/0xa0 [libceph]
 [<ffffffffa022fb41>] ? rbd_client_release+0x71/0xb0 [rbd]
 [<ffffffffa0230798>] ? rbd_put_client+0x28/0x30 [rbd]
 [<ffffffffa02307ba>] ? rbd_dev_destroy+0x1a/0x40 [rbd]
 [<ffffffffa023083b>] ? rbd_dev_image_release+0x5b/0x70 [rbd]
 [<ffffffffa0231095>] ? rbd_remove+0x155/0x180 [rbd]
 [<ffffffff81407187>] ? bus_attr_store+0x27/0x30
 [<ffffffff811f2d66>] ? sysfs_write_file+0xe6/0x170
 [<ffffffff8117feae>] ? vfs_write+0xce/0x200
 [<ffffffff8119cf0c>] ? fget_light+0x3c/0x130
 [<ffffffff811803b5>] ? SyS_write+0x55/0xa0
 [<ffffffff81653782>] ? system_call_fastpath+0x16/0x1b

ubuntu@teuthology:/a/teuthology-2013-09-02_01:01:32-krbd-master-testing-basic-plana/17253$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 263cbbcaf605e359a46e30889595d82629f82080
machine_type: plana
nuke-on-error: true
os_type: ubuntu
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      global:
        ms inject socket failures: 500
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 1c5e58a85ef7f26b2c617ecb6c08de5632bb0fe3
  ceph-deploy:
    branch:
      dev: master
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
  install:
    ceph:
      sha1: 1c5e58a85ef7f26b2c617ecb6c08de5632bb0fe3
  s3tests:
    branch: master
  workunit:
    sha1: 1c5e58a85ef7f26b2c617ecb6c08de5632bb0fe3
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- workunit:
    clients:
      all:
      - rbd/map-unmap.sh
teuthology_branch: master
ubuntu@teuthology:/a/teuthology-2013-09-02_01:01:32-krbd-master-testing-basic-plana/17253$

Actions

Copy link

Updated by Sage Weil over 10 years ago

Status changed from New to Duplicate

i think/hope this is a duplicate of the async notify racing with shutdown

Actions

Copy link

Updated by Josh Durgin over 9 years ago

Status changed from Duplicate to 12

Got reports of the 2nd trace (http://tracker.ceph.com/issues/5429#note-7) occuring on a kernel with the notify fixes.

Actions

Copy link

#10

Updated by Josh Durgin over 9 years ago

Project changed from rbd to Linux kernel client
Category set to rbd
Assignee deleted (~~Josh Durgin~~)

Actions

Copy link

#11

Updated by Ilya Dryomov over 9 years ago

Assignee set to Ilya Dryomov

I bet there is another trace of this somewhere, no rcu stall, just plain NULL deref in rb_erase(). Will try to investigate.

Actions

Copy link

#12

Updated by JuanJose Galvez over 9 years ago

Is there anything which needs to be gathered from the cluster currently displaying this issue which could help out?

Actions

Copy link

#13

Updated by Ilya Dryomov over 9 years ago

If it's crashed again, a full dmesg and a tail (say, last 5-10 minutes before the crash) of osd/messenger logs would help.

Actions

Copy link

#14

Updated by Ilya Dryomov over 9 years ago

And if it hasn't, the same (or at least a full dmesg) from the previous crash won't hurt, if you still have it around.

Actions

Copy link

#15

Updated by Ilya Dryomov over 9 years ago

Status changed from 12 to Resolved

What Josh got a report of was not a referenced trace, but the
following (pulled off of the vmcore):

[197102.902802] ------------[ cut here ]------------
[197102.903670] kernel BUG at /builddir/build/BUILD/ceph-3.10-dc9ac62/net/ceph//osd_client.c:1003!
[197102.904553] invalid opcode: 0000 [#1] SMP
[197102.905393] Modules linked in: fuse btrfs zlib_deflate raid6_pq xor vfat msdos fat xfs bridge stp llc xt_nat xt_REDIRECT rbd(OF) libceph(OF) ip6table_filter ip6_tables sg openvswitch vxlan ip_tunnel gre libcrc32c ipt_REJECT xt_comment xt_conntrack xt_multiport iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip_tables iTCO_wdt iTCO_vendor_support ipmi_devintf coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel nfsd aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich mfd_core shpchp wmi ipmi_si ipmi_msghandler mperf acpi_power_meter auth_rpcgss nfs_acl lockd sunrpc binfmt_misc dm_multipath ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect
[197102.911118] sysimgblt i2c_algo_bit drm_kms_helper ttm drm i2c_core enic megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[197102.913242] CPU: 4 PID: 18929 Comm: rbd Tainted: GF O-------------- 3.10.0-123.8.1.el7.x86_64 #1
[197102.914487] Hardware name: Cisco Systems Inc UCSB-B200-M3/UCSB-B200-M3, BIOS B200M3.2.2.2.0.042820141643 04/28/2014
[197102.915959] task: ffff882f75fe38e0 ti: ffff882f60e24000 task.ti: ffff882f60e24000
[197102.917434] RIP: 0010:[<ffffffffa0448dc9>] [<ffffffffa0448dc9>] __remove_osd+0x89/0x90 [libceph]
[197102.918961] RSP: 0018:ffff882f60e25da0 EFLAGS: 00010206
[197102.920408] RAX: ffff885ea5043ca0 RBX: ffff885ea5043800 RCX: 0000000180190011
[197102.921488] RDX: 0000000000000000 RSI: ffff885ea5043800 RDI: ffff880036837768
[197102.922561] RBP: ffff882f60e25db0 R08: ffff882de51caf80 R09: 0000000180190011
[197102.923636] R10: ffffffff814b65af R11: ffffea00b7947200 R12: ffff880036837768
[197102.924709] R13: ffff8800368377c0 R14: 0000000000000000 R15: 0000000000000000
[197102.925782] FS: 00007fae849447c0(0000) GS:ffff882fbfc80000(0000) knlGS:0000000000000000
[197102.926863] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[197102.927936] CR2: 00007f3af4745f20 CR3: 0000002f61bdf000 CR4: 00000000001407e0
[197102.929038] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[197102.930075] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[197102.931133] Stack:
[197102.932198] ffff880036837768 ffff8800368377e8 ffff882f60e25dd8 ffffffffa044d2a4
[197102.933306] ffff880036837000 ffff885ecd83ef00 0000000000000001 ffff882f60e25df0
[197102.934390] ffffffffa043e82c ffff885ecd83ef08 ffff882f60e25e10 ffffffffa048d3e6
[197102.935474] Call Trace:
[197102.936547] [<ffffffffa044d2a4>] ceph_osdc_stop+0x94/0x100 [libceph]
[197102.937633] [<ffffffffa043e82c>] ceph_destroy_client+0x2c/0xa0 [libceph]
[197102.938709] [<ffffffffa048d3e6>] rbd_client_release+0x46/0x80 [rbd]
[197102.939809] [<ffffffffa048e705>] rbd_dev_destroy+0x65/0x70 [rbd]
[197102.940875] [<ffffffffa048e9a7>] rbd_dev_image_release+0x57/0x60 [rbd]
[197102.941946] [<ffffffffa048fe43>] do_rbd_remove.isra.33+0x163/0x1f0 [rbd]
[197102.943050] [<ffffffffa048ff14>] rbd_remove+0x24/0x30 [rbd]
[197102.944110] [<ffffffff813b41a7>] bus_attr_store+0x27/0x30
[197102.945166] [<ffffffff81225286>] sysfs_write_file+0xc6/0x140
[197102.946232] [<ffffffff811af6dd>] vfs_write+0xbd/0x1e0
[197102.947346] [<ffffffff811b0128>] SyS_write+0x58/0xb0
[197102.948420] [<ffffffff815f2a59>] system_call_fastpath+0x16/0x1b
[197102.949473] Code: 2e 97 ff ff 48 89 df e8 06 ff ff ff 5b 41 5c 5d c3 48 89 f2 48 c7 c7 f8 7f 46 a0 48 c7 c6 be c6 45 a0 31 c0 e8 b9 fe e8 e0 eb 92 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 55 f6 05 8d f2 01 00 04 48
[197102.951679] RIP [<ffffffffa0448dc9>] __remove_osd+0x89/0x90 [libceph]
[197102.952791] RSP <ffff882f60e25da0>

This is a

BUG_ON(!list_empty(&osd->o_requests));

in __remove_osd() in our original rhel7 kmod (dc9ac62e1e1a, rhel7
branch @ github).

vmcore showed that o_requests had a single entry on it which turned out
to be a requeued due to a connection reset and half resent lingering
request. request structures were completely messed up due to rbd unmap
unregistering requeued request with __unregister_linger_request().
This (request cancellation) was fixed upstream a while ago and the
fixes are also in the updated kmod.

Actions

Copy link

#16

Updated by Markus Blank-Burian about 9 years ago

During some tests, I stumbled upon this bug in rb_erase triggered osd_reset() -> __reset_osd() -> __remove_osd(). But I have not been using rbd but cephfs with kernel v3.14.28 + patches mentioned in #10449 and #10450. The bug was triggered by restarting all OSDs of our cluster simultaneously.

Actions

Copy link

#17

Updated by Ilya Dryomov about 9 years ago

This ticket has it mixed up with another issue, we are tracking rb_erase() in #8087.
I'll post your comment and reply there.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #5429

libceph: rcu stall, null deref in osd_reset->__reset_osd->__remove_osd

Updated by Sage Weil almost 11 years ago

Updated by Ian Colle almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil over 10 years ago

Updated by Sage Weil over 10 years ago

Updated by Sage Weil over 10 years ago

Updated by Sage Weil over 10 years ago

Updated by Josh Durgin over 9 years ago

Updated by Josh Durgin over 9 years ago

Updated by Ilya Dryomov over 9 years ago

Updated by JuanJose Galvez over 9 years ago

Updated by Ilya Dryomov over 9 years ago

Updated by Ilya Dryomov over 9 years ago

Updated by Ilya Dryomov over 9 years ago

Updated by Markus Blank-Burian about 9 years ago

Updated by Ilya Dryomov about 9 years ago