Bug #5428: libceph: null deref in ceph_auth_reset - rbd - Ceph

Actions

Copy link

Bug #5428

closed

libceph: null deref in ceph_auth_reset

Added by Sage Weil almost 11 years ago. Updated over 10 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Josh Durgin

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

<4>[19534.802099] libceph: mon2 10.214.131.7:6790 socket closed (con state CONNE
CTING)
<1>[19534.809633] BUG: unable to handle kernel NULL pointer dereference at           (null)
<1>[19534.817523] IP: [<ffffffff8163353e>] mutex_lock_nested+0xee/0x360
<4>[19534.823650] PGD 0 
<4>[19534.825688] Oops: 0002 [#1] SMP 
[dumpcommon]kdb>   -bt

Stack traceback for pid 13893
0xffff88020cbe3f20    13893        2  1    0   R  0xffff88020cbe43a8 *kworker/0:0
 ffff880211275b28 0000000000000018 ffffffff8163351a ffffffffa081f3c6
 ffff880224f3b598 ffff88021112ea80 0000000000000246 ffff88021112ea80
 0000000000000000 1111111111111111 ffff880211275b48 ffff880211275b98
Call Trace:
 [<ffffffff8163351a>] ? mutex_lock_nested+0xca/0x360
 [<ffffffffa081f3c6>] ? ceph_auth_reset+0x26/0x80 [libceph]
 [<ffffffffa081f3c6>] ? ceph_auth_reset+0x26/0x80 [libceph]
 [<ffffffffa0812776>] ? __close_session+0x76/0xa0 [libceph]
 [<ffffffffa0812e33>] ? mon_fault+0x53/0xe0 [libceph]
 [<ffffffffa080ee21>] ? con_work+0x571/0x2d50 [libceph]
 [<ffffffff81080bb3>] ? idle_balance+0x133/0x180
 [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110
 [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110
 [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540
 [<ffffffff8105f3da>] ? process_one_work+0x1da/0x540
 [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540
 [<ffffffff810605bc>] ? worker_thread+0x11c/0x370
 [<ffffffff810604a0>] ? manage_workers.isra.20+0x2e0/0x2e0
 [<ffffffff8106727a>] ? kthread+0xea/0xf0
 [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150
 [<ffffffff8163ff9c>] ? ret_from_fork+0x7c/0xb0
 [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150

run was

ubuntu@teuthology:/a/teuthology-2013-06-22_01:00:51-kernel-next-testing-basic/42855$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 2dd322b42d608a37f3e5beed57a8fbc673da6e32
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: next
  ceph:
    conf:
      global:
        ms inject socket failures: 500
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 94eada40460cc6010be23110ef8ce0e3d92691af
  install:
    ceph:
      sha1: 94eada40460cc6010be23110ef8ce0e3d92691af
  s3tests:
    branch: next
  workunit:
    sha1: 94eada40460cc6010be23110ef8ce0e3d92691af
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- ceph-fuse: null
- workunit:
    clients:
      all:
      - rbd/map-unmap.sh

Files

Download all files

dump.txt (127 KB) dump.txt		Sage Weil, 06/23/2013 10:18 AM
dump.txt (85.6 KB) dump.txt		Sage Weil, 07/19/2013 08:49 AM

Actions

Copy link

Updated by Sage Weil almost 11 years ago

first guess was a shutdown race, but ceph_monc_stop() is flushing the msgr wq. also, no other threads appear to be in ceph code at this time.

ok, looking at the test output all rbds have long since been unmapped (unless there is a bug in the test script), so this is a leaked msgr socket, most likely.

Actions

Copy link

Updated by Sage Weil almost 11 years ago

leaving plana09 in kdb

Actions

Copy link

Updated by Sage Weil almost 11 years ago

File dump.txt dump.txt added

Actions

Copy link

Updated by Ian Colle almost 11 years ago

Assignee set to Josh Durgin

Actions

Copy link

Updated by Sage Weil almost 11 years ago

File dump.txt dump.txt added

hit this again, ubuntu@teuthology:/a/teuthology-2013-07-19_01:01:11-krbd-next-testing-basic/73011

Actions

Copy link

Updated by Sage Weil almost 11 years ago

focusing on teh warning leading up to this first: it looks like the socket callback is happening when the socket is in the CLOSED state, which is always preceded by a sock->ops->shutdown(). best theory is that shutdown isn't serialized against the callbacks. alternatively, there is some ugly use-after-free going on, but that seems less likely.

Actions

Copy link