Project

General

Profile

Bug #5428

libceph: null deref in ceph_auth_reset

Added by Sage Weil almost 11 years ago. Updated over 10 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

<4>[19534.802099] libceph: mon2 10.214.131.7:6790 socket closed (con state CONNE
CTING)
<1>[19534.809633] BUG: unable to handle kernel NULL pointer dereference at           (null)
<1>[19534.817523] IP: [<ffffffff8163353e>] mutex_lock_nested+0xee/0x360
<4>[19534.823650] PGD 0 
<4>[19534.825688] Oops: 0002 [#1] SMP 
[dumpcommon]kdb>   -bt

Stack traceback for pid 13893
0xffff88020cbe3f20    13893        2  1    0   R  0xffff88020cbe43a8 *kworker/0:0
 ffff880211275b28 0000000000000018 ffffffff8163351a ffffffffa081f3c6
 ffff880224f3b598 ffff88021112ea80 0000000000000246 ffff88021112ea80
 0000000000000000 1111111111111111 ffff880211275b48 ffff880211275b98
Call Trace:
 [<ffffffff8163351a>] ? mutex_lock_nested+0xca/0x360
 [<ffffffffa081f3c6>] ? ceph_auth_reset+0x26/0x80 [libceph]
 [<ffffffffa081f3c6>] ? ceph_auth_reset+0x26/0x80 [libceph]
 [<ffffffffa0812776>] ? __close_session+0x76/0xa0 [libceph]
 [<ffffffffa0812e33>] ? mon_fault+0x53/0xe0 [libceph]
 [<ffffffffa080ee21>] ? con_work+0x571/0x2d50 [libceph]
 [<ffffffff81080bb3>] ? idle_balance+0x133/0x180
 [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110
 [<ffffffff81071b78>] ? finish_task_switch+0x48/0x110
 [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540
 [<ffffffff8105f3da>] ? process_one_work+0x1da/0x540
 [<ffffffff8105f36f>] ? process_one_work+0x16f/0x540
 [<ffffffff810605bc>] ? worker_thread+0x11c/0x370
 [<ffffffff810604a0>] ? manage_workers.isra.20+0x2e0/0x2e0
 [<ffffffff8106727a>] ? kthread+0xea/0xf0
 [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150
 [<ffffffff8163ff9c>] ? ret_from_fork+0x7c/0xb0
 [<ffffffff81067190>] ? flush_kthread_worker+0x150/0x150

run was

ubuntu@teuthology:/a/teuthology-2013-06-22_01:00:51-kernel-next-testing-basic/42855$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 2dd322b42d608a37f3e5beed57a8fbc673da6e32
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: next
  ceph:
    conf:
      global:
        ms inject socket failures: 500
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 94eada40460cc6010be23110ef8ce0e3d92691af
  install:
    ceph:
      sha1: 94eada40460cc6010be23110ef8ce0e3d92691af
  s3tests:
    branch: next
  workunit:
    sha1: 94eada40460cc6010be23110ef8ce0e3d92691af
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- ceph-fuse: null
- workunit:
    clients:
      all:
      - rbd/map-unmap.sh

dump.txt View (127 KB) Sage Weil, 06/23/2013 10:18 AM

dump.txt View (85.6 KB) Sage Weil, 07/19/2013 08:49 AM

History

#1 Updated by Sage Weil almost 11 years ago

first guess was a shutdown race, but ceph_monc_stop() is flushing the msgr wq. also, no other threads appear to be in ceph code at this time.

ok, looking at the test output all rbds have long since been unmapped (unless there is a bug in the test script), so this is a leaked msgr socket, most likely.

#2 Updated by Sage Weil almost 11 years ago

leaving plana09 in kdb

#3 Updated by Sage Weil almost 11 years ago

#4 Updated by Ian Colle almost 11 years ago

  • Assignee set to Josh Durgin

#5 Updated by Sage Weil over 10 years ago

hit this again, ubuntu@teuthology:/a/teuthology-2013-07-19_01:01:11-krbd-next-testing-basic/73011

#6 Updated by Sage Weil over 10 years ago

focusing on teh warning leading up to this first: it looks like the socket callback is happening when the socket is in the CLOSED state, which is always preceded by a sock->ops->shutdown(). best theory is that shutdown isn't serialized against the callbacks. alternatively, there is some ugly use-after-free going on, but that seems less likely.

#7 Updated by Sage Weil over 10 years ago

  • Priority changed from Urgent to High

#8 Updated by Sage Weil over 10 years ago

  • Status changed from New to Can't reproduce

Also available in: Atom PDF