Project

General

Profile

Bug #8087

libceph: null deref in osd_reset -> __reset_osd -> __remove_osd

Added by Sage Weil almost 8 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
libceph
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

I thought there was an open bug for this but I couldn't find it.

<4>[54746.470533] libceph: osd3 10.214.131.23:6800 socket closed (con state OPEN)
<4>[54750.230820] libceph: osd2 10.214.131.24:6811 socket closed (con state OPEN)
<6>[54751.166391] libceph: osd5 down
<6>[54756.706455] libceph: osd4 weight 0x10000 (in)
<4>[54757.199901] libceph: osd4 10.214.131.23:6809 socket closed (con state OPEN)
<4>[54758.246418] libceph: osd3 10.214.131.23:6800 socket closed (con state OPEN)
<4>[54761.874915] libceph: osd3 10.214.131.23:6800 socket closed (con state OPEN)
<4>[54761.896173] libceph: osd3 10.214.131.23:6800 socket closed (con state CONNECTING)
<4>[54763.145385] libceph: osd3 10.214.131.23:6800 socket closed (con state CONNECTING)
<4>[54764.153993] libceph: osd3 10.214.131.23:6800 socket closed (con state CONNECTING)
<4>[54766.235252] libceph: osd3 10.214.131.23:6800 socket closed (con state CONNECTING)
<6>[54768.281152] libceph: osd5 primary-affinity 0x10000
<4>[54770.181661] libceph: osd3 10.214.131.23:6800 socket closed (con state CONNECTING)
<6>[54774.521289] libceph: osd5 primary-affinity 0x0
<4>[54778.186468] libceph: osd3 10.214.131.23:6800 socket closed (con state CONNECTING)
<6>[54783.515908] libceph: osd3 down
<4>[54788.156802] libceph: osd0 10.214.131.24:6800 socket closed (con state OPEN)
<6>[54790.331993] libceph: osd3 up
<6>[54798.984619] libceph: osd5 up
<4>[54799.446893] libceph: osd0 10.214.131.24:6800 socket closed (con state OPEN)
<6>[54803.215756] libceph: osd2 weight 0x0 (out)
<4>[54803.335380] libceph: osd3 10.214.131.23:6804 socket closed (con state OPEN)
<4>[54804.821799] libceph: osd2 10.214.131.24:6811 socket closed (con state OPEN)
<6>[54809.166935] libceph: osd0 primary-affinity 0x0
<6>[54815.953029] libceph: osd2 weight 0x10000 (in)
<6>[54860.014440] libceph: osd4 primary-affinity 0x0
<4>[54862.901538] libceph: osd4 10.214.131.23:6809 socket closed (con state OPEN)
<6>[54882.915187] libceph: osd5 primary-affinity 0x1f10
<6>[54889.048579] libceph: osd4 primary-affinity 0x0
<6>[54900.497577] libceph: osd5 primary-affinity 0x10000
<4>[54910.985836] libceph: osd1 10.214.131.24:6809 socket closed (con state OPEN)
<4>[54911.007718] libceph: osd1 10.214.131.24:6809 socket closed (con state CONNECTING)
<4>[54912.259483] libceph: osd1 10.214.131.24:6809 socket closed (con state CONNECTING)
<4>[54913.256074] libceph: osd1 10.214.131.24:6809 socket closed (con state CONNECTING)
<4>[54915.325247] libceph: osd1 10.214.131.24:6809 socket closed (con state CONNECTING)
<4>[54916.024475] libceph: osd5 10.214.131.23:6810 socket closed (con state OPEN)
<4>[54916.042183] libceph: osd5 10.214.131.23:6810 socket closed (con state CONNECTING)
<4>[54916.317907] libceph: osd5 10.214.131.23:6810 socket closed (con state CONNECTING)
<4>[54917.234492] libceph: osd5 10.214.131.23:6810 socket closed (con state CONNECTING)
<4>[54919.327734] libceph: osd5 10.214.131.23:6810 socket closed (con state CONNECTING)
<4>[54919.327736] libceph: osd1 10.214.131.24:6809 socket closed (con state CONNECTING)
<6>[54922.389440] libceph: osd2 primary-affinity 0x0
<4>[54923.298189] libceph: osd5 10.214.131.23:6810 socket closed (con state CONNECTING)
<4>[54927.268480] libceph: osd1 10.214.131.24:6809 socket closed (con state CONNECTING)
<6>[54928.589414] libceph: osd2 weight 0x0 (out)
<4>[54931.302937] libceph: osd5 10.214.131.23:6810 socket closed (con state CONNECTING)
<6>[54936.009719] libceph: osd1 down
<4>[54936.342872] libceph: osd3 10.214.131.23:6804 socket closed (con state OPEN)
<4>[54937.469072] libceph: osd2 10.214.131.24:6811 socket closed (con state OPEN)
<6>[54940.727869] libceph: osd2 weight 0x10000 (in)
<6>[54947.242425] libceph: osd5 down
<6>[54947.245510] libceph: osd5 primary-affinity 0x10000
<4>[54947.344592] libceph: osd5 10.214.131.23:6810 socket closed (con state CONNECTING)
<1>[54947.549149] BUG: unable to handle kernel NULL pointer dereference at           (null)
<1>[54947.557038] IP: [<ffffffff8136901b>] rb_erase+0x1bb/0x370
<4>[54947.562469] PGD 0 
<4>[54947.564504] Oops: 0002 [#1] SMP 
[5]kdb>  
[5]kdb> bt
Stack traceback for pid 6709
0xffff880223dfe300     6709        2  1    5   R  0xffff880223dfe8a8 *kworker/5:1
 ffff880222153c28 0000000000000018 ffffffff00000001 ffff880222153c48
 ffffffffa07eaed1 ffff880222153ca8 ffff8801f58ca000 ffff880222153c78
 ffffffffa07eb139 ffff8801bd4c3768 ffff880222153ca8 ffff8801f58ca000
Call Trace:
 [<ffffffffa07eaed1>] ? __remove_osd+0x31/0x80 [libceph]
 [<ffffffffa07eb139>] ? __reset_osd+0x109/0x110 [libceph]
 [<ffffffffa07ec445>] ? osd_reset+0x55/0x200 [libceph]
 [<ffffffffa07e5fa5>] ? con_work+0xd15/0x2210 [libceph]
 [<ffffffff8106d9c4>] ? process_one_work+0x174/0x4a0
 [<ffffffff8106ed6b>] ? worker_thread+0x11b/0x370
 [<ffffffff8106ec50>] ? manage_workers.isra.20+0x2d0/0x2d0
 [<ffffffff81075809>] ? kthread+0xc9/0xe0
 [<ffffffff81075740>] ? flush_kthread_worker+0xb0/0xb0
 [<ffffffff816b20ac>] ? ret_from_fork+0x7c/0xb0
 [<ffffffff81075740>] ? flush_kthread_worker+0xb0/0xb0

the workload was /var/lib/teuthworker/archive/teuthology-2014-04-11_23:01:07-kcephfs-master-testing-basic-plana/186991

rb_erase.txt View (34 KB) Markus Blank-Burian, 01/20/2015 04:04 AM

kaa-54.log.bz2 - Verbose client log (900 KB) Markus Blank-Burian, 01/30/2015 03:47 PM

IMG_20150130_163313.jpg View - Screenshot of bug message (228 KB) Markus Blank-Burian, 01/30/2015 03:49 PM


Related issues

Related to Linux kernel client - Bug #6809: 3.11 kernel panic: Workqueue: ceph-msgr con_work New 11/20/2013

History

#1 Updated by Ian Colle almost 8 years ago

  • Assignee set to Ilya Dryomov

#2 Updated by Ilya Dryomov about 7 years ago

Reported in #5429:

During some tests, I stumbled upon this bug in rb_erase triggered osd_reset() -> __reset_osd() -> __remove_osd().
But I have not been using rbd but cephfs with kernel v3.14.28 + patches mentioned in #10449 and #10450.
The bug was triggered by restarting all OSDs of our cluster simultaneously.

Markus, can you reproduce it at least somewhat reliably? Is there any chance you could try restarting all OSDs for a while to try to reproduce this with logging enabled?

#3 Updated by Ilya Dryomov about 7 years ago

Also, how many OSDs do you have? How exactly did you restart all OSDs simultaneously - paste exact commands. Was the cluster under any load or idle when this happened?

#4 Updated by Markus Blank-Burian about 7 years ago

I noticed this before, but have not reproduced it systematically. We have currently 82 OSDs and I restarted them simultaneously using a script like this:

for host in $hosts
do
        ssh $host $* &
done
wait

Restart via the gentoo initscript "/etc/init.d/ceph restart", which in turn calls "/usr/lib64/ceph/ceph_init.sh" to restart the daemons.
I will see, what I can do to reproduce, but this will eventually need to wait until our next maintenance. What logging flags do you want me to activate then?

#5 Updated by Markus Blank-Burian about 7 years ago

actually the restart was a few hours earlier at about 22:53 and the crash was at 03:12, so it might also be unrelated. but in the log (see attached file) there were many osd down/up entries which led me believe it was connected. as far as i can tell, there were only a few active processes at the time using ceph. stats say around 4 IOPS.

#6 Updated by Markus Blank-Burian almost 7 years ago

I have reproduced the bug 3 times today and had full kernel logging activated the last time. The log and a screenshot of the bug message is attached. The bug occured while restarting osd.13 osd.14 osd.62 and osd.38. The noout flag was not set. As you can see from the, there was some moderate activity on this particular node.

#8 Updated by Ilya Dryomov almost 7 years ago

  • Status changed from 12 to Resolved

libceph: fix double __remove_osd() problem

Also available in: Atom PDF