Bug #8087
libceph: null deref in osd_reset -> __reset_osd -> __remove_osd
0%
Description
I thought there was an open bug for this but I couldn't find it.
<4>[54746.470533] libceph: osd3 10.214.131.23:6800 socket closed (con state OPEN) <4>[54750.230820] libceph: osd2 10.214.131.24:6811 socket closed (con state OPEN) <6>[54751.166391] libceph: osd5 down <6>[54756.706455] libceph: osd4 weight 0x10000 (in) <4>[54757.199901] libceph: osd4 10.214.131.23:6809 socket closed (con state OPEN) <4>[54758.246418] libceph: osd3 10.214.131.23:6800 socket closed (con state OPEN) <4>[54761.874915] libceph: osd3 10.214.131.23:6800 socket closed (con state OPEN) <4>[54761.896173] libceph: osd3 10.214.131.23:6800 socket closed (con state CONNECTING) <4>[54763.145385] libceph: osd3 10.214.131.23:6800 socket closed (con state CONNECTING) <4>[54764.153993] libceph: osd3 10.214.131.23:6800 socket closed (con state CONNECTING) <4>[54766.235252] libceph: osd3 10.214.131.23:6800 socket closed (con state CONNECTING) <6>[54768.281152] libceph: osd5 primary-affinity 0x10000 <4>[54770.181661] libceph: osd3 10.214.131.23:6800 socket closed (con state CONNECTING) <6>[54774.521289] libceph: osd5 primary-affinity 0x0 <4>[54778.186468] libceph: osd3 10.214.131.23:6800 socket closed (con state CONNECTING) <6>[54783.515908] libceph: osd3 down <4>[54788.156802] libceph: osd0 10.214.131.24:6800 socket closed (con state OPEN) <6>[54790.331993] libceph: osd3 up <6>[54798.984619] libceph: osd5 up <4>[54799.446893] libceph: osd0 10.214.131.24:6800 socket closed (con state OPEN) <6>[54803.215756] libceph: osd2 weight 0x0 (out) <4>[54803.335380] libceph: osd3 10.214.131.23:6804 socket closed (con state OPEN) <4>[54804.821799] libceph: osd2 10.214.131.24:6811 socket closed (con state OPEN) <6>[54809.166935] libceph: osd0 primary-affinity 0x0 <6>[54815.953029] libceph: osd2 weight 0x10000 (in) <6>[54860.014440] libceph: osd4 primary-affinity 0x0 <4>[54862.901538] libceph: osd4 10.214.131.23:6809 socket closed (con state OPEN) <6>[54882.915187] libceph: osd5 primary-affinity 0x1f10 <6>[54889.048579] libceph: osd4 primary-affinity 0x0 <6>[54900.497577] libceph: osd5 primary-affinity 0x10000 <4>[54910.985836] libceph: osd1 10.214.131.24:6809 socket closed (con state OPEN) <4>[54911.007718] libceph: osd1 10.214.131.24:6809 socket closed (con state CONNECTING) <4>[54912.259483] libceph: osd1 10.214.131.24:6809 socket closed (con state CONNECTING) <4>[54913.256074] libceph: osd1 10.214.131.24:6809 socket closed (con state CONNECTING) <4>[54915.325247] libceph: osd1 10.214.131.24:6809 socket closed (con state CONNECTING) <4>[54916.024475] libceph: osd5 10.214.131.23:6810 socket closed (con state OPEN) <4>[54916.042183] libceph: osd5 10.214.131.23:6810 socket closed (con state CONNECTING) <4>[54916.317907] libceph: osd5 10.214.131.23:6810 socket closed (con state CONNECTING) <4>[54917.234492] libceph: osd5 10.214.131.23:6810 socket closed (con state CONNECTING) <4>[54919.327734] libceph: osd5 10.214.131.23:6810 socket closed (con state CONNECTING) <4>[54919.327736] libceph: osd1 10.214.131.24:6809 socket closed (con state CONNECTING) <6>[54922.389440] libceph: osd2 primary-affinity 0x0 <4>[54923.298189] libceph: osd5 10.214.131.23:6810 socket closed (con state CONNECTING) <4>[54927.268480] libceph: osd1 10.214.131.24:6809 socket closed (con state CONNECTING) <6>[54928.589414] libceph: osd2 weight 0x0 (out) <4>[54931.302937] libceph: osd5 10.214.131.23:6810 socket closed (con state CONNECTING) <6>[54936.009719] libceph: osd1 down <4>[54936.342872] libceph: osd3 10.214.131.23:6804 socket closed (con state OPEN) <4>[54937.469072] libceph: osd2 10.214.131.24:6811 socket closed (con state OPEN) <6>[54940.727869] libceph: osd2 weight 0x10000 (in) <6>[54947.242425] libceph: osd5 down <6>[54947.245510] libceph: osd5 primary-affinity 0x10000 <4>[54947.344592] libceph: osd5 10.214.131.23:6810 socket closed (con state CONNECTING) <1>[54947.549149] BUG: unable to handle kernel NULL pointer dereference at (null) <1>[54947.557038] IP: [<ffffffff8136901b>] rb_erase+0x1bb/0x370 <4>[54947.562469] PGD 0 <4>[54947.564504] Oops: 0002 [#1] SMP [5]kdb> [5]kdb> bt Stack traceback for pid 6709 0xffff880223dfe300 6709 2 1 5 R 0xffff880223dfe8a8 *kworker/5:1 ffff880222153c28 0000000000000018 ffffffff00000001 ffff880222153c48 ffffffffa07eaed1 ffff880222153ca8 ffff8801f58ca000 ffff880222153c78 ffffffffa07eb139 ffff8801bd4c3768 ffff880222153ca8 ffff8801f58ca000 Call Trace: [<ffffffffa07eaed1>] ? __remove_osd+0x31/0x80 [libceph] [<ffffffffa07eb139>] ? __reset_osd+0x109/0x110 [libceph] [<ffffffffa07ec445>] ? osd_reset+0x55/0x200 [libceph] [<ffffffffa07e5fa5>] ? con_work+0xd15/0x2210 [libceph] [<ffffffff8106d9c4>] ? process_one_work+0x174/0x4a0 [<ffffffff8106ed6b>] ? worker_thread+0x11b/0x370 [<ffffffff8106ec50>] ? manage_workers.isra.20+0x2d0/0x2d0 [<ffffffff81075809>] ? kthread+0xc9/0xe0 [<ffffffff81075740>] ? flush_kthread_worker+0xb0/0xb0 [<ffffffff816b20ac>] ? ret_from_fork+0x7c/0xb0 [<ffffffff81075740>] ? flush_kthread_worker+0xb0/0xb0
the workload was /var/lib/teuthworker/archive/teuthology-2014-04-11_23:01:07-kcephfs-master-testing-basic-plana/186991
Related issues
History
#1 Updated by Ian Colle almost 10 years ago
- Assignee set to Ilya Dryomov
#2 Updated by Ilya Dryomov about 9 years ago
Reported in #5429:
During some tests, I stumbled upon this bug in rb_erase triggered osd_reset() -> __reset_osd() -> __remove_osd(). But I have not been using rbd but cephfs with kernel v3.14.28 + patches mentioned in #10449 and #10450. The bug was triggered by restarting all OSDs of our cluster simultaneously.
Markus, can you reproduce it at least somewhat reliably? Is there any chance you could try restarting all OSDs for a while to try to reproduce this with logging enabled?
#3 Updated by Ilya Dryomov about 9 years ago
Also, how many OSDs do you have? How exactly did you restart all OSDs simultaneously - paste exact commands. Was the cluster under any load or idle when this happened?
#4 Updated by Markus Blank-Burian about 9 years ago
I noticed this before, but have not reproduced it systematically. We have currently 82 OSDs and I restarted them simultaneously using a script like this:
for host in $hosts do ssh $host $* & done wait
Restart via the gentoo initscript "/etc/init.d/ceph restart", which in turn calls "/usr/lib64/ceph/ceph_init.sh" to restart the daemons.
I will see, what I can do to reproduce, but this will eventually need to wait until our next maintenance. What logging flags do you want me to activate then?
#5 Updated by Markus Blank-Burian about 9 years ago
- File rb_erase.txt View added
actually the restart was a few hours earlier at about 22:53 and the crash was at 03:12, so it might also be unrelated. but in the log (see attached file) there were many osd down/up entries which led me believe it was connected. as far as i can tell, there were only a few active processes at the time using ceph. stats say around 4 IOPS.
#6 Updated by Markus Blank-Burian about 9 years ago
- File kaa-54.log.bz2 added
I have reproduced the bug 3 times today and had full kernel logging activated the last time. The log and a screenshot of the bug message is attached. The bug occured while restarting osd.13 osd.14 osd.62 and osd.38. The noout flag was not set. As you can see from the, there was some moderate activity on this particular node.
#7 Updated by Markus Blank-Burian about 9 years ago
- File IMG_20150130_163313.jpg View added
#8 Updated by Ilya Dryomov about 9 years ago
- Status changed from 12 to Resolved
libceph: fix double __remove_osd() problem