Project

General

Profile

Actions

Bug #2026

closed

osd: ceph::HeartbeatMap::check_touch_file

Added by Wido den Hollander about 12 years ago. Updated over 11 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After my data loss due to a btrfs bug I re-installed my whole cluster with 0.41 and kernel 3.2 (ceph-client with btrfs latest) on my OSD's.

I started running rados bench for some time, that went fine, but when I started running a RBD VM I noticed that 2 OSD's had crashed with:

(gdb) bt
#0  0x00007f2bc1faff2b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00000000006098a2 in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2  0x0000000000609a6d in handle_fatal_signal (signum=6) at global/signal_handler.cc:109
#3  <signal handler called>
#4  0x00007f2bc052d3a5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f2bc0530b0b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f2bc0debd7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f2bc0de9f26 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f2bc0de9f53 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f2bc0dea04e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00000000005dc6b0 in ceph::__ceph_assert_fail (assertion=0x738663 "0 == \"hit suicide timeout\"", file=0x7385da "common/HeartbeatMap.cc", line=78, 
    func=0x7387a0 "bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)") at common/assert.cc:75
#11 0x0000000000670704 in ceph::HeartbeatMap::_check (this=<optimized out>, h=<optimized out>, who=0x7386d8 "is_healthy", now=<optimized out>)
    at common/HeartbeatMap.cc:78
#12 0x0000000000670e87 in ceph::HeartbeatMap::is_healthy (this=0xd9b000) at common/HeartbeatMap.cc:118
#13 0x00000000006710a6 in ceph::HeartbeatMap::check_touch_file (this=0xd9b000) at common/HeartbeatMap.cc:129
#14 0x00000000005e56c7 in CephContextServiceThread::entry (this=0xdab900) at common/ceph_context.cc:64
#15 0x00007f2bc1fa7efc in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#16 0x00007f2bc05d889d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#17 0x0000000000000000 in ?? ()

Around the same time the OSD's crashed I saw the following btrfs messages:

[384001.602475] INFO: task ceph-osd:14493 blocked for more than 120 seconds.
[384001.602490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[384001.602508] ceph-osd        D ffff8800b31f1aa0     0 14493      1 0x00000000
[384001.602521]  ffff8800b33afc58 0000000000000086 ffff880138b1db80 ffffffff8109a52a
[384001.602533]  ffff8800b31f16e0 ffff8800b33affd8 ffff8800b33affd8 ffff8800b33affd8
[384001.602546]  ffff8800b30fadc0 ffff8800b31f16e0 ffff8800b31f16e0 ffff880134f4f100
[384001.602559] Call Trace:
[384001.602571]  [<ffffffff8109a52a>] ? exit_robust_list+0x7a/0x130
[384001.602582]  [<ffffffff816135df>] schedule+0x3f/0x60
[384001.602592]  [<ffffffff81066935>] exit_mm+0x85/0x130
[384001.602603]  [<ffffffff81066b4f>] do_exit+0x16f/0x880
[384001.602615]  [<ffffffff810675c4>] do_group_exit+0x44/0xa0
[384001.602627]  [<ffffffff81077cec>] get_signal_to_deliver+0x21c/0x5a0
[384001.602640]  [<ffffffff81013135>] do_signal+0x65/0x700
[384001.602651]  [<ffffffff811824fd>] ? d_free+0x5d/0x70
[384001.602662]  [<ffffffff81189e8e>] ? vfsmount_lock_local_unlock+0x1e/0x30
[384001.602673]  [<ffffffff8118b8c0>] ? mntput_no_expire+0x30/0xf0
[384001.602685]  [<ffffffff81013855>] do_notify_resume+0x65/0x80
[384001.602696]  [<ffffffff8161d750>] int_signal+0x12/0x17

While the PID doesn't match the OSD, I do see that the btrfs messages were from 03:29 and the core dump from 03:30.

Does this seem related to btrfs or is this an OSD bug?


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #2045: osd: dout_lock deadlockCan't reproduce02/09/2012

Actions
Actions

Also available in: Atom PDF