Actions
Bug #2026
closedosd: ceph::HeartbeatMap::check_touch_file
% Done:
0%
Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
After my data loss due to a btrfs bug I re-installed my whole cluster with 0.41 and kernel 3.2 (ceph-client with btrfs latest) on my OSD's.
I started running rados bench for some time, that went fine, but when I started running a RBD VM I noticed that 2 OSD's had crashed with:
(gdb) bt #0 0x00007f2bc1faff2b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00000000006098a2 in reraise_fatal (signum=6) at global/signal_handler.cc:59 #2 0x0000000000609a6d in handle_fatal_signal (signum=6) at global/signal_handler.cc:109 #3 <signal handler called> #4 0x00007f2bc052d3a5 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x00007f2bc0530b0b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x00007f2bc0debd7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x00007f2bc0de9f26 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x00007f2bc0de9f53 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x00007f2bc0dea04e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x00000000005dc6b0 in ceph::__ceph_assert_fail (assertion=0x738663 "0 == \"hit suicide timeout\"", file=0x7385da "common/HeartbeatMap.cc", line=78, func=0x7387a0 "bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)") at common/assert.cc:75 #11 0x0000000000670704 in ceph::HeartbeatMap::_check (this=<optimized out>, h=<optimized out>, who=0x7386d8 "is_healthy", now=<optimized out>) at common/HeartbeatMap.cc:78 #12 0x0000000000670e87 in ceph::HeartbeatMap::is_healthy (this=0xd9b000) at common/HeartbeatMap.cc:118 #13 0x00000000006710a6 in ceph::HeartbeatMap::check_touch_file (this=0xd9b000) at common/HeartbeatMap.cc:129 #14 0x00000000005e56c7 in CephContextServiceThread::entry (this=0xdab900) at common/ceph_context.cc:64 #15 0x00007f2bc1fa7efc in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #16 0x00007f2bc05d889d in clone () from /lib/x86_64-linux-gnu/libc.so.6 #17 0x0000000000000000 in ?? ()
Around the same time the OSD's crashed I saw the following btrfs messages:
[384001.602475] INFO: task ceph-osd:14493 blocked for more than 120 seconds. [384001.602490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [384001.602508] ceph-osd D ffff8800b31f1aa0 0 14493 1 0x00000000 [384001.602521] ffff8800b33afc58 0000000000000086 ffff880138b1db80 ffffffff8109a52a [384001.602533] ffff8800b31f16e0 ffff8800b33affd8 ffff8800b33affd8 ffff8800b33affd8 [384001.602546] ffff8800b30fadc0 ffff8800b31f16e0 ffff8800b31f16e0 ffff880134f4f100 [384001.602559] Call Trace: [384001.602571] [<ffffffff8109a52a>] ? exit_robust_list+0x7a/0x130 [384001.602582] [<ffffffff816135df>] schedule+0x3f/0x60 [384001.602592] [<ffffffff81066935>] exit_mm+0x85/0x130 [384001.602603] [<ffffffff81066b4f>] do_exit+0x16f/0x880 [384001.602615] [<ffffffff810675c4>] do_group_exit+0x44/0xa0 [384001.602627] [<ffffffff81077cec>] get_signal_to_deliver+0x21c/0x5a0 [384001.602640] [<ffffffff81013135>] do_signal+0x65/0x700 [384001.602651] [<ffffffff811824fd>] ? d_free+0x5d/0x70 [384001.602662] [<ffffffff81189e8e>] ? vfsmount_lock_local_unlock+0x1e/0x30 [384001.602673] [<ffffffff8118b8c0>] ? mntput_no_expire+0x30/0xf0 [384001.602685] [<ffffffff81013855>] do_notify_resume+0x65/0x80 [384001.602696] [<ffffffff8161d750>] int_signal+0x12/0x17
While the PID doesn't match the OSD, I do see that the btrfs messages were from 03:29 and the core dump from 03:30.
Does this seem related to btrfs or is this an OSD bug?
Actions