Bug #2026: osd: ceph::HeartbeatMap::check_touch_file - Ceph - Ceph

Actions

Copy link

Bug #2026

closed

osd: ceph::HeartbeatMap::check_touch_file

Added by Wido den Hollander about 12 years ago. Updated over 11 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

OSD

Target version:

v0.44

% Done:

Spent time:

2:00 h

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After my data loss due to a btrfs bug I re-installed my whole cluster with 0.41 and kernel 3.2 (ceph-client with btrfs latest) on my OSD's.

I started running rados bench for some time, that went fine, but when I started running a RBD VM I noticed that 2 OSD's had crashed with:

(gdb) bt
#0  0x00007f2bc1faff2b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00000000006098a2 in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2  0x0000000000609a6d in handle_fatal_signal (signum=6) at global/signal_handler.cc:109
#3  <signal handler called>
#4  0x00007f2bc052d3a5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f2bc0530b0b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f2bc0debd7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f2bc0de9f26 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f2bc0de9f53 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f2bc0dea04e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00000000005dc6b0 in ceph::__ceph_assert_fail (assertion=0x738663 "0 == \"hit suicide timeout\"", file=0x7385da "common/HeartbeatMap.cc", line=78, 
    func=0x7387a0 "bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)") at common/assert.cc:75
#11 0x0000000000670704 in ceph::HeartbeatMap::_check (this=<optimized out>, h=<optimized out>, who=0x7386d8 "is_healthy", now=<optimized out>)
    at common/HeartbeatMap.cc:78
#12 0x0000000000670e87 in ceph::HeartbeatMap::is_healthy (this=0xd9b000) at common/HeartbeatMap.cc:118
#13 0x00000000006710a6 in ceph::HeartbeatMap::check_touch_file (this=0xd9b000) at common/HeartbeatMap.cc:129
#14 0x00000000005e56c7 in CephContextServiceThread::entry (this=0xdab900) at common/ceph_context.cc:64
#15 0x00007f2bc1fa7efc in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#16 0x00007f2bc05d889d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#17 0x0000000000000000 in ?? ()

Around the same time the OSD's crashed I saw the following btrfs messages:

[384001.602475] INFO: task ceph-osd:14493 blocked for more than 120 seconds.
[384001.602490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[384001.602508] ceph-osd        D ffff8800b31f1aa0     0 14493      1 0x00000000
[384001.602521]  ffff8800b33afc58 0000000000000086 ffff880138b1db80 ffffffff8109a52a
[384001.602533]  ffff8800b31f16e0 ffff8800b33affd8 ffff8800b33affd8 ffff8800b33affd8
[384001.602546]  ffff8800b30fadc0 ffff8800b31f16e0 ffff8800b31f16e0 ffff880134f4f100
[384001.602559] Call Trace:
[384001.602571]  [<ffffffff8109a52a>] ? exit_robust_list+0x7a/0x130
[384001.602582]  [<ffffffff816135df>] schedule+0x3f/0x60
[384001.602592]  [<ffffffff81066935>] exit_mm+0x85/0x130
[384001.602603]  [<ffffffff81066b4f>] do_exit+0x16f/0x880
[384001.602615]  [<ffffffff810675c4>] do_group_exit+0x44/0xa0
[384001.602627]  [<ffffffff81077cec>] get_signal_to_deliver+0x21c/0x5a0
[384001.602640]  [<ffffffff81013135>] do_signal+0x65/0x700
[384001.602651]  [<ffffffff811824fd>] ? d_free+0x5d/0x70
[384001.602662]  [<ffffffff81189e8e>] ? vfsmount_lock_local_unlock+0x1e/0x30
[384001.602673]  [<ffffffff8118b8c0>] ? mntput_no_expire+0x30/0xf0
[384001.602685]  [<ffffffff81013855>] do_notify_resume+0x65/0x80
[384001.602696]  [<ffffffff8161d750>] int_signal+0x12/0x17

While the PID doesn't match the OSD, I do see that the btrfs messages were from 03:29 and the core dump from 03:30.

Does this seem related to btrfs or is this an OSD bug?

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #2026

osd: ceph::HeartbeatMap::check_touch_file

Updated by Sage Weil about 12 years ago

Updated by Wido den Hollander about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Xiaopong Tran over 11 years ago

Updated by Xiaopong Tran over 11 years ago

Updated by Xiaopong Tran over 11 years ago

Updated by Sage Weil over 11 years ago