Project

General

Profile

Bug #2026

osd: ceph::HeartbeatMap::check_touch_file

Added by Wido den Hollander over 7 years ago. Updated almost 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
Start date:
02/06/2012
Due date:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

After my data loss due to a btrfs bug I re-installed my whole cluster with 0.41 and kernel 3.2 (ceph-client with btrfs latest) on my OSD's.

I started running rados bench for some time, that went fine, but when I started running a RBD VM I noticed that 2 OSD's had crashed with:

(gdb) bt
#0  0x00007f2bc1faff2b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00000000006098a2 in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2  0x0000000000609a6d in handle_fatal_signal (signum=6) at global/signal_handler.cc:109
#3  <signal handler called>
#4  0x00007f2bc052d3a5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f2bc0530b0b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f2bc0debd7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f2bc0de9f26 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f2bc0de9f53 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f2bc0dea04e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00000000005dc6b0 in ceph::__ceph_assert_fail (assertion=0x738663 "0 == \"hit suicide timeout\"", file=0x7385da "common/HeartbeatMap.cc", line=78, 
    func=0x7387a0 "bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)") at common/assert.cc:75
#11 0x0000000000670704 in ceph::HeartbeatMap::_check (this=<optimized out>, h=<optimized out>, who=0x7386d8 "is_healthy", now=<optimized out>)
    at common/HeartbeatMap.cc:78
#12 0x0000000000670e87 in ceph::HeartbeatMap::is_healthy (this=0xd9b000) at common/HeartbeatMap.cc:118
#13 0x00000000006710a6 in ceph::HeartbeatMap::check_touch_file (this=0xd9b000) at common/HeartbeatMap.cc:129
#14 0x00000000005e56c7 in CephContextServiceThread::entry (this=0xdab900) at common/ceph_context.cc:64
#15 0x00007f2bc1fa7efc in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#16 0x00007f2bc05d889d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#17 0x0000000000000000 in ?? ()

Around the same time the OSD's crashed I saw the following btrfs messages:

[384001.602475] INFO: task ceph-osd:14493 blocked for more than 120 seconds.
[384001.602490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[384001.602508] ceph-osd        D ffff8800b31f1aa0     0 14493      1 0x00000000
[384001.602521]  ffff8800b33afc58 0000000000000086 ffff880138b1db80 ffffffff8109a52a
[384001.602533]  ffff8800b31f16e0 ffff8800b33affd8 ffff8800b33affd8 ffff8800b33affd8
[384001.602546]  ffff8800b30fadc0 ffff8800b31f16e0 ffff8800b31f16e0 ffff880134f4f100
[384001.602559] Call Trace:
[384001.602571]  [<ffffffff8109a52a>] ? exit_robust_list+0x7a/0x130
[384001.602582]  [<ffffffff816135df>] schedule+0x3f/0x60
[384001.602592]  [<ffffffff81066935>] exit_mm+0x85/0x130
[384001.602603]  [<ffffffff81066b4f>] do_exit+0x16f/0x880
[384001.602615]  [<ffffffff810675c4>] do_group_exit+0x44/0xa0
[384001.602627]  [<ffffffff81077cec>] get_signal_to_deliver+0x21c/0x5a0
[384001.602640]  [<ffffffff81013135>] do_signal+0x65/0x700
[384001.602651]  [<ffffffff811824fd>] ? d_free+0x5d/0x70
[384001.602662]  [<ffffffff81189e8e>] ? vfsmount_lock_local_unlock+0x1e/0x30
[384001.602673]  [<ffffffff8118b8c0>] ? mntput_no_expire+0x30/0xf0
[384001.602685]  [<ffffffff81013855>] do_notify_resume+0x65/0x80
[384001.602696]  [<ffffffff8161d750>] int_signal+0x12/0x17

While the PID doesn't match the OSD, I do see that the btrfs messages were from 03:29 and the core dump from 03:30.

Does this seem related to btrfs or is this an OSD bug?


Related issues

Related to Ceph - Bug #2045: osd: dout_lock deadlock Can't reproduce 02/09/2012

History

#1 Updated by Sage Weil over 7 years ago

This looks like a btrfs or kernel issue to me. Have you seen it since?

#2 Updated by Wido den Hollander over 7 years ago

I guess a btrfs one. Right now I'm running a couple of virtual machines without any issues, so for now we can leave this one as 'Can't reproduce' I guess.

#3 Updated by Sage Weil over 7 years ago

  • Assignee set to Sage Weil
  • Target version set to v0.43

I just hit this in qa, ubuntu@teuthology:/var/lib/teuthworker/archive/nightly_coverage_2012-02-14-a/11871.

#4 Updated by Sage Weil over 7 years ago

  • Assignee deleted (Sage Weil)

in my case, this looks like #2045.

#5 Updated by Sage Weil over 7 years ago

  • Target version changed from v0.43 to v0.44

#6 Updated by Sage Weil about 7 years ago

  • Status changed from New to Can't reproduce

#7 Updated by Xiaopong Tran almost 7 years ago

It would be great if we could keep this issue going, and if there is anything I could provide, I'd be happy to do so. I've had two mon crashes in less than 3 days (on the same machine), which look similar to this issue. And everytime it generated a gigantic core dump file of 16GB or so. The mon log file does not show anything suspicious, no trace stack or anything.

Please let me know what other information I can provide.

#8 Updated by Xiaopong Tran almost 7 years ago

Some more information:

OS: Debian Wheezy 3.2.0-3-amd64
Ceph: ceph version 0.48argonaut (c2b20ca74249892c8e5e40c12aa14446a2bf2030)

The mon seems to crash every other 3 days or so, and while it's crashing, the cluster is unreachable, ceph command is not responding, and there seems to be no way to check the cluster status :(

My cluster has been unreachable for hours now, it starts to look scary...

#9 Updated by Xiaopong Tran almost 7 years ago

Ceph cluster is totally unreachable at this point, the only interesting in the log (on one of the machines, not the one where mon is crashing, because during this time, I can't even ssh in) in the ceph.log is this:

2012-08-31 08:16:57.839163 osd.30 10.1.0.12:6801/2138 10326088 : [WRN] slow request 1724174.352231 seconds old, received at 2012-08-11
 09:20:43.486859: osd_op(mds.0.3:7976960 10000021b5f.00000000 [delete] 0.18371e8c snapc 1=[]) v4 currently started
2012-08-31 08:16:57.839168 osd.30 10.1.0.12:6801/2138 10326089 : [WRN] slow request 1724174.341145 seconds old, received at 2012-08-11
 09:20:43.497945: osd_op(mds.0.3:7977137 10000026752.00000001 [delete] 0.b4d35e8c snapc 1=[]) v4 currently started
2012-08-31 08:16:57.839172 osd.30 10.1.0.12:6801/2138 10326090 : [WRN] slow request 1724169.390402 seconds old, received at 2012-08-11
 09:20:48.448688: osd_op(mds.0.3:7980425 1000001dcdd.00000000 [delete] 0.634fe8c snapc 1=[]) v4 currently waiting for sub ops
2012-08-31 08:16:57.839176 osd.30 10.1.0.12:6801/2138 10326091 : [WRN] slow request 1724154.529565 seconds old, received at 2012-08-11
 09:21:03.309525: osd_op(mds.0.3:7981591 100000268a3.00000001 [delete] 0.52de9e8c snapc 1=[]) v4 currently no flag points reached
2012-08-31 08:17:10.221551 mon.1 10.1.0.13:6789/0 88603 : [INF] pgmap v2260517: 11552 pgs: 11551 active+clean, 1 active+clean+replay; 
3351 GB data, 10574 GB used, 93682 GB / 101 TB avail
2012-08-31 08:17:17.377889 mon.1 10.1.0.13:6789/0 88604 : [INF] pgmap v2260518: 11552 pgs: 11551 active+clean, 1 active+clean+replay; 
3351 GB data, 10574 GB used, 93682 GB / 101 TB avail
2012-08-31 08:17:17.485759 mon.1 10.1.0.13:6789/0 88605 : [INF] mdsmap e675: 1/1/1 up {0=c=up:active(laggy or crashed)}
2012-08-31 08:17:20.671111 mon.1 10.1.0.13:6789/0 88606 : [INF] pgmap v2260519: 11552 pgs: 11551 active+clean, 1 active+clean+replay; 3352 GB data, 10574 GB used, 93682 GB / 101 TB avail
2012-08-31 08:16:57.839180 osd.30 10.1.0.12:6801/2138 10326092 : [WRN] slow request 1724119.438966 seconds old, received at 2012-08-11 09:21:38.400124: osd_op(mds.0.3:7982856 10000000b6b.00000000 [delete] 0.ab3c4e8c snapc 1=[]) v4 currently started
2012-08-31 08:16:58.841239 osd.30 10.1.0.12:6801/2138 10326093 : [WRN] 6 slow requests, 6 included below; oldest blocked for > 1724175.354302 secs
2012-08-31 08:16:58.841248 osd.30 10.1.0.12:6801/2138 10326094 : [WRN] slow request 1724175.354302 seconds old, received at 2012-08-11 09:20:43.486859: osd_op(mds.0.3:7976960 10000021b5f.00000000 [delete] 0.18371e8c snapc 1=[]) v4 currently started
2012-08-31 08:16:58.841253 osd.30 10.1.0.12:6801/2138 10326095 : [WRN] slow request 1724175.343216 seconds old, received at 2012-08-11 09:20:43.497945: osd_op(mds.0.3:7977137 10000026752.00000001 [delete] 0.b4d35e8c snapc 1=[]) v4 currently started

One of the think I notice is that, mon seems to eat up a lot of memory. Here is some info on another machine:

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                                 
 2126 root      20   0 1397m 459m    0 S  13.6  1.4   1208:21 ceph-osd                                                                
 1986 root      20   0 1272m 220m    0 S  11.6  0.7   1217:09 ceph-osd                                                                
 2499 root      20   0 1087m 171m    0 S   1.0  0.5 386:43.58 ceph-osd                                                                
 9705 root      20   0     0    0    0 S   1.0  0.0   0:11.87 kworker/0:0                                                             
 1717 root      20   0 68.0g  25g  456 S   0.7 81.9   9064:40 ceph-mon                                                                
 2056 root      20   0 1051m 155m    0 S   0.7  0.5 402:21.84 ceph-osd                                                                
 2337 root      20   0 1108m 177m    0 S   0.7  0.6 407:13.85 ceph-osd                                                                
 2672 root      20   0 1134m 190m    0 S   0.7  0.6 500:59.67 ceph-osd     

Is that normal?

#10 Updated by Sage Weil almost 7 years ago

This is unrelated to the heartbeta problem; opened new bug #3067!

Also available in: Atom PDF