https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2012-02-11T09:57:53ZCeph Ceph - Bug #2026: osd: ceph::HeartbeatMap::check_touch_filehttps://tracker.ceph.com/issues/2026?journal_id=85412012-02-11T09:57:53ZSage Weilsage@newdream.net
<ul></ul><p>This looks like a btrfs or kernel issue to me. Have you seen it since?</p> Ceph - Bug #2026: osd: ceph::HeartbeatMap::check_touch_filehttps://tracker.ceph.com/issues/2026?journal_id=85622012-02-11T22:32:59ZWido den Hollanderwido@42on.com
<ul></ul><p>I guess a btrfs one. Right now I'm running a couple of virtual machines without any issues, so for now we can leave this one as 'Can't reproduce' I guess.</p> Ceph - Bug #2026: osd: ceph::HeartbeatMap::check_touch_filehttps://tracker.ceph.com/issues/2026?journal_id=86102012-02-14T07:59:16ZSage Weilsage@newdream.net
<ul><li><strong>Assignee</strong> set to <i>Sage Weil</i></li><li><strong>Target version</strong> set to <i>v0.43</i></li></ul><p>I just hit this in qa, ubuntu@teuthology:/var/lib/teuthworker/archive/nightly_coverage_2012-02-14-a/11871.</p> Ceph - Bug #2026: osd: ceph::HeartbeatMap::check_touch_filehttps://tracker.ceph.com/issues/2026?journal_id=86132012-02-14T09:38:37ZSage Weilsage@newdream.net
<ul><li><strong>Assignee</strong> deleted (<del><i>Sage Weil</i></del>)</li></ul><p>in my case, this looks like <a class="issue tracker-1 status-9 priority-4 priority-default closed" title="Bug: osd: dout_lock deadlock (Can't reproduce)" href="https://tracker.ceph.com/issues/2045">#2045</a>.</p> Ceph - Bug #2026: osd: ceph::HeartbeatMap::check_touch_filehttps://tracker.ceph.com/issues/2026?journal_id=88912012-03-03T14:26:19ZSage Weilsage@newdream.net
<ul><li><strong>Target version</strong> changed from <i>v0.43</i> to <i>v0.44</i></li></ul> Ceph - Bug #2026: osd: ceph::HeartbeatMap::check_touch_filehttps://tracker.ceph.com/issues/2026?journal_id=93572012-03-30T16:25:51ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Can't reproduce</i></li></ul> Ceph - Bug #2026: osd: ceph::HeartbeatMap::check_touch_filehttps://tracker.ceph.com/issues/2026?journal_id=129842012-08-28T07:45:49ZXiaopong Tranxp@renzhi.ca
<ul></ul><p>It would be great if we could keep this issue going, and if there is anything I could provide, I'd be happy to do so. I've had two mon crashes in less than 3 days (on the same machine), which look similar to this issue. And everytime it generated a gigantic core dump file of 16GB or so. The mon log file does not show anything suspicious, no trace stack or anything.</p>
<p>Please let me know what other information I can provide.</p> Ceph - Bug #2026: osd: ceph::HeartbeatMap::check_touch_filehttps://tracker.ceph.com/issues/2026?journal_id=130152012-08-30T20:58:22ZXiaopong Tranxp@renzhi.ca
<ul></ul><p>Some more information:</p>
<p>OS: Debian Wheezy 3.2.0-3-amd64<br />Ceph: ceph version 0.48argonaut (<a class="changeset" title="v0.48argonaut" href="https://tracker.ceph.com/projects/ceph/repository/revisions/c2b20ca74249892c8e5e40c12aa14446a2bf2030">c2b20ca74249892c8e5e40c12aa14446a2bf2030</a>)</p>
<p>The mon seems to crash every other 3 days or so, and while it's crashing, the cluster is unreachable, ceph command is not responding, and there seems to be no way to check the cluster status :(</p>
<p>My cluster has been unreachable for hours now, it starts to look scary...</p> Ceph - Bug #2026: osd: ceph::HeartbeatMap::check_touch_filehttps://tracker.ceph.com/issues/2026?journal_id=130162012-08-30T21:32:25ZXiaopong Tranxp@renzhi.ca
<ul></ul><p>Ceph cluster is totally unreachable at this point, the only interesting in the log (on one of the machines, not the one where mon is crashing, because during this time, I can't even ssh in) in the ceph.log is this:</p>
<pre>
2012-08-31 08:16:57.839163 osd.30 10.1.0.12:6801/2138 10326088 : [WRN] slow request 1724174.352231 seconds old, received at 2012-08-11
09:20:43.486859: osd_op(mds.0.3:7976960 10000021b5f.00000000 [delete] 0.18371e8c snapc 1=[]) v4 currently started
2012-08-31 08:16:57.839168 osd.30 10.1.0.12:6801/2138 10326089 : [WRN] slow request 1724174.341145 seconds old, received at 2012-08-11
09:20:43.497945: osd_op(mds.0.3:7977137 10000026752.00000001 [delete] 0.b4d35e8c snapc 1=[]) v4 currently started
2012-08-31 08:16:57.839172 osd.30 10.1.0.12:6801/2138 10326090 : [WRN] slow request 1724169.390402 seconds old, received at 2012-08-11
09:20:48.448688: osd_op(mds.0.3:7980425 1000001dcdd.00000000 [delete] 0.634fe8c snapc 1=[]) v4 currently waiting for sub ops
2012-08-31 08:16:57.839176 osd.30 10.1.0.12:6801/2138 10326091 : [WRN] slow request 1724154.529565 seconds old, received at 2012-08-11
09:21:03.309525: osd_op(mds.0.3:7981591 100000268a3.00000001 [delete] 0.52de9e8c snapc 1=[]) v4 currently no flag points reached
2012-08-31 08:17:10.221551 mon.1 10.1.0.13:6789/0 88603 : [INF] pgmap v2260517: 11552 pgs: 11551 active+clean, 1 active+clean+replay;
3351 GB data, 10574 GB used, 93682 GB / 101 TB avail
2012-08-31 08:17:17.377889 mon.1 10.1.0.13:6789/0 88604 : [INF] pgmap v2260518: 11552 pgs: 11551 active+clean, 1 active+clean+replay;
3351 GB data, 10574 GB used, 93682 GB / 101 TB avail
2012-08-31 08:17:17.485759 mon.1 10.1.0.13:6789/0 88605 : [INF] mdsmap e675: 1/1/1 up {0=c=up:active(laggy or crashed)}
2012-08-31 08:17:20.671111 mon.1 10.1.0.13:6789/0 88606 : [INF] pgmap v2260519: 11552 pgs: 11551 active+clean, 1 active+clean+replay; 3352 GB data, 10574 GB used, 93682 GB / 101 TB avail
2012-08-31 08:16:57.839180 osd.30 10.1.0.12:6801/2138 10326092 : [WRN] slow request 1724119.438966 seconds old, received at 2012-08-11 09:21:38.400124: osd_op(mds.0.3:7982856 10000000b6b.00000000 [delete] 0.ab3c4e8c snapc 1=[]) v4 currently started
2012-08-31 08:16:58.841239 osd.30 10.1.0.12:6801/2138 10326093 : [WRN] 6 slow requests, 6 included below; oldest blocked for > 1724175.354302 secs
2012-08-31 08:16:58.841248 osd.30 10.1.0.12:6801/2138 10326094 : [WRN] slow request 1724175.354302 seconds old, received at 2012-08-11 09:20:43.486859: osd_op(mds.0.3:7976960 10000021b5f.00000000 [delete] 0.18371e8c snapc 1=[]) v4 currently started
2012-08-31 08:16:58.841253 osd.30 10.1.0.12:6801/2138 10326095 : [WRN] slow request 1724175.343216 seconds old, received at 2012-08-11 09:20:43.497945: osd_op(mds.0.3:7977137 10000026752.00000001 [delete] 0.b4d35e8c snapc 1=[]) v4 currently started
</pre>
<p>One of the think I notice is that, mon seems to eat up a lot of memory. Here is some info on another machine:</p>
<pre>
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2126 root 20 0 1397m 459m 0 S 13.6 1.4 1208:21 ceph-osd
1986 root 20 0 1272m 220m 0 S 11.6 0.7 1217:09 ceph-osd
2499 root 20 0 1087m 171m 0 S 1.0 0.5 386:43.58 ceph-osd
9705 root 20 0 0 0 0 S 1.0 0.0 0:11.87 kworker/0:0
1717 root 20 0 68.0g 25g 456 S 0.7 81.9 9064:40 ceph-mon
2056 root 20 0 1051m 155m 0 S 0.7 0.5 402:21.84 ceph-osd
2337 root 20 0 1108m 177m 0 S 0.7 0.6 407:13.85 ceph-osd
2672 root 20 0 1134m 190m 0 S 0.7 0.6 500:59.67 ceph-osd
</pre>
<p>Is that normal?</p> Ceph - Bug #2026: osd: ceph::HeartbeatMap::check_touch_filehttps://tracker.ceph.com/issues/2026?journal_id=130202012-08-31T08:21:10ZSage Weilsage@newdream.net
<ul></ul><p>This is unrelated to the heartbeta problem; opened new bug <a class="issue tracker-1 status-3 priority-4 priority-default closed" title="Bug: mon: runaway memory (Resolved)" href="https://tracker.ceph.com/issues/3067">#3067</a>!</p>