Project

General

Profile

Actions

Bug #4116

closed

common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

Added by Evan Felix about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
0.56.2
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have been experiencing slowness lately, with lots of messages in ceph -w about operations taking a long time. During this i keep loosing OSD's with a hit suicide timeout assert.

ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x307) [0x895387]
2: (ceph::HeartbeatMap::is_healthy()+0xa7) [0x895ba7]
3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x896098]
4: (CephContextServiceThread::entry()+0x5d) [0x7b871d]
5: (()+0x7851) [0x7ff2fbd26851]
6: (clone()+0x6d) [0x7ff2fa38511d]

here is the stack trace, but from other bugs it seems that this is caused probably by other issues.

I am attaching logs and objdump files. If there is more debug i can turn on that would help determine the issue please advise.

osd.8 uses up ~13G of memory before crashing. whereas other osd's use about 2g normally.


Files

osd.0.log.gz (170 KB) osd.0.log.gz osd.0 log Evan Felix, 02/13/2013 10:48 AM
osd.8.log.gz (205 KB) osd.8.log.gz osd.8 log Evan Felix, 02/13/2013 10:48 AM
cepd-osd.objdump.gz (10.1 MB) cepd-osd.objdump.gz 0.56.2 objdump Evan Felix, 02/13/2013 10:48 AM
osd.5.log.gz (2.65 MB) osd.5.log.gz log for 5 Evan Felix, 02/14/2013 10:51 AM
osd.5.log.gz (2.65 MB) osd.5.log.gz log for 5 Evan Felix, 02/14/2013 11:22 AM
gdb.txt (226 KB) gdb.txt pg dump, and backtrace. Evan Felix, 02/18/2013 09:58 AM
gdb.txt (51.7 KB) gdb.txt full back trace of all threads. Evan Felix, 02/18/2013 10:44 AM
osd.5.log.10.gz (3.16 MB) osd.5.log.10.gz debug log with more debug Evan Felix, 02/18/2013 02:16 PM
pg4.0.log (3.87 KB) pg4.0.log pg 4.0 query Evan Felix, 02/19/2013 09:43 AM
ceph-osd.37.log.gz (2.85 MB) ceph-osd.37.log.gz osd.37 suicide timeout Wido den Hollander, 03/29/2013 02:43 PM

Related issues 1 (0 open1 closed)

Related to Ceph - Fix #4192: osd: fix log trimmingResolvedSage Weil02/19/2013

Actions
Actions #1

Updated by Evan Felix about 11 years ago

I've also seen these Journal errors at times:
2013-02-13 12:45:55.029747 7f71cabfd700 1 journal check_for_full at 643530752 : JOURNAL FULL 643530752 >= 1048571903 (max_size 1048576000 start 643530752)
2013-02-13 12:45:55.029792 7f71cabfd700 0 journal JOURNAL TOO SMALL: continuing, but slow: item 1381216256 > journal 1048571904 (usable)

Would it be beneficial for me to increase the journal sizes?

Actions #2

Updated by Sage Weil about 11 years ago

Evan Felix wrote:

I've also seen these Journal errors at times:
2013-02-13 12:45:55.029747 7f71cabfd700 1 journal check_for_full at 643530752 : JOURNAL FULL 643530752 >= 1048571903 (max_size 1048576000 start 643530752)
2013-02-13 12:45:55.029792 7f71cabfd700 0 journal JOURNAL TOO SMALL: continuing, but slow: item 1381216256 > journal 1048571904 (usable)

Would it be beneficial for me to increase the journal sizes?

Hi Evan!

Yes... larger journal will help avoid the FULL messages, certainly. 1GB isn't particularly big. The default is 5 or 10 (i forget).

The heartbeat issue sounds very much like a bug that is fixed and going into 0.56.3 (releasing later today). Can you try that, or latest bobtail branch, and see if the issue goes away? Be warned that the first restart of the osd may take a while as it "catches up".

Actions #3

Updated by Sage Weil about 11 years ago

  • Status changed from New to Need More Info
Actions #4

Updated by Sage Weil about 11 years ago

  • Status changed from Need More Info to 7
Actions #5

Updated by Greg Farnum about 11 years ago

That JOURNAL TOO SMALL message indicates you're sending single operations which are larger than your journal. And indeed your journal is 1GB and that op is 1.3GB.

If you're doing a lot of stuff with gigabyte-scale objects, it could also explain why you're getting a bunch of suicide timeouts (which is the OSD complaining that it gave the filesystem something to do, and it went away and never came back). Ceph is in general optimized for megabyte-scale objects, and you can do larger ones but will need to tune things differently — I'd really recommend chunking of some sort instead.

Actions #6

Updated by Evan Felix about 11 years ago

the objects in question are about 1360713573, though i only do small updates inside the file. does changing the size in the ceph.conf file for the journal make it grow in size?

Actions #7

Updated by Greg Farnum about 11 years ago

No, you'd have to shut down the OSD, do a journal flush, and then re-create the journal. It's sparsely documented at http://ceph.com/docs/master/man/8/ceph-osd/?highlight=flush-journal#cmdoption-ceph-osd--flush-journal

Actions #8

Updated by Evan Felix about 11 years ago

I'll try the new code when it is released. for now i've resized the journal on one of the problems osd's( osd 2) with these commands:

ceph-osd -i 2 --flush-journal
mv /var/lib/ceph/osd/ceph-2/journal /var/lib/ceph/osd/ceph-2/journal.old
ceph-osd -i 2 --mkjournal
Actions #9

Updated by Evan Felix about 11 years ago

whoops, grabbed the mtime, they are all about 13380480 bytes.

Evan Felix wrote:

the objects in question are about 1360713573, though i only do small updates inside the file. does changing the size in the ceph.conf file for the journal make it grow in size?

Actions #10

Updated by Evan Felix about 11 years ago

I using the 0.56.3 rpms that were releases yesterday for elg.x86_64, with a larger journal of 5G now, but two of my osd's are still getting the suicide error. i have crash dumps if you want them.

Actions #11

Updated by Evan Felix about 11 years ago

I using the 0.56.3 rpms that were releases yesterday for elg.x86_64, with a larger journal of 5G now, but two of my osd's are still getting the suicide error. i have crash dumps if you want them.

Actions #12

Updated by Evan Felix about 11 years ago

Sage, I've been testing for many days now, and two of my OSD's dump core after consuming lots of memory. I've poked around in one and it seems stuck in the OSD::process_peering_events call. Looking at the pg it is working on it has hundreds of entries, and one vector(_M_buckets) that is 12582917 entries.

I am attaching the gdb backtrace and pg entry. Please advise next steps, as i'd really like to get the system back into a usable state.

Actions #13

Updated by Sage Weil about 11 years ago

It looks like IndexedLog::caller_ops is growing without bound; trying to reproduce that, although it is likely unrelated to your problem here.

What are the other threads doing? It has a lot of old map epochs to process.. this is somewhat expected right after the upgrade to .3, but it shouldn't persist after an initial (slow) osd restart.

Also, how did you make gdb print the STL containers so nicely? :)

Actions #14

Updated by Evan Felix about 11 years ago

Most of the Threads are in Wait() or tcp_wait.. i'll attach full thread bt's

I just used gdb to dump the info(in the OSD::process_peering_events frame):
set logging on
print *pg
set logging off

Actions #15

Updated by Sage Weil about 11 years ago

Hmm, it's not clear why it would get stuck there. Can you reproduce this with 'debug osd = 10'?

The earlier log you attached may have had it, but we just upgraded redmine and the attachments seem to be missing currently.

Actions #16

Updated by Evan Felix about 11 years ago

See attached log

Actions #17

Updated by Sage Weil about 11 years ago

Woah, okay.. one of your pgs has a 1.3GB log object: 'read_log 0~1381204356'. That is what it is trying to write out when it hits the suicide timeout. The pg claims to have done 90 million updates over its lifetime (that last 10M of which are in the log). Other pgs in that pool have similar numbers, so that does seem to be real... that also explains why that hash_map was huge.

I'm not sure why the PG didn't trim its log before. At this point, tho, the simplest thing may be to just let it chew through the peering process (and then trim). You can increase the heartbeat timeouts with 'osd op thread timeout' (the suicide threshold is 10* that value).

Actions #18

Updated by Evan Felix about 11 years ago

is there a way to determine what object it is, i expect all my object to be about 13M, nowhere near a gig. i'll adjust timeout and let it run

Actions #19

Updated by Sage Weil about 11 years ago

It's not a user object; its the internal pg log of recent requests. No way to change it without surgery...

Let us know if a longer timeout doesn't let it recover and move on!

Actions #20

Updated by Evan Felix about 11 years ago

I've set it to 45... and its sucking up memory... and the oom killer got it... I'll see if i can add swap.. it was at 26G of mem before it died.

Actions #21

Updated by Evan Felix about 11 years ago

sage, the rados command i'm using have start responding again.. I had to move through a few osd's changing the time out, as one came up a couple others died.. but they are holding for now.

Actions #22

Updated by Sage Weil about 11 years ago

excellent. can you paste the output from 'ceph pg 4.0 query' so we can verify that that ginormous log got trimmed?

thanks!

Actions #23

Updated by Sage Weil about 11 years ago

  • Assignee set to Sage Weil
Actions #24

Updated by Evan Felix about 11 years ago

Here is the current output of 'pg 4.0 query'

Actions #25

Updated by Sage Weil about 11 years ago

  • Priority changed from Normal to Urgent

Okay, the problem here is that it's not trimming the log because the PG is degraded. We need to fix that ASAP. I'll get a patch pushed and backported to bobtail for you shortly.

Thanks!

Actions #26

Updated by Sage Weil about 11 years ago

I pushed a wip-4116-workaround branch with a workaround that will make it trim the log. It might be a painful restart (it'll load the log one last time) but it should trim it shortly after that. If you install that on and restart osd.8 first it'll get you back in business! Thanks for helping track this down. issue #4192 will track a permanent fix.

Actions #27

Updated by Sage Weil about 11 years ago

  • Status changed from 7 to Resolved

Evan, going to assume this is resolved.. let me know if not! #4192 will track the long-term fix.

Actions #28

Updated by Evan Felix about 11 years ago

Yes the cluster has returned to a working state and osd's are staying up with the new code.

Actions #29

Updated by Wido den Hollander about 11 years ago

I'm seeing this one on one of my OSDs right now.

It seems to be working properly for some time, but then it suddenly dies.

The backtrace is this:

 ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)
 1: /usr/bin/ceph-osd() [0x788fba]
 2: (()+0xfcb0) [0x7f11b21b1cb0]
 3: (gsignal()+0x35) [0x7f11b0b70425]
 4: (abort()+0x17b) [0x7f11b0b73b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f11b14c269d]
 6: (()+0xb5846) [0x7f11b14c0846]
 7: (()+0xb5873) [0x7f11b14c0873]
 8: (()+0xb596e) [0x7f11b14c096e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x8343af]
 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x107) [0x7a2547]
 11: (ceph::HeartbeatMap::is_healthy()+0xa7) [0x7a2ef7]
 12: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x7a34a3]
 13: (CephContextServiceThread::entry()+0x55) [0x83fd75]
 14: (()+0x7e9a) [0x7f11b21a9e9a]
 15: (clone()+0x6d) [0x7f11b0c2dcbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I've attached the logs of osd.37.

Debugging wasn't that high, so I'm not sure if all the data is in there.

The osd doesn't have any disk issues nor btrfs message.

Kernel is 3.8.0 with Ceph 0.56.4

Actions

Also available in: Atom PDF