Bug #4116: common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout") - Ceph - Ceph

Actions

Copy link

Bug #4116

closed

common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

Added by Evan Felix about 11 years ago. Updated about 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sage Weil

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

0.56.2

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have been experiencing slowness lately, with lots of messages in ceph -w about operations taking a long time. During this i keep loosing OSD's with a hit suicide timeout assert.

ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x307) [0x895387]
2: (ceph::HeartbeatMap::is_healthy()+0xa7) [0x895ba7]
3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x896098]
4: (CephContextServiceThread::entry()+0x5d) [0x7b871d]
5: (()+0x7851) [0x7ff2fbd26851]
6: (clone()+0x6d) [0x7ff2fa38511d]

here is the stack trace, but from other bugs it seems that this is caused probably by other issues.

I am attaching logs and objdump files. If there is more debug i can turn on that would help determine the issue please advise.

osd.8 uses up ~13G of memory before crashing. whereas other osd's use about 2g normally.

Files

Download all files

osd.0.log.gz (170 KB) osd.0.log.gz	osd.0 log	Evan Felix, 02/13/2013 10:48 AM
osd.8.log.gz (205 KB) osd.8.log.gz	osd.8 log	Evan Felix, 02/13/2013 10:48 AM
cepd-osd.objdump.gz (10.1 MB) cepd-osd.objdump.gz	0.56.2 objdump	Evan Felix, 02/13/2013 10:48 AM
osd.5.log.gz (2.65 MB) osd.5.log.gz	log for 5	Evan Felix, 02/14/2013 10:51 AM
osd.5.log.gz (2.65 MB) osd.5.log.gz	log for 5	Evan Felix, 02/14/2013 11:22 AM
gdb.txt (226 KB) gdb.txt	pg dump, and backtrace.	Evan Felix, 02/18/2013 09:58 AM
gdb.txt (51.7 KB) gdb.txt	full back trace of all threads.	Evan Felix, 02/18/2013 10:44 AM
osd.5.log.10.gz (3.16 MB) osd.5.log.10.gz	debug log with more debug	Evan Felix, 02/18/2013 02:16 PM
pg4.0.log (3.87 KB) pg4.0.log	pg 4.0 query	Evan Felix, 02/19/2013 09:43 AM
ceph-osd.37.log.gz (2.85 MB) ceph-osd.37.log.gz	osd.37 suicide timeout	Wido den Hollander, 03/29/2013 02:43 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Evan Felix about 11 years ago

I've also seen these Journal errors at times:
2013-02-13 12:45:55.029747 7f71cabfd700 1 journal check_for_full at 643530752 : JOURNAL FULL 643530752 >= 1048571903 (max_size 1048576000 start 643530752)
2013-02-13 12:45:55.029792 7f71cabfd700 0 journal JOURNAL TOO SMALL: continuing, but slow: item 1381216256 > journal 1048571904 (usable)

Would it be beneficial for me to increase the journal sizes?

Actions

Copy link

Updated by Sage Weil about 11 years ago

Evan Felix wrote:

I've also seen these Journal errors at times:
2013-02-13 12:45:55.029747 7f71cabfd700 1 journal check_for_full at 643530752 : JOURNAL FULL 643530752 >= 1048571903 (max_size 1048576000 start 643530752)
2013-02-13 12:45:55.029792 7f71cabfd700 0 journal JOURNAL TOO SMALL: continuing, but slow: item 1381216256 > journal 1048571904 (usable)

Would it be beneficial for me to increase the journal sizes?

Hi Evan!

Yes... larger journal will help avoid the FULL messages, certainly. 1GB isn't particularly big. The default is 5 or 10 (i forget).

The heartbeat issue sounds very much like a bug that is fixed and going into 0.56.3 (releasing later today). Can you try that, or latest bobtail branch, and see if the issue goes away? Be warned that the first restart of the osd may take a while as it "catches up".

Actions

Copy link

Updated by Sage Weil about 11 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Sage Weil about 11 years ago

Status changed from Need More Info to 7

Actions

Copy link

Updated by Greg Farnum about 11 years ago

That JOURNAL TOO SMALL message indicates you're sending single operations which are larger than your journal. And indeed your journal is 1GB and that op is 1.3GB.

If you're doing a lot of stuff with gigabyte-scale objects, it could also explain why you're getting a bunch of suicide timeouts (which is the OSD complaining that it gave the filesystem something to do, and it went away and never came back). Ceph is in general optimized for megabyte-scale objects, and you can do larger ones but will need to tune things differently — I'd really recommend chunking of some sort instead.

Actions

Copy link

Updated by Evan Felix about 11 years ago

the objects in question are about 1360713573, though i only do small updates inside the file. does changing the size in the ceph.conf file for the journal make it grow in size?

Actions

Copy link

Updated by Greg Farnum about 11 years ago

No, you'd have to shut down the OSD, do a journal flush, and then re-create the journal. It's sparsely documented at http://ceph.com/docs/master/man/8/ceph-osd/?highlight=flush-journal#cmdoption-ceph-osd--flush-journal

Actions

Copy link

Updated by Evan Felix about 11 years ago

I'll try the new code when it is released. for now i've resized the journal on one of the problems osd's( osd 2) with these commands:

ceph-osd -i 2 --flush-journal
mv /var/lib/ceph/osd/ceph-2/journal /var/lib/ceph/osd/ceph-2/journal.old
ceph-osd -i 2 --mkjournal

Actions

Copy link

Updated by Evan Felix about 11 years ago

whoops, grabbed the mtime, they are all about 13380480 bytes.

Evan Felix wrote:

the objects in question are about 1360713573, though i only do small updates inside the file. does changing the size in the ceph.conf file for the journal make it grow in size?

Actions

Copy link

#10

Updated by Evan Felix about 11 years ago

File osd.5.log.gz osd.5.log.gz added

I using the 0.56.3 rpms that were releases yesterday for elg.x86_64, with a larger journal of 5G now, but two of my osd's are still getting the suicide error. i have crash dumps if you want them.

Actions

Copy link

#11

Updated by Evan Felix about 11 years ago

File osd.5.log.gz osd.5.log.gz added

I using the 0.56.3 rpms that were releases yesterday for elg.x86_64, with a larger journal of 5G now, but two of my osd's are still getting the suicide error. i have crash dumps if you want them.

Actions

Copy link

#12

Updated by Evan Felix about 11 years ago

File gdb.txt gdb.txt added

Sage, I've been testing for many days now, and two of my OSD's dump core after consuming lots of memory. I've poked around in one and it seems stuck in the OSD::process_peering_events call. Looking at the pg it is working on it has hundreds of entries, and one vector(_M_buckets) that is 12582917 entries.

I am attaching the gdb backtrace and pg entry. Please advise next steps, as i'd really like to get the system back into a usable state.

Actions

Copy link

#13

Updated by Sage Weil about 11 years ago

It looks like IndexedLog::caller_ops is growing without bound; trying to reproduce that, although it is likely unrelated to your problem here.

What are the other threads doing? It has a lot of old map epochs to process.. this is somewhat expected right after the upgrade to .3, but it shouldn't persist after an initial (slow) osd restart.

Also, how did you make gdb print the STL containers so nicely? :)

Actions

Copy link

#14

Updated by Evan Felix about 11 years ago

File gdb.txt gdb.txt added

Most of the Threads are in Wait() or tcp_wait.. i'll attach full thread bt's

I just used gdb to dump the info(in the OSD::process_peering_events frame):
set logging on
print *pg
set logging off

Actions

Copy link

#15

Updated by Sage Weil about 11 years ago

Hmm, it's not clear why it would get stuck there. Can you reproduce this with 'debug osd = 10'?

The earlier log you attached may have had it, but we just upgraded redmine and the attachments seem to be missing currently.

Actions

Copy link

#16

Updated by Evan Felix about 11 years ago

File osd.5.log.10.gz osd.5.log.10.gz added

See attached log

Actions

Copy link

#17

Updated by Sage Weil about 11 years ago

Woah, okay.. one of your pgs has a 1.3GB log object: 'read_log 0~1381204356'. That is what it is trying to write out when it hits the suicide timeout. The pg claims to have done 90 million updates over its lifetime (that last 10M of which are in the log). Other pgs in that pool have similar numbers, so that does seem to be real... that also explains why that hash_map was huge.

I'm not sure why the PG didn't trim its log before. At this point, tho, the simplest thing may be to just let it chew through the peering process (and then trim). You can increase the heartbeat timeouts with 'osd op thread timeout' (the suicide threshold is 10* that value).

Actions

Copy link

#18

Updated by Evan Felix about 11 years ago

is there a way to determine what object it is, i expect all my object to be about 13M, nowhere near a gig. i'll adjust timeout and let it run

Actions

Copy link

#19

Updated by Sage Weil about 11 years ago

It's not a user object; its the internal pg log of recent requests. No way to change it without surgery...

Let us know if a longer timeout doesn't let it recover and move on!

Actions

Copy link

#20

Updated by Evan Felix about 11 years ago

I've set it to 45... and its sucking up memory... and the oom killer got it... I'll see if i can add swap.. it was at 26G of mem before it died.

Actions

Copy link

#21

Updated by Evan Felix about 11 years ago

sage, the rados command i'm using have start responding again.. I had to move through a few osd's changing the time out, as one came up a couple others died.. but they are holding for now.

Actions

Copy link

#22

Updated by Sage Weil about 11 years ago

excellent. can you paste the output from 'ceph pg 4.0 query' so we can verify that that ginormous log got trimmed?

thanks!

Actions

Copy link

#23

Updated by Sage Weil about 11 years ago

Assignee set to Sage Weil

Actions

Copy link

#24

Updated by Evan Felix about 11 years ago

File pg4.0.log pg4.0.log added

Here is the current output of 'pg 4.0 query'

Actions

Copy link

#25

Updated by Sage Weil about 11 years ago

Priority changed from Normal to Urgent

Okay, the problem here is that it's not trimming the log because the PG is degraded. We need to fix that ASAP. I'll get a patch pushed and backported to bobtail for you shortly.

Thanks!

Actions

Copy link

#26

Updated by Sage Weil about 11 years ago

I pushed a wip-4116-workaround branch with a workaround that will make it trim the log. It might be a painful restart (it'll load the log one last time) but it should trim it shortly after that. If you install that on and restart osd.8 first it'll get you back in business! Thanks for helping track this down. issue #4192 will track a permanent fix.

Actions

Copy link

#27

Updated by Sage Weil about 11 years ago

Status changed from 7 to Resolved

Evan, going to assume this is resolved.. let me know if not! #4192 will track the long-term fix.

Actions

Copy link

#28

Updated by Evan Felix about 11 years ago

Yes the cluster has returned to a working state and osd's are staying up with the new code.

Actions

Copy link

#29

Updated by Wido den Hollander about 11 years ago

File ceph-osd.37.log.gz ceph-osd.37.log.gz added

I'm seeing this one on one of my OSDs right now.

It seems to be working properly for some time, but then it suddenly dies.

The backtrace is this:

 ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)
 1: /usr/bin/ceph-osd() [0x788fba]
 2: (()+0xfcb0) [0x7f11b21b1cb0]
 3: (gsignal()+0x35) [0x7f11b0b70425]
 4: (abort()+0x17b) [0x7f11b0b73b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f11b14c269d]
 6: (()+0xb5846) [0x7f11b14c0846]
 7: (()+0xb5873) [0x7f11b14c0873]
 8: (()+0xb596e) [0x7f11b14c096e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x8343af]
 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x107) [0x7a2547]
 11: (ceph::HeartbeatMap::is_healthy()+0xa7) [0x7a2ef7]
 12: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x7a34a3]
 13: (CephContextServiceThread::entry()+0x55) [0x83fd75]
 14: (()+0x7e9a) [0x7f11b21a9e9a]
 15: (clone()+0x6d) [0x7f11b0c2dcbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I've attached the logs of osd.37.

Debugging wasn't that high, so I'm not sure if all the data is in there.

The osd doesn't have any disk issues nor btrfs message.

Kernel is 3.8.0 with Ceph 0.56.4

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #4116

common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

Updated by Evan Felix about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Evan Felix about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Evan Felix about 11 years ago

Updated by Evan Felix about 11 years ago

Updated by Evan Felix about 11 years ago

Updated by Evan Felix about 11 years ago

Updated by Evan Felix about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Evan Felix about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Evan Felix about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Evan Felix about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Evan Felix about 11 years ago

Updated by Evan Felix about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Evan Felix about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Evan Felix about 11 years ago

Updated by Wido den Hollander about 11 years ago