Project

General

Profile

Bug #6386

ceph-osd processes are having hang tasks

Added by Andrei Mikhailovsky over 10 years ago. Updated over 10 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I am having issues with ceph-osd processes showing hang tasks messages in dmesg output on both of my ceph-osd servers. The example of messages are:

[263403.883274] INFO: task ceph-osd:9440 blocked for more than 120 seconds.
[263403.883380] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[263403.883501] ceph-osd D ffff88050f3e94a0 0 9440 1 0x00000000
[263403.883505] ffff8804c1f6bcd8 0000000000000082 0000000000000000 ffff88052fc13f40
[263403.883510] ffff8804c1f6bfd8 ffff8804c1f6bfd8 ffff8804c1f6bfd8 0000000000013f40
[263403.883515] ffff88000ea30000 ffff88000ea345c0 ffff8804f5b58380 ffff8804f5b58380
[263403.883520] Call Trace:
[263403.883525] [<ffffffff816f2de9>] schedule+0x29/0x70
[263403.883529] [<ffffffff8105fbe5>] exit_mm+0x85/0x130
[263403.883534] [<ffffffff8105fdf3>] do_exit+0x163/0x480
[263403.883538] [<ffffffff8106d64b>] ? __dequeue_signal+0x6b/0xb0
[263403.883542] [<ffffffff810601a4>] do_group_exit+0x44/0xa0
[263403.883547] [<ffffffff810701ce>] get_signal_to_deliver+0x22e/0x490
[263403.883551] [<ffffffff81014be9>] do_signal+0x29/0x130
[263403.883555] [<ffffffff810bb13c>] ? do_futex+0x7c/0x1b0
[263403.883559] [<ffffffff8107bdcd>] ? task_work_run+0xcd/0xf0
[263403.883563] [<ffffffff810bb3b7>] ? sys_futex+0x147/0x1a0
[263403.883568] [<ffffffff81014d70>] do_notify_resume+0x80/0xc0
[263403.883572] [<ffffffff816fce5a>] int_signal+0x12/0x17
[263549.958642] init: ceph-osd (ceph/9) main process (9408) killed by ABRT signal
[263549.958704] init: ceph-osd (ceph/9) main process ended, respawning
[995916.698139] init: ceph-osd (ceph/10) main process (28976) killed by KILL signal
[995916.712471] init: ceph-osd (ceph/16) main process (28530) killed by KILL signal
[1038995.998709] init: ceph-osd (ceph/13) main process (31053) killed by ABRT signal
[1038995.999886] init: ceph-osd (ceph/13) main process ended, respawning
[1196156.643770] init: ceph-osd (ceph/12) main process (31355) killed by ABRT signal
[1196156.644902] init: ceph-osd (ceph/12) main process ended, respawning

Both of my osd servers are running Ubuntu 12.04 with the latest updates using backported kernel version 3.8.0-30-generic. Both servers have 24GB of ram and run ceph monitor, osd and mds services. Each server has 8 osds. Networking between the servers is 40gbit/s infiniband ipoib.

History

#1 Updated by Sage Weil over 10 years ago

  • Status changed from New to Need More Info

We used to see this periodicially but had no real clue. At some point we stopped seeing it. I suspect that it is just a problem with older kernels.. can you try something newer than 3.8? How regularly does this happen?

#2 Updated by Andrei Mikhailovsky over 10 years ago

Sage,

I've seen this once on ceph 0.67.3. However, I've upgraded from 0.61.7 about 3 weeks ago and this happened on both servers that had 2 weeks uptime. I've rebooted the servers and will keep an eye to see if it will happen again.

Regarding the kernel - what version would you recommend to try?

Thanks

Andrei

#3 Updated by Andrei Mikhailovsky over 10 years ago

Let me clarify the previous message. It happened once, but the hang tasks happened on many osds, not just on one. I've provided an example of dmesg output, which was full of those messages.

#4 Updated by Sage Weil over 10 years ago

  • Priority changed from Urgent to Normal
  • Source changed from other to Community (user)

I would try whatever kernel ubuntu has for precise that is > 3.8... not sure offhand what they have. If nothing else you can grab their packages upstream stable kernel (e.g. 3.10.y).

#5 Updated by Andrei Mikhailovsky over 10 years ago

Sage,

I had this happened again, while I was at the Ceph Day ((. Anyway, the behaviour seemed pretty similar. A bunch of osds died about the same time on both osd servers. vms had hang tasks.

I am downloading the 3.11 kernel from Ubuntu 13.10 to see if it will work better. Will keep an eye on it and update the ticket.

#6 Updated by Andrei Mikhailovsky over 10 years ago

Sage, sorry, I think i've mixed the bug numbers. I've meant to post it to another bug where i've uploaded a bunch of logs from the osd server. I can't seems to find that bug now.

#7 Updated by Sage Weil over 10 years ago

  • Status changed from Need More Info to Can't reproduce

Also available in: Atom PDF