Project

General

Profile

Actions

Bug #3546

closed

CEPH 0.48.2 OSD crashed causing kernel RBD clients to reboot

Added by Kevin Scheunemann over 11 years ago. Updated over 11 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Here is a stack trace of an OSD crash, after the OSD crashed it caused all of the hosts using the RBD kernel client to reboot.

Nov 28 00:42:02 d44-1e-a1-3a-a2-50 kernel: [19830344.673997] ceph-osd23148 general protection ip:7f3a2e915df7 sp:7f3a01137700 error:0 in libtcmalloc.so.0.1.0[7f3a2e8f7000+3f000]
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940146] INFO: task ceph-osd:23017 blocked for more than 120 seconds.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940370] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940696] ceph-osd D 0000000000000001 0 23017 1 0x00000000
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940709] ffff880101d71cc8 0000000000000082 0000000000000000 ffffffffffffffe0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940730] ffff880101d71fd8 ffff880101d71fd8 ffff880101d71fd8 0000000000013780
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940745] ffff880809518000 ffff8800030e96f0 ffff8800030e96f0 ffff8808096e7380
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940760] Call Trace:
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940779] [<ffffffff81659fdf>] schedule+0x3f/0x60
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940793] [<ffffffff8106b915>] exit_mm+0x85/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940803] [<ffffffff8106bb2e>] do_exit+0x16e/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940814] [<ffffffff8109d8bf>] ? __unqueue_futex+0x3f/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940826] [<ffffffff8107a2ca>] ? __dequeue_signal+0x6a/0xb0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940836] [<ffffffff8106bf84>] do_group_exit+0x44/0xa0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940846] [<ffffffff8107ce0c>] get_signal_to_deliver+0x21c/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940858] [<ffffffff81013865>] do_signal+0x45/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940867] [<ffffffff8106653b>] ? do_fork+0x15b/0x2e0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940877] [<ffffffff810a098c>] ? do_futex+0xbc/0x1d0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940885] [<ffffffff810a0baa>] ? sys_futex+0x10a/0x1a0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940894] [<ffffffff8107d1c2>] ? set_current_blocked+0x52/0x70
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940903] [<ffffffff81013b15>] do_notify_resume+0x65/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940913] [<ffffffff816647d0>] int_signal+0x12/0x17
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.940925] INFO: task ceph-osd:23018 blocked for more than 120 seconds.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941133] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941459] ceph-osd D 0000000000000007 0 23018 1 0x00000000
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941469] ffff881844849cc8 0000000000000082 0000000000000000 ffffffffffffffe0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941485] ffff881844849fd8 ffff881844849fd8 ffff881844849fd8 0000000000013780
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941500] ffff88200c0f44d0 ffff881ce7208000 ffff881ce7208000 ffff8808096e7380
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941514] Call Trace:
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941523] [<ffffffff81659fdf>] schedule+0x3f/0x60
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941532] [<ffffffff8106b915>] exit_mm+0x85/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941541] [<ffffffff8106bb2e>] do_exit+0x16e/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941550] [<ffffffff8109d8bf>] ? __unqueue_futex+0x3f/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941560] [<ffffffff8107a2ca>] ? __dequeue_signal+0x6a/0xb0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941569] [<ffffffff8106bf84>] do_group_exit+0x44/0xa0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941578] [<ffffffff8107ce0c>] get_signal_to_deliver+0x21c/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941588] [<ffffffff81013865>] do_signal+0x45/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941597] [<ffffffff810a098c>] ? do_futex+0xbc/0x1d0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941606] [<ffffffff810a0baa>] ? sys_futex+0x10a/0x1a0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941615] [<ffffffff81013b15>] do_notify_resume+0x65/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941625] [<ffffffff81177fa7>] ? sys_write+0x67/0x90
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941633] [<ffffffff816647d0>] int_signal+0x12/0x17
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941642] INFO: task ceph-osd:23019 blocked for more than 120 seconds.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.941849] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942175] ceph-osd D 0000000000000015 0 23019 1 0x00000000
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942184] ffff881843513cc8 0000000000000082 0000000000000000 ffffffffffffffe0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942198] ffff881843513fd8 ffff881843513fd8 ffff881843513fd8 0000000000013780
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942213] ffff88007d0aade0 ffff881ce72096f0 ffff881ce72096f0 ffff8808096e7380
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942227] Call Trace:
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942235] [<ffffffff81659fdf>] schedule+0x3f/0x60
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942244] [<ffffffff8106b915>] exit_mm+0x85/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942253] [<ffffffff8106bb2e>] do_exit+0x16e/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942263] [<ffffffff8107a2ca>] ? __dequeue_signal+0x6a/0xb0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942273] [<ffffffff8106bf84>] do_group_exit+0x44/0xa0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942282] [<ffffffff8107ce0c>] get_signal_to_deliver+0x21c/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942292] [<ffffffff81013865>] do_signal+0x45/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942301] [<ffffffff810a098c>] ? do_futex+0xbc/0x1d0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942309] [<ffffffff810a0baa>] ? sys_futex+0x10a/0x1a0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942318] [<ffffffff81013b15>] do_notify_resume+0x65/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942327] [<ffffffff816647d0>] int_signal+0x12/0x17
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942335] INFO: task ceph-osd:23020 blocked for more than 120 seconds.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942541] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942867] ceph-osd D 000000000000000c 0 23020 1 0x00000000
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942876] ffff8818410d3cc8 0000000000000082 ffff8818410d3ca8 ffffffff8104c308
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942890] ffff8818410d3fd8 ffff8818410d3fd8 ffff8818410d3fd8 0000000000013780
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942905] ffff880e15865bc0 ffff881ce720ade0 0000000000000282 ffff8808096e7380
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942919] Call Trace:
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942929] [<ffffffff8104c308>] ? __wake_up_common+0x58/0x90
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942938] [<ffffffff81659fdf>] schedule+0x3f/0x60
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942946] [<ffffffff8106b915>] exit_mm+0x85/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942956] [<ffffffff8106bb2e>] do_exit+0x16e/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942966] [<ffffffff8107a2ca>] ? __dequeue_signal+0x6a/0xb0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942975] [<ffffffff8106bf84>] do_group_exit+0x44/0xa0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942984] [<ffffffff8107ce0c>] get_signal_to_deliver+0x21c/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.942994] [<ffffffff81013865>] do_signal+0x45/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943003] [<ffffffff810a0a16>] ? do_futex+0x146/0x1d0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943012] [<ffffffff8106bc2b>] ? do_exit+0x26b/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943020] [<ffffffff810a0baa>] ? sys_futex+0x10a/0x1a0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943030] [<ffffffff810562ca>] ? finish_task_switch+0x4a/0xf0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943040] [<ffffffff81013b15>] do_notify_resume+0x65/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943049] [<ffffffff816647d0>] int_signal+0x12/0x17
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943057] INFO: task ceph-osd:23035 blocked for more than 120 seconds.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943264] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943590] ceph-osd D 0000000000000007 0 23035 1 0x00000000
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943598] ffff881843d65cc8 0000000000000082 0000000000000000 ffffffffffffffe0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943613] ffff881843d65fd8 ffff881843d65fd8 ffff881843d65fd8 0000000000013780
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943627] ffff882007acdbc0 ffff8820075e5bc0 ffff8820075e5bc0 ffff8808096e7380
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943641] Call Trace:
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943649] [<ffffffff81659fdf>] schedule+0x3f/0x60
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943658] [<ffffffff8106b915>] exit_mm+0x85/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943667] [<ffffffff8106bb2e>] do_exit+0x16e/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943677] [<ffffffff8107a2ca>] ? __dequeue_signal+0x6a/0xb0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943687] [<ffffffff8106bf84>] do_group_exit+0x44/0xa0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943696] [<ffffffff8107ce0c>] get_signal_to_deliver+0x21c/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943706] [<ffffffff81013865>] do_signal+0x45/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943715] [<ffffffff8106653b>] ? do_fork+0x15b/0x2e0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943723] [<ffffffff810a0978>] ? do_futex+0xa8/0x1d0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943732] [<ffffffff810a0baa>] ? sys_futex+0x10a/0x1a0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943740] [<ffffffff8107d1c2>] ? set_current_blocked+0x52/0x70
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943750] [<ffffffff81013b15>] do_notify_resume+0x65/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943759] [<ffffffff816647d0>] int_signal+0x12/0x17
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943767] INFO: task ceph-osd:23036 blocked for more than 120 seconds.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.943973] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944328] ceph-osd D ffffffff81806240 0 23036 1 0x00000000
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944337] ffff8818435f1cc8 0000000000000082 0000000000000000 ffffffffffffffe0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944352] ffff8818435f1fd8 ffff8818435f1fd8 ffff8818435f1fd8 0000000000013780
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944382] ffff88080c5f8000 ffff88200c0716f0 ffff88200c0716f0 ffff8808096e7380
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944414] Call Trace:
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944429] [<ffffffff81659fdf>] schedule+0x3f/0x60
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944445] [<ffffffff8106b915>] exit_mm+0x85/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944461] [<ffffffff8106bb2e>] do_exit+0x16e/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944476] [<ffffffff8109d8bf>] ? __unqueue_futex+0x3f/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944492] [<ffffffff8107a2ca>] ? __dequeue_signal+0x6a/0xb0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944513] [<ffffffff8106bf84>] do_group_exit+0x44/0xa0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944528] [<ffffffff8107ce0c>] get_signal_to_deliver+0x21c/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944545] [<ffffffff81013865>] do_signal+0x45/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944561] [<ffffffff810a098c>] ? do_futex+0xbc/0x1d0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944576] [<ffffffff810a0baa>] ? sys_futex+0x10a/0x1a0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944592] [<ffffffff81013b15>] do_notify_resume+0x65/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944607] [<ffffffff816647d0>] int_signal+0x12/0x17
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944619] INFO: task ceph-osd:23037 blocked for more than 120 seconds.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.944831] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945154] ceph-osd D ffffffff81806240 0 23037 1 0x00000000
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945164] ffff8818426f9cc8 0000000000000082 0000000000000000 ffffffffffffffe0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945179] ffff8818426f9fd8 ffff8818426f9fd8 ffff8818426f9fd8 0000000000013780
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945193] ffff88080c56c4d0 ffff882007acade0 ffff882007acade0 ffff8808096e7380
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945207] Call Trace:
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945216] [<ffffffff81659fdf>] schedule+0x3f/0x60
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945225] [<ffffffff8106b915>] exit_mm+0x85/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945235] [<ffffffff8106bb2e>] do_exit+0x16e/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945244] [<ffffffff8109d8bf>] ? __unqueue_futex+0x3f/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945254] [<ffffffff8107a2ca>] ? __dequeue_signal+0x6a/0xb0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945263] [<ffffffff8106bf84>] do_group_exit+0x44/0xa0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945272] [<ffffffff8107ce0c>] get_signal_to_deliver+0x21c/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945282] [<ffffffff81013865>] do_signal+0x45/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945292] [<ffffffff810a098c>] ? do_futex+0xbc/0x1d0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945300] [<ffffffff810a0baa>] ? sys_futex+0x10a/0x1a0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945310] [<ffffffff81013b15>] do_notify_resume+0x65/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945319] [<ffffffff816647d0>] int_signal+0x12/0x17
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945327] INFO: task ceph-osd:23038 blocked for more than 120 seconds.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945536] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945859] ceph-osd D ffffffff81806240 0 23038 1 0x00000000
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945868] ffff881843151cc8 0000000000000082 0000000000000000 ffffffffffffffe0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945882] ffff881843151fd8 ffff881843151fd8 ffff881843151fd8 0000000000013780
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945896] ffff88080c53dbc0 ffff882007acdbc0 ffff882007acdbc0 ffff8808096e7380
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945910] Call Trace:
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945918] [<ffffffff81659fdf>] schedule+0x3f/0x60
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945928] [<ffffffff8106b915>] exit_mm+0x85/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945937] [<ffffffff8106bb2e>] do_exit+0x16e/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945947] [<ffffffff8107a2ca>] ? __dequeue_signal+0x6a/0xb0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945957] [<ffffffff8106bf84>] do_group_exit+0x44/0xa0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945966] [<ffffffff8107ce0c>] get_signal_to_deliver+0x21c/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945976] [<ffffffff81013865>] do_signal+0x45/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945984] [<ffffffff8106653b>] ? do_fork+0x15b/0x2e0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.945994] [<ffffffff8107a56b>] ? recalc_sigpending+0x1b/0x50
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946005] [<ffffffff8107ac67>] ? __set_task_blocked+0x37/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946014] [<ffffffff8107d1c2>] ? set_current_blocked+0x52/0x70
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946025] [<ffffffff81013b15>] do_notify_resume+0x65/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946041] [<ffffffff816647d0>] int_signal+0x12/0x17
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946053] INFO: task ceph-osd:23039 blocked for more than 120 seconds.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946264] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946588] ceph-osd D 0000000000000010 0 23039 1 0x00000000
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946597] ffff881844819cc8 0000000000000082 0000000000000000 ffffffffffffffe0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946612] ffff881844819fd8 ffff881844819fd8 ffff881844819fd8 0000000000013780
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946626] ffff88000c36ade0 ffff882007ac96f0 ffff882007ac96f0 ffff8808096e7380
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946641] Call Trace:
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946649] [<ffffffff81659fdf>] schedule+0x3f/0x60
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946658] [<ffffffff8106b915>] exit_mm+0x85/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946668] [<ffffffff8106bb2e>] do_exit+0x16e/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946676] [<ffffffff8109d8bf>] ? __unqueue_futex+0x3f/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946686] [<ffffffff8107a2ca>] ? __dequeue_signal+0x6a/0xb0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946696] [<ffffffff8106bf84>] do_group_exit+0x44/0xa0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946704] [<ffffffff8107ce0c>] get_signal_to_deliver+0x21c/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946715] [<ffffffff81013865>] do_signal+0x45/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946723] [<ffffffff810a098c>] ? do_futex+0xbc/0x1d0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946732] [<ffffffff810a0baa>] ? sys_futex+0x10a/0x1a0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946741] [<ffffffff81013b15>] do_notify_resume+0x65/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946750] [<ffffffff816647d0>] int_signal+0x12/0x17
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946758] INFO: task ceph-osd:23040 blocked for more than 120 seconds.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.946965] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947290] ceph-osd D ffffffff81806240 0 23040 1 0x00000000
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947299] ffff881845befcc8 0000000000000082 0000000000000000 ffffffffffffffe0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947313] ffff881845beffd8 ffff881845beffd8 ffff881845beffd8 0000000000013780
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947327] ffff88080c4ddbc0 ffff882007acc4d0 ffff882007acc4d0 ffff8808096e7380
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947341] Call Trace:
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947349] [<ffffffff81659fdf>] schedule+0x3f/0x60
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947358] [<ffffffff8106b915>] exit_mm+0x85/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947367] [<ffffffff8106bb2e>] do_exit+0x16e/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947377] [<ffffffff8107a2ca>] ? __dequeue_signal+0x6a/0xb0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947387] [<ffffffff8106bf84>] do_group_exit+0x44/0xa0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947396] [<ffffffff8107ce0c>] get_signal_to_deliver+0x21c/0x420
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947406] [<ffffffff81013865>] do_signal+0x45/0x130
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947414] [<ffffffff8106653b>] ? do_fork+0x15b/0x2e0
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947424] [<ffffffff8107a56b>] ? recalc_sigpending+0x1b/0x50
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947434] [<ffffffff8107ac67>] ? __set_task_blocked+0x37/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947443] [<ffffffff8107d1c2>] ? set_current_blocked+0x52/0x70
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947452] [<ffffffff81013b15>] do_notify_resume+0x65/0x80
Nov 28 00:44:22 d44-1e-a1-3a-a2-50 kernel: [19830483.947461] [<ffffffff816647d0>] int_signal+0x12/0x17

Actions #1

Updated by Sage Weil over 11 years ago

What kernel version are you running?

Actions #2

Updated by Kevin Scheunemann over 11 years ago

At the time, the clients where running 3.2.0-32, but we have since upgraded to 3.6.9 per another ceph bug.

We have noticed massive memory leaks with the actual OSDs themselves (this is what we figured caused the original problem).
We will see an OSD grow by 4GB to 6GB of resident memory every 7 days. and Virtual memory grows at about the same pace.
We have some pretty neat graphs of this.

Actions #3

Updated by Sage Weil over 11 years ago

  • Status changed from New to Won't Fix

The crash is a known problem with pre-3.4 kernels. Fixes have been backported to 3.4 stable and 3.6 stable kernels, and we will continue that going forward.

What version of the osds are you running? Let's open a separate issue for the osd memory leak. It woudl be ideal if you could try the current next branch, as there were some leaks fixed recently that will go into the next release.

Actions #4

Updated by Kevin Scheunemann over 11 years ago

We are using 0.48.2 for the OSDs and our plan is to upgrade to 0.56 (or the next stable release) when it comes out.

Actions #5

Updated by Sage Weil over 11 years ago

There aren't known leaks in argonaut. If you can reproduce with valgrind massif and see where the heap is going, that'd be awesome. Drop by #ceph if you need help?

Actions

Also available in: Atom PDF