Project

General

Profile

Bug #3789

OSD core dump and down OSD on CentOS cluster

Added by Anonymous over 6 years ago. Updated over 6 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
01/11/2013
Due date:
% Done:

0%

Spent time:
Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Running a CentOS VM cluster. Running v0.56.1

I had written a bit of data, and stopped writing about 4pm yesterday. I was running scans to validate the writes that had been done, and left it running overnight.

When I came in this morning, 2 of the 3 nodes had core files, and some of the OSD's were down.

[root@centos1 core]# service ceph -a status === mon.a ===
mon.a: running {"version":"0.56.1"} === mon.b ===
mon.b: not running. === mon.c ===
mon.c: running {"version":"0.56.1"} === mds.a ===
mds.a: running {"version":"0.56.1"} === osd.0 ===
osd.0: running {"version":"0.56.1"} === osd.1 ===
osd.1: running {"version":"0.56.1"} === osd.2 ===
osd.2: not running. === osd.3 ===
osd.3: not running. === osd.4 ===
osd.4: not running. === osd.5 ===
osd.5: not running. === osd.6 ===
osd.6: running {"version":"0.56.1"} === osd.7 ===
osd.7: not running. === osd.8 ===
osd.8: not running.

the core files come from the OSD daemons.
centos1: cored at 8:49am on Jan 11
centos2: cored at 8:42am on Jan 11
centos3: cored at 17:28pm on Jan 10

[root@centos3 core]# file core.0*
core.0.14160: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/ceph-osd -i 7 --pid-file /var/run/ceph/osd.7.pid -c /etc/ceph/ceph.con'
core.0.14401: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/ceph-osd -i 8 --pid-file /var/run/ceph/osd.8.pid -c /etc/ceph/ceph.con'

[root@centos2 core]# file core.0.8304
core.0.8304: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/ceph-osd -i 5 --pid-file /var/run/ceph/osd.5.pid -c /tmp/ceph.conf.268'

[root@centos1 core]# file cor*
core.0.25741: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.con'
core.0.26177: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/ceph-osd -i 2 --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.con'

They have different backtraces so I will open different bugs for each.
backtrace from one of the core files on centos3:

  1. gdb /usr/bin/ceph-osd core.0.14401
    Core was generated by `/usr/bin/ceph-osd -i 8 --pid-file /var/run/ceph/osd.8.pid -c /etc/ceph/ceph.con'.
    Program terminated with signal 6, Aborted.
    #0 0x00007faa9c2e13cb in raise () from /lib64/libpthread.so.0
    Missing separate debuginfos, use: debuginfo-install ceph-0.56.1-0.el6.x86_64
    (gdb) bt
    #0 0x00007faa9c2e13cb in raise () from /lib64/libpthread.so.0
    #1 0x000000000078c557 in reraise_fatal (signum=6) at global/signal_handler.cc:58
    #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:104
    #3 <signal handler called>
    #4 0x00007faa9afae8a5 in raise () from /lib64/libc.so.6
    #5 0x00007faa9afb0085 in abort () from /lib64/libc.so.6
    #6 0x00007faa9b866a5d in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
    #7 0x00007faa9b864be6 in ?? () from /usr/lib64/libstdc++.so.6
    #8 0x00007faa9b864c13 in std::terminate() () from /usr/lib64/libstdc++.so.6
    #9 0x00007faa9b864d0e in _cxa_throw () from /usr/lib64/libstdc++.so.6
    #10 0x0000000000837839 in ceph::
    _ceph_assert_fail (assertion=0x2da4d50 "\001", file=0x7faa8011b230 "\360\255\023\200\252\177", line=3294, func=0x9360c0 "virtual void SyncEntryTimeout::finish(int)")
    at common/assert.cc:77
    #11 0x00000000007313ef in SyncEntryTimeout::finish (this=<value optimized out>, r=<value optimized out>) at os/FileStore.cc:3294
    #12 0x000000000084f053 in SafeTimer::timer_thread (this=0x2dc6a68) at common/Timer.cc:105
    #13 0x000000000085121d in SafeTimerThread::entry (this=<value optimized out>) at common/Timer.cc:38
    #14 0x00007faa9c2d9851 in start_thread () from /lib64/libpthread.so.0
    #15 0x00007faa9b06367d in clone () from /lib64/libc.so.6

Testing the log roll on centos3, it appears that the OSD stopped writing it's logs around 15:30 Jan 10, so I have no logs after that time.
and the logging stopped on centos1 and centos2 at 17:30 Jan 10.

putting the core files and binaries on burnupi40:/home/ubuntu/centos_troubleshooting

unfortunately, these are VM machines inside the Sunnyvale office, so they are not available for troublshooting by the LA engineers. But I will gladly do whatever you need to pull info off.

History

#1 Updated by Sage Weil over 6 years ago

  • Status changed from New to Need More Info

check dmesg, or VM responsiveness. this triggers when a call to sync(2) takes more than... 2 minutes? i forget how long. it's there as a safety for when the kernel or underlying fs is hung.

#2 Updated by Anonymous over 6 years ago

backtrace of core.0.14401 from centos3:
Core was generated by `/usr/bin/ceph-osd -i 8 --pid-file /var/run/ceph/osd.8.pid -c /etc/ceph/ceph.con'.
Program terminated with signal 6, Aborted.
#0 0x00007faa9c2e13cb in raise () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install ceph-0.56.1-0.el6.x86_64
(gdb) bt
#0 0x00007faa9c2e13cb in raise () from /lib64/libpthread.so.0
]#1 0x000000000078c557 in reraise_fatal (signum=6) at global/signal_handler.cc:58
#2 handle_fatal_signal (signum=6) at global/signal_handler.cc:104
#3 <signal handler called>
#4 0x00007faa9afae8a5 in raise () from /lib64/libc.so.6
#5 0x00007faa9afb0085 in abort () from /lib64/libc.so.6
#6 0x00007faa9b866a5d in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#7 0x00007faa9b864be6 in ?? () from /usr/lib64/libstdc++.so.6
#8 0x00007faa9b864c13 in std::terminate() () from /usr/lib64/libstdc++.so.6
#9 0x00007faa9b864d0e in _cxa_throw () from /usr/lib64/libstdc++.so.6
#10 0x0000000000837839 in ceph::
_ceph_assert_fail (assertion=0x2da4d50 "\001", file=0x7faa8011b230 "\360\255\023\200\252\177", line=3294, func=0x9360c0 "virtual void SyncEntryTimeout::finish(int)")
at common/assert.cc:77
#11 0x00000000007313ef in SyncEntryTimeout::finish (this=<value optimized out>, r=<value optimized out>) at os/FileStore.cc:3294
#12 0x000000000084f053 in SafeTimer::timer_thread (this=0x2dc6a68) at common/Timer.cc:105
#13 0x000000000085121d in SafeTimerThread::entry (this=<value optimized out>) at common/Timer.cc:38
#14 0x00007faa9c2d9851 in start_thread () from /lib64/libpthread.so.0
#15 0x00007faa9b06367d in clone () from /lib64/libc.so.6

dmesg output from centos3:

hrtimer: interrupt took 3666286 ns
INFO: task ceph-osd:14160 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 14160 1 0x00000080
ffff88001b4fbc98 0000000000000082 0000000000010287 ffff88001b4fbc00
ffff88001b4fbc68 ffffffff810a45a0 ffff88001b4fbca0 ffff8800187da040
ffff8800187da5f8 ffff88001b4fbfd8 000000000000fb88 ffff8800187da5f8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81060a83>] ? wake_up_new_task+0xd3/0x120
[<ffffffff8106a873>] ? do_fork+0x133/0x460
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:14163 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 14163 1 0x00000080
ffff88001b483c98 0000000000000082 ffffffff81ecb198 ffff88001b4a8080
ffff88001b483c68 ffffffff810a45a0 0000000000000001 ffff88001b483c98
ffff88001b4a8638 ffff88001b483fd8 000000000000fb88 ffff88001b4a8638
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff8111673e>] ? generic_file_aio_write+0xbe/0xe0
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:14166 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 14166 1 0x00000080
ffff88001d885c98 0000000000000082 ffff88001d885c38 ffffffff814fe99e
ffff88001d885c68 ffffffff810a45a0 ffff88001d885c68 ffff88001d885c48
ffff88001ae19af8 ffff88001d885fd8 000000000000fb88 ffff88001ae19af8
Call Trace:
[<ffffffff814fe99e>] ? __wait_on_bit+0x7e/0x90
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
[<ffffffff814fd830>] ? thread_return+0x4e/0x76e
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100bb5c>] retint_signal+0x48/0x8c
INFO: task ceph-osd:14171 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 14171 1 0x00000080
ffff88001b4b3c98 0000000000000082 0000000000010287 ffff88001b4b3c00
ffff88001b4b3c68 ffffffff810a45a0 ffffea0000000000 ffff88001b4b3b88
ffff88000bb30638 ffff88001b4b3fd8 000000000000fb88 ffff88000bb30638
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81060280>] ? wake_up_state+0x10/0x20
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff8100988e>] ? __switch_to+0x26e/0x320
[<ffffffff814fd830>] ? thread_return+0x4e/0x76e
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:14417 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 14417 1 0x00000080
ffff88001b0ddc98 0000000000000082 ffffffff81ec9370 ffff88001b410080
ffff88001b0ddc68 ffffffff810a45a0 ffff88001b0ddca0 ffff88001b410080
ffff88001b410638 ffff88001b0ddfd8 000000000000fb88 ffff88001b410638
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:14418 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 14418 1 0x00000080
ffff88001b4abc98 0000000000000082 ffffffff81eca040 ffff880010f64ae0
ffff88001b4abc68 ffffffff810a45a0 ffff88001b4abca0 ffff880010f64ae0
ffff880010f65098 ffff88001b4abfd8 000000000000fb88 ffff880010f65098
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:14419 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 14419 1 0x00000080
ffff88001b515c98 0000000000000082 0000000000010287 ffff88001b515c00
ffff88001b515c68 ffffffff810a45a0 ffff88001b515ca0 ffff88001f2ecaa0
ffff88001f2ed058 ffff88001b515fd8 000000000000fb88 ffff88001f2ed058
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:14420 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 14420 1 0x00000080
ffff88001b4cdc98 0000000000000082 0000000000010287 ffff88001b4cdc00
ffff88001b4cdc68 ffffffff810a45a0 ffff88001b4cdca0 ffff88001f2ed500
ffff88001f2edab8 ffff88001b4cdfd8 000000000000fb88 ffff88001f2edab8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:14421 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 14421 1 0x00000080
ffff88001b03bc98 0000000000000082 0000000000010287 ffff88001b03bc00
ffff88001b03bc68 ffffffff810a45a0 ffffea0000141c48 ffffea0000000000
ffff88001b0d7af8 ffff88001b03bfd8 000000000000fb88 ffff88001b0d7af8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff811852f7>] ? pipe_read+0x2a7/0x4e0
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff8121fd4f>] ? selinux_file_permission+0xbf/0x150
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:14422 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 14422 1 0x00000080
ffff880019e4bc98 0000000000000082 ffffffff81eca900 ffff88001d4a0aa0
ffff880019e4bc68 ffffffff810a45a0 ffff880019e4bca0 ffff88001d4a0aa0
ffff88001d4a1058 ffff880019e4bfd8 000000000000fb88 ffff88001d4a1058
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff8100988e>] ? __switch_to+0x26e/0x320
[<ffffffff814fd830>] ? thread_return+0x4e/0x76e
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
[root@centos3 core]#

#3 Updated by Anonymous over 6 years ago

looks from dmesg, you are right Sage, low on resources

centos1 core# gdb /usr/bin/ceph-osd core.0.26177

Core was generated by `/usr/bin/ceph-osd -i 2 --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.con'.
Program terminated with signal 6, Aborted.
#0 0x00007fb42de008a5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install ceph-0.56.1-0.el6.x86_64
(gdb) bt
#0 0x00007fb42de008a5 in raise () from /lib64/libc.so.6
#1 0x00007fb42de02085 in abort () from /lib64/libc.so.6
#2 0x00007fb42e6b8971 in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3 0x00007fb42e6b6be6 in ?? () from /usr/lib64/libstdc++.so.6
#4 0x00007fb42e6b6c13 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5 0x00007fb42e6b6d0e in _cxa_throw () from /usr/lib64/libstdc++.so.6
#6 0x0000000000837839 in ceph::
_ceph_assert_fail(char const*, char const*, int, char const*) ()
#7 0x00000000007c4b4b in ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long) ()
#8 0x00000000007c4ee7 in ceph::HeartbeatMap::is_healthy() ()
#9 0x00000000007c5148 in ceph::HeartbeatMap::check_touch_file() ()
#10 0x000000000084d8ad in CephContextServiceThread::entry() ()
#11 0x00007fb42f12b851 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fb42deb567d in clone () from /lib64/libc.so.6

dmesg:
hrtimer: interrupt took 3717906 ns
INFO: task ceph-osd:26177 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 26177 1 0x00000080
ffff88001d039c98 0000000000000082 0000000000010287 ffff88001d039c00
ffff88001d039c68 ffffffff810a45a0 ffff88001d039ca0 ffff880000b9a080
ffff880000b9a638 ffff88001d039fd8 000000000000fb88 ffff880000b9a638
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81060a83>] ? wake_up_new_task+0xd3/0x120
[<ffffffff8106a873>] ? do_fork+0x133/0x460
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:26180 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 26180 1 0x00000080
ffff88001f2d9c98 0000000000000082 0000000000000001 ffff88001ac57cf0
0000000000000000 0000000000000000 ffff88001f2d9c18 ffffffff81060262
ffff88001d0fbab8 ffff88001f2d9fd8 000000000000fb88 ffff88001d0fbab8
Call Trace:
[<ffffffff81060262>] ? default_wake_function+0x12/0x20
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff8111673e>] ? generic_file_aio_write+0xbe/0xe0
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
[<ffffffff8117b0e2>] ? vfs_write+0x132/0x1a0
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:26185 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 26185 1 0x00000080
ffff88001f61fc98 0000000000000082 0000000000010287 ffff88001f61fc00
ffff88001f61fc68 ffffffff810a45a0 ffffea0000000000 ffff88001f61fb88
ffff88001ad39098 ffff88001f61ffd8 000000000000fb88 ffff88001ad39098
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81060280>] ? wake_up_state+0x10/0x20
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
[<ffffffff814fd830>] ? thread_return+0x4e/0x76e
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:26815 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 26815 1 0x00000080
ffff88001d56dc98 0000000000000082 0000000000010287 ffff88001d56dc00
ffff88001d56dc68 ffffffff810a45a0 ffff88001d56dca0 ffff88001f465540
ffff88001f465af8 ffff88001d56dfd8 000000000000fb88 ffff88001f465af8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:26816 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 26816 1 0x00000080
ffff88001f5bbc98 0000000000000082 0000000000010287 ffff88001f5bbc00
ffff88001f5bbc68 ffffffff810a45a0 ffff88001fa885f8 ffff88001f5bbc98
ffff88000c9a4638 ffff88001f5bbfd8 000000000000fb88 ffff88000c9a4638
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:26817 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 26817 1 0x00000080
ffff88001ad0dc98 0000000000000082 0000000000010287 ffff88001ad0dc00
ffff88001ad0dc68 ffffffff810a45a0 ffff88001ad0dca0 ffff88000c9a4ae0
ffff88000c9a5098 ffff88001ad0dfd8 000000000000fb88 ffff88000c9a5098
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:26818 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 26818 1 0x00000080
ffff88001f7e1c98 0000000000000082 0000000000010287 ffff88001f7e1c00
ffff88001f7e1c68 ffffffff810a45a0 ffff880002216680 ffff88001f7e1c98
ffff880008791ab8 ffff88001f7e1fd8 000000000000fb88 ffff880008791ab8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:26819 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 26819 1 0x00000080
ffff88001acc9c98 0000000000000082 0000000000010287 ffff88001acc9c00
ffff88001acc9c68 ffffffff810a45a0 ffffea0000142738 ffffea0000000000
ffff880008791058 ffff88001acc9fd8 000000000000fb88 ffff880008791058
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff811852f7>] ? pipe_read+0x2a7/0x4e0
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff8121fd4f>] ? selinux_file_permission+0xbf/0x150
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:26820 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 26820 1 0x00000080
ffff88001d0c1c98 0000000000000082 0000000000010287 ffff88001d0c1c00
ffff88001d0c1c68 ffffffff810a45a0 ffff88001d0c1d18 ffff88001d0c1c98
ffff88001d01c638 ffff88001d0c1fd8 000000000000fb88 ffff88001d01c638
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff810e0f55>] ? call_rcu_sched+0x15/0x20
[<ffffffff810e0f6e>] ? call_rcu+0xe/0x10
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:26821 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 26821 1 0x00000080
ffff88001acf9c98 0000000000000082 0000000000010287 ffff88001acf9c00
ffff88001acf9c68 ffffffff810a45a0 ffff88001acf9ca0 ffff88001f703500
ffff88001f703ab8 ffff88001acf9fd8 000000000000fb88 ffff88001f703ab8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
[root@centos1 core]#

#4 Updated by Anonymous over 6 years ago

all core files have similar backtrace.
again, Sage, looks like you are right, low resources

dmesg:
hrtimer: interrupt took 5259323 ns
INFO: task ceph-osd:5038 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 5038 1 0x00000080
ffff88001cc59d18 0000000000000082 ffff88001cc59c88 ffffffff8116b670
ffff88001cc59c98 ffff8800000116c0 0000000000000000 0000000000000000
ffff8800025d7af8 ffff88001cc59fd8 000000000000fb88 ffff8800025d7af8
Call Trace:
[<ffffffff8116b670>] ? mem_cgroup_get_reclaim_stat_from_page+0x20/0x70
[<ffffffff8127865d>] ? rb_insert_color+0x9d/0x160
[<ffffffff814fe6a5>] schedule_timeout+0x215/0x2e0
[<ffffffff81054a04>] ? check_preempt_wakeup+0x1a4/0x260
[<ffffffff810632c4>] ? enqueue_task_fair+0x64/0x100
[<ffffffff814fe323>] wait_for_common+0x123/0x180
[<ffffffff81060250>] ? default_wake_function+0x0/0x20
[<ffffffff814fe43d>] wait_for_completion+0x1d/0x20
[<ffffffff811a46b8>] sync_inodes_sb+0x88/0x190
[<ffffffff811aa212>] __sync_filesystem+0x82/0x90
[<ffffffff811aa41b>] sync_filesystem+0x4b/0x70
[<ffffffff811aa490>] sys_syncfs+0x50/0x80
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
ceph-osd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
ceph-osd cpuset=/ mems_allowed=0
Pid: 17403, comm: ceph-osd Not tainted 2.6.32-279.el6.x86_64 #1
Call Trace:
[<ffffffff810c4971>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff811170e0>] ? dump_header+0x90/0x1b0
[<ffffffff812146fc>] ? security_real_capable_noaudit+0x3c/0x70
[<ffffffff81117562>] ? oom_kill_process+0x82/0x2a0
[<ffffffff811174a1>] ? select_bad_process+0xe1/0x120
[<ffffffff811179a0>] ? out_of_memory+0x220/0x3c0
[<ffffffff811276be>] ? __alloc_pages_nodemask+0x89e/0x940
[<ffffffff8115c1da>] ? alloc_pages_current+0xaa/0x110
[<ffffffff811144e7>] ? __page_cache_alloc+0x87/0x90
[<ffffffff8112a10b>] ? __do_page_cache_readahead+0xdb/0x210
[<ffffffff8118fdd0>] ? __pollwait+0x0/0xf0
[<ffffffff8112a261>] ? ra_submit+0x21/0x30
[<ffffffff81115813>] ? filemap_fault+0x4c3/0x500
[<ffffffff8118fec0>] ? pollwake+0x0/0x60
[<ffffffff8113ec14>] ? __do_fault+0x54/0x510
[<ffffffff8113f1c7>] ? handle_pte_fault+0xf7/0xb50
[<ffffffff81060280>] ? wake_up_state+0x10/0x20
[<ffffffff810a3bc0>] ? wake_futex+0x40/0x60
[<ffffffff810a43fe>] ? futex_wake+0x10e/0x120
[<ffffffff8113fe04>] ? handle_mm_fault+0x1e4/0x2b0
[<ffffffff810a6340>] ? do_futex+0x100/0xb00
[<ffffffff81044479>] ? __do_page_fault+0x139/0x480
[<ffffffff8142874b>] ? sys_recvfrom+0x16b/0x180
[<ffffffff8100988e>] ? __switch_to+0x26e/0x320
[<ffffffff81012bd9>] ? read_tsc+0x9/0x20
[<ffffffff8109cd39>] ? ktime_get_ts+0xa9/0xe0
[<ffffffff814fd830>] ? thread_return+0x4e/0x76e
[<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
[<ffffffff81500625>] ? page_fault+0x25/0x30
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 30
active_anon:52525 inactive_anon:52561 isolated_anon:0
active_file:218 inactive_file:321 isolated_file:0
unevictable:0 dirty:0 writeback:1 unstable:0
free:1189 slab_reclaimable:2305 slab_unreclaimable:12054
mapped:206 shmem:0 pagetables:1320 bounce:0
Node 0 DMA free:2044kB min:84kB low:104kB high:124kB active_anon:6640kB inactive_anon:6724kB active_file:24kB inactive_file:108kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15368kB mlocked:0kB dirty:0kB writeback:0kB mapped:60kB shmem:0kB slab_reclaimable:92kB slab_unreclaimable:56kB kernel_stack:0kB pagetables:68kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:275 all_unreclaimable? yes
lowmem_reserve[]: 0 489 489 489
Node 0 DMA32 free:2712kB min:2784kB low:3480kB high:4176kB active_anon:203460kB inactive_anon:203520kB active_file:848kB inactive_file:1176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:500896kB mlocked:0kB dirty:0kB writeback:4kB mapped:764kB shmem:0kB slab_reclaimable:9128kB slab_unreclaimable:48160kB kernel_stack:2448kB pagetables:5212kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:704 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2044kB
Node 0 DMA32: 176*4kB 3*8kB 2*16kB 1*32kB 0*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2712kB
6002 total pagecache pages
5455 pages in swap cache
Swap cache stats: add 1182169, delete 1176714, find 251320/357333
Free swap = 0kB
Total swap = 1015800kB
131055 pages RAM
5413 pages reserved
692 pages shared
120653 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 383] 0 383 2865 0 0 -17 -1000 udevd
[ 920] 0 920 6914 27 0 -17 -1000 auditd
[ 936] 0 936 62796 49 0 0 0 rsyslogd
[ 970] 0 970 16016 0 0 -17 -1000 sshd
[ 1471] 0 1471 19667 34 0 0 0 master
[ 1479] 0 1479 29309 31 0 0 0 crond
[ 1485] 89 1485 19730 18 0 0 0 qmgr
[ 1496] 0 1496 1014 1 0 0 0 mingetty
[ 1498] 0 1498 1014 1 0 0 0 mingetty
[ 1500] 0 1500 1014 1 0 0 0 mingetty
[ 1502] 0 1502 1014 1 0 0 0 mingetty
[ 1507] 0 1507 3095 0 0 -17 -1000 udevd
[ 1508] 0 1508 3095 0 0 -17 -1000 udevd
[ 1509] 0 1509 1014 1 0 0 0 mingetty
[ 1511] 0 1511 1014 1 0 0 0 mingetty
[ 8060] 0 8060 384151 62840 0 0 0 ceph-mon
[ 8174] 0 8174 259649 12390 0 0 0 ceph-osd
[ 8242] 0 8242 252767 12542 0 0 0 ceph-osd
[ 8304] 0 8304 258688 12090 0 0 0 ceph-osd
[17882] 89 17882 19687 215 0 0 0 pickup
Out of memory: Kill process 8060 (ceph-mon) score 644 or sacrifice child
Killed process 8060, UID 0, (ceph-mon) total-vm:1536604kB, anon-rss:250868kB, file-rss:492kB
INFO: task ceph-osd:8304 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8304 1 0x00000080
ffff88001f709c98 0000000000000086 0000000000010287 ffff88001f709c00
ffff88001f709c68 ffffffff810a45a0 ffff88001f709ca0 ffff88001da65540
ffff88001da65af8 ffff88001f709fd8 000000000000fb88 ffff88001da65af8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81099015>] ? sched_clock_local+0x25/0x90
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81060a83>] ? wake_up_new_task+0xd3/0x120
[<ffffffff8106a873>] ? do_fork+0x133/0x460
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8305 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8305 1 0x00000080
ffff88001da8dc98 0000000000000086 ffffffff81ecb5d0 ffff880003ac9500
ffff88001da8dc68 ffffffff810a45a0 ffff88001da8dca0 ffff880003ac9500
ffff880003ac9ab8 ffff88001da8dfd8 000000000000fb88 ffff880003ac9ab8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff8111673e>] ? generic_file_aio_write+0xbe/0xe0
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8306 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8306 1 0x00000080
ffff88001d779c98 0000000000000086 ffffffff81eca720 ffff88001d69c080
ffff88001d779c68 ffffffff810a45a0 ffff88001d779ca0 ffff88001d69c080
ffff88001d69c638 ffff88001d779fd8 000000000000fb88 ffff88001d69c638
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff8100988e>] ? __switch_to+0x26e/0x320
[<ffffffff814fd830>] ? thread_return+0x4e/0x76e
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8307 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8307 1 0x00000080
ffff88001f68fc98 0000000000000086 0000000000010287 ffff88001f68fc00
ffff88001f68fc68 ffffffff810a45a0 ffffea0000000000 ffff88001f68fb88
ffff88001f56daf8 ffff88001f68ffd8 000000000000fb88 ffff88001f56daf8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81060280>] ? wake_up_state+0x10/0x20
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
[<ffffffff814fd830>] ? thread_return+0x4e/0x76e
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8373 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8373 1 0x00000080
ffff8800024f1c98 0000000000000086 ffffffff81ec9cf8 ffff88001f5c0080
ffff8800024f1c68 ffffffff810a45a0 ffff88001f64f800 ffff8800024f1c98
ffff88001f5c0638 ffff8800024f1fd8 000000000000fb88 ffff88001f5c0638
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8374 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8374 1 0x00000080
ffff88001cc57c98 0000000000000086 0000000000010287 ffff88001cc57c00
ffff88001cc57c68 ffffffff810a45a0 ffff88001cc57ca0 ffff88001cd7f500
ffff88001cd7fab8 ffff88001cc57fd8 000000000000fb88 ffff88001cd7fab8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8375 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8375 1 0x00000080
ffff88001dbf9c98 0000000000000086 ffffffff81ecaa18 ffff88001f43c080
ffff88001dbf9c68 ffffffff810a45a0 ffff88001dbf9ca0 ffff88001f43c080
ffff88001f43c638 ffff88001dbf9fd8 000000000000fb88 ffff88001f43c638
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81060280>] ? wake_up_state+0x10/0x20
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8376 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8376 1 0x00000080
ffff88001cc15c98 0000000000000086 ffffffff81ec9488 ffff88001f5c1540
ffff88001cc15c68 ffffffff810a45a0 ffff88001cc15ca0 ffff88001f5c1540
ffff88001f5c1af8 ffff88001cc15fd8 000000000000fb88 ffff88001f5c1af8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8377 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8377 1 0x00000080
ffff88001dabbc98 0000000000000086 0000000000010287 ffff88001dabbc00
ffff88001dabbc68 ffffffff810a45a0 ffffea00005e2458 ffffea0000000000
ffff88001dba1098 ffff88001dabbfd8 000000000000fb88 ffff88001dba1098
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff811852f7>] ? pipe_read+0x2a7/0x4e0
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff8121fd4f>] ? selinux_file_permission+0xbf/0x150
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17

  1. gdb /usr/bin/ceph-osd core.0.8174
    Core was generated by `/usr/bin/ceph-osd -i 3 --pid-file /var/run/ceph/osd.3.pid -c /tmp/ceph.conf.263'.
    Program terminated with signal 6, Aborted.
    #0 0x00007f2c157503cb in raise () from /lib64/libpthread.so.0
    Missing separate debuginfos, use: debuginfo-install ceph-0.56.1-0.el6.x86_64
    (gdb) bt
    #0 0x00007f2c157503cb in raise () from /lib64/libpthread.so.0
    #1 0x000000000078c557 in ?? ()
    #2 <signal handler called>
    #3 0x00007f2c1441d8a5 in raise () from /lib64/libc.so.6
    #4 0x00007f2c1441f085 in abort () from /lib64/libc.so.6
    #5 0x00007f2c14cd5a5d in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
    #6 0x00007f2c14cd3be6 in ?? () from /usr/lib64/libstdc++.so.6
    #7 0x00007f2c14cd3c13 in std::terminate() () from /usr/lib64/libstdc++.so.6
    #8 0x00007f2c14cd3d0e in _cxa_throw () from /usr/lib64/libstdc++.so.6
    #9 0x0000000000837839 in ceph::
    _ceph_assert_fail(char const*, char const*, int, char const*) ()
    #10 0x00000000007c4b4b in ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long) ()
    #11 0x00000000007c4ee7 in ceph::HeartbeatMap::is_healthy() ()
    #12 0x00000000007c5148 in ceph::HeartbeatMap::check_touch_file() ()
    #13 0x000000000084d8ad in CephContextServiceThread::entry() ()
    #14 0x00007f2c15748851 in start_thread () from /lib64/libpthread.so.0
    #15 0x00007f2c144d267d in clone () from /lib64/libc.so.6

#5 Updated by Anonymous over 6 years ago

Deb Barba wrote:

all core files have similar backtrace.
again, Sage, looks like you are right, low resources

dmesg:
hrtimer: interrupt took 5259323 ns
INFO: task ceph-osd:5038 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 5038 1 0x00000080
ffff88001cc59d18 0000000000000082 ffff88001cc59c88 ffffffff8116b670
ffff88001cc59c98 ffff8800000116c0 0000000000000000 0000000000000000
ffff8800025d7af8 ffff88001cc59fd8 000000000000fb88 ffff8800025d7af8
Call Trace:
[<ffffffff8116b670>] ? mem_cgroup_get_reclaim_stat_from_page+0x20/0x70
[<ffffffff8127865d>] ? rb_insert_color+0x9d/0x160
[<ffffffff814fe6a5>] schedule_timeout+0x215/0x2e0
[<ffffffff81054a04>] ? check_preempt_wakeup+0x1a4/0x260
[<ffffffff810632c4>] ? enqueue_task_fair+0x64/0x100
[<ffffffff814fe323>] wait_for_common+0x123/0x180
[<ffffffff81060250>] ? default_wake_function+0x0/0x20
[<ffffffff814fe43d>] wait_for_completion+0x1d/0x20
[<ffffffff811a46b8>] sync_inodes_sb+0x88/0x190
[<ffffffff811aa212>] __sync_filesystem+0x82/0x90
[<ffffffff811aa41b>] sync_filesystem+0x4b/0x70
[<ffffffff811aa490>] sys_syncfs+0x50/0x80
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
ceph-osd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
ceph-osd cpuset=/ mems_allowed=0
Pid: 17403, comm: ceph-osd Not tainted 2.6.32-279.el6.x86_64 #1
Call Trace:
[<ffffffff810c4971>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff811170e0>] ? dump_header+0x90/0x1b0
[<ffffffff812146fc>] ? security_real_capable_noaudit+0x3c/0x70
[<ffffffff81117562>] ? oom_kill_process+0x82/0x2a0
[<ffffffff811174a1>] ? select_bad_process+0xe1/0x120
[<ffffffff811179a0>] ? out_of_memory+0x220/0x3c0
[<ffffffff811276be>] ? __alloc_pages_nodemask+0x89e/0x940
[<ffffffff8115c1da>] ? alloc_pages_current+0xaa/0x110
[<ffffffff811144e7>] ? __page_cache_alloc+0x87/0x90
[<ffffffff8112a10b>] ? __do_page_cache_readahead+0xdb/0x210
[<ffffffff8118fdd0>] ? __pollwait+0x0/0xf0
[<ffffffff8112a261>] ? ra_submit+0x21/0x30
[<ffffffff81115813>] ? filemap_fault+0x4c3/0x500
[<ffffffff8118fec0>] ? pollwake+0x0/0x60
[<ffffffff8113ec14>] ? __do_fault+0x54/0x510
[<ffffffff8113f1c7>] ? handle_pte_fault+0xf7/0xb50
[<ffffffff81060280>] ? wake_up_state+0x10/0x20
[<ffffffff810a3bc0>] ? wake_futex+0x40/0x60
[<ffffffff810a43fe>] ? futex_wake+0x10e/0x120
[<ffffffff8113fe04>] ? handle_mm_fault+0x1e4/0x2b0
[<ffffffff810a6340>] ? do_futex+0x100/0xb00
[<ffffffff81044479>] ? __do_page_fault+0x139/0x480
[<ffffffff8142874b>] ? sys_recvfrom+0x16b/0x180
[<ffffffff8100988e>] ? __switch_to+0x26e/0x320
[<ffffffff81012bd9>] ? read_tsc+0x9/0x20
[<ffffffff8109cd39>] ? ktime_get_ts+0xa9/0xe0
[<ffffffff814fd830>] ? thread_return+0x4e/0x76e
[<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
[<ffffffff81500625>] ? page_fault+0x25/0x30
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 30
active_anon:52525 inactive_anon:52561 isolated_anon:0
active_file:218 inactive_file:321 isolated_file:0
unevictable:0 dirty:0 writeback:1 unstable:0
free:1189 slab_reclaimable:2305 slab_unreclaimable:12054
mapped:206 shmem:0 pagetables:1320 bounce:0
Node 0 DMA free:2044kB min:84kB low:104kB high:124kB active_anon:6640kB inactive_anon:6724kB active_file:24kB inactive_file:108kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15368kB mlocked:0kB dirty:0kB writeback:0kB mapped:60kB shmem:0kB slab_reclaimable:92kB slab_unreclaimable:56kB kernel_stack:0kB pagetables:68kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:275 all_unreclaimable? yes
lowmem_reserve[]: 0 489 489 489
Node 0 DMA32 free:2712kB min:2784kB low:3480kB high:4176kB active_anon:203460kB inactive_anon:203520kB active_file:848kB inactive_file:1176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:500896kB mlocked:0kB dirty:0kB writeback:4kB mapped:764kB shmem:0kB slab_reclaimable:9128kB slab_unreclaimable:48160kB kernel_stack:2448kB pagetables:5212kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:704 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2044kB
Node 0 DMA32: 176*4kB 3*8kB 2*16kB 1*32kB 0*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2712kB
6002 total pagecache pages
5455 pages in swap cache
Swap cache stats: add 1182169, delete 1176714, find 251320/357333
Free swap = 0kB
Total swap = 1015800kB
131055 pages RAM
5413 pages reserved
692 pages shared
120653 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 383] 0 383 2865 0 0 -17 -1000 udevd
[ 920] 0 920 6914 27 0 -17 -1000 auditd
[ 936] 0 936 62796 49 0 0 0 rsyslogd
[ 970] 0 970 16016 0 0 -17 -1000 sshd
[ 1471] 0 1471 19667 34 0 0 0 master
[ 1479] 0 1479 29309 31 0 0 0 crond
[ 1485] 89 1485 19730 18 0 0 0 qmgr
[ 1496] 0 1496 1014 1 0 0 0 mingetty
[ 1498] 0 1498 1014 1 0 0 0 mingetty
[ 1500] 0 1500 1014 1 0 0 0 mingetty
[ 1502] 0 1502 1014 1 0 0 0 mingetty
[ 1507] 0 1507 3095 0 0 -17 -1000 udevd
[ 1508] 0 1508 3095 0 0 -17 -1000 udevd
[ 1509] 0 1509 1014 1 0 0 0 mingetty
[ 1511] 0 1511 1014 1 0 0 0 mingetty
[ 8060] 0 8060 384151 62840 0 0 0 ceph-mon
[ 8174] 0 8174 259649 12390 0 0 0 ceph-osd
[ 8242] 0 8242 252767 12542 0 0 0 ceph-osd
[ 8304] 0 8304 258688 12090 0 0 0 ceph-osd
[17882] 89 17882 19687 215 0 0 0 pickup
Out of memory: Kill process 8060 (ceph-mon) score 644 or sacrifice child
Killed process 8060, UID 0, (ceph-mon) total-vm:1536604kB, anon-rss:250868kB, file-rss:492kB
INFO: task ceph-osd:8304 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8304 1 0x00000080
ffff88001f709c98 0000000000000086 0000000000010287 ffff88001f709c00
ffff88001f709c68 ffffffff810a45a0 ffff88001f709ca0 ffff88001da65540
ffff88001da65af8 ffff88001f709fd8 000000000000fb88 ffff88001da65af8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81099015>] ? sched_clock_local+0x25/0x90
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81060a83>] ? wake_up_new_task+0xd3/0x120
[<ffffffff8106a873>] ? do_fork+0x133/0x460
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8305 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8305 1 0x00000080
ffff88001da8dc98 0000000000000086 ffffffff81ecb5d0 ffff880003ac9500
ffff88001da8dc68 ffffffff810a45a0 ffff88001da8dca0 ffff880003ac9500
ffff880003ac9ab8 ffff88001da8dfd8 000000000000fb88 ffff880003ac9ab8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff8111673e>] ? generic_file_aio_write+0xbe/0xe0
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8306 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8306 1 0x00000080
ffff88001d779c98 0000000000000086 ffffffff81eca720 ffff88001d69c080
ffff88001d779c68 ffffffff810a45a0 ffff88001d779ca0 ffff88001d69c080
ffff88001d69c638 ffff88001d779fd8 000000000000fb88 ffff88001d69c638
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff8100988e>] ? __switch_to+0x26e/0x320
[<ffffffff814fd830>] ? thread_return+0x4e/0x76e
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8307 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8307 1 0x00000080
ffff88001f68fc98 0000000000000086 0000000000010287 ffff88001f68fc00
ffff88001f68fc68 ffffffff810a45a0 ffffea0000000000 ffff88001f68fb88
ffff88001f56daf8 ffff88001f68ffd8 000000000000fb88 ffff88001f56daf8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81060280>] ? wake_up_state+0x10/0x20
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
[<ffffffff814fd830>] ? thread_return+0x4e/0x76e
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8373 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8373 1 0x00000080
ffff8800024f1c98 0000000000000086 ffffffff81ec9cf8 ffff88001f5c0080
ffff8800024f1c68 ffffffff810a45a0 ffff88001f64f800 ffff8800024f1c98
ffff88001f5c0638 ffff8800024f1fd8 000000000000fb88 ffff88001f5c0638
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8374 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8374 1 0x00000080
ffff88001cc57c98 0000000000000086 0000000000010287 ffff88001cc57c00
ffff88001cc57c68 ffffffff810a45a0 ffff88001cc57ca0 ffff88001cd7f500
ffff88001cd7fab8 ffff88001cc57fd8 000000000000fb88 ffff88001cd7fab8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8375 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8375 1 0x00000080
ffff88001dbf9c98 0000000000000086 ffffffff81ecaa18 ffff88001f43c080
ffff88001dbf9c68 ffffffff810a45a0 ffff88001dbf9ca0 ffff88001f43c080
ffff88001f43c638 ffff88001dbf9fd8 000000000000fb88 ffff88001f43c638
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81060280>] ? wake_up_state+0x10/0x20
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8376 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8376 1 0x00000080
ffff88001cc15c98 0000000000000086 ffffffff81ec9488 ffff88001f5c1540
ffff88001cc15c68 ffffffff810a45a0 ffff88001cc15ca0 ffff88001f5c1540
ffff88001f5c1af8 ffff88001cc15fd8 000000000000fb88 ffff88001f5c1af8
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff81191d8f>] ? __d_free+0x3f/0x60
[<ffffffff8119a330>] ? mntput_no_expire+0x30/0x110
[<ffffffff810a6dbb>] ? sys_futex+0x7b/0x170
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17
INFO: task ceph-osd:8377 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ceph-osd D 0000000000000000 0 8377 1 0x00000080
ffff88001dabbc98 0000000000000086 0000000000010287 ffff88001dabbc00
ffff88001dabbc68 ffffffff810a45a0 ffffea00005e2458 ffffea0000000000
ffff88001dba1098 ffff88001dabbfd8 000000000000fb88 ffff88001dba1098
Call Trace:
[<ffffffff810a45a0>] ? exit_robust_list+0x90/0x160
[<ffffffff81070085>] exit_mm+0x95/0x180
[<ffffffff810704cf>] do_exit+0x15f/0x870
[<ffffffff811852f7>] ? pipe_read+0x2a7/0x4e0
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81085866>] get_signal_to_deliver+0x1f6/0x460
[<ffffffff8100a2d5>] do_signal+0x75/0x800
[<ffffffff8121fd4f>] ? selinux_file_permission+0xbf/0x150
[<ffffffff8100aaf0>] do_notify_resume+0x90/0xc0
[<ffffffff8100b3c1>] int_signal+0x12/0x17

  1. gdb /usr/bin/ceph-osd core.0.8174
    Core was generated by `/usr/bin/ceph-osd -i 3 --pid-file /var/run/ceph/osd.3.pid -c /tmp/ceph.conf.263'.
    Program terminated with signal 6, Aborted.
    #0 0x00007f2c157503cb in raise () from /lib64/libpthread.so.0
    Missing separate debuginfos, use: debuginfo-install ceph-0.56.1-0.el6.x86_64
    (gdb) bt
    #0 0x00007f2c157503cb in raise () from /lib64/libpthread.so.0
    #1 0x000000000078c557 in ?? ()
    #2 <signal handler called>
    #3 0x00007f2c1441d8a5 in raise () from /lib64/libc.so.6
    #4 0x00007f2c1441f085 in abort () from /lib64/libc.so.6
    #5 0x00007f2c14cd5a5d in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
    #6 0x00007f2c14cd3be6 in ?? () from /usr/lib64/libstdc++.so.6
    #7 0x00007f2c14cd3c13 in std::terminate() () from /usr/lib64/libstdc++.so.6
    #8 0x00007f2c14cd3d0e in _cxa_throw () from /usr/lib64/libstdc++.so.6
    #9 0x0000000000837839 in ceph::
    _ceph_assert_fail(char const*, char const*, int, char const*) ()
    #10 0x00000000007c4b4b in ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long) ()
    #11 0x00000000007c4ee7 in ceph::HeartbeatMap::is_healthy() ()
    #12 0x00000000007c5148 in ceph::HeartbeatMap::check_touch_file() ()
    #13 0x000000000084d8ad in CephContextServiceThread::entry() ()
    #14 0x00007f2c15748851 in start_thread () from /lib64/libpthread.so.0
    #15 0x00007f2c144d267d in clone () from /lib64/libc.so.6

#6 Updated by Anonymous over 6 years ago

  • Status changed from Need More Info to Won't Fix

dmesg shows it was a lack of resources.

upping the memory on these VMs from 512M to 2G

since it appears it was a resource problem, i will close this bug.

do we have any mechanism that I am missing that notifies the end user when crashes like this occur? So they can go in and fix their cluster before there are a critical number of resources that have failed?

caused by a lack of resources on the system.
have increased the memory from 512M to 2G, will retest.

#7 Updated by Sage Weil over 6 years ago

There is 'ceph health', and a nagios plugin that runs it. A similarly trivial plugin can probably be written for other monitoring systems... i'm not sure what else people actually use these days.

Also available in: Atom PDF