Project

General

Profile

Actions

Bug #3789

closed

OSD core dump and down OSD on CentOS cluster

Added by Anonymous over 11 years ago. Updated over 11 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Spent time:
Source:
Q/A
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Running a CentOS VM cluster. Running v0.56.1

I had written a bit of data, and stopped writing about 4pm yesterday. I was running scans to validate the writes that had been done, and left it running overnight.

When I came in this morning, 2 of the 3 nodes had core files, and some of the OSD's were down.

[root@centos1 core]# service ceph -a status === mon.a ===
mon.a: running {"version":"0.56.1"} === mon.b ===
mon.b: not running. === mon.c ===
mon.c: running {"version":"0.56.1"} === mds.a ===
mds.a: running {"version":"0.56.1"} === osd.0 ===
osd.0: running {"version":"0.56.1"} === osd.1 ===
osd.1: running {"version":"0.56.1"} === osd.2 ===
osd.2: not running. === osd.3 ===
osd.3: not running. === osd.4 ===
osd.4: not running. === osd.5 ===
osd.5: not running. === osd.6 ===
osd.6: running {"version":"0.56.1"} === osd.7 ===
osd.7: not running. === osd.8 ===
osd.8: not running.

the core files come from the OSD daemons.
centos1: cored at 8:49am on Jan 11
centos2: cored at 8:42am on Jan 11
centos3: cored at 17:28pm on Jan 10

[root@centos3 core]# file core.0*
core.0.14160: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/ceph-osd -i 7 --pid-file /var/run/ceph/osd.7.pid -c /etc/ceph/ceph.con'
core.0.14401: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/ceph-osd -i 8 --pid-file /var/run/ceph/osd.8.pid -c /etc/ceph/ceph.con'

[root@centos2 core]# file core.0.8304
core.0.8304: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/ceph-osd -i 5 --pid-file /var/run/ceph/osd.5.pid -c /tmp/ceph.conf.268'

[root@centos1 core]# file cor*
core.0.25741: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.con'
core.0.26177: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/ceph-osd -i 2 --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.con'

They have different backtraces so I will open different bugs for each.
backtrace from one of the core files on centos3:

  1. gdb /usr/bin/ceph-osd core.0.14401
    Core was generated by `/usr/bin/ceph-osd -i 8 --pid-file /var/run/ceph/osd.8.pid -c /etc/ceph/ceph.con'.
    Program terminated with signal 6, Aborted.
    #0 0x00007faa9c2e13cb in raise () from /lib64/libpthread.so.0
    Missing separate debuginfos, use: debuginfo-install ceph-0.56.1-0.el6.x86_64
    (gdb) bt
    #0 0x00007faa9c2e13cb in raise () from /lib64/libpthread.so.0
    #1 0x000000000078c557 in reraise_fatal (signum=6) at global/signal_handler.cc:58
    #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:104
    #3 <signal handler called>
    #4 0x00007faa9afae8a5 in raise () from /lib64/libc.so.6
    #5 0x00007faa9afb0085 in abort () from /lib64/libc.so.6
    #6 0x00007faa9b866a5d in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
    #7 0x00007faa9b864be6 in ?? () from /usr/lib64/libstdc++.so.6
    #8 0x00007faa9b864c13 in std::terminate() () from /usr/lib64/libstdc++.so.6
    #9 0x00007faa9b864d0e in _cxa_throw () from /usr/lib64/libstdc++.so.6
    #10 0x0000000000837839 in ceph::
    _ceph_assert_fail (assertion=0x2da4d50 "\001", file=0x7faa8011b230 "\360\255\023\200\252\177", line=3294, func=0x9360c0 "virtual void SyncEntryTimeout::finish(int)")
    at common/assert.cc:77
    #11 0x00000000007313ef in SyncEntryTimeout::finish (this=<value optimized out>, r=<value optimized out>) at os/FileStore.cc:3294
    #12 0x000000000084f053 in SafeTimer::timer_thread (this=0x2dc6a68) at common/Timer.cc:105
    #13 0x000000000085121d in SafeTimerThread::entry (this=<value optimized out>) at common/Timer.cc:38
    #14 0x00007faa9c2d9851 in start_thread () from /lib64/libpthread.so.0
    #15 0x00007faa9b06367d in clone () from /lib64/libc.so.6

Testing the log roll on centos3, it appears that the OSD stopped writing it's logs around 15:30 Jan 10, so I have no logs after that time.
and the logging stopped on centos1 and centos2 at 17:30 Jan 10.

putting the core files and binaries on burnupi40:/home/ubuntu/centos_troubleshooting

unfortunately, these are VM machines inside the Sunnyvale office, so they are not available for troublshooting by the LA engineers. But I will gladly do whatever you need to pull info off.

Actions

Also available in: Atom PDF