Bug #21532
openosd: Abort in thread_name:tp_osd_tp
0%
Description
2017-09-24T02:15:59.401 INFO:tasks.ceph.osd.3.smithi070.stderr:*** Caught signal (Aborted) ** 2017-09-24T02:15:59.465 INFO:tasks.ceph.osd.3.smithi070.stderr: in thread 7f2eec83b700 thread_name:tp_osd_tp 2017-09-24T02:15:59.481 INFO:tasks.ceph.osd.3.smithi070.stderr:daemon-helper: command crashed with signal 6 2017-09-24T02:15:59.507 INFO:tasks.ceph.osd.2.smithi070.stderr:2017-09-24 02:01:28.091912 7fe6a6d8e700 -1 osd.2 68 heartbeat_check: no reply from 172.21.15.70:6807 osd.3 since back 2017-09-24 02:01:05.370066 front 2017-09-24 02:01:05.370066 (cutoff 2017-09-24 02:01:08.091886) 2017-09-24T02:15:59.559 INFO:tasks.ceph.osd.2.smithi070.stderr:2017-09-24 02:01:29.092137 7fe6a6d8e700 -1 osd.2 68 heartbeat_check: no reply from 172.21.15.70:6807 osd.3 since back 2017-09-24 02:01:05.370066 front 2017-09-24 02:01:05.370066 (cutoff 2017-09-24 02:01:09.092112) 2017-09-24T02:15:59.571 INFO:tasks.ceph.osd.2.smithi070.stderr:2017-09-24 02:01:30.092504 7fe6a6d8e700 -1 osd.2 68 heartbeat_check: no reply from 172.21.15.70:6807 osd.3 since back 2017-09-24 02:01:05.370066 front 2017-09-24 02:01:05.370066 (cutoff 2017-09-24 02:01:10.092480) 2017-09-24T02:15:59.596 INFO:tasks.ceph.osd.2.smithi070.stderr:2017-09-24 02:01:31.092755 7fe6a6d8e700 -1 osd.2 68 heartbeat_check: no reply from 172.21.15.70:6807 osd.3 since back 2017-09-24 02:01:05.370066 front 2017-09-24 02:01:05.370066 (cutoff 2017-09-24 02:01:11.092713)
From: /ceph/teuthology-archive/yuriw-2017-09-23_23:29:32-kcephfs-wip-yuri-testing-2017-09-23-2100-testing-basic-smithi/1663744/teuthology.log
Core: /ceph/teuthology-archive/yuriw-2017-09-23_23:29:32-kcephfs-wip-yuri-testing-2017-09-23-2100-testing-basic-smithi/1663744/remote/smithi070/coredump/1506218658.15242.core
Updated by Greg Farnum over 6 years ago
The log ends 14 minutes prior to the signal, which I imagine is related to #21507.
(gdb) bt #0 0x00007f2f0aad2611 in __pthread_kill (threadid=<optimized out>, signo=6) at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:61 #1 0x000055eef2f81940 in ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long) () #2 0x000055eef2f81d82 in ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, long, long) () #3 0x000055eef293cfc4 in OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) () #4 0x000055eef2e7b85f in ShardedThreadPool::shardedthreadpool_worker(unsigned int) () #5 0x000055eef2e7d7b0 in ShardedThreadPool::WorkThreadSharded::entry() () #6 0x00007f2f0aacd184 in start_thread (arg=0x7f2eef040700) at pthread_create.c:312 #7 0x00007f2f09bbd37d in eventfd (count=15242, flags=0) at ../sysdeps/unix/sysv/linux/eventfd.c:42 #8 0x0000000000000000 in ?? ()
Updated by Greg Farnum over 6 years ago
- Status changed from New to Need More Info
[10:22:39] <@sage> it looks like everyone is waiting for log flush.. which is deep in snprintf in the core. can't tell if it has been making progress or not without a log. :/
[10:23:26] <gregsfortytwo> okay, that makes me think the log rotation is broken somehow, see http://tracker.ceph.com/issues/21507
[10:23:31] <@sage> this doesn't look like a regression to me... more likely just slow writes to teh fullish log disk?
[10:23:46] <gregsfortytwo> maybe the disk is filling up but the log rotation is erasing that evidence?
[10:23:50] <gregsfortytwo> batrick: ^
[10:24:11] <@sage> or the disk is busy doing log rotation (gzipping?), which makes it busy and the log flush slow, which blocked up everything else?
[10:24:21] <@sage> anyway, this should not block 12.2.1 IMO..
Updated by Sage Weil over 6 years ago
It looks like a zilli threads are blocked at
Thread 4 (Thread 0x7f2f07821700 (LWP 15248)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x000055eef2e83b63 in ceph::logging::Log::submit_entry (this=0x55eefe0aa000, e=0x55ef0272f080) at /build/ceph-12.2.0-294-gfa7a3c7/src/log/Log.cc:216
which is await for the flusher thread, which is doing
Thread 2 (Thread 0x7f2f08823700 (LWP 15246)): #0 0x00007f2f09b3e58e in __GI__IO_default_xsputn (f=0x7f2f088213f0, data=<optimized out>, n=1) at genops.c:470 #1 0x00007f2f09b32600 in __GI__IO_padn (fp=fp@entry=0x7f2f088213f0, pad=pad@entry=48, count=count@entry=1) at iopadn.c:67 #2 0x00007f2f09b0d6bc in _IO_vfprintf_internal (s=s@entry=0x7f2f088213f0, format=<optimized out>, format@entry=0x55eef33c49f8 "%04d-%02d-%02d %02d:%02d:%02d.%06ld", ap=ap@entry=0x7f2f08821568) at vfprintf.c:1660 #3 0x00007f2f09bcc165 in ___vsnprintf_chk (s=0x7f2f08821670 "2017-09-24 02:01:", maxlen=<optimized out>, flags=1, slen=<optimized out>, format=0x55eef33c49f8 "%04d-%02d-%02d %02d:%02d:%02d.%06ld", args=args@entry=0x7f2f08821568) at vsnprintf_chk.c:63 #4 0x00007f2f09bcc0c8 in ___snprintf_chk (s=s@entry=0x7f2f08821670 "2017-09-24 02:01:", maxlen=maxlen@entry=325, flags=flags@entry=1, slen=slen@entry=18446744073709551615, format=format@entry=0x55eef33c49f8 "%04d-%02d-%02d %02d:%02d:%02d.%06ld") at snprintf_chk.c:34 #5 0x000055eef2e83f74 in snprintf (__fmt=0x55eef33c49f8 "%04d-%02d-%02d %02d:%02d:%02d.%06ld", __n=325, __s=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:65 #6 sprintf (outlen=325, out=<optimized out>, this=0x55ef04247380) at /build/ceph-12.2.0-294-gfa7a3c7/src/include/utime.h:312 #7 ceph::logging::Log::_flush (this=this@entry=0x55eefe0aa000, t=t@entry=0x7f2f088218d0, requeue=requeue@entry=0x55eefe0aa110, crash=crash@entry=false) at /build/ceph-12.2.0-294-gfa7a3c7/src/log/Log.cc:314 #8 0x000055eef2e844f9 in ceph::logging::Log::flush (this=this@entry=0x55eefe0aa000) at /build/ceph-12.2.0-294-gfa7a3c7/src/log/Log.cc:274 #9 0x000055eef2e8474e in ceph::logging::Log::entry (this=0x55eefe0aa000) at /build/ceph-12.2.0-294-gfa7a3c7/src/log/Log.cc:448 #10 0x00007f2f0aacd184 in start_thread (arg=0x7f2f08823700) at pthread_create.c:312 #11 0x00007f2f09bbd37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
can't really tell from that whether that thread is busy or stuck.
My guess is busy, but the logging disk is going really slow?
Updated by Sage Weil over 6 years ago
...although even a slow disk shouldn't be long enough for the the heartbeat to time out. :/