Project

General

Profile

Actions

Bug #21532

open

osd: Abort in thread_name:tp_osd_tp

Added by Patrick Donnelly over 6 years ago. Updated over 6 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
kcephfs
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2017-09-24T02:15:59.401 INFO:tasks.ceph.osd.3.smithi070.stderr:*** Caught signal (Aborted) **
2017-09-24T02:15:59.465 INFO:tasks.ceph.osd.3.smithi070.stderr: in thread 7f2eec83b700 thread_name:tp_osd_tp
2017-09-24T02:15:59.481 INFO:tasks.ceph.osd.3.smithi070.stderr:daemon-helper: command crashed with signal 6
2017-09-24T02:15:59.507 INFO:tasks.ceph.osd.2.smithi070.stderr:2017-09-24 02:01:28.091912 7fe6a6d8e700 -1 osd.2 68 heartbeat_check: no reply from 172.21.15.70:6807 osd.3 since back 2017-09-24 02:01:05.370066 front 2017-09-24 02:01:05.370066 (cutoff 2017-09-24 02:01:08.091886)
2017-09-24T02:15:59.559 INFO:tasks.ceph.osd.2.smithi070.stderr:2017-09-24 02:01:29.092137 7fe6a6d8e700 -1 osd.2 68 heartbeat_check: no reply from 172.21.15.70:6807 osd.3 since back 2017-09-24 02:01:05.370066 front 2017-09-24 02:01:05.370066 (cutoff 2017-09-24 02:01:09.092112)
2017-09-24T02:15:59.571 INFO:tasks.ceph.osd.2.smithi070.stderr:2017-09-24 02:01:30.092504 7fe6a6d8e700 -1 osd.2 68 heartbeat_check: no reply from 172.21.15.70:6807 osd.3 since back 2017-09-24 02:01:05.370066 front 2017-09-24 02:01:05.370066 (cutoff 2017-09-24 02:01:10.092480)
2017-09-24T02:15:59.596 INFO:tasks.ceph.osd.2.smithi070.stderr:2017-09-24 02:01:31.092755 7fe6a6d8e700 -1 osd.2 68 heartbeat_check: no reply from 172.21.15.70:6807 osd.3 since back 2017-09-24 02:01:05.370066 front 2017-09-24 02:01:05.370066 (cutoff 2017-09-24 02:01:11.092713)

From: /ceph/teuthology-archive/yuriw-2017-09-23_23:29:32-kcephfs-wip-yuri-testing-2017-09-23-2100-testing-basic-smithi/1663744/teuthology.log

Core: /ceph/teuthology-archive/yuriw-2017-09-23_23:29:32-kcephfs-wip-yuri-testing-2017-09-23-2100-testing-basic-smithi/1663744/remote/smithi070/coredump/1506218658.15242.core

Actions #1

Updated by Greg Farnum over 6 years ago

The log ends 14 minutes prior to the signal, which I imagine is related to #21507.

(gdb) bt
#0  0x00007f2f0aad2611 in __pthread_kill (threadid=<optimized out>, signo=6) at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:61
#1  0x000055eef2f81940 in ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long) ()
#2  0x000055eef2f81d82 in ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, long, long) ()
#3  0x000055eef293cfc4 in OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) ()
#4  0x000055eef2e7b85f in ShardedThreadPool::shardedthreadpool_worker(unsigned int) ()
#5  0x000055eef2e7d7b0 in ShardedThreadPool::WorkThreadSharded::entry() ()
#6  0x00007f2f0aacd184 in start_thread (arg=0x7f2eef040700) at pthread_create.c:312
#7  0x00007f2f09bbd37d in eventfd (count=15242, flags=0) at ../sysdeps/unix/sysv/linux/eventfd.c:42
#8  0x0000000000000000 in ?? ()
Actions #2

Updated by Greg Farnum over 6 years ago

  • Status changed from New to Need More Info

[10:22:39] <@sage> it looks like everyone is waiting for log flush.. which is deep in snprintf in the core. can't tell if it has been making progress or not without a log. :/
[10:23:26] <gregsfortytwo> okay, that makes me think the log rotation is broken somehow, see http://tracker.ceph.com/issues/21507
[10:23:31] <@sage> this doesn't look like a regression to me... more likely just slow writes to teh fullish log disk?
[10:23:46] <gregsfortytwo> maybe the disk is filling up but the log rotation is erasing that evidence?
[10:23:50] <gregsfortytwo> batrick: ^
[10:24:11] <@sage> or the disk is busy doing log rotation (gzipping?), which makes it busy and the log flush slow, which blocked up everything else?
[10:24:21] <@sage> anyway, this should not block 12.2.1 IMO..

Actions #3

Updated by Greg Farnum over 6 years ago

  • Priority changed from Urgent to Normal
Actions #4

Updated by Sage Weil over 6 years ago

It looks like a zilli threads are blocked at

Thread 4 (Thread 0x7f2f07821700 (LWP 15248)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000055eef2e83b63 in ceph::logging::Log::submit_entry (this=0x55eefe0aa000, e=0x55ef0272f080) at /build/ceph-12.2.0-294-gfa7a3c7/src/log/Log.cc:216

which is await for the flusher thread, which is doing
Thread 2 (Thread 0x7f2f08823700 (LWP 15246)):
#0  0x00007f2f09b3e58e in __GI__IO_default_xsputn (f=0x7f2f088213f0, data=<optimized out>, n=1) at genops.c:470
#1  0x00007f2f09b32600 in __GI__IO_padn (fp=fp@entry=0x7f2f088213f0, pad=pad@entry=48, count=count@entry=1) at iopadn.c:67
#2  0x00007f2f09b0d6bc in _IO_vfprintf_internal (s=s@entry=0x7f2f088213f0, format=<optimized out>, format@entry=0x55eef33c49f8 "%04d-%02d-%02d %02d:%02d:%02d.%06ld", ap=ap@entry=0x7f2f08821568) at vfprintf.c:1660
#3  0x00007f2f09bcc165 in ___vsnprintf_chk (s=0x7f2f08821670 "2017-09-24 02:01:", maxlen=<optimized out>, flags=1, slen=<optimized out>, format=0x55eef33c49f8 "%04d-%02d-%02d %02d:%02d:%02d.%06ld", args=args@entry=0x7f2f08821568) at vsnprintf_chk.c:63
#4  0x00007f2f09bcc0c8 in ___snprintf_chk (s=s@entry=0x7f2f08821670 "2017-09-24 02:01:", maxlen=maxlen@entry=325, flags=flags@entry=1, slen=slen@entry=18446744073709551615, format=format@entry=0x55eef33c49f8 "%04d-%02d-%02d %02d:%02d:%02d.%06ld") at snprintf_chk.c:34
#5  0x000055eef2e83f74 in snprintf (__fmt=0x55eef33c49f8 "%04d-%02d-%02d %02d:%02d:%02d.%06ld", __n=325, __s=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:65
#6  sprintf (outlen=325, out=<optimized out>, this=0x55ef04247380) at /build/ceph-12.2.0-294-gfa7a3c7/src/include/utime.h:312
#7  ceph::logging::Log::_flush (this=this@entry=0x55eefe0aa000, t=t@entry=0x7f2f088218d0, requeue=requeue@entry=0x55eefe0aa110, crash=crash@entry=false) at /build/ceph-12.2.0-294-gfa7a3c7/src/log/Log.cc:314
#8  0x000055eef2e844f9 in ceph::logging::Log::flush (this=this@entry=0x55eefe0aa000) at /build/ceph-12.2.0-294-gfa7a3c7/src/log/Log.cc:274
#9  0x000055eef2e8474e in ceph::logging::Log::entry (this=0x55eefe0aa000) at /build/ceph-12.2.0-294-gfa7a3c7/src/log/Log.cc:448
#10 0x00007f2f0aacd184 in start_thread (arg=0x7f2f08823700) at pthread_create.c:312
#11 0x00007f2f09bbd37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

can't really tell from that whether that thread is busy or stuck.

My guess is busy, but the logging disk is going really slow?

Actions #5

Updated by Sage Weil over 6 years ago

...although even a slow disk shouldn't be long enough for the the heartbeat to time out. :/

Actions

Also available in: Atom PDF