Bug #21532: osd: Abort in thread_name:tp_osd_tp - RADOS - Ceph

Actions

Copy link

Bug #21532

open

osd: Abort in thread_name:tp_osd_tp

Added by Patrick Donnelly over 6 years ago. Updated over 6 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Correctness/Safety

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.1

ceph-qa-suite:

kcephfs

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2017-09-24T02:15:59.401 INFO:tasks.ceph.osd.3.smithi070.stderr:*** Caught signal (Aborted) **
2017-09-24T02:15:59.465 INFO:tasks.ceph.osd.3.smithi070.stderr: in thread 7f2eec83b700 thread_name:tp_osd_tp
2017-09-24T02:15:59.481 INFO:tasks.ceph.osd.3.smithi070.stderr:daemon-helper: command crashed with signal 6
2017-09-24T02:15:59.507 INFO:tasks.ceph.osd.2.smithi070.stderr:2017-09-24 02:01:28.091912 7fe6a6d8e700 -1 osd.2 68 heartbeat_check: no reply from 172.21.15.70:6807 osd.3 since back 2017-09-24 02:01:05.370066 front 2017-09-24 02:01:05.370066 (cutoff 2017-09-24 02:01:08.091886)
2017-09-24T02:15:59.559 INFO:tasks.ceph.osd.2.smithi070.stderr:2017-09-24 02:01:29.092137 7fe6a6d8e700 -1 osd.2 68 heartbeat_check: no reply from 172.21.15.70:6807 osd.3 since back 2017-09-24 02:01:05.370066 front 2017-09-24 02:01:05.370066 (cutoff 2017-09-24 02:01:09.092112)
2017-09-24T02:15:59.571 INFO:tasks.ceph.osd.2.smithi070.stderr:2017-09-24 02:01:30.092504 7fe6a6d8e700 -1 osd.2 68 heartbeat_check: no reply from 172.21.15.70:6807 osd.3 since back 2017-09-24 02:01:05.370066 front 2017-09-24 02:01:05.370066 (cutoff 2017-09-24 02:01:10.092480)
2017-09-24T02:15:59.596 INFO:tasks.ceph.osd.2.smithi070.stderr:2017-09-24 02:01:31.092755 7fe6a6d8e700 -1 osd.2 68 heartbeat_check: no reply from 172.21.15.70:6807 osd.3 since back 2017-09-24 02:01:05.370066 front 2017-09-24 02:01:05.370066 (cutoff 2017-09-24 02:01:11.092713)

From: /ceph/teuthology-archive/yuriw-2017-09-23_23:29:32-kcephfs-wip-yuri-testing-2017-09-23-2100-testing-basic-smithi/1663744/teuthology.log

Core: /ceph/teuthology-archive/yuriw-2017-09-23_23:29:32-kcephfs-wip-yuri-testing-2017-09-23-2100-testing-basic-smithi/1663744/remote/smithi070/coredump/1506218658.15242.core

Actions

Copy link

Updated by Greg Farnum over 6 years ago

The log ends 14 minutes prior to the signal, which I imagine is related to #21507.

(gdb) bt
#0  0x00007f2f0aad2611 in __pthread_kill (threadid=<optimized out>, signo=6) at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:61
#1  0x000055eef2f81940 in ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long) ()
#2  0x000055eef2f81d82 in ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, long, long) ()
#3  0x000055eef293cfc4 in OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) ()
#4  0x000055eef2e7b85f in ShardedThreadPool::shardedthreadpool_worker(unsigned int) ()
#5  0x000055eef2e7d7b0 in ShardedThreadPool::WorkThreadSharded::entry() ()
#6  0x00007f2f0aacd184 in start_thread (arg=0x7f2eef040700) at pthread_create.c:312
#7  0x00007f2f09bbd37d in eventfd (count=15242, flags=0) at ../sysdeps/unix/sysv/linux/eventfd.c:42
#8  0x0000000000000000 in ?? ()

Actions

Copy link

Updated by Greg Farnum over 6 years ago

Status changed from New to Need More Info

[10:22:39] <@sage> it looks like everyone is waiting for log flush.. which is deep in snprintf in the core. can't tell if it has been making progress or not without a log. :/
[10:23:26] <gregsfortytwo> okay, that makes me think the log rotation is broken somehow, see http://tracker.ceph.com/issues/21507
[10:23:31] <@sage> this doesn't look like a regression to me... more likely just slow writes to teh fullish log disk?
[10:23:46] <gregsfortytwo> maybe the disk is filling up but the log rotation is erasing that evidence?
[10:23:50] <gregsfortytwo> batrick: ^
[10:24:11] <@sage> or the disk is busy doing log rotation (gzipping?), which makes it busy and the log flush slow, which blocked up everything else?
[10:24:21] <@sage> anyway, this should not block 12.2.1 IMO..

Actions

Copy link

Updated by Greg Farnum over 6 years ago

Priority changed from Urgent to Normal

Actions

Copy link

Updated by Sage Weil over 6 years ago

It looks like a zilli threads are blocked at

Thread 4 (Thread 0x7f2f07821700 (LWP 15248)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000055eef2e83b63 in ceph::logging::Log::submit_entry (this=0x55eefe0aa000, e=0x55ef0272f080) at /build/ceph-12.2.0-294-gfa7a3c7/src/log/Log.cc:216

which is await for the flusher thread, which is doing

Thread 2 (Thread 0x7f2f08823700 (LWP 15246)):
#0  0x00007f2f09b3e58e in __GI__IO_default_xsputn (f=0x7f2f088213f0, data=<optimized out>, n=1) at genops.c:470
#1  0x00007f2f09b32600 in __GI__IO_padn (fp=fp@entry=0x7f2f088213f0, pad=pad@entry=48, count=count@entry=1) at iopadn.c:67
#2  0x00007f2f09b0d6bc in _IO_vfprintf_internal (s=s@entry=0x7f2f088213f0, format=<optimized out>, format@entry=0x55eef33c49f8 "%04d-%02d-%02d %02d:%02d:%02d.%06ld", ap=ap@entry=0x7f2f08821568) at vfprintf.c:1660
#3  0x00007f2f09bcc165 in ___vsnprintf_chk (s=0x7f2f08821670 "2017-09-24 02:01:", maxlen=<optimized out>, flags=1, slen=<optimized out>, format=0x55eef33c49f8 "%04d-%02d-%02d %02d:%02d:%02d.%06ld", args=args@entry=0x7f2f08821568) at vsnprintf_chk.c:63
#4  0x00007f2f09bcc0c8 in ___snprintf_chk (s=s@entry=0x7f2f08821670 "2017-09-24 02:01:", maxlen=maxlen@entry=325, flags=flags@entry=1, slen=slen@entry=18446744073709551615, format=format@entry=0x55eef33c49f8 "%04d-%02d-%02d %02d:%02d:%02d.%06ld") at snprintf_chk.c:34
#5  0x000055eef2e83f74 in snprintf (__fmt=0x55eef33c49f8 "%04d-%02d-%02d %02d:%02d:%02d.%06ld", __n=325, __s=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:65
#6  sprintf (outlen=325, out=<optimized out>, this=0x55ef04247380) at /build/ceph-12.2.0-294-gfa7a3c7/src/include/utime.h:312
#7  ceph::logging::Log::_flush (this=this@entry=0x55eefe0aa000, t=t@entry=0x7f2f088218d0, requeue=requeue@entry=0x55eefe0aa110, crash=crash@entry=false) at /build/ceph-12.2.0-294-gfa7a3c7/src/log/Log.cc:314
#8  0x000055eef2e844f9 in ceph::logging::Log::flush (this=this@entry=0x55eefe0aa000) at /build/ceph-12.2.0-294-gfa7a3c7/src/log/Log.cc:274
#9  0x000055eef2e8474e in ceph::logging::Log::entry (this=0x55eefe0aa000) at /build/ceph-12.2.0-294-gfa7a3c7/src/log/Log.cc:448
#10 0x00007f2f0aacd184 in start_thread (arg=0x7f2f08823700) at pthread_create.c:312
#11 0x00007f2f09bbd37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

can't really tell from that whether that thread is busy or stuck.

My guess is busy, but the logging disk is going really slow?

Actions

Copy link

Updated by Sage Weil over 6 years ago

...although even a slow disk shouldn't be long enough for the the heartbeat to time out. :/

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #21532

osd: Abort in thread_name:tp_osd_tp

Updated by Greg Farnum over 6 years ago

Updated by Greg Farnum over 6 years ago

Updated by Greg Farnum over 6 years ago

Updated by Sage Weil over 6 years ago

Updated by Sage Weil over 6 years ago