Project

General

Profile

Actions

Bug #20222

closed

v12.0.3 Luminous bluestore 'tp_osd_tp thread tp_osd_tp' had timed out after 60

Added by Nokia ceph-users almost 7 years ago. Updated over 6 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-disk
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

Env:-- 5 node, v12.0.3, EC 4+1 bluestore , RHEL 7.3 - 3.10.0-514.el7.x86_64

Problem desc...:-- bluestore OSD's going down on 12.0.3 when the cluster is under moderate (200MB/s) load,

Below osd logs after enabling - debug bluestore = 20

0xa389a531/0xa70e0000, compress 0x0/0x0/0x0)
2017-05-30 08:57:15.707908 7f9d6a71f700 20 bluestore(/var/lib/ceph/osd/ceph-19) statfsstore_statfs(0x574506f0000/0x5753b990000, stored 0xa389a531/0xa70e0000, compress 0x0/0x0/0x0)
2017-05-30 08:57:17.410365 7f9d6a71f700 20 bluestore(/var/lib/ceph/osd/ceph-19) statfsstore_statfs(0x574506f0000/0x5753b990000, stored 0xa389a531/0xa70e0000, compress 0x0/0x0/0x0)
2017-05-30 08:57:18.512773 7f9d6a71f700 20 bluestore(/var/lib/ceph/osd/ceph-19) statfsstore_statfs(0x574506f0000/0x5753b990000, stored 0xa389a531/0xa70e0000, compress 0x0/0x0/0x0)
2017-05-30 08:57:20.005812 7f9d8f712700 1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 60
2017-05-30 08:57:20.006648 7f9d8ff13700 1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 60
2017-05-30 08:57:20.015585 7f9d8f712700 1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 60
2017-05-30 08:57:20.015677 7f9d8f712700 1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 60

..<snip>..

2017-05-30 08:57:37.506457 7f9d875fd700 20 bluestore(/var/lib/ceph/osd/ceph-19) _do_write perform garbage collection, expected benefit = 0 AUs
2017-05-30 08:57:37.506459 7f9d875fd700 20 bluestore.extentmap(0x7f9da0ed50b0) dirty_range mark inline shard dirty
2017-05-30 08:57:37.506461 7f9d875fd700 10 bluestore(/var/lib/ceph/osd/ceph-19) _write meta #-1:7b3f43c4:::osd_superblock:0# 0x0~1f8 = 0
2017-05-30 08:57:37.506463 7f9d875fd700 10 bluestore(/var/lib/ceph/osd/ceph-19) _txc_calc_cost 0x7f9da2d22680 cost 3104677 (2 ios * 1500000 + 104677 bytes)
2017-05-30 08:57:37.506479 7f9d875fd700 20 bluestore(/var/lib/ceph/osd/ceph-19) _txc_write_nodes txc 0x7f9da2d22680 onodes 0x7f9da0ed4fc0,0x7f9da5bd8000,0x7f9da72dad80 shared_blobs
2017-05-30 08:57:37.506482 7f9d875fd700 20 bluestore.extentmap(0x7f9da0ed50b0) update #-1:7b3f43c4:::osd_superblock:0#
2017-05-30 08:57:37.506485 7f9d875fd700 20 bluestore.extentmap(0x7f9da0ed50b0) update inline shard 81 bytes from 1 extents
2017-05-30 08:57:37.506491 7f9d875fd700 20 bluestore(/var/lib/ceph/osd/ceph-19) onode #-1:7b3f43c4:::osd_superblock:0# is 108 (21 bytes onode + 2 bytes spanning blobs + 85 bytes inline extents)
2017-05-30 08:57:37.506495 7f9d875fd700 20 bluestore.extentmap(0x7f9da5bd80f0) update #-1:a51ca935:::osdmap.848:0#
2017-05-30 08:57:37.506497 7f9d875fd700 20 bluestore.extentmap(0x7f9da5bd80f0) update inline shard 235 bytes from 2 extents
2017-05-30 08:57:37.506499 7f9d875fd700 20 bluestore(/var/lib/ceph/osd/ceph-19) onode #-1:a51ca935:::osdmap.848:0# is 264 (23 bytes onode + 2 bytes spanning blobs + 239 bytes inline extents)
2017-05-30 08:57:37.506502 7f9d875fd700 20 bluestore.extentmap(0x7f9da72dae70) update #-1:48acce63:::inc_osdmap.848:0#
2017-05-30 08:57:37.506504 7f9d875fd700 20 bluestore.extentmap(0x7f9da72dae70) update inline shard 82 bytes from 1 extents
2017-05-30 08:57:37.506506 7f9d875fd700 20 bluestore(/var/lib/ceph/osd/ceph-19) onode #-1:48acce63:::inc_osdmap.848:0# is 110 (22 bytes onode + 2 bytes spanning blobs + 86 bytes inline extents)
2017-05-30 08:57:37.506558 7f9d875fd700 20 bluestore(/var/lib/ceph/osd/ceph-19) _txc_finalize_kv txc 0x7f9da2d22680 allocated 0x[388afa0000~30000] released 0x[]
2017-05-30 08:57:37.506562 7f9d875fd700 10 freelist allocate 0x388afa0000~30000
2017-05-30 08:57:37.506563 7f9d875fd700 20 freelist _xor first_key 0x388af80000 last_key 0x388af80000
2017-05-30 08:57:37.510446 7f9d8ff13700 0 -- 10.50.63.152:6857/1940715 >> 10.50.63.153:6869/1572972 conn(0x7f9da698a800 :6857 s=STATE_OPEN pgs=243 cs=1 l=0).fault with nothing to send, going to standby
2017-05-30 08:57:37.514366 7f9d8ff13700 0 -- 10.50.63.152:6857/1940715 >> 10.50.63.153:6863/1572982 conn(0x7f9da620f800 :6857 s=STATE_OPEN pgs=149 cs=1 l=0).fault with nothing to send, going to standby
2017-05-30 08:57:37.517351 7f9d8ef11700 1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 60
2017-05-30 08:57:37.517354 7f9d8ef11700 1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 60 ====

config value --> threadpool_default_timeout = 60

// default timeout while caling WaitInterval on an empty queue
OPTION

I altered the above value to 300, then it serve for some time

2017-05-30 09:44:32.027168 7efc1f2b9700 20 bluestore(/var/lib/ceph/osd/ceph-41) statfsstore_statfs(0x574f2860000/0x5753b990000, stored 0x2e65ed1/0x6730000, compress 0x0/0x0/0x0)
2017-05-30 09:44:35.529164 7efc1f2b9700 20 bluestore(/var/lib/ceph/osd/ceph-41) statfsstore_statfs(0x574f2860000/0x5753b990000, stored 0x2e65ed1/0x6730000, compress 0x0/0x0/0x0)
2017-05-30 09:44:40.831130 7efc1f2b9700 20 bluestore(/var/lib/ceph/osd/ceph-41) statfsstore_statfs(0x574f2860000/0x5753b990000, stored 0x2e65ed1/0x6730000, compress 0x0/0x0/0x0)
2017-05-30 09:44:41.634056 7efc3fafa700 -1 osd.41 878 heartbeat_check: no reply from 10.50.62.152:6853 osd.30 since back 2017-05-30 09:44:21.521517 front 2017-05-30 09:44:21.521517 (cutoff 2017-05-30 09:44:21.634053)
2017-05-30 09:44:42.128280 7efc442ac700 1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 300
2017-05-30 09:44:42.128282 7efc442ac700 1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 300
2017-05-30 09:44:42.128284 7efc442ac700 1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 300

Ref - [ceph-users] Lumionous: bluestore 'tp_osd_tp thread tp_osd_tp' had timed out after 60 { for more details... }

Findings.. =========

I/O not backing up, suspecting a bluestore bug...

awk 'NR==5 || /sdo/' diskstat-PL6-CN4.txt
<snip>
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size Wait RWSize QLen Wait SvcTim Util
14:17:12 sdo 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14:17:13 sdo 0 0 0 0 0 220 47 4 55 12 55 1 12 7 2
14:17:14 sdo 0 0 0 0 0 56 67 10 6 17 5 1 17 10 10
14:17:15 sdo 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14:17:16 sdo 0 0 0 0 0 740 102 19 39 25 38 5 25 4 8
14:17:17 sdo 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14:17:18 sdo 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
<snip>

Thanks


Files

osd.12.ops (136 KB) osd.12.ops Output of 'ceph daemon osd.12 ops' Thomas Coelho, 06/28/2017 12:30 PM
ops_14-osd.zip (516 KB) ops_14-osd.zip osd-14 ops report Nokia ceph-users, 09/10/2017 04:39 PM

Related issues 1 (0 open1 closed)

Related to RADOS - Bug #21171: bluestore: aio submission deadlockResolvedSage Weil08/29/2017

Actions
Actions

Also available in: Atom PDF