Project

General

Profile

Actions

Bug #47974

open

Slow requests due to unhealthy hearbeat - 'OSD::osd_op_tp thread 0x7f7f85903700' had timed out after 15s

Added by Sridhar Seshasayee over 3 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Slow requests observed due to unhealthy hearbeats on osd.2.

/a/sseshasa-2020-10-23_12:25:57-rados-wip-sseshasa-testing-2020-10-22-1249-distro-basic-smithi/5551861

The following is the sequence for the op for which slow request was reported,

1. Op enqueued
2020-10-24T00:11:26.191+0000 7f7fa8a2f700 15 osd.2 30 enqueue_op 0x55655bc81a20 prio 63 type 42 cost 4194304 latency 0.000038 epoch 30 osd_op(client.4712.0:226 5.d 5.551df92d (undecoded) ondisk+write+known_if_redirected e30) v8
2020-10-24T00:11:26.191+0000 7f7fa8a2f700 20 osd.2 op_wq(5) _enqueue OpSchedulerItem(5.d PGOpItem(op=osd_op(client.4712.0:226 5.d 5.551df92d (undecoded) ondisk+write+known_if_redirected e30) v8) prio 63 cost 4194304 e30)

2. Lots of connection related messages along with the unhealthy timeout messages,
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f85903700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f86104700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f86905700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f87106700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f87907700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f88909700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f8990b700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f8a10c700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f8b10e700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f8b90f700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f8c911700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 10 osd.2 30 internal heartbeat not healthy, dropping ping request

3. Op finally gets dequeued,
2020-10-24T00:11:56.126+0000 7f7f8b90f700 0 bluestore(/var/lib/ceph/osd/ceph-2) log_latency slow operation observed for submit_transact, latency = 29.946883267s
...
2020-10-24T00:11:56.126+0000 7f7f86104700 10 osd.2 30 dequeue_op 0x55655bc81a20 prio 63 cost 4194304 latency 29.934464 osd_op(client.4712.0:226 5.d 5.551df92d (undecoded) ondisk+write+known_if_redirected e30) v8 pg pg[5.d( v 30'19 (0'0,30'19] local-lis/les=28/29 n=19 ec=28/28 lis/c=28/28 les/c/f=29/29/0 sis=28) [2,0] r=0 lpr=28 luod=30'16 crt=30'19 lcod 30'15 mlcod 30'15 active+clean]
2020-10-24T00:11:56.126+0000 7f7f86104700 10 osd.2 pg_epoch: 30 pg[5.d( v 30'19 (0'0,30'19] local-lis/les=28/29 n=19 ec=28/28 lis/c=28/28 les/c/f=29/29/0 sis=28) [2,0] r=0 lpr=28 luod=30'16 crt=30'19 lcod 30'15 mlcod 30'15 active+clean] _handle_message: 0x55655bc81a20

4. Only to be requeued back again
2020-10-24T00:12:04.128+0000 7f7f8a10c700 20 osd.2 pg_epoch: 30 pg[5.d( v 30'19 (0'0,30'19] local-lis/les=28/29 n=19 ec=28/28 lis/c=28/28 les/c/f=29/29/0 sis=28) [2,0] r=0 lpr=28 crt=30'19 lcod 30'18 mlcod 30'18 active+clean] requeue_op 0x55655bc81a20

5. Op finally completes
2020-10-24T00:12:04.128+0000 7f7f86104700 10 osd.2 30 dequeue_op 0x55655bc81a20 prio 63 cost 4194304 latency 37.935948 osd_op(client.4712.0:226 5.d 5:b49fb8aa:::benchmark_data_smithi101_31109_object225:head [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304 in=4194304b] snapc 0=[] ondisk+write+known_if_redirected e30) v8 pg pg[5.d( v 30'19 (0'0,30'19] local-lis/les=28/29 n=19 ec=28/28 lis/c=28/28 les/c/f=29/29/0 sis=28) [2,0] r=0 lpr=28 crt=30'19 lcod 30'18 mlcod 30'18 active+clean]

2020-10-24T00:12:04.128+0000 7f7f86104700 20 osd.2 pg_epoch: 30 pg[5.d( v 30'19 (0'0,30'19] local-lis/les=28/29 n=19 ec=28/28 lis/c=28/28 les/c/f=29/29/0 sis=28) [2,0] r=0 lpr=28 crt=30'19 lcod 30'18 mlcod 30'18 active+clean] do_op: op osd_op(client.4712.0:226 5.d 5:b49fb8aa:::benchmark_data_smithi101_31109_object225:head [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304 in=4194304b] snapc 0=[] ondisk+write+known_if_redirected e30) v8

2020-10-24T00:12:04.133+0000 7f7f86104700 10 osd.2 30 dequeue_op 0x55655bc81a20 finish

Actions #1

Updated by Igor Fedotov over 3 years ago

Did you perform any large pool/PG removals recently? Or may be some data rebalancing that could result in PG migration to different OSDs?

Actions

Also available in: Atom PDF