Bug #47974
openSlow requests due to unhealthy hearbeat - 'OSD::osd_op_tp thread 0x7f7f85903700' had timed out after 15s
0%
Description
Slow requests observed due to unhealthy hearbeats on osd.2.
/a/sseshasa-2020-10-23_12:25:57-rados-wip-sseshasa-testing-2020-10-22-1249-distro-basic-smithi/5551861
The following is the sequence for the op for which slow request was reported,
1. Op enqueued
2020-10-24T00:11:26.191+0000 7f7fa8a2f700 15 osd.2 30 enqueue_op 0x55655bc81a20 prio 63 type 42 cost 4194304 latency 0.000038 epoch 30 osd_op(client.4712.0:226 5.d 5.551df92d (undecoded) ondisk+write+known_if_redirected e30) v8
2020-10-24T00:11:26.191+0000 7f7fa8a2f700 20 osd.2 op_wq(5) _enqueue OpSchedulerItem(5.d PGOpItem(op=osd_op(client.4712.0:226 5.d 5.551df92d (undecoded) ondisk+write+known_if_redirected e30) v8) prio 63 cost 4194304 e30)
2. Lots of connection related messages along with the unhealthy timeout messages,
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f85903700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f86104700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f86905700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f87106700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f87907700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f88909700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f8990b700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f8a10c700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f8b10e700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f8b90f700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7f8c911700' had timed out after 15s
2020-10-24T00:11:41.841+0000 7f7fa8a2f700 10 osd.2 30 internal heartbeat not healthy, dropping ping request
3. Op finally gets dequeued,
2020-10-24T00:11:56.126+0000 7f7f8b90f700 0 bluestore(/var/lib/ceph/osd/ceph-2) log_latency slow operation observed for submit_transact, latency = 29.946883267s
...
2020-10-24T00:11:56.126+0000 7f7f86104700 10 osd.2 30 dequeue_op 0x55655bc81a20 prio 63 cost 4194304 latency 29.934464 osd_op(client.4712.0:226 5.d 5.551df92d (undecoded) ondisk+write+known_if_redirected e30) v8 pg pg[5.d( v 30'19 (0'0,30'19] local-lis/les=28/29 n=19 ec=28/28 lis/c=28/28 les/c/f=29/29/0 sis=28) [2,0] r=0 lpr=28 luod=30'16 crt=30'19 lcod 30'15 mlcod 30'15 active+clean]
2020-10-24T00:11:56.126+0000 7f7f86104700 10 osd.2 pg_epoch: 30 pg[5.d( v 30'19 (0'0,30'19] local-lis/les=28/29 n=19 ec=28/28 lis/c=28/28 les/c/f=29/29/0 sis=28) [2,0] r=0 lpr=28 luod=30'16 crt=30'19 lcod 30'15 mlcod 30'15 active+clean] _handle_message: 0x55655bc81a20
4. Only to be requeued back again
2020-10-24T00:12:04.128+0000 7f7f8a10c700 20 osd.2 pg_epoch: 30 pg[5.d( v 30'19 (0'0,30'19] local-lis/les=28/29 n=19 ec=28/28 lis/c=28/28 les/c/f=29/29/0 sis=28) [2,0] r=0 lpr=28 crt=30'19 lcod 30'18 mlcod 30'18 active+clean] requeue_op 0x55655bc81a20
5. Op finally completes
2020-10-24T00:12:04.128+0000 7f7f86104700 10 osd.2 30 dequeue_op 0x55655bc81a20 prio 63 cost 4194304 latency 37.935948 osd_op(client.4712.0:226 5.d 5:b49fb8aa:::benchmark_data_smithi101_31109_object225:head [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304 in=4194304b] snapc 0=[] ondisk+write+known_if_redirected e30) v8 pg pg[5.d( v 30'19 (0'0,30'19] local-lis/les=28/29 n=19 ec=28/28 lis/c=28/28 les/c/f=29/29/0 sis=28) [2,0] r=0 lpr=28 crt=30'19 lcod 30'18 mlcod 30'18 active+clean]
2020-10-24T00:12:04.128+0000 7f7f86104700 20 osd.2 pg_epoch: 30 pg[5.d( v 30'19 (0'0,30'19] local-lis/les=28/29 n=19 ec=28/28 lis/c=28/28 les/c/f=29/29/0 sis=28) [2,0] r=0 lpr=28 crt=30'19 lcod 30'18 mlcod 30'18 active+clean] do_op: op osd_op(client.4712.0:226 5.d 5:b49fb8aa:::benchmark_data_smithi101_31109_object225:head [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304 in=4194304b] snapc 0=[] ondisk+write+known_if_redirected e30) v8
2020-10-24T00:12:04.133+0000 7f7f86104700 10 osd.2 30 dequeue_op 0x55655bc81a20 finish
Updated by Igor Fedotov over 3 years ago
Did you perform any large pool/PG removals recently? Or may be some data rebalancing that could result in PG migration to different OSDs?