Project

General

Profile

Actions

Bug #20343

open

Jewel: OSD Thread time outs in XFS

Added by Eric Choi almost 7 years ago. Updated almost 7 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Creating a tracker ticket following suggestion from mailing list:

"
We've been having this ongoing problem with threads timing out on the
OSDs. Typically we'll see the OSD become unresponsive for about a minute,
as threads from other OSDs time out. The timeouts don't seem to be
correlated to high load. We turned up the logs to 10/10 for part of a day
to catch some of these in progress, and saw the pattern below in the logs
several times (grepping for individual threads involved in the time outs).

We are using Jewel 10.2.7.

Logs:

2017-06-12 18:45:12.530698 7f82ebfa8700 10 osd.30 pg_epoch: 5484 pg[10.6d2(
v 5484'12967030 (5469'12963946,5484'12967030] local-les=5476 n=419 ec=593
les/c/f 5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475
pi=4780-5474/109 luod=0'0 lua=5484'12967019 crt=5484'12967027 lcod
5484'12967028 active] add_log_entry 5484'12967030 (0'0) modify
10:4b771c01:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0634p1mdw1-11203-593EE138-2E:head
by client.1274027169.0:3107075054 2017-06-12 18:45:12.523899

2017-06-12 18:45:12.530718 7f82ebfa8700 10 osd.30 pg_epoch: 5484 pg[10.6d2(
v 5484'12967030 (5469'12963946,5484'12967030] local-les=5476 n=419 ec=593
les/c/f 5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475
pi=4780-5474/109 luod=0'0 lua=5484'12967019 crt=5484'12967028 lcod
5484'12967028 active] append_log: trimming to 5484'12967028 entries
5484'12967028 (5484'12967026) delete
10:4b796a74:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0469p1mdw1-21390-593EE137-57:head
by client.1274027164.0:3183456083 2017-06-12 18:45:12.491741

2017-06-12 18:45:12.530754 7f82ebfa8700 5 write_log with: dirty_to: 0'0,
dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: false,
divergent_priors: 0, writeout_from: 5484'12967030, trimmed:

2017-06-12 18:45:28.171843 7f82dc503700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.171877 7f82dc402700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174900 7f82d8887700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174979 7f82d8786700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248499 7f82df05e700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248651 7f82df967700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.261044 7f82d8483700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

Metrics:
OSD Disk IO Wait spikes from 2ms to 1s, CPU Procs Blocked spikes from 0 to
16, IO In progress spikes from 0 to hundreds, IO Time Weighted, IO Time
spike. Average Queue Size on the device spikes. One minute later, Write
Time, Reads, and Read Time spike briefly.

Any thoughts on what may be causing this behavior?

--Tyler

"

To clarify, this is a RGW cluster. When this happens, we do see latency spikes in RGW as well. We are using blind bucket set up for this zone with index_type = 1. Our main use case is with many small objects (400K objects / min, each payload about 100 KB). The pattern is that we do insert / read / and delete in a relatively short time window repeatedly (<30 ms).

This happens repeatedly during high traffic time. Any other information that will help debugging?

Actions #1

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to rgw

That IO pattern may just be killing the OSD on its own, but I'm not sure what RGW is turning it into or if there's stuff like metadata logging that needs to get zapped first.

Actions #2

Updated by Josh Durgin almost 7 years ago

  • Project changed from rgw to Ceph

This could be filestore splitting directories into multiple subdirectories when there are many objects, then merging them when you delete objects, then splitting again with a new set of uploads. Have you tried setting filestore_merge_threshold = -10 (or negative whatever value you've got set now) to prevent directory merging?

You can inspect the directory structure beneath certain pgs at different times in your workload to see if the tree is getting deeper and shallower for a given PG.

Actions #3

Updated by Eric Choi almost 7 years ago

We looked through the mon logs and we can't really find any splitting (or merging) pg states in there. Do we need to up the log level there?

Also wouldn't this be just resulting in performance degradation rather than the thread lock ups that we see? In general our performance seems okay (other than the occasional lock ups)

Actions #4

Updated by Josh Durgin almost 7 years ago

The filestore-level splitting and merging isn't in the logs - the best way to tell is examining a pg's directory - e.g. run tree on them. Thread timeouts like that are a symptom of something going slowly, such as this.

Actions #5

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
Actions

Also available in: Atom PDF