Bug #10183: OSD dispatcher thread hangs several seconds due to contention for osd_lock - Ceph - Ceph

Actions

Copy link

Bug #10183

closed

OSD dispatcher thread hangs several seconds due to contention for osd_lock

Added by Guang Yang over 9 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Guang Yang

Category:

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Recently when investigating the long tail latency during backfilling/recovering, I found on some OSDs, the dispatcher thread could hang several seconds (even more than 10 seconds) so that the client op/sub-op was stuck at dispatch queue, resulting in dramatical latency increasing.

Following are log snippets proving the above observation:
Sub-op handling:

2014-11-20 11:29:05.882153 7ff466dd9700 10 -- 10.214.140.27:6866/2006571 >> 10.214.140.156:6830/3068565 pipe(0x86c4000 sd=455 :6866 s=2 pgs=2579712 cs=57 l=0 c=0x1fc526e0).reader got message 110789 0x35e10000 MOSDECSubOpRead(3.16des7 42351 ECSubRead(tid=27810863, to_read={b02236d/exxxx.jpg/head//3=0,57344}, attrs_to_read=)) v1
2014-11-20 11:29:05.882172 7ff466dd9700 20 -- 10.214.140.27:6866/2006571 queue 0x35e10000 prio 63
2014-11-20 11:29:12.494715 7ff4db153700  1 -- 10.214.140.27:6866/2006571 <== osd.444 10.214.140.156:6830/3068565 110789 ==== MOSDECSubOpRead(3.16des7 42351 ECSubRead(tid=27810863, to_read={b02236de/default.341342.405_8103466848_264dbb45ea_o.jpg/head//3=0,57344}, attrs_to_read=)) v1 ==== 166+0+0 (3748001437 0 0) 0x35e10000 con 0x1fc526e0
2014-11-20 11:29:12.494738 7ff4db153700 20 osd.310 42351 _dispatch 0x35e10000 MOSDECSubOpRead(3.16des7 42351 ECSubRead(tid=27810863, to_read={b02236de/default.341342.405_8103466848_264dbb45ea_o.jpg/head//3=0,57344}, attrs_to_read=)) v1

And it turned out that the dispatcher thread 7ff4db153700 was hung for 8 seconds, waiting for the osd_lock.

OSD tick:

2014-11-20 11:29:04.620614 7ff4e5d82700 10 osd.310 42351 sched_scrub scrubbing 3.14e5s0 at 2014-09-11 04:40:26.086526, diff >= max
2014-11-20 11:29:04.620619 7ff4e5d82700 20 osd.310 42351 inc_scrubs_pending 1 + 0 active >= max 1
2014-11-20 11:29:04.620621 7ff4e5d82700 20 osd.310 pg_epoch: 42351 pg[3.14e5s0( v 42071'128667 (40890'125666,42071'128667] local-les=35345 n=86661 ec=165 les/c 35345/35356 35339/35339/34961) [310,416,303
,81,340,6,463,354,359,457,394] r=0 lpr=35339 crt=42071'128665 lcod 42071'128666 mlcod 42071'128666 active+clean] sched_scrub: failed to reserve locally
2014-11-20 11:29:04.620638 7ff4e5d82700 10 osd.310 42351 sched_scrub scrubbing 4.13a at 2014-09-11 09:06:52.643757, diff >= max
2014-11-20 11:29:04.620643 7ff4e5d82700 20 osd.310 42351 inc_scrubs_pending 1 + 0 active >= max 1
2014-11-20 11:29:04.620645 7ff4e5d82700 20 osd.310 pg_epoch: 42351 pg[4.13a( v 41164'689081 (40890'686080,41164'689081] local-les=34971 n=18 ec=457 les/c 34971/34977 34961/34961/34961) [310,314,26] r=0 l
pr=34961 crt=40890'689079 lcod 41164'689080 mlcod 41164'689080 active+clean] sched_scrub: failed to reserve locally
......######### hang 8 seconds ###########.......
2014-11-20 11:29:12.487913 7ff4e5d82700 10 osd.310 42351 sched_scrub scrubbing 3.175as0 at 2014-09-11 14:46:47.920784, diff >= max

While calling sched_scrub, it iterates all to_schedule_scrup PGs, lock it and check if we can schedule it now, note this method has osd_lock locked.

The problem is, if the PG is locked doing something that is time-consuming (in our case, it is recovering), it could block the entire OSD if the above sequence happened and cause long tail latency.

Potential solution to fix:

In sched_scrub, can we do best effort iteration to kick off PG scrubbing, that is, we do try lock for the PG and if it failed, go to next one. This should not cause starvation given the current scheduling algorithm (a set and iterate one by one).
Is there a chance to release osd_lock before taking the pg lock? I am not sure if this breaks some basic assumption, need more deep dive.