Project

General

Profile

Actions

Bug #9487

closed

dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helping

Added by Dan van der Ster over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
dumpling
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,
using dumpling 0.67.10...

We are doing quite a bit of backfilling these days in order to make room for some new SSD journals. I am removing 2 OSDs at a time using ceph osd crush reweight osd.N 0, and each time I do this I get slow requests which start a few minutes after the backfilling begins and end around 3 minutes later. Otherwise the backfilling completes without incident. I was able to isolate the cause of the backfilling to one single OSD which is busy with snap trimming. Here are some logs of osd.11 from this morning.

Backfilling starts at 2014-09-16 09:03:04.623202

First slow request is
2014-09-16 09:06:42.413217 osd.94 xxx:6920/89989 108 : [WRN] slow request 30.169698 seconds old, received at 2014-09-16 09:06:12.243490: osd_op(client.36356481.0:203616535 rbd_data.22cd9436e995f3.0000000000000f8f [write 2899968~20480] 4.9e398fd7 e95006) v4 currently waiting for subops from [11,1169]

Here is ceph-osd.11.log with debug_osd=20

2014-09-16 09:06:12.275675 7ff0ac575700 20 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...] snap_trimmer slept for 0.100000
2014-09-16 09:06:12.435938 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...] snap_trimmer entry
2014-09-16 09:06:12.436803 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...] snap_trimmer posting
...
2014-09-16 09:06:12.439147 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...]] SnapTrimmer state<Trimming/TrimmingObjects>: TrimmingObjects: trimming snap 1
...
2014-09-16 09:06:12.446790 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...]] SnapTrimmer state<Trimming/TrimmingObjects>: TrimmingObjects: trimming snap 2

then eventually the last one three minutes later:

2014-09-16 09:09:04.621188 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 m
lcod 0'0 active+clean snaptrimq=[7275~1]] SnapTrimmer state<Trimming/TrimmingObjects>: TrimmingObjects: trimming snap 7275

During those three minutes, the osd was only snap trimmingl there are no other type of ops getting through. I even tried (re)injecting --debug_osd=20 while the osd was trimming, just to see exactly when osd.11 was responsive again. The injectargs hung until just after trimming completed:

2014-09-16 09:09:04.820524 7ff0b6f86700 20 osd.11 95020 _dispatch 0x1218ec80 command(tid 24: {"injected_args": ["--debug_osd 20"], "prefix": "injectargs"}) v1

Obviously the slow requests also disappeared just after the snap trim on osd.11 completed, and during this backfilling exercise there were no other slow requests.

As you can also see, the osd_snap_trim_sleep is not effective. We have it set to 0.1 but it's useless in this case anyway because the sleep only happens once at the start of trimming all of PG 5.318.

Full log of osd.11 is attached.

Do you have any suggestion here how to make this more transparent? We have ~150 more OSDs to drain so I'll have plenty of opportunities to troubleshoot this.

Best Regards, Dan


Files

ceph-osd.11.log.gz (30 MB) ceph-osd.11.log.gz Dan van der Ster, 09/16/2014 01:44 AM
ceph-osd.76.log.gz (24.3 MB) ceph-osd.76.log.gz Dan van der Ster, 11/04/2014 02:31 AM
ceph-osd.90.log.gz (21.1 MB) ceph-osd.90.log.gz Dan van der Ster, 11/17/2014 03:17 AM
cephversion.txt (48.5 KB) cephversion.txt Dan van der Ster, 11/21/2014 08:03 AM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #9503: Dumpling: removing many snapshots in a short time makes OSDs go berserkResolvedSamuel Just09/17/2014

Actions
Actions

Also available in: Atom PDF