Project

General

Profile

Actions

Bug #9487

closed

dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helping

Added by Dan van der Ster over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
dumpling
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,
using dumpling 0.67.10...

We are doing quite a bit of backfilling these days in order to make room for some new SSD journals. I am removing 2 OSDs at a time using ceph osd crush reweight osd.N 0, and each time I do this I get slow requests which start a few minutes after the backfilling begins and end around 3 minutes later. Otherwise the backfilling completes without incident. I was able to isolate the cause of the backfilling to one single OSD which is busy with snap trimming. Here are some logs of osd.11 from this morning.

Backfilling starts at 2014-09-16 09:03:04.623202

First slow request is
2014-09-16 09:06:42.413217 osd.94 xxx:6920/89989 108 : [WRN] slow request 30.169698 seconds old, received at 2014-09-16 09:06:12.243490: osd_op(client.36356481.0:203616535 rbd_data.22cd9436e995f3.0000000000000f8f [write 2899968~20480] 4.9e398fd7 e95006) v4 currently waiting for subops from [11,1169]

Here is ceph-osd.11.log with debug_osd=20

2014-09-16 09:06:12.275675 7ff0ac575700 20 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...] snap_trimmer slept for 0.100000
2014-09-16 09:06:12.435938 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...] snap_trimmer entry
2014-09-16 09:06:12.436803 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...] snap_trimmer posting
...
2014-09-16 09:06:12.439147 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...]] SnapTrimmer state<Trimming/TrimmingObjects>: TrimmingObjects: trimming snap 1
...
2014-09-16 09:06:12.446790 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...]] SnapTrimmer state<Trimming/TrimmingObjects>: TrimmingObjects: trimming snap 2

then eventually the last one three minutes later:

2014-09-16 09:09:04.621188 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 m
lcod 0'0 active+clean snaptrimq=[7275~1]] SnapTrimmer state<Trimming/TrimmingObjects>: TrimmingObjects: trimming snap 7275

During those three minutes, the osd was only snap trimmingl there are no other type of ops getting through. I even tried (re)injecting --debug_osd=20 while the osd was trimming, just to see exactly when osd.11 was responsive again. The injectargs hung until just after trimming completed:

2014-09-16 09:09:04.820524 7ff0b6f86700 20 osd.11 95020 _dispatch 0x1218ec80 command(tid 24: {"injected_args": ["--debug_osd 20"], "prefix": "injectargs"}) v1

Obviously the slow requests also disappeared just after the snap trim on osd.11 completed, and during this backfilling exercise there were no other slow requests.

As you can also see, the osd_snap_trim_sleep is not effective. We have it set to 0.1 but it's useless in this case anyway because the sleep only happens once at the start of trimming all of PG 5.318.

Full log of osd.11 is attached.

Do you have any suggestion here how to make this more transparent? We have ~150 more OSDs to drain so I'll have plenty of opportunities to troubleshoot this.

Best Regards, Dan


Files

ceph-osd.11.log.gz (30 MB) ceph-osd.11.log.gz Dan van der Ster, 09/16/2014 01:44 AM
ceph-osd.76.log.gz (24.3 MB) ceph-osd.76.log.gz Dan van der Ster, 11/04/2014 02:31 AM
ceph-osd.90.log.gz (21.1 MB) ceph-osd.90.log.gz Dan van der Ster, 11/17/2014 03:17 AM
cephversion.txt (48.5 KB) cephversion.txt Dan van der Ster, 11/21/2014 08:03 AM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #9503: Dumpling: removing many snapshots in a short time makes OSDs go berserkResolvedSamuel Just09/17/2014

Actions
Actions #1

Updated by Dan van der Ster over 9 years ago

In case it wasn't clear, there is nothing special about osd.11. Each time I reweight 2 OSDs the slow requests are caused by a different osd which became busy with snap trimming.

Actions #2

Updated by Dan van der Ster over 9 years ago

I was able to isolate the cause of the backfilling to one single OSD

typo.. I was able to isolate the cause of the slow requests to one single OSD...

Actions #3

Updated by Dan van der Ster over 9 years ago

Here is a bit more... I checked for "snap_trimmer entry" on other OSDs this morning. There were a few others, but all except osd.11 had only a handful (~5-10) of snaps to trim, e.g.:

/var/log/ceph/ceph-osd.252.log:2014-09-16 09:03:04.407862 7f0665871700 10 osd.252 pg_epoch: 94976 pg[2.174( v 74470'89792 (45121'86792,74470'89792] local-les=94976 n=0 ec=1 les/c 94976/94976 94971/94975/94975) [252,656,694] r=0 lpr=94975 lcod 0'0 mlcod 0'0 active+clean snaptrimq=[1~5]] snap_trimmer entry

/var/log/ceph/ceph-osd.279.log:2014-09-16 09:03:04.929522 7f23c9e2f700 10 osd.279 pg_epoch: 94976 pg[2.3d7( v 74470'98502 (6107'95502,74470'98502] local-les=94976 n=0 ec=1 les/c 94976/94976 94971/94975/94975) [279,496,1140] r=0 lpr=94975 lcod 0'0 mlcod 0'0 active+clean snaptrimq=[1~5]] snap_trimmer entry

/var/log/ceph/ceph-osd.1156.log:2014-09-16 09:03:03.381479 7fe40ff9b700 10 osd.1156 pg_epoch: 94975 pg[2.200( v 74470'95313 (6105'92313,74470'95313] local-les=94975 n=0 ec=1 les/c 94975/94975 94971/94974/94974) [1156,780,304] r=0 lpr=94974 lcod 0'0 mlcod 0'0 active+clean snaptrimq=[1~5]] snap_trimmer entry

/var/log/ceph/ceph-osd.220.log:2014-09-16 09:07:03.202399 7f0c98fbb700 10 osd.220 pg_epoch: 95014 pg[5.29( v 93278'1466 (0'0,93278'1466] local-les=94972 n=483 ec=357 les/c 94972/95014 94971/94971/94971) [220,385,169] r=0 lpr=94971 mlcod 0'0 active+clean snaptrimq=[7236~2,7239~4,723e~38]] snap_trimmer entry

But osd.11 had more than 28000 snaps to trim:

  1. grep 'trimming snap' ceph-osd.11.log | wc -l
    28457

It looks like we need a way to do the trimming in chunks, say 16 at a time? Is that doable?

Actions #4

Updated by Dan van der Ster over 9 years ago

I also noticed that before the snap trimmer starts, purge_snaps is [] for 5.318. Is that normal, or should (the complete) purge_snaps normally be copied along to the new OSD? Certainly if it had been, the small incremental (if at all) trim would have completed very quickly. If on the contrary the missed purged_snaps update is only happening rarely, then that might explain why we don't see these slow requests after every backfill of a pool 5 PG completes.

I saw in the doc that "if the purged_snaps update is lost, we merely retrim a now empty snap." -- but I didn't find where the purged_snaps update occurs and how it could possibly be lost. In this case, "merely" retrimming takes a long time, so understanding where it got lost could also be a good solution to this issue.

I also wanted to mention that pool 5 is our glance images pool. The number of snaps to trim will only continue to rise over time, so the effect of this issue will get worse and worse.

Actions #5

Updated by Dan van der Ster over 9 years ago

Please comment on https://github.com/ceph/ceph/pull/2516.
Thanks!

Actions #6

Updated by Sage Weil over 9 years ago

  • Status changed from New to In Progress
  • Assignee set to Sage Weil
  • Priority changed from High to Urgent

Dan van der Ster wrote:

I also noticed that before the snap trimmer starts, purge_snaps is [] for 5.318. Is that normal, or should (the complete) purge_snaps normally be copied along to the new OSD? Certainly if it had been, the small incremental (if at all) trim would have completed very quickly. If on the contrary the missed purged_snaps update is only happening rarely, then that might explain why we don't see these slow requests after every backfill of a pool 5 PG completes.

I think this is exactly the problem... purged_snaps shouldn't start out empty. That's why it is uselessing scanning for all 7000 deleted snaps.

Trying to reproduce this ....

Actions #7

Updated by Sage Weil over 9 years ago

Okay, I can't seem to reproduce this.

Dan or Florian, can you attach a log? What I need is debug ms = 1 and debug osd = 20 on teh OSD that the PG is getting backfilled to, but starting before the backfill starts. I want to see why purged_snaps is not getting initialized properly.

Thanks!

Actions #8

Updated by Dan van der Ster over 9 years ago

Thanks Sage. There's a log with debug_osd=20 attached to this issue. I'll try tomorrow to get one with debug_ms=1 too.

Actions #9

Updated by Sage Weil over 9 years ago

Nevermind, I've reproduced it!

Actions #10

Updated by Sage Weil over 9 years ago

  • Status changed from In Progress to Fix Under Review

wip-9487
wip-9487-dumpling for backport

Actions #11

Updated by Dan van der Ster over 9 years ago

Hi Sage, Thanks for the quick patch. I tried wip-9487-dumpling on our test cluster and now there is no snap trimming at all after backfilling. If I perform some rmsnap's during the backfilling, the snap_trimq is never more than 1 or 2 snaps long. So I think this fixes it, thanks!

Actions #12

Updated by Samuel Just over 9 years ago

  • Status changed from Fix Under Review to 12

I had some comments on that pull request.

Actions #13

Updated by Samuel Just over 9 years ago

  • Status changed from 12 to 7
Actions #14

Updated by Dan van der Ster over 9 years ago

Hi Sam,
Same as for #9503
I think this is fixed in master/giant.. correct? Just a gentle reminder that we'd appreciate a backport in dumpling.
Cheers, Dan

Actions #15

Updated by Sage Weil over 9 years ago

  • Status changed from 7 to Pending Backport
Actions #16

Updated by Dan van der Ster over 9 years ago

Hi Sage and Sam,
I've just tried wip-9113-9487-dumpling on our test cluster. (Using this build: http://gitbuilder.ceph.com/ceph-rpm-centos6_5-x86_64-basic/ref/wip-9113-9487-dumpling/x86_64/)

osd.76: running {"version":"0.67.11-4-g496e561"}

This commit:

ReplicatedPG: don't move on to the next snap immediately

seems to work. I see a sleep between each trim operation now.

But this fix:

osd: initialize purged_snap on backfill start; restart backfill if change

doesn't seem to work here. I still see

2014-11-04 11:11:50.037102 7f84cc75f700  5 osd.76 pg_epoch: 18200 pg[35.0( empty local-les=0 n=0 ec=18036 les/c 18151/18151 18188/18188/18188) [76,71,298] r=0 lpr=18188 pi=18148-18187/2 mlcod 0'0 inactive] enter Star
ted/Primary/Active
2014-11-04 11:11:50.037123 7f84cc75f700 10 osd.76 pg_epoch: 18200 pg[35.0( empty local-les=0 n=0 ec=18036 les/c 18151/18151 18188/18188/18188) [76,71,298] r=0 lpr=18188 pi=18148-18187/2 mlcod 0'0 inactive] state<Star
ted/Primary/Active>: In Active, about to call activate
2014-11-04 11:11:50.037138 7f84cc75f700 10 osd.76 pg_epoch: 18200 pg[35.0( empty local-les=18200 n=0 ec=18036 les/c 18151/18151 18188/18188/18188) [76,71,298] r=0 lpr=18188 pi=18148-18187/2 mlcod 0'0 active] check_lo
cal
2014-11-04 11:11:50.037152 7f84cc75f700 10 osd.76 pg_epoch: 18200 pg[35.0( empty local-les=18200 n=0 ec=18036 les/c 18151/18151 18188/18188/18188) [76,71,298] r=0 lpr=18188 pi=18148-18187/2 mlcod 0'0 active snaptrimq=[1~6e]] activate - snap_trimq [1~6e]

...

2014-11-04 11:11:50.037152 7f84cc75f700 10 osd.76 pg_epoch: 18200 pg[35.0( empty local-les=18200 n=0 ec=18036 les/c 18151/18151 18188/18188/18188) [76,71,298] r=0 lpr=18188 pi=18148-18187/2 mlcod 0'0 active snaptrimq=[1~6e]] activate - snap_trimq [1~6e]
2014-11-04 11:12:01.181457 7f84c955a700 10 osd.76 pg_epoch: 18206 pg[35.0( empty local-les=18200 n=0 ec=18036 les/c 18200/18206 18188/18188/18188) [76,71,298] r=0 lpr=18188 mlcod 0'0 active+clean snaptrimq=[2~6d]] SnapTrimmer state<Trimming/WaitingOnReplicas>: purged_snaps now [1~1], snap_trimq now [2~6d]
2014-11-04 11:12:01.370462 7f84c955a700 10 osd.76 pg_epoch: 18206 pg[35.0( empty local-les=18200 n=0 ec=18036 les/c 18200/18206 18188/18188/18188) [76,71,298] r=0 lpr=18188 mlcod 0'0 active+clean snaptrimq=[3~6c]] SnapTrimmer state<Trimming/WaitingOnReplicas>: purged_snaps now [1~2], snap_trimq now [3~6c]
2014-11-04 11:12:01.508636 7f84c955a700 10 osd.76 pg_epoch: 18206 pg[35.0( empty local-les=18200 n=0 ec=18036 les/c 18200/18206 18188/18188/18188) [76,71,298] r=0 lpr=18188 mlcod 0'0 active+clean snaptrimq=[4~6b]] SnapTrimmer state<Trimming/WaitingOnReplicas>: purged_snaps now [1~3], snap_trimq now [4~6b]
...

A full log is attached (of OSD 76, who is getting PG 35.0 which has a few trimmed snaps inside).

(This is strange because the older branch wip-9487-dumpling does fix the purged_snaps initialization problem -- we're still running that branch elsewhere -- and the relevant commit hasn't changed much since then).

Actions #17

Updated by Dan van der Ster over 9 years ago

Looking more, I noticed that the pool 35 PGs are not entering the backfilling state -- only recovery. I'm bringing osd.76 in and out, also reweighting it from 2.72 to 0 to 2.72. I never see the line:

pg 35.0 ... restarting backfill on osd.76

Hence it never enters the block in PG.cc where the purged_stats are copied over. Pool 35 is a test pool where I've made some snaps, rm'd some snaps, and written some random bench objects. Any idea? Am I doing something wrong in this test, or do you think the purged_snaps need to be copied also for the recovery state?

Actions #18

Updated by Samuel Just over 9 years ago

I think that's an annoying special case for snaps purged on an empty pg. Both the old primary which did the trim and the new primary which is empty have the same last update 0'0 and the same empty log. So, the new empty primary's info is chosen over the old empty primary's info and thus purged_snaps goes back to empty. If the old primary had a longer log, it would have resulted in the new primary effectively starting with the old primary's info and purged_snaps. I think that's a separate bug. Are you sure the other patch did not result in this behavior?

Actions #19

Updated by Dan van der Ster over 9 years ago

Well the PG isn't empty -- I've been writing a bunch of data to it using rados bench. Basically, I'm having trouble getting a new test pool to enter backfilling at all. But maybe thats a different issue.

Anyway, I have other older pools that are entering backfilling and have a few purged_snaps. But alas I still have re-trimming of already purged snaps with 0.67.11-4-g496e561. I've attached a log for osd.90 -- PG's 3.2e7 and 3.7e are affected, e.g.

2014-11-17 11:54:38.981159 7f0cc119e700 10 osd.90 pg_epoch: 19542 pg[3.2e7( v 19531'36796 (18972'33790,19531'36796] local-les=19542 n=0 ec=256 les/c 19542/19542 19522/19541/19532) [90,7,99] r=0 lpr=19541 lcod 19531'36795 mlcod 0'0 active+clean snaptrimq=[1~14]] SnapTrimmer state<Trimming/WaitingOnReplicas>: WaitingOnReplicas: adding snap 1 to purged_snaps
2014-11-17 11:54:38.981177 7f0cc119e700 10 osd.90 pg_epoch: 19542 pg[3.2e7( v 19531'36796 (18972'33790,19531'36796] local-les=19542 n=0 ec=256 les/c 19542/19542 19522/19541/19532) [90,7,99] r=0 lpr=19541 lcod 19531'36795 mlcod 0'0 active+clean snaptrimq=[2~13]] SnapTrimmer state<Trimming/WaitingOnReplicas>: purged_snaps now [1~1], snap_trimq now [2~13]
2014-11-17 11:54:39.084524 7f0cc079d700 10 osd.90 pg_epoch: 19542 pg[3.2e7( v 19531'36796 (18972'33790,19531'36796] local-les=19542 n=0 ec=256 les/c 19542/19542 19522/19541/19532) [90,7,99] r=0 lpr=19541 lcod 19531'36795 mlcod 0'0 active+clean snaptrimq=[2~13]] SnapTrimmer state<Trimming/WaitingOnReplicas>: WaitingOnReplicas: adding snap 2 to purged_snaps
2014-11-17 11:54:39.084542 7f0cc079d700 10 osd.90 pg_epoch: 19542 pg[3.2e7( v 19531'36796 (18972'33790,19531'36796] local-les=19542 n=0 ec=256 les/c 19542/19542 19522/19541/19532) [90,7,99] r=0 lpr=19541 lcod 19531'36795 mlcod 0'0 active+clean snaptrimq=[3~12]] SnapTrimmer state<Trimming/WaitingOnReplicas>: purged_snaps now [1~2], snap_trimq now [3~12]
2014-11-17 11:54:39.189007 7f0cc119e700 10 osd.90 pg_epoch: 19542 pg[3.2e7( v 19531'36796 (18972'33790,19531'36796] local-les=19542 n=0 ec=256 les/c 19542/19542 19522/19541/19532) [90,7,99] r=0 lpr=19541 lcod 19531'36795 mlcod 0'0 active+clean snaptrimq=[3~12]] SnapTrimmer state<Trimming/WaitingOnReplicas>: WaitingOnReplicas: adding snap 3 to purged_snaps
2014-11-17 11:54:39.189121 7f0cc119e700 10 osd.90 pg_epoch: 19542 pg[3.2e7( v 19531'36796 (18972'33790,19531'36796] local-les=19542 n=0 ec=256 les/c 19542/19542 19522/19541/19532) [90,7,99] r=0 lpr=19541 lcod 19531'36795 mlcod 0'0 active+clean snaptrimq=[4~11]] SnapTrimmer state<Trimming/WaitingOnReplicas>: purged_snaps now [1~3], snap_trimq now [4~11]
...

The full log is attached. Is this yet another case?

Actions #20

Updated by Samuel Just over 9 years ago

All other osds are running that branch, right? Also, which sha1 was it which you thought was working (the branches have been changed)?

Actions #21

Updated by Dan van der Ster over 9 years ago

This test cluster is currently running 0.67.11-4-g496e561, mons and osds.

On our prod cluster we still run ceph-0.67.10-15.g23876d7.el6.x86_64 (from post 10 above). I haven't tried that older patch on the test cluster recently. I could revert to that and reconfirm that it's passing my current tests.

Actions #22

Updated by Samuel Just over 9 years ago

Oddly, I'm able to reproduce it easily on v0.67.11, but not wip-9113-9487-dumpling (496e561d81f2dd1bf92d588fc3afc2431e0a5b98). Are you sure the other osds were running wip-9113-9487-dumpling? Can you trigger backfill again on this pg/osd combo (probably mark it out, wait for active+clean, and then back in, wait for active+clean) and capture logging on all of the osds in the old and new acting set?

Actions #23

Updated by Dan van der Ster over 9 years ago

Today I restarted every mon and osd on the test cluster (again) and confirmed it is all running 0.67.11-4-g496e561. Now I cannot reproduce the issue anymore!! Great, but strange! (I still have a few PGs with the "annoying special case for snaps purged on an empty pg" issue, but the other issue I was reporting is no longer there).

This is very strange... I am sure that the OSDs have been running 0.67.11-4-g496e561 since Nov 4 -- I have the startup logs to prove it (attached). Perhaps there is something about this patch that implies that OSDs need to be restarted twice for it to fully work??

Anyway, from my side this looks good now to merge to dumpling (and firefly). I will definitely pay close attention when we are running this in production.

Actions #24

Updated by Sage Weil over 9 years ago

  • Status changed from Pending Backport to Resolved
Actions #25

Updated by Sage Weil over 9 years ago

  • Status changed from Resolved to Pending Backport

oops, still need firefly

Actions #26

Updated by Sage Weil over 9 years ago

  • Assignee changed from Sage Weil to Samuel Just
Actions #27

Updated by Samuel Just over 9 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF