Project

General

Profile

Actions

Bug #54396

closed

Setting osd_pg_max_concurrent_snap_trims to 0 prematurely clears the snaptrim queue

Added by Dan van der Ster about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus,pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

See https://www.spinics.net/lists/ceph-users/msg71061.html

This time around, after a few hours of snaptrimming, users complained of high IO
latency, and indeed Ceph reported "slow ops" on a number of OSDs and on the
active MDS. I attributed this to the snaptrimming and decided to reduce it by
initially setting osd_pg_max_concurrent_snap_trims to 1, which didn't seem to
help much, so I then set it to 0, which had the surprising effect of
transitioning all PGs back to active+clean (is this intended?). I also restarted
the MDS which seemed to be struggling. IO latency went back to normal
immediately.

In the code, when osd_pg_max_concurrent_snap_trims is 0, PrimaryLogPG::AwaitAsyncWork::react(const DoSnapWork&) calls pg->snap_mapper.get_next_objects_to_trim looking for 0 snaps to trim. But pg->snap_mapper.get_next_objects_to_trim returns ENOENT in this case, then DoSnapWork erases the remaining snap_to_trim.


Related issues 4 (0 open4 closed)

Related to RADOS - Bug #52026: osd: pgs went back into snaptrim state after osd restartResolvedRonen Friedman

Actions
Copied to RADOS - Backport #54466: pacific: Setting osd_pg_max_concurrent_snap_trims to 0 prematurely clears the snaptrim queueResolvedLaura FloresActions
Copied to RADOS - Backport #54467: quincy: Setting osd_pg_max_concurrent_snap_trims to 0 prematurely clears the snaptrim queueResolvedLaura FloresActions
Copied to RADOS - Backport #54468: octopus: Setting osd_pg_max_concurrent_snap_trims to 0 prematurely clears the snaptrim queueResolvedLaura FloresActions
Actions #1

Updated by Dan van der Ster about 2 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Dan van der Ster
  • Pull request ID set to 45140
Actions #2

Updated by Dan van der Ster about 2 years ago

More context:

ceph pg dump reports a SNAPTRIMQ_LEN of 0 on all PGs.

Did CephFS just leak a massive 12 TiB worth of objects...? It seems to me that
the snaptrim operation did not complete at all.

Actions #3

Updated by Radoslaw Zarzynski about 2 years ago

  • Related to Bug #52026: osd: pgs went back into snaptrim state after osd restart added
Actions #4

Updated by Radoslaw Zarzynski about 2 years ago

  • Priority changed from Normal to High
Actions #5

Updated by Laura Flores about 2 years ago

  • Status changed from Fix Under Review to Resolved
Actions #6

Updated by Neha Ojha about 2 years ago

  • Status changed from Resolved to Pending Backport
Actions #7

Updated by Backport Bot about 2 years ago

  • Copied to Backport #54466: pacific: Setting osd_pg_max_concurrent_snap_trims to 0 prematurely clears the snaptrim queue added
Actions #8

Updated by Backport Bot about 2 years ago

  • Copied to Backport #54467: quincy: Setting osd_pg_max_concurrent_snap_trims to 0 prematurely clears the snaptrim queue added
Actions #9

Updated by Backport Bot about 2 years ago

  • Copied to Backport #54468: octopus: Setting osd_pg_max_concurrent_snap_trims to 0 prematurely clears the snaptrim queue added
Actions #10

Updated by Neha Ojha almost 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF