Bug #52026: osd: pgs went back into snaptrim state after osd restart - RADOS - Ceph

Actions

Copy link

Bug #52026

closed

osd: pgs went back into snaptrim state after osd restart

Added by Arthur Outhenin-Chalandre almost 3 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Ronen Friedman

Category:

Target version:

% Done:

Source:

Tags:

Backport:

pacific, quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.2.5

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

45640

Crash signature (v1):

Crash signature (v2):

Description

We are testing snapshots in CephFS. This is a 4 nodes clusters with only replicated pools. During our tests we did a massive deletion of snapshots with many files already deleted and had a large number of snaptrim.

The initial snaptrim after the massive snapshot deletion went for 10 hours. Then sometimes later, one of our node crashed (for an unrelated issue) and when the OSD went back we saw that some pgs switch into the snaptrim/snaptrim_wait state.
We added snaptrimq monitoring (the sum of snaptrimq_len on all the pgs) and realized that the number was high and not decreasing.

The next day we restarted all the osd on another box, and saw again the pgs going into the snaptrim state.

I attached a screenshot of our grafana plots. The first spike is the massive snapshot deletion, the second one is the crash of one box and the third is the intentional restart of all the OSDs.

Files

2021-08-03-151112_915x738_scrot.png (69.3 KB) 2021-08-03-151112_915x738_scrot.png

Arthur Outhenin-Chalandre, 08/03/2021 01:12 PM

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Neha Ojha almost 3 years ago

Status changed from New to Need More Info
Priority changed from Normal to High

Is it possible for you share some OSD logs with debug_osd=20 from when this issue happens?

Actions

Copy link

Updated by Arthur Outhenin-Chalandre almost 3 years ago

Here is the log of an osd that restarted and made a few pgs into the snaptrim state.

ceph-post-file: 88808267-4ec6-416e-b61c-11da74a4d68e

Actions

Copy link

Updated by Arthur Outhenin-Chalandre almost 3 years ago

I reproduced the issue by doing a `ceph pg repeer` on a pg with a non-zero snaptrimq_len. After the pg has been repeered, it switched into the snaptrim state.

Actions

Copy link

Updated by Arthur Outhenin-Chalandre almost 3 years ago

I searched a bit through the log I sent and I don't see any traces of a pg into the snaptrim state, probably because it was not the primary osd.

We tested the repeer on a pg which has a non zero snaptrimq_len from our nautilus prod RBD cluster (which use replicated pool as well). In this cluster the repeer and the snaptrim was just too fast to be noticeable without looking at the log.

Here is the log of the primary osd (which is much smaller than the previous logs I sent!):
ceph-post-file: f0208568-fbf4-48fa-a758-0c85ab4a6cdc

The repeer makes the snaptrimq_len of the pg go to zero. I am guessing that sometimes it doesn't go into the snaptrim state and snaptrimq_len can increase to a number which could be noticeable if we do a repeer (which happens while I was stress-testing snapshots). However, I don't have any logs showing this (buggy?) state transition.

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Status changed from Need More Info to New

Thanks for providing the logs, is there a particular PG we should look at the in the logs?

Actions

Copy link

Updated by Arthur Outhenin-Chalandre over 2 years ago

I don't have the logs right now but it prints the state of the PG so if you search for`snaptrim` in `f0208568-fbf4-48fa-a758-0c85ab4a6cdc` you should be able to find the PG that I repeered. I can check tomorrow if necessary.

Actions

Copy link

Updated by Joshua Baergen over 2 years ago

Just wanted to note that we recently encountered what appears to be the same issue on some Luminous (12.2.12) clusters that we're preparing for upgrade. At least two clusters, if not three of them, have had some OSDs in a destroyed state for many weeks, awaiting disk replacement. This has naturally led to a significant pileup of osdmaps. (Not sure if that's relevant to this problem.) In each case, restarting certain OSDs (we're not sure which ones - we went host-at-a-time and not all hosts triggered the problem) caused all PGs to go into snaptrim_wait. snaptrimq_len was as high as 1mil.

We carefully cleared this issue (took several days), and in one of the clusters another OSD restart caused the problem to happen again.

Given that this is on Luminous, we weren't going to report this issue, but given the similarity to the issue in this ticket I thought it worth mentioning.

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Assignee set to Christopher Hoffman

Actions

Copy link

Updated by David Prude over 2 years ago

We are also seeing this issue on 16.2.5. We schedule cephfs snapshots via cron in a 24h7d2w rotation schedule. Over several months we observed unexplained storage consumption which grew without explanation until we eventually received the following error:

[WRN] PG_SLOW_SNAP_TRIMMING: snap trim queue for 1 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on)

We examined our system and found roughly 499 PGs with queues over 32768 and a total cumulative snaptrim queue length across all PGs of over 12 Million. We took the following actions:

1. Stopped the snapshot schedule
2. Systematically repeered all PGs which had non-zero snaptrim queue lenghts.
3. Verified that no PGs listed anything remaining in their snaptrim queues.
4. Re-enabled snapshots.

The above resulted in reclaiming > 94T of raw capacity.

We have now been running our snapshot schedule for 2 days and currently see 109 PGs with non-zero snaptrim queues totaling 33,806 items in the snaptrim queues across those PGs.

It would seem, as best as we can tell, as if snapshot trimming is not occurring unless we manually repeer the associated PGs.

We are more than happy to provide any information that would prove useful.

Actions

Copy link

#10

Updated by Christopher Hoffman over 2 years ago

Can you collect and share OSD logs (with debug_osd=20 and debug_ms=1) when you are encountering this issue?

Actions

Copy link

#11

Updated by David Prude over 2 years ago

Christopher Hoffman wrote:

Can you collect and share OSD logs (with debug_osd=20 and debug_ms=1) when you are encountering this issue?

I'd be more than happy to, but could you please provide me the following guidance:

1. Do you want logs from ALL OSDs or would a single OSD suffice?
2. You've asked for logs "when I'm encountering this issue". I believe snapshot trimming is not occurring automatically. We currently are continuing to rotate snapshots and the queue lengths are continuing to grow. For what time period would you like logs collected: Any time, as you would expect the logs to continuously be reporting what you are looking for? At the time immediately following a snapshot duration? Any clarification you could provide would be helpful.

Actions

Copy link

#12

Updated by Christopher Hoffman over 2 years ago

David Prude wrote:

Christopher Hoffman wrote:

Can you collect and share OSD logs (with debug_osd=20 and debug_ms=1) when you are encountering this issue?

I'd be more than happy to, but could you please provide me the following guidance:

1. Do you want logs from ALL OSDs or would a single OSD suffice?
2. You've asked for logs "when I'm encountering this issue". I believe snapshot trimming is not occurring automatically. We currently are continuing to rotate snapshots and the queue lengths are continuing to grow. For what time period would you like logs collected: Any time, as you would expect the logs to continuously be reporting what you are looking for? At the time immediately following a snapshot duration? Any clarification you could provide would be helpful.

1. A single OSD will be fine, just ensure it is one exhibiting the issue.
2. Can you collect the logs just before the repeer and 12h time afterwards (adjust time, if needed to gather logs during a snapshot trim)?

Actions

Copy link

#13

Updated by David Prude over 2 years ago

Christopher Hoffman wrote:

1. A single OSD will be fine, just ensure it is one exhibiting the issue.
2. Can you collect the logs just before the repeer and 12h time afterwards (adjust time, if needed to gather logs during a snapshot trim)?

Just to clarify, you want a section of log just before I manually repeer? As far as I can tell no snapshot trim is occurring except when I manually repeer so for "logs during snapshot trim" you would like this while trimming is occurring as a result of a manual repeer? If this is the case, I can:

1. collect logs prior to repeer
2. manually repeer
3. collect logs as the snapshot trimming initiated by the repeer is progressing.

How large of a time window would you like for each of the above?

Actions

Copy link

#14

Updated by Christopher Hoffman over 2 years ago

Yes that is the case.

Can you collect the log starting at the manual repeer? The intent was to capture the logs starting at this point. Also include logs while the snaptrim is underway for 12h.

Actions

Copy link

#15

Updated by David Prude over 2 years ago

I will collect these logs as you've requested. As an update: I am now seeing snaptrim occurring automatically without intervention. This was not the case before exhaustively manually re-peering as described.

Actions

Copy link

#16

Updated by David Prude about 2 years ago

Fortunately (or perhaps not so fortunately), in the process of dealing with this issue we performed a full restart of all OSDs. Since this restart, this behavior has resolved. We now see regular snap trimming and it would seem we have not re-entered this state.

Actions

Copy link

#17

Updated by Radoslaw Zarzynski about 2 years ago

Related to Bug #54396: Setting osd_pg_max_concurrent_snap_trims to 0 prematurely clears the snaptrim queue added

Actions

Copy link

#18

Updated by Jack Y about 2 years ago

We are having the same issue with ceph 15.2.13. We take RBD snapshots that gets deleted after 3 days.
The problem gets so bad that the cluster falls behind with deep scrub and scrub.
We have to restart OSDs individually for it to resume snaptrim and scrubbing.
Our cluster so far freed about 111 TB of space.

Actions

Copy link

#19

Updated by Radoslaw Zarzynski about 2 years ago

Could you please collect the logs exactly like mentioned in the comment #14 (https://tracker.ceph.com/issues/52026#note-14)?

Actions

Copy link

#20

Updated by Josh Durgin about 2 years ago

Backport set to pacific, quincy

Good news, the root cause was identifiable from logs in a test environment with snapshot-based mirroring enabled.

The issue (at least in pacific) is that a pg exiting scrub is supposed to restart snap trim, but it does not. This happens due to the way snap trim checks for scrub being in progress - when snap trim is trying to requeue, scrub is still 'queued_or_active', and thus the snap trimmer goes back to the NotTrimming state, and only starts trimming again with external input (e.g. restarting the OSD or at least peering).

Relevant code is https://github.com/ceph/ceph/blob/v16.2.7/src/osd/PrimaryLogPG.cc#L15205-L15207 and https://github.com/ceph/ceph/blob/v16.2.7/src/osd/pg_scrubber.cc#L1636-L1639

Details of the investigation follow:

In the secondary cluster, where capacity use is as expected, there is just a single pg with a long snaptrimq, of ~14k snapshots to delete. In the primary cluster, where there is a lot of excess capacity used, 27/32 of the rbd pool pgs have long snaptrim queues (14k snapshots). In both cases, the rest are 0.
Turning up osd debugging, we see only the 5 pgs with no snaptrimq are getting queued for snaptrim:

# zgrep queueing.*snaptrim ceph-osd.0.log-saved.gz > ~/osd.0.snaptrim_queued
# less ~/osd.0.snaptrim_queued 
# grep -oE 'pg\[3\.[0-9a-f]+' ~/osd.0.snaptrim_queued | sort | uniq -c
     40 pg[3.17
     40 pg[3.a
     40 pg[3.f
# grep -oE 'pg\[3\.[0-9a-f]+' ~/osd.1.snaptrim_queued | sort | uniq -c
     40 pg[3.10
     40 pg[3.d

Experimentally, I tried disabling the snapshot schedule so no new snapshots would be created. This had no effect on snaptrim - it continued to not run for the 27/32 affected pgs and there was no change in snaptrimq lengths.

Looking at one of the pgs that was not getting scheduled for snaptrim, 3.1e, it became clear what the problem was. When scrubbing completes and tries to wake up the snap trimmer, scrub is still active, so the snap trimmer goes back to waiting for scrub to complete. However, scrub has no confirmation of the snap trimmer state, so it completes and does nothing further to requeue snaptrimming. Thus, the snaptrimer gets stuck in the NotTrimming state until something (like restarting the OSD) wakes it up.

031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e:  ]  trimq=14048 ps=730]  scrubber pg(3.1e) scrub finished, requeuing snap_trimmer
031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e:  ]  trimq=14048 ps=730] exit Trimming/WaitScrub
031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e:  ]  trimq=14048 ps=730] enter NotTrimming
031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e:  ]  trimq=14048 ps=730] SnapTrimmer state<NotTrimming>: NotTrimming react KickTrim
031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e:  ]  trimq=14048 ps=730] SnapTrimmer state<NotTrimming>:  scrubbing, will requeue snap_trimmer after
031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e:  ]  trimq=14048 ps=730] exit NotTrimming
031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e:  ]  trimq=14048 ps=730] enter Trimming/WaitScrub
031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e:  ]  trimq=14048 ps=730]  scrubber pg(3.1e) update_op_mode_text: repair: visible: false, internal: false. Displayed: scrub
031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e:  ]  trimq=14048 ps=730]  scrubber pg(3.1e) cleanup_on_finish
031 mlcod 18865'554031 active+clean trimq=14048 ps=730] prepare_stats_for_publish reporting purged_snaps [577~5,590~14,5b8~14,5e0~14,608~14,630~14,658~14,680~14,6a8~14,6d0~14,6f8~14,720~14,748~14,770~14,798~14,7c0~14,7e8~14,810~14,838~14,860~14,888~14,8b0~14,8d8~14,900~14,9>
031 mlcod 18865'554031 active+clean trimq=14048 ps=730] publish_stats_to_osd 18869:1761277
031 mlcod 18865'554031 active+clean trimq=14048 ps=730]  scrubber pg(3.1e) clear_scrub_reservations
031 mlcod 18865'554031 active+clean trimq=14048 ps=730] prepare_stats_for_publish reporting purged_snaps [577~5,590~14,5b8~14,5e0~14,608~14,630~14,658~14,680~14,6a8~14,6d0~14,6f8~14,720~14,748~14,770~14,798~14,7c0~14,7e8~14,810~14,838~14,860~14,888~14,8b0~14,8d8~14,900~14,9>
031 mlcod 18865'554031 active+clean trimq=14048 ps=730] publish_stats_to_osd 18869:1761278
031 mlcod 18865'554031 active+clean trimq=14048 ps=730]  scrubber pg(3.1e) reset_internal_state
031 mlcod 18865'554031 active+clean trimq=14048 ps=730]  PrimaryLog scrubber pg(3.1e) _scrub_clear_state
031 mlcod 18865'554031 active+clean trimq=14048 ps=730] share_pg_info
031 mlcod 18865'554031 active+clean trimq=14048 ps=730]  scrubberFSM pg(3.1e) ~ActiveScrubbing
031 mlcod 18865'554031 active+clean trimq=14048 ps=730]  scrubber pg(3.1e) unreserve_replicas
031 mlcod 18865'554031 active+clean trimq=14048 ps=730]  scrubberFSM pg(3.1e) -- state -->> NotActive

Actions

Copy link

#21