Bug #52026
closedosd: pgs went back into snaptrim state after osd restart
0%
Description
We are testing snapshots in CephFS. This is a 4 nodes clusters with only replicated pools. During our tests we did a massive deletion of snapshots with many files already deleted and had a large number of snaptrim.
The initial snaptrim after the massive snapshot deletion went for 10 hours. Then sometimes later, one of our node crashed (for an unrelated issue) and when the OSD went back we saw that some pgs switch into the snaptrim/snaptrim_wait state.
We added snaptrimq monitoring (the sum of snaptrimq_len on all the pgs) and realized that the number was high and not decreasing.
The next day we restarted all the osd on another box, and saw again the pgs going into the snaptrim state.
I attached a screenshot of our grafana plots. The first spike is the massive snapshot deletion, the second one is the crash of one box and the third is the intentional restart of all the OSDs.
Files
Updated by Neha Ojha almost 3 years ago
- Status changed from New to Need More Info
- Priority changed from Normal to High
Is it possible for you share some OSD logs with debug_osd=20 from when this issue happens?
Updated by Arthur Outhenin-Chalandre almost 3 years ago
Here is the log of an osd that restarted and made a few pgs into the snaptrim state.
ceph-post-file: 88808267-4ec6-416e-b61c-11da74a4d68e
Updated by Arthur Outhenin-Chalandre almost 3 years ago
I reproduced the issue by doing a `ceph pg repeer` on a pg with a non-zero snaptrimq_len. After the pg has been repeered, it switched into the snaptrim state.
Updated by Arthur Outhenin-Chalandre almost 3 years ago
I searched a bit through the log I sent and I don't see any traces of a pg into the snaptrim state, probably because it was not the primary osd.
We tested the repeer on a pg which has a non zero snaptrimq_len from our nautilus prod RBD cluster (which use replicated pool as well). In this cluster the repeer and the snaptrim was just too fast to be noticeable without looking at the log.
Here is the log of the primary osd (which is much smaller than the previous logs I sent!):
ceph-post-file: f0208568-fbf4-48fa-a758-0c85ab4a6cdc
The repeer makes the snaptrimq_len of the pg go to zero. I am guessing that sometimes it doesn't go into the snaptrim state and snaptrimq_len can increase to a number which could be noticeable if we do a repeer (which happens while I was stress-testing snapshots). However, I don't have any logs showing this (buggy?) state transition.
Updated by Neha Ojha over 2 years ago
- Status changed from Need More Info to New
Thanks for providing the logs, is there a particular PG we should look at the in the logs?
Updated by Arthur Outhenin-Chalandre over 2 years ago
I don't have the logs right now but it prints the state of the PG so if you search for`snaptrim` in `f0208568-fbf4-48fa-a758-0c85ab4a6cdc` you should be able to find the PG that I repeered. I can check tomorrow if necessary.
Updated by Joshua Baergen over 2 years ago
Just wanted to note that we recently encountered what appears to be the same issue on some Luminous (12.2.12) clusters that we're preparing for upgrade. At least two clusters, if not three of them, have had some OSDs in a destroyed state for many weeks, awaiting disk replacement. This has naturally led to a significant pileup of osdmaps. (Not sure if that's relevant to this problem.) In each case, restarting certain OSDs (we're not sure which ones - we went host-at-a-time and not all hosts triggered the problem) caused all PGs to go into snaptrim_wait. snaptrimq_len was as high as 1mil.
We carefully cleared this issue (took several days), and in one of the clusters another OSD restart caused the problem to happen again.
Given that this is on Luminous, we weren't going to report this issue, but given the similarity to the issue in this ticket I thought it worth mentioning.
Updated by David Prude over 2 years ago
We are also seeing this issue on 16.2.5. We schedule cephfs snapshots via cron in a 24h7d2w rotation schedule. Over several months we observed unexplained storage consumption which grew without explanation until we eventually received the following error:
[WRN] PG_SLOW_SNAP_TRIMMING: snap trim queue for 1 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on)
We examined our system and found roughly 499 PGs with queues over 32768 and a total cumulative snaptrim queue length across all PGs of over 12 Million. We took the following actions:
1. Stopped the snapshot schedule
2. Systematically repeered all PGs which had non-zero snaptrim queue lenghts.
3. Verified that no PGs listed anything remaining in their snaptrim queues.
4. Re-enabled snapshots.
The above resulted in reclaiming > 94T of raw capacity.
We have now been running our snapshot schedule for 2 days and currently see 109 PGs with non-zero snaptrim queues totaling 33,806 items in the snaptrim queues across those PGs.
It would seem, as best as we can tell, as if snapshot trimming is not occurring unless we manually repeer the associated PGs.
We are more than happy to provide any information that would prove useful.
Updated by Christopher Hoffman over 2 years ago
Can you collect and share OSD logs (with debug_osd=20 and debug_ms=1) when you are encountering this issue?
Updated by David Prude over 2 years ago
Christopher Hoffman wrote:
Can you collect and share OSD logs (with debug_osd=20 and debug_ms=1) when you are encountering this issue?
I'd be more than happy to, but could you please provide me the following guidance:
1. Do you want logs from ALL OSDs or would a single OSD suffice?
2. You've asked for logs "when I'm encountering this issue". I believe snapshot trimming is not occurring automatically. We currently are continuing to rotate snapshots and the queue lengths are continuing to grow. For what time period would you like logs collected: Any time, as you would expect the logs to continuously be reporting what you are looking for? At the time immediately following a snapshot duration? Any clarification you could provide would be helpful.
Updated by Christopher Hoffman over 2 years ago
David Prude wrote:
Christopher Hoffman wrote:
Can you collect and share OSD logs (with debug_osd=20 and debug_ms=1) when you are encountering this issue?
I'd be more than happy to, but could you please provide me the following guidance:
1. Do you want logs from ALL OSDs or would a single OSD suffice?
2. You've asked for logs "when I'm encountering this issue". I believe snapshot trimming is not occurring automatically. We currently are continuing to rotate snapshots and the queue lengths are continuing to grow. For what time period would you like logs collected: Any time, as you would expect the logs to continuously be reporting what you are looking for? At the time immediately following a snapshot duration? Any clarification you could provide would be helpful.
1. A single OSD will be fine, just ensure it is one exhibiting the issue.
2. Can you collect the logs just before the repeer and 12h time afterwards (adjust time, if needed to gather logs during a snapshot trim)?
Updated by David Prude over 2 years ago
Christopher Hoffman wrote:
1. A single OSD will be fine, just ensure it is one exhibiting the issue.
2. Can you collect the logs just before the repeer and 12h time afterwards (adjust time, if needed to gather logs during a snapshot trim)?
Just to clarify, you want a section of log just before I manually repeer? As far as I can tell no snapshot trim is occurring except when I manually repeer so for "logs during snapshot trim" you would like this while trimming is occurring as a result of a manual repeer? If this is the case, I can:
1. collect logs prior to repeer
2. manually repeer
3. collect logs as the snapshot trimming initiated by the repeer is progressing.
How large of a time window would you like for each of the above?
Updated by Christopher Hoffman over 2 years ago
Yes that is the case.
Can you collect the log starting at the manual repeer? The intent was to capture the logs starting at this point. Also include logs while the snaptrim is underway for 12h.
Updated by David Prude over 2 years ago
I will collect these logs as you've requested. As an update: I am now seeing snaptrim occurring automatically without intervention. This was not the case before exhaustively manually re-peering as described.
Updated by David Prude about 2 years ago
Fortunately (or perhaps not so fortunately), in the process of dealing with this issue we performed a full restart of all OSDs. Since this restart, this behavior has resolved. We now see regular snap trimming and it would seem we have not re-entered this state.
Updated by Radoslaw Zarzynski about 2 years ago
- Related to Bug #54396: Setting osd_pg_max_concurrent_snap_trims to 0 prematurely clears the snaptrim queue added
Updated by Jack Y about 2 years ago
We are having the same issue with ceph 15.2.13. We take RBD snapshots that gets deleted after 3 days.
The problem gets so bad that the cluster falls behind with deep scrub and scrub.
We have to restart OSDs individually for it to resume snaptrim and scrubbing.
Our cluster so far freed about 111 TB of space.
Updated by Radoslaw Zarzynski about 2 years ago
Could you please collect the logs exactly like mentioned in the comment #14 (https://tracker.ceph.com/issues/52026#note-14)?
Updated by Josh Durgin about 2 years ago
- Backport set to pacific, quincy
Good news, the root cause was identifiable from logs in a test environment with snapshot-based mirroring enabled.
The issue (at least in pacific) is that a pg exiting scrub is supposed to restart snap trim, but it does not. This happens due to the way snap trim checks for scrub being in progress - when snap trim is trying to requeue, scrub is still 'queued_or_active', and thus the snap trimmer goes back to the NotTrimming state, and only starts trimming again with external input (e.g. restarting the OSD or at least peering).
Relevant code is https://github.com/ceph/ceph/blob/v16.2.7/src/osd/PrimaryLogPG.cc#L15205-L15207 and https://github.com/ceph/ceph/blob/v16.2.7/src/osd/pg_scrubber.cc#L1636-L1639
Details of the investigation follow:
In the secondary cluster, where capacity use is as expected, there is just a single pg with a long snaptrimq, of ~14k snapshots to delete. In the primary cluster, where there is a lot of excess capacity used, 27/32 of the rbd pool pgs have long snaptrim queues (14k snapshots). In both cases, the rest are 0.
Turning up osd debugging, we see only the 5 pgs with no snaptrimq are getting queued for snaptrim:
# zgrep queueing.*snaptrim ceph-osd.0.log-saved.gz > ~/osd.0.snaptrim_queued # less ~/osd.0.snaptrim_queued # grep -oE 'pg\[3\.[0-9a-f]+' ~/osd.0.snaptrim_queued | sort | uniq -c 40 pg[3.17 40 pg[3.a 40 pg[3.f # grep -oE 'pg\[3\.[0-9a-f]+' ~/osd.1.snaptrim_queued | sort | uniq -c 40 pg[3.10 40 pg[3.d
Experimentally, I tried disabling the snapshot schedule so no new snapshots would be created. This had no effect on snaptrim - it continued to not run for the 27/32 affected pgs and there was no change in snaptrimq lengths.
Looking at one of the pgs that was not getting scheduled for snaptrim, 3.1e, it became clear what the problem was. When scrubbing completes and tries to wake up the snap trimmer, scrub is still active, so the snap trimmer goes back to waiting for scrub to complete. However, scrub has no confirmation of the snap trimmer state, so it completes and does nothing further to requeue snaptrimming. Thus, the snaptrimer gets stuck in the NotTrimming state until something (like restarting the OSD) wakes it up.
031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e: ] trimq=14048 ps=730] scrubber pg(3.1e) scrub finished, requeuing snap_trimmer 031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e: ] trimq=14048 ps=730] exit Trimming/WaitScrub 031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e: ] trimq=14048 ps=730] enter NotTrimming 031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e: ] trimq=14048 ps=730] SnapTrimmer state<NotTrimming>: NotTrimming react KickTrim 031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e: ] trimq=14048 ps=730] SnapTrimmer state<NotTrimming>: scrubbing, will requeue snap_trimmer after 031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e: ] trimq=14048 ps=730] exit NotTrimming 031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e: ] trimq=14048 ps=730] enter Trimming/WaitScrub 031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e: ] trimq=14048 ps=730] scrubber pg(3.1e) update_op_mode_text: repair: visible: false, internal: false. Displayed: scrub 031 mlcod 18865'554031 active+clean+scrubbing [ 3.1e: ] trimq=14048 ps=730] scrubber pg(3.1e) cleanup_on_finish 031 mlcod 18865'554031 active+clean trimq=14048 ps=730] prepare_stats_for_publish reporting purged_snaps [577~5,590~14,5b8~14,5e0~14,608~14,630~14,658~14,680~14,6a8~14,6d0~14,6f8~14,720~14,748~14,770~14,798~14,7c0~14,7e8~14,810~14,838~14,860~14,888~14,8b0~14,8d8~14,900~14,9> 031 mlcod 18865'554031 active+clean trimq=14048 ps=730] publish_stats_to_osd 18869:1761277 031 mlcod 18865'554031 active+clean trimq=14048 ps=730] scrubber pg(3.1e) clear_scrub_reservations 031 mlcod 18865'554031 active+clean trimq=14048 ps=730] prepare_stats_for_publish reporting purged_snaps [577~5,590~14,5b8~14,5e0~14,608~14,630~14,658~14,680~14,6a8~14,6d0~14,6f8~14,720~14,748~14,770~14,798~14,7c0~14,7e8~14,810~14,838~14,860~14,888~14,8b0~14,8d8~14,900~14,9> 031 mlcod 18865'554031 active+clean trimq=14048 ps=730] publish_stats_to_osd 18869:1761278 031 mlcod 18865'554031 active+clean trimq=14048 ps=730] scrubber pg(3.1e) reset_internal_state 031 mlcod 18865'554031 active+clean trimq=14048 ps=730] PrimaryLog scrubber pg(3.1e) _scrub_clear_state 031 mlcod 18865'554031 active+clean trimq=14048 ps=730] share_pg_info 031 mlcod 18865'554031 active+clean trimq=14048 ps=730] scrubberFSM pg(3.1e) ~ActiveScrubbing 031 mlcod 18865'554031 active+clean trimq=14048 ps=730] scrubber pg(3.1e) unreserve_replicas 031 mlcod 18865'554031 active+clean trimq=14048 ps=730] scrubberFSM pg(3.1e) -- state -->> NotActive
Updated by Josh Durgin about 2 years ago
- Status changed from New to In Progress
- Assignee changed from Christopher Hoffman to Ronen Friedman
- Priority changed from High to Urgent
Updated by Neha Ojha about 2 years ago
- Status changed from In Progress to Fix Under Review
Updated by Neha Ojha about 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 2 years ago
- Copied to Backport #55138: quincy: osd: pgs went back into snaptrim state after osd restart added
Updated by Backport Bot about 2 years ago
- Copied to Backport #55139: pacific: osd: pgs went back into snaptrim state after osd restart added
Updated by Yuri Weinstein about 2 years ago
Updated by Neha Ojha about 2 years ago
- Status changed from Pending Backport to Resolved