Bug #63883
openmin_last_epoch_clean is not updated, causing osdmap to be unable to be trimmed, and monitor db keeps growing.
0%
Description
The ceph alarm mon has no space. I found that the ceph osdmap epoch is very large and has not been clipped.
The status of all osd and pg is healthy and normal.
Through further investigation, I found that min_last_epoch_clean has not changed, resulting in no trim, which is very strange.
Is there any unknown osd or pg information in the monitor?
These are not printed in the code. Does the community have any good ideas? thanks
my ceph version is 17.2.5
- ceph report
osdmap_clean epochs":
"min_last_epoch_clean": 237271
osdmap_first_committed": 237271,
osdmap last committed": 685278,
osdmap manifest":
"first pinned": 237271."
last pinned": 684777.
Updated by changzhi tan 4 months ago
There is no printing information in this code, but it should be the min epoch taken from here.
@epoch_t OSDMonitor::get_min_last_epoch_clean() const
{
auto floor = last_epoch_clean.get_lower_bound(osdmap);
// also scan osd epochs
// don't trim past the oldest reported osd epoch
for (auto [osd, epoch] : osd_epochs) {
if (epoch < floor) {
floor = epoch;
}
}
return floor;
}@
Updated by changzhi tan 4 months ago
Another phenomenon is that the number of epochs in OSDMAP has been growing rapidly, perhaps because all the newly added epochs in the future are not normal clean, resulting in the last minute epoch clean not being updated?
Updated by Matan Breizman 4 months ago
- Status changed from New to Fix Under Review
- Assignee set to Matan Breizman
- Backport set to quincy,reef
- Pull request ID set to 54999
Thank you for the report!
Can you please share the `osd_epochs` section from ceph report?
Updated by changzhi tan 4 months ago
hi Matan,
thank you for your reply. Yes, I checked the ceph report and there are some osds in the output and their "osd_epochs" is 237271.
Some of these OSDs are in the current cluster, and some have been deleted a long time ago, which is strange. Restarting these osds could not bring the ceph monitor back to normal, so we finally restarted the ceph leader monitor.
I saw the patch you submitted and I hope it can solve this problem. If you have any other questions, feel free to communicate. I have recorded the cluster information when the exception occurred. Thank you very much.
Updated by changzhi tan 4 months ago
Matan Breizman wrote:
Thank you for the report!
Can you please share the `osd_epochs` section from ceph report?
In the ceph report, some clusters have deleted OSDs that do not exist, but the report will show it.
Is it related to this submission? https://github.com/ceph/ceph/pull/44303
Updated by Matan Breizman 4 months ago
In the ceph report, some clusters have deleted OSDs that do not exist, but the report will show it.
The issue is that when the OSD was marked as out it was erased from "osd_epochs".
However, the OSD may have continued to send beacons and therefore get added back to the map as a stale OSD (Since it won't be erased again).
Restarting the monitor worked because a new "osd_epochs" map was created, without the removed OSDs.
Alternatively, re-adding and marking the OSD as out would perhaps trigger the erasure from the map again.
Is it related to this submission? https://github.com/ceph/ceph/pull/44303
Not exactly, PR#44303 is (mostly) a cleanup since the the erasure which was removed is redundant.
Thank you again for the tracker.
Updated by changzhi tan 4 months ago
Matan Breizman wrote:
In the ceph report, some clusters have deleted OSDs that do not exist, but the report will show it.
The issue is that when the OSD was marked as out it was erased from "osd_epochs".
However, the OSD may have continued to send beacons and therefore get added back to the map as a stale OSD (Since it won't be erased again).Restarting the monitor worked because a new "osd_epochs" map was created, without the removed OSDs.
Alternatively, re-adding and marking the OSD as out would perhaps trigger the erasure from the map again.Is it related to this submission? https://github.com/ceph/ceph/pull/44303
Not exactly, PR#44303 is (mostly) a cleanup since the the erasure which was removed is redundant.
Thank you again for the tracker.
thanks (:
Updated by Mykola Golub 2 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 2 months ago
- Copied to Backport #64649: quincy: min_last_epoch_clean is not updated, causing osdmap to be unable to be trimmed, and monitor db keeps growing. added
Updated by Backport Bot 2 months ago
- Copied to Backport #64650: reef: min_last_epoch_clean is not updated, causing osdmap to be unable to be trimmed, and monitor db keeps growing. added
Updated by Mykola Golub 2 months ago
- Tags deleted (
backport_processed) - Backport changed from quincy,reef to quincy,reef,squid
Updated by Backport Bot 2 months ago
- Copied to Backport #64651: squid: min_last_epoch_clean is not updated, causing osdmap to be unable to be trimmed, and monitor db keeps growing. added
Updated by Konstantin Shalygin 2 months ago
- Affected Versions v16.2.15, v17.2.7, v18.1.2, v18.2.2, v19.1.0 added
Updated by Konstantin Shalygin 2 months ago
- Affected Versions deleted (
v18.1.2)