Project

General

Profile

Actions

Bug #63883

open

min_last_epoch_clean is not updated, causing osdmap to be unable to be trimmed, and monitor db keeps growing.

Added by changzhi tan 4 months ago. Updated 2 months ago.

Status:
Pending Backport
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy,reef,squid
Regression:
No
Severity:
1 - critical
Reviewed:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The ceph alarm mon has no space. I found that the ceph osdmap epoch is very large and has not been clipped.
The status of all osd and pg is healthy and normal.
Through further investigation, I found that min_last_epoch_clean has not changed, resulting in no trim, which is very strange.
Is there any unknown osd or pg information in the monitor?
These are not printed in the code. Does the community have any good ideas? thanks

my ceph version is 17.2.5

  1. ceph report

osdmap_clean epochs":
"min_last_epoch_clean": 237271
osdmap_first_committed": 237271,
osdmap last committed": 685278,
osdmap manifest":
"first pinned": 237271."
last pinned": 684777.


Related issues 3 (3 open0 closed)

Copied to RADOS - Backport #64649: quincy: min_last_epoch_clean is not updated, causing osdmap to be unable to be trimmed, and monitor db keeps growing.In ProgressMykola GolubActions
Copied to RADOS - Backport #64650: reef: min_last_epoch_clean is not updated, causing osdmap to be unable to be trimmed, and monitor db keeps growing.In ProgressMykola GolubActions
Copied to RADOS - Backport #64651: squid: min_last_epoch_clean is not updated, causing osdmap to be unable to be trimmed, and monitor db keeps growing.In ProgressMykola GolubActions
Actions #1

Updated by changzhi tan 4 months ago

There is no printing information in this code, but it should be the min epoch taken from here.
@epoch_t OSDMonitor::get_min_last_epoch_clean() const {
auto floor = last_epoch_clean.get_lower_bound(osdmap);
// also scan osd epochs
// don't trim past the oldest reported osd epoch
for (auto [osd, epoch] : osd_epochs) {
if (epoch < floor) {
floor = epoch;
}
}
return floor;
}@

Actions #2

Updated by Neha Ojha 4 months ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSDMap)
Actions #3

Updated by changzhi tan 4 months ago

Another phenomenon is that the number of epochs in OSDMAP has been growing rapidly, perhaps because all the newly added epochs in the future are not normal clean, resulting in the last minute epoch clean not being updated?

Actions #4

Updated by Matan Breizman 4 months ago

  • Status changed from New to Fix Under Review
  • Assignee set to Matan Breizman
  • Backport set to quincy,reef
  • Pull request ID set to 54999

Thank you for the report!
Can you please share the `osd_epochs` section from ceph report?

Actions #5

Updated by changzhi tan 4 months ago

hi Matan,
thank you for your reply. Yes, I checked the ceph report and there are some osds in the output and their "osd_epochs" is 237271.
Some of these OSDs are in the current cluster, and some have been deleted a long time ago, which is strange. Restarting these osds could not bring the ceph monitor back to normal, so we finally restarted the ceph leader monitor.
I saw the patch you submitted and I hope it can solve this problem. If you have any other questions, feel free to communicate. I have recorded the cluster information when the exception occurred. Thank you very much.

Actions #6

Updated by changzhi tan 4 months ago

Matan Breizman wrote:

Thank you for the report!
Can you please share the `osd_epochs` section from ceph report?

In the ceph report, some clusters have deleted OSDs that do not exist, but the report will show it.
Is it related to this submission? https://github.com/ceph/ceph/pull/44303

Actions #7

Updated by Matan Breizman 4 months ago

In the ceph report, some clusters have deleted OSDs that do not exist, but the report will show it.

The issue is that when the OSD was marked as out it was erased from "osd_epochs".
However, the OSD may have continued to send beacons and therefore get added back to the map as a stale OSD (Since it won't be erased again).

Restarting the monitor worked because a new "osd_epochs" map was created, without the removed OSDs.
Alternatively, re-adding and marking the OSD as out would perhaps trigger the erasure from the map again.

Is it related to this submission? https://github.com/ceph/ceph/pull/44303

Not exactly, PR#44303 is (mostly) a cleanup since the the erasure which was removed is redundant.
Thank you again for the tracker.

Actions #8

Updated by changzhi tan 4 months ago

Matan Breizman wrote:

In the ceph report, some clusters have deleted OSDs that do not exist, but the report will show it.

The issue is that when the OSD was marked as out it was erased from "osd_epochs".
However, the OSD may have continued to send beacons and therefore get added back to the map as a stale OSD (Since it won't be erased again).

Restarting the monitor worked because a new "osd_epochs" map was created, without the removed OSDs.
Alternatively, re-adding and marking the OSD as out would perhaps trigger the erasure from the map again.

Is it related to this submission? https://github.com/ceph/ceph/pull/44303

Not exactly, PR#44303 is (mostly) a cleanup since the the erasure which was removed is redundant.
Thank you again for the tracker.

thanks (:

Actions #9

Updated by Mykola Golub 2 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #10

Updated by Backport Bot 2 months ago

  • Copied to Backport #64649: quincy: min_last_epoch_clean is not updated, causing osdmap to be unable to be trimmed, and monitor db keeps growing. added
Actions #11

Updated by Backport Bot 2 months ago

  • Copied to Backport #64650: reef: min_last_epoch_clean is not updated, causing osdmap to be unable to be trimmed, and monitor db keeps growing. added
Actions #12

Updated by Backport Bot 2 months ago

  • Tags set to backport_processed
Actions #13

Updated by Mykola Golub 2 months ago

  • Tags deleted (backport_processed)
  • Backport changed from quincy,reef to quincy,reef,squid
Actions #14

Updated by Backport Bot 2 months ago

  • Copied to Backport #64651: squid: min_last_epoch_clean is not updated, causing osdmap to be unable to be trimmed, and monitor db keeps growing. added
Actions #15

Updated by Backport Bot 2 months ago

  • Tags set to backport_processed
Actions #16

Updated by Konstantin Shalygin 2 months ago

  • Affected Versions v16.2.15, v17.2.7, v18.1.2, v18.2.2, v19.1.0 added
Actions #17

Updated by Konstantin Shalygin 2 months ago

  • Affected Versions deleted (v18.1.2)
Actions

Also available in: Atom PDF