Project

General

Profile

Bug #44419

ops stuck on "wait for new map" for no apparent reason

Added by Nikola Ciprich 28 days ago. Updated 15 days ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
OSDMap
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

I'd like to report the problem we've hit on one of our mimic clusters
(13.2.6 as well as 13.2.6). Any manipulation with OSD (ie restart) causes
lot of slow ops caused by waiting for new map. It seems those are slowed by SATA
OSDs which keep being 100% busy reading for long time until all ops are gone,
blocking OPS on unrelated NVME pools - SATA pools are completely unused now.

is this possible that those maps are being requested from slow SATA OSDs
and it takes such a long time for some reason? why could it take so long?
the cluster is very small with very light load..

when we restarted one of the nodes, it took literally hours for the peering to finish
due to waiting for maps.. we've done all possible network checks, as well as harddrives
checks, everything seems to be in order..

We can easily reproduce the problem I'll soon have maintenance window so I'll try to gather
as much debug info as possible..


Related issues

Duplicates RADOS - Bug #37875: osdmaps aren't being cleaned up automatically on healthy cluster New

History

#1 Updated by Nikola Ciprich 20 days ago

while digging deeper, I noticed that when the cluster gets into this
state, osd_map_cache_miss on OSDs starts growing rapidly.. even when
I increased osd map cache size to 500 (which was the default at least
for luminous) it behaves the same..

I think this could be related..

#2 Updated by Nikola Ciprich 19 days ago

so I can confirm that at least in my case, the problem is caused
by old osd maps not being pruned for some reason, and thus not fitting
into cache. When I increased osd map cache to 5000 the problem is gone.

The question is why they're not being pruned, even though the cluster is in
healthy state and there are no down OSDs.

#3 Updated by Greg Farnum 15 days ago

  • Status changed from New to Duplicate

#4 Updated by Greg Farnum 15 days ago

  • Duplicates Bug #37875: osdmaps aren't being cleaned up automatically on healthy cluster added

Also available in: Atom PDF