Project

General

Profile

Actions

Bug #44419

closed

ops stuck on "wait for new map" for no apparent reason

Added by Nikola Ciprich about 4 years ago. Updated about 4 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
OSDMap
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'd like to report the problem we've hit on one of our mimic clusters
(13.2.6 as well as 13.2.6). Any manipulation with OSD (ie restart) causes
lot of slow ops caused by waiting for new map. It seems those are slowed by SATA
OSDs which keep being 100% busy reading for long time until all ops are gone,
blocking OPS on unrelated NVME pools - SATA pools are completely unused now.

is this possible that those maps are being requested from slow SATA OSDs
and it takes such a long time for some reason? why could it take so long?
the cluster is very small with very light load..

when we restarted one of the nodes, it took literally hours for the peering to finish
due to waiting for maps.. we've done all possible network checks, as well as harddrives
checks, everything seems to be in order..

We can easily reproduce the problem I'll soon have maintenance window so I'll try to gather
as much debug info as possible..


Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #45400: mon/OSDMonitor: maps not trimmed if osds are downResolvedJoao Eduardo Luis

Actions
Actions #1

Updated by Nikola Ciprich about 4 years ago

while digging deeper, I noticed that when the cluster gets into this
state, osd_map_cache_miss on OSDs starts growing rapidly.. even when
I increased osd map cache size to 500 (which was the default at least
for luminous) it behaves the same..

I think this could be related..

Actions #2

Updated by Nikola Ciprich about 4 years ago

so I can confirm that at least in my case, the problem is caused
by old osd maps not being pruned for some reason, and thus not fitting
into cache. When I increased osd map cache to 5000 the problem is gone.

The question is why they're not being pruned, even though the cluster is in
healthy state and there are no down OSDs.

Actions #3

Updated by Greg Farnum about 4 years ago

  • Status changed from New to Duplicate
Actions #4

Updated by Greg Farnum about 4 years ago

  • Is duplicate of Bug #37875: osdmaps aren't being cleaned up automatically on healthy cluster added
Actions #5

Updated by Nathan Cutler almost 4 years ago

  • Is duplicate of Bug #45400: mon/OSDMonitor: maps not trimmed if osds are down added
Actions #6

Updated by Nathan Cutler almost 4 years ago

  • Is duplicate of deleted (Bug #37875: osdmaps aren't being cleaned up automatically on healthy cluster)
Actions

Also available in: Atom PDF