Project

General

Profile

Bug #45670

luminous: osd: too many store transactions when osd got an incremental osdmap but failed encode full with correct crc again and again

Added by Song Jin 4 months ago. Updated about 2 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Suppose osd got an message containing only one incremental osdmap(epoch 123), but failed to encode full with correct crc due to osdmap data structure change from an old code version to a new version during online upgrading.
1. In this case, we should go through osdmap epoch from start=123 to last=123, but crc err occured, and last epoch changed to 122(last = e - 1), we quit the osdmap epoch loop in func handle_osd_map in OSD.cc
2. However, store transaction will be delivered as normal even osd did not get any correct osdmap.
3. For a cluster big enough, too many crc err inc osdmap messages will lead to too many store transactions, and lead to too many callback func _committed_osd_maps invoked.
4. In func _committed_osd_maps when we consume_map, we need to iterate over all pgs. So we need to get pg_lock again and again.
5. this will lead to slow osdmap update, peering stuck, io timeout, etc.
So store transactions may need to be skipped to avoid this case as We don't get any correct osdmap at all.

History

#1 Updated by Kefu Chai 4 months ago

  • Pull request ID set to 35218

#2 Updated by Nathan Cutler 4 months ago

  • Status changed from New to Need More Info

Echoing what Kefu said in the PR: "hi Eugene, we don't merge luminous PRs anymore as it's EOL already. is master or other branches like nautlus, mimic, or octopus also suffering from this issue? if yes, could you retarget to the latest release branch instead? so we can backport the fix."

#3 Updated by Dan van der Ster about 2 months ago

This should be fixed as a side-effect of https://tracker.ceph.com/issues/46443
(fixed in whereever that gets backported)

Also available in: Atom PDF