Bug #45670: luminous: osd: too many store transactions when osd got an incremental osdmap but failed encode full with correct crc again and again - Ceph - Ceph

Actions

Copy link

Bug #45670

open

luminous: osd: too many store transactions when osd got an incremental osdmap but failed encode full with correct crc again and again

Added by Song Jin almost 4 years ago. Updated over 3 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

OSD

Target version:

v12.2.14

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

35218

Crash signature (v1):

Crash signature (v2):

Description

Suppose osd got an message containing only one incremental osdmap(epoch 123), but failed to encode full with correct crc due to osdmap data structure change from an old code version to a new version during online upgrading.
1. In this case, we should go through osdmap epoch from start=123 to last=123, but crc err occured, and last epoch changed to 122(last = e - 1), we quit the osdmap epoch loop in func handle_osd_map in OSD.cc
2. However, store transaction will be delivered as normal even osd did not get any correct osdmap.
3. For a cluster big enough, too many crc err inc osdmap messages will lead to too many store transactions, and lead to too many callback func _committed_osd_maps invoked.
4. In func _committed_osd_maps when we consume_map, we need to iterate over all pgs. So we need to get pg_lock again and again.
5. this will lead to slow osdmap update, peering stuck, io timeout, etc.
So store transactions may need to be skipped to avoid this case as We don't get any correct osdmap at all.

Actions

Copy link

Updated by Kefu Chai almost 4 years ago

Pull request ID set to 35218

Actions

Copy link

Updated by Nathan Cutler almost 4 years ago

Status changed from New to Need More Info

Echoing what Kefu said in the PR: "hi Eugene, we don't merge luminous PRs anymore as it's EOL already. is master or other branches like nautlus, mimic, or octopus also suffering from this issue? if yes, could you retarget to the latest release branch instead? so we can backport the fix."

Actions

Copy link

Updated by Dan van der Ster over 3 years ago

This should be fixed as a side-effect of https://tracker.ceph.com/issues/46443
(fixed in whereever that gets backported)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #45670

luminous: osd: too many store transactions when osd got an incremental osdmap but failed encode full with correct crc again and again

Updated by Kefu Chai almost 4 years ago

Updated by Nathan Cutler almost 4 years ago

Updated by Dan van der Ster over 3 years ago