Bug #45670
openluminous: osd: too many store transactions when osd got an incremental osdmap but failed encode full with correct crc again and again
0%
Description
Suppose osd got an message containing only one incremental osdmap(epoch 123), but failed to encode full with correct crc due to osdmap data structure change from an old code version to a new version during online upgrading.
1. In this case, we should go through osdmap epoch from start=123 to last=123, but crc err occured, and last epoch changed to 122(last = e - 1), we quit the osdmap epoch loop in func handle_osd_map in OSD.cc
2. However, store transaction will be delivered as normal even osd did not get any correct osdmap.
3. For a cluster big enough, too many crc err inc osdmap messages will lead to too many store transactions, and lead to too many callback func _committed_osd_maps invoked.
4. In func _committed_osd_maps when we consume_map, we need to iterate over all pgs. So we need to get pg_lock again and again.
5. this will lead to slow osdmap update, peering stuck, io timeout, etc.
So store transactions may need to be skipped to avoid this case as We don't get any correct osdmap at all.
Updated by Nathan Cutler almost 4 years ago
- Status changed from New to Need More Info
Echoing what Kefu said in the PR: "hi Eugene, we don't merge luminous PRs anymore as it's EOL already. is master or other branches like nautlus, mimic, or octopus also suffering from this issue? if yes, could you retarget to the latest release branch instead? so we can backport the fix."
Updated by Dan van der Ster over 3 years ago
This should be fixed as a side-effect of https://tracker.ceph.com/issues/46443
(fixed in whereever that gets backported)