Project

General

Profile

Actions

Bug #10431

closed

PG can not finish peering due to mismatch between OSD peering queue and PG peering queue

Added by Dong Lei over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
giant,firefly
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We were debugging a PG stuck at peering problem. It may due to peering event lost or not been handled.

We found that some thread call osd->peering_queue.push_back without holding the osd_lock. It may cause a race condition when other threads (usually a dispatcher thread) push_back to peering_queue at the same time.

We found at least when handling an FlushedEvt, the thread will push_back osd peering_queue.

Can we add some checkers to assure the thread holds lock when doing osd->peering_wq.queue(PG*).


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #11134: PGs stuck much longer than needed in Peering or InactiveDuplicate03/17/2015

Actions
Actions

Also available in: Atom PDF