Bug #10431: PG can not finish peering due to mismatch between OSD peering queue and PG peering queue - Ceph - Ceph

Actions

Copy link

Bug #10431

closed

PG can not finish peering due to mismatch between OSD peering queue and PG peering queue

Added by Dong Lei over 9 years ago. Updated about 9 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Dong Lei

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

giant,firefly

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We were debugging a PG stuck at peering problem. It may due to peering event lost or not been handled.

We found that some thread call osd->peering_queue.push_back without holding the osd_lock. It may cause a race condition when other threads (usually a dispatcher thread) push_back to peering_queue at the same time.

We found at least when handling an FlushedEvt, the thread will push_back osd peering_queue.

Can we add some checkers to assure the thread holds lock when doing osd->peering_wq.queue(PG*).

firefly equivalent change 852d7b5b3c019c02c042b767fc88916088e1a94d

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Status changed from New to In Progress
Assignee set to Dong Lei

https://github.com/ceph/ceph/pull/3252

Actions

Copy link

Updated by Dong Lei over 9 years ago

More more situation is observed.

One op thread is handling an FlushedEvt and doing osd->peering_wq.enqueue
Another op thread is taking jobs from peering_wq and doing osd->erase.

We capture this by adding some code to print peering_wq size before enqueue and it causes the crash. The crash reason is concurrent size() and erase() for std::list is not thread safe.

Concurrent push_back and erase seems not to cause a crash but still should be corrected.

Actions

Copy link

Updated by Dong Lei over 9 years ago

We confirmed that there is an mismatch between OSD's peering_queue and PG's peering_queue.

From the log bellow we can see that, there is no PG or only 1 PG in OSD's peering_queue, but has 1 or 2 event in PG's peering queue. This means the last message in this PG's peering_queue can not handled. And this is the message needed to finish peering.

Logs:(the size log is added by us)

2014-12-25 10:02:12.657518 7fe56fd9c700 5 osd.460 pg_epoch: 103353
pg[3.1bf0s0( v 100677'146529 (90767'143519,100677'146529] local-les=103353
n=84857 ec=165 les/c 102515/102515 103352/103352/103060)
[460,343,387,522,426,99,430,408,369,402,257] r=0 lpr=103352 pi=102403-103351/20
crt=100677'146529 lcod 0'0 mlcod 0'0 inactive] queue_peering_event:Before
append evt:epoch_sent: 103353 epoch_requested: 103353 MInfoRec from 343(1)
info: 3.1bf0s1( v 100677'146529 (90767'143519,100677'146529] local-les=103353
n=84857 ec=165 les/c 103353/102515 103352/103352/103060) -- evt_peering_queue
has 1
2014-12-25 10:02:12.657537 7fe56fd9c700 5 osd.460 pg_epoch: 103353
pg[3.1bf0s0( v 100677'146529 (90767'143519,100677'146529] local-les=103353
n=84857 ec=165 les/c 102515/102515 103352/103352/103060)
[460,343,387,522,426,99,430,408,369,402,257] r=0 lpr=103352 pi=102403-103351/20
crt=100677'146529 lcod 0'0 mlcod 0'0 inactive] queue_peering_event:After append
evt:epoch_sent: 103353 epoch_requested: 103353 MInfoRec from 343(1) info:
3.1bf0s1( v 100677'146529 (90767'143519,100677'146529] local-les=103353 n=84857
ec=165 les/c 103353/102515 103352/103352/103060) -- evt_peering_queue has 2
2014-12-25 10:02:12.657551 7fe56fd9c700 5 osd.460 pg_epoch: 103353
pg[3.1bf0s0( v 100677'146529 (90767'143519,100677'146529] local-les=103353
n=84857 ec=165 les/c 102515/102515 103352/103352/103060)
[460,343,387,522,426,99,430,408,369,402,257] r=0 lpr=103352 pi=102403-103351/20
crt=100677'146529 lcod 0'0 mlcod 0'0 inactive] queue_peering_event:Before
append pg:3.1bf0 -- osd_peering_queue has 0
2014-12-25 10:02:12.657581 7fe567d8c700 12 OSD::osd_tp worker wq OSD::PeeringWQ
start processing 0x2c33aa00 (1 active)
2014-12-25 10:02:12.657594 7fe564585700 20 OSD::osd_tp worker waiting
2014-12-25 10:02:12.657579 7fe56fd9c700 5 osd.460 pg_epoch: 103353
pg[3.1bf0s0( v 100677'146529 (90767'143519,100677'146529] local-les=103353
n=84857 ec=165 les/c 102515/102515 103352/103352/103060)
[460,343,387,522,426,99,430,408,369,402,257] r=0 lpr=103352 pi=102403-103351/20
crt=100677'146529 lcod 0'0 mlcod 0'0 inactive] queue_peering_event:After append
pg:3.1bf0 -- osd_peering_queue has 0
2014-12-25 10:02:12.657626 7fe56fd9c700 10 osd.460 103353 do_waiters -- start
2014-12-25 10:02:12.657635 7fe56fd9c700 10 osd.460 103353 do_waiters -- finish
2014-12-25 10:02:12.657650 7fe567d8c700 10 osd.460 pg_epoch: 103353
pg[3.1bf0s0( v 100677'146529 (90767'143519,100677'146529] local-les=103353
n=84857 ec=165 les/c 102515/102515 103352/103352/103060)
[460,343,387,522,426,99,430,408,369,402,257] r=0 lpr=103352 pi=102403-103351/20
crt=100677'146529 lcod 0'0 mlcod 0'0 inactive] handle_peering_event:
epoch_sent: 103353 epoch_requested: 103353 MInfoRec from 408(7) info: 3.1bf0s7(
v 100677'146529 (90767'143519,100677'146529] local-les=103353 n=84857 ec=165
les/c 103353/102515 103352/103352/103060)
2014-12-25 10:02:12.657679 7fe567d8c700 10 osd.460 pg_epoch: 103353
pg[3.1bf0s0( v 100677'146529 (90767'143519,100677'146529] local-les=103353
n=84857 ec=165 les/c 102515/102515 103352/103352/103060)
[460,343,387,522,426,99,430,408,369,402,257] r=0 lpr=103352 pi=102403-103351/20
crt=100677'146529 lcod 0'0 mlcod 0'0 inactive] state<Started/Primary/Active>:
peer osd.408(7) activated and committed
2014-12-25 10:02:12.657696 7fe567d8c700 10 log is not dirty
2014-12-25 10:02:12.657719 7fe567d8c700 15 OSD::osd_tp worker wq OSD::PeeringWQ
done processing 0x2c33aa00 (0 active)
2014-12-25 10:02:12.657723 7fe567d8c700 20 OSD::osd_tp worker waiting

Actions

Copy link

Updated by Dong Lei over 9 years ago

Can we change the title to

"PG can not finish peering due to mismatch between OSD peering queue and PG peering queue"

Actions

Copy link

Updated by Guang Yang over 9 years ago

Subject changed from OSD peering_queue push_back without holding osd_lock to PG can not finish peering due to mismatch between OSD peering queue and PG peering queue

Actions

Copy link

Updated by Dong Lei over 9 years ago

We added a lock for OSD's peering_queue like the first commit in the above PR.
It worked fine at first. But when we started to do more tests, the same error occurred.

Finally, we find the root cause of the mismatch due to logs. You can see that an op thread toke some PGs and started to process their events. This op thread did not pop event from PG's peering_queue and put (PG, evt) pair again.

I check the code that this thread should to the branch of !advance_pg in process_peering_events and forget to pop event:

if (!advance_pg(curmap->get_epoch(), pg, handle, &rctx, &split_pgs)) {
      // the op goes here forget to pop event
      pg->queue_null(curmap->get_epoch(), curmap->get_epoch());
    } else if (!pg->peering_queue.empty()) {
      PG::CephPeeringEvtRef evt = pg->peering_queue.front();
      pg->peering_queue.pop_front();
      pg->handle_peering_event(evt, &rctx);
    }

The fix is in the second commit in PR.

LOGS:
2014-12-26 07:54:16.257230 7fdf1147b700 5 osd.331 105056 Will process 20 pgs
2014-12-26 07:54:16.259198 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] take_waiters:Before splice evt, evt_peering_queue has 1 -- evt_peering_waiters has 0
2014-12-26 07:54:16.259227 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] take_waiters:After splice evt, evt_peering_queue has 1 -- evt_peering_waiters has 0
2014-12-26 07:54:16.259244 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] exit Reset 12.715638 20 0.000219
2014-12-26 07:54:16.259263 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] enter Started
2014-12-26 07:54:16.259279 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] enter Start
2014-12-26 07:54:16.259294 7fdf1147b700 1 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2014-12-26 07:54:16.259311 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] exit Start 0.000032 0 0.000000
2014-12-26 07:54:16.259330 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] enter Started/Stray
2014-12-26 07:54:16.259361 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] queue_peering_event:Before append evt:epoch_sent: 105056 epoch_requested: 105056 NullEvt -- evt_peering_queue has 1
2014-12-26 07:54:16.259378 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] queue_peering_event:After append evt:epoch_sent: 105056 epoch_requested: 105056 NullEvt -- evt_peering_queue has 2

Actions

Copy link

Updated by Dong Lei over 9 years ago

After thinking over about it, just pop event may be careless and risk at lost event. I update the PR to queue pg back to be consistent.

Actions

Copy link

Updated by Sage Weil over 9 years ago

Status changed from In Progress to Fix Under Review
Priority changed from Normal to Urgent
Backport set to giant,firefly

I believe we just found this bug.. can you please look at

https://github.com/ceph/ceph/pull/3326

Actions

Copy link

Updated by Dong Lei over 9 years ago

Hi sage, sorry for my first commit in the PR which add a unnecessary lock. But what I did in the second commit is what you did above. https://github.com/boydc2014/ceph/commit/13813bcaa8c77cba4a89e79ba9b77afb083e1af4

Actions

Copy link

#10

Updated by Loïc Dachary about 9 years ago

Description updated (diff)

Actions

Copy link

#11

Updated by Samuel Just about 9 years ago

Status changed from Fix Under Review to Resolved

A fix for this is already merged.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #10431

PG can not finish peering due to mismatch between OSD peering queue and PG peering queue

Updated by Loïc Dachary over 9 years ago

Updated by Dong Lei over 9 years ago

Updated by Dong Lei over 9 years ago

Updated by Dong Lei over 9 years ago

Updated by Guang Yang over 9 years ago

Updated by Dong Lei over 9 years ago

Updated by Dong Lei over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Dong Lei over 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Samuel Just about 9 years ago