Project

General

Profile

Bug #10431

PG can not finish peering due to mismatch between OSD peering queue and PG peering queue

Added by Dong Lei over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
giant,firefly
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

We were debugging a PG stuck at peering problem. It may due to peering event lost or not been handled.

We found that some thread call osd->peering_queue.push_back without holding the osd_lock. It may cause a race condition when other threads (usually a dispatcher thread) push_back to peering_queue at the same time.

We found at least when handling an FlushedEvt, the thread will push_back osd peering_queue.

Can we add some checkers to assure the thread holds lock when doing osd->peering_wq.queue(PG*).


Related issues

Related to Ceph - Bug #11134: PGs stuck much longer than needed in Peering or Inactive Duplicate 03/17/2015

Associated revisions

Revision 492ccc90 (diff)
Added by Sage Weil about 5 years ago

osd: requeue PG when we skip handling a peering event

If we don't handle the event, we need to put the PG back into the peering
queue or else the event won't get processed until the next event is
queued, at which point we'll be processing events with a delay.

The queue_null is not necessary (and is a waste of effort) because the
event is still in pg->peering_queue and the PG is queued.

Note that this only triggers when we exceeed osd_map_max_advance, usually
when there is a lot of peering and recovery activity going on. A
workaround is to increase that value, but if you exceed osd_map_cache_size
you expose yourself to crache thrashing by the peering work queue, which
can cause serious problems with heavily degraded clusters and bit lots of
people on dumpling.

Backport: giant, firefly
Fixes: #10431
Signed-off-by: Sage Weil <>

Revision 16c023d8 (diff)
Added by Sage Weil about 5 years ago

osd: requeue PG when we skip handling a peering event

If we don't handle the event, we need to put the PG back into the peering
queue or else the event won't get processed until the next event is
queued, at which point we'll be processing events with a delay.

The queue_null is not necessary (and is a waste of effort) because the
event is still in pg->peering_queue and the PG is queued.

Note that this only triggers when we exceeed osd_map_max_advance, usually
when there is a lot of peering and recovery activity going on. A
workaround is to increase that value, but if you exceed osd_map_cache_size
you expose yourself to crache thrashing by the peering work queue, which
can cause serious problems with heavily degraded clusters and bit lots of
people on dumpling.

Backport: giant, firefly
Fixes: #10431
Signed-off-by: Sage Weil <>
(cherry picked from commit 492ccc900c3358f36b6b14a207beec071eb06707)

History

#1 Updated by Loic Dachary over 5 years ago

  • Status changed from New to In Progress
  • Assignee set to Dong Lei

#2 Updated by Dong Lei over 5 years ago

More more situation is observed.

One op thread is handling an FlushedEvt and doing osd->peering_wq.enqueue
Another op thread is taking jobs from peering_wq and doing osd->erase.

We capture this by adding some code to print peering_wq size before enqueue and it causes the crash. The crash reason is concurrent size() and erase() for std::list is not thread safe.

Concurrent push_back and erase seems not to cause a crash but still should be corrected.

#3 Updated by Dong Lei over 5 years ago

We confirmed that there is an mismatch between OSD's peering_queue and PG's peering_queue.

From the log bellow we can see that, there is no PG or only 1 PG in OSD's peering_queue, but has 1 or 2 event in PG's peering queue. This means the last message in this PG's peering_queue can not handled. And this is the message needed to finish peering.

Logs:(the size log is added by us)

2014-12-25 10:02:12.657518 7fe56fd9c700 5 osd.460 pg_epoch: 103353
pg[3.1bf0s0( v 100677'146529 (90767'143519,100677'146529] local-les=103353
n=84857 ec=165 les/c 102515/102515 103352/103352/103060)
[460,343,387,522,426,99,430,408,369,402,257] r=0 lpr=103352 pi=102403-103351/20
crt=100677'146529 lcod 0'0 mlcod 0'0 inactive] queue_peering_event:Before
append evt:epoch_sent: 103353 epoch_requested: 103353 MInfoRec from 343(1)
info: 3.1bf0s1( v 100677'146529 (90767'143519,100677'146529] local-les=103353
n=84857 ec=165 les/c 103353/102515 103352/103352/103060) -- evt_peering_queue
has 1
2014-12-25 10:02:12.657537 7fe56fd9c700 5 osd.460 pg_epoch: 103353
pg[3.1bf0s0( v 100677'146529 (90767'143519,100677'146529] local-les=103353
n=84857 ec=165 les/c 102515/102515 103352/103352/103060)
[460,343,387,522,426,99,430,408,369,402,257] r=0 lpr=103352 pi=102403-103351/20
crt=100677'146529 lcod 0'0 mlcod 0'0 inactive] queue_peering_event:After append
evt:epoch_sent: 103353 epoch_requested: 103353 MInfoRec from 343(1) info:
3.1bf0s1( v 100677'146529 (90767'143519,100677'146529] local-les=103353 n=84857
ec=165 les/c 103353/102515 103352/103352/103060) -- evt_peering_queue has 2
2014-12-25 10:02:12.657551 7fe56fd9c700 5 osd.460 pg_epoch: 103353
pg[3.1bf0s0( v 100677'146529 (90767'143519,100677'146529] local-les=103353
n=84857 ec=165 les/c 102515/102515 103352/103352/103060)
[460,343,387,522,426,99,430,408,369,402,257] r=0 lpr=103352 pi=102403-103351/20
crt=100677'146529 lcod 0'0 mlcod 0'0 inactive] queue_peering_event:Before
append pg:3.1bf0 -- osd_peering_queue has 0
2014-12-25 10:02:12.657581 7fe567d8c700 12 OSD::osd_tp worker wq OSD::PeeringWQ
start processing 0x2c33aa00 (1 active)
2014-12-25 10:02:12.657594 7fe564585700 20 OSD::osd_tp worker waiting
2014-12-25 10:02:12.657579 7fe56fd9c700 5 osd.460 pg_epoch: 103353
pg[3.1bf0s0( v 100677'146529 (90767'143519,100677'146529] local-les=103353
n=84857 ec=165 les/c 102515/102515 103352/103352/103060)
[460,343,387,522,426,99,430,408,369,402,257] r=0 lpr=103352 pi=102403-103351/20
crt=100677'146529 lcod 0'0 mlcod 0'0 inactive] queue_peering_event:After append
pg:3.1bf0 -- osd_peering_queue has 0
2014-12-25 10:02:12.657626 7fe56fd9c700 10 osd.460 103353 do_waiters -- start
2014-12-25 10:02:12.657635 7fe56fd9c700 10 osd.460 103353 do_waiters -- finish
2014-12-25 10:02:12.657650 7fe567d8c700 10 osd.460 pg_epoch: 103353
pg[3.1bf0s0( v 100677'146529 (90767'143519,100677'146529] local-les=103353
n=84857 ec=165 les/c 102515/102515 103352/103352/103060)
[460,343,387,522,426,99,430,408,369,402,257] r=0 lpr=103352 pi=102403-103351/20
crt=100677'146529 lcod 0'0 mlcod 0'0 inactive] handle_peering_event:
epoch_sent: 103353 epoch_requested: 103353 MInfoRec from 408(7) info: 3.1bf0s7(
v 100677'146529 (90767'143519,100677'146529] local-les=103353 n=84857 ec=165
les/c 103353/102515 103352/103352/103060)
2014-12-25 10:02:12.657679 7fe567d8c700 10 osd.460 pg_epoch: 103353
pg[3.1bf0s0( v 100677'146529 (90767'143519,100677'146529] local-les=103353
n=84857 ec=165 les/c 102515/102515 103352/103352/103060)
[460,343,387,522,426,99,430,408,369,402,257] r=0 lpr=103352 pi=102403-103351/20
crt=100677'146529 lcod 0'0 mlcod 0'0 inactive] state<Started/Primary/Active>:
peer osd.408(7) activated and committed
2014-12-25 10:02:12.657696 7fe567d8c700 10 log is not dirty
2014-12-25 10:02:12.657719 7fe567d8c700 15 OSD::osd_tp worker wq OSD::PeeringWQ
done processing 0x2c33aa00 (0 active)
2014-12-25 10:02:12.657723 7fe567d8c700 20 OSD::osd_tp worker waiting

#4 Updated by Dong Lei over 5 years ago

Can we change the title to

"PG can not finish peering due to mismatch between OSD peering queue and PG peering queue"

?

#5 Updated by Guang Yang over 5 years ago

  • Subject changed from OSD peering_queue push_back without holding osd_lock to PG can not finish peering due to mismatch between OSD peering queue and PG peering queue

#6 Updated by Dong Lei over 5 years ago

We added a lock for OSD's peering_queue like the first commit in the above PR.
It worked fine at first. But when we started to do more tests, the same error occurred.

Finally, we find the root cause of the mismatch due to logs. You can see that an op thread toke some PGs and started to process their events. This op thread did not pop event from PG's peering_queue and put (PG, evt) pair again.

I check the code that this thread should to the branch of !advance_pg in process_peering_events and forget to pop event:

if (!advance_pg(curmap->get_epoch(), pg, handle, &rctx, &split_pgs)) {
// the op goes here forget to pop event
pg->queue_null(curmap->get_epoch(), curmap->get_epoch());
} else if (!pg->peering_queue.empty()) {
PG::CephPeeringEvtRef evt = pg->peering_queue.front();
pg->peering_queue.pop_front();
pg->handle_peering_event(evt, &rctx);
}

The fix is in the second commit in PR.

LOGS:
2014-12-26 07:54:16.257230 7fdf1147b700 5 osd.331 105056 Will process 20 pgs
2014-12-26 07:54:16.259198 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] take_waiters:Before splice evt, evt_peering_queue has 1 -- evt_peering_waiters has 0
2014-12-26 07:54:16.259227 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] take_waiters:After splice evt, evt_peering_queue has 1 -- evt_peering_waiters has 0
2014-12-26 07:54:16.259244 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] exit Reset 12.715638 20 0.000219
2014-12-26 07:54:16.259263 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] enter Started
2014-12-26 07:54:16.259279 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] enter Start
2014-12-26 07:54:16.259294 7fdf1147b700 1 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2014-12-26 07:54:16.259311 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] exit Start 0.000032 0 0.000000
2014-12-26 07:54:16.259330 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] enter Started/Stray
2014-12-26 07:54:16.259361 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] queue_peering_event:Before append evt:epoch_sent: 105056 epoch_requested: 105056 NullEvt -- evt_peering_queue has 1
2014-12-26 07:54:16.259378 7fdf1147b700 5 osd.331 pg_epoch: 105055 pg[3.17b5s9( v 100707'146643 (90795'143641,100707'146643] local-les=105035 n=85055 ec=165 les/c 105035/105036 105034/105034/104604) [375,216,231,403,493,62,428,277,7,331,394] r=9 lpr=105036 pi=165-105033/343 crt=100677'146640 lcod 0'0 inactive NOTIFY] queue_peering_event:After append evt:epoch_sent: 105056 epoch_requested: 105056 NullEvt -- evt_peering_queue has 2

#7 Updated by Dong Lei over 5 years ago

After thinking over about it, just pop event may be careless and risk at lost event. I update the PR to queue pg back to be consistent.

#8 Updated by Sage Weil about 5 years ago

  • Status changed from In Progress to Fix Under Review
  • Priority changed from Normal to Urgent
  • Backport set to giant,firefly

I believe we just found this bug.. can you please look at

https://github.com/ceph/ceph/pull/3326

#9 Updated by Dong Lei about 5 years ago

Hi sage, sorry for my first commit in the PR which add a unnecessary lock. But what I did in the second commit is what you did above. https://github.com/boydc2014/ceph/commit/13813bcaa8c77cba4a89e79ba9b77afb083e1af4

#10 Updated by Loic Dachary about 5 years ago

  • Description updated (diff)

#11 Updated by Samuel Just about 5 years ago

  • Status changed from Fix Under Review to Resolved

A fix for this is already merged.

Also available in: Atom PDF