Project

General

Profile

Actions

Bug #26891

open

backfill reservation deadlock/stall

Added by Sage Weil over 5 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

on backfill target:
- get backfill request, queue RequestBackfillPrio

2018-08-09 08:23:50.061 7f4bcfc7d700  1 -- 172.21.15.29:6805/68932 <== osd.6 172.21.15.29:6809/32617 1307 ==== MBackfillReserve(3.1 REQUEST prio: 100 e655/655) v4 ==== 30+0+0 (721037294 0 0) 0x562985bf1600 con 0x5629820a7500
2018-08-09 08:23:50.061 7f4bcfc7d700 15 osd.4 655 enqueue_peering_evt 3.1 epoch_sent: 655 epoch_requested: 655 RequestBackfillPrio: priority 100
2018-08-09 08:23:50.061 7f4bcfc7d700 20 osd.4 op_wq(1) _enqueue OpQueueItem(3.1 PGPeeringEvent(epoch_sent: 655 epoch_requested: 655 RequestBackfillPrio: priority 100) prio 255 cost 10 e655)

- backfill finishes, sends pg_backfill(finish ...)
2018-08-09 08:23:50.061 7f4bb10e2700 10 osd.4 655 dequeue_op 0x56298189e480 prio 127 cost 0 latency 0.000078 pg_backfill(finish 3.1 e 655/655 lb MAX) v3 pg pg[3.1( v 647'5475 (597'5175,647'5475] lb 3:b8dd2f6d:::smithi14638918-457 ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo:head (bitwise) local-lis/les=653/654 n=44 ec=19/19 lis/c 653/555 les/c/f 654/556/0 652/653/433) [6,2,4]/[6,2,3] r=-1 lpr=653 pi=[555,653)/1 luod=0'0 crt=647'5475 active mbc={}]
2018-08-09 08:23:50.061 7f4bb10e2700 20 osd.4 655 share_map osd.6 172.21.15.29:6809/32617 655
2018-08-09 08:23:50.061 7f4bb10e2700 20 osd.4 655 should_share_map osd.6 172.21.15.29:6809/32617 655
2018-08-09 08:23:50.061 7f4bb10e2700 10 osd.4 pg_epoch: 655 pg[3.1( v 647'5475 (597'5175,647'5475] lb 3:b8dd2f6d:::smithi14638918-457 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head (bitwise) local-lis/les=653/654 n=44 ec=19/
19 lis/c 653/555 les/c/f 654/556/0 652/653/433) [6,2,4]/[6,2,3] r=-1 lpr=653 pi=[555,653)/1 luod=0'0 crt=647'5475 active mbc={}] _handle_message: 0x56298189e480
2018-08-09 08:23:50.061 7f4bb10e2700 10 osd.4 pg_epoch: 655 pg[3.1( v 647'5475 (597'5175,647'5475] lb 3:b8dd2f6d:::smithi14638918-457 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head (bitwise) local-lis/les=653/654 n=44 ec=19/
19 lis/c 653/555 les/c/f 654/556/0 652/653/433) [6,2,4]/[6,2,3] r=-1 lpr=653 pi=[555,653)/1 luod=0'0 crt=647'5475 active mbc={}] do_backfill pg_backfill(finish 3.1 e 655/655 lb MAX) v3
2018-08-09 08:23:50.061 7f4bb10e2700  1 -- 172.21.15.29:6805/68932 --> 172.21.15.29:6809/32617 -- pg_backfill(finish_ack 3.1 e 655/655 lb MIN) v3 -- 0x562984af0f00 con 0

- process resv cancel (from finish??)
2018-08-09 08:23:50.061 7f4bb10e2700 20 osd.4 op_wq(1) _process 3.1 to_process <> waiting <> waiting_peering {}
2018-08-09 08:23:50.061 7f4bb10e2700 20 osd.4 op_wq(1) _process OpQueueItem(3.1 PGPeeringEvent(epoch_sent: 655 epoch_requested: 655 RemoteReservationCanceled) prio 255 cost 10 e655) queued
2018-08-09 08:23:50.061 7f4bb10e2700 20 osd.4 op_wq(1) _process 3.1 to_process <OpQueueItem(3.1 PGPeeringEvent(epoch_sent: 655 epoch_requested: 655 RemoteReservationCanceled) prio 255 cost 10 e655)> waiting <> waiting_peering {}
2018-08-09 08:23:50.061 7f4bb10e2700 20 osd.4 op_wq(1) _process OpQueueItem(3.1 PGPeeringEvent(epoch_sent: 655 epoch_requested: 655 RemoteReservationCanceled) prio 255 cost 10 e655) pg 0x562984cc3000
2018-08-09 08:23:50.061 7f4bb10e2700 10 osd.4 pg_epoch: 655 pg[3.1( v 647'5475 (597'5175,647'5475] local-lis/les=653/654 n=49 ec=19/19 lis/c 653/555 les/c/f 654/556/0 652/653/433) [6,2,4]/[6,2,3] r=-1 lpr=653 pi=[555,653)/1 luod=0'0 crt=647'5475 active mbc={}] do_peering_event: epoch_sent: 655 epoch_requested: 655 RemoteReservationCanceled
2018-08-09 08:23:50.061 7f4bb10e2700  5 osd.4 pg_epoch: 655 pg[3.1( v 647'5475 (597'5175,647'5475] local-lis/les=653/654 n=49 ec=19/19 lis/c 653/555 les/c/f 654/556/0 652/653/433) [6,2,4]/[6,2,3] r=-1 lpr=653 pi=[555,653)/1 luod=0'0 crt=647'5475 active mbc={}] exit Started/ReplicaActive/RepRecovering 0.913845 4 0.000087
2018-08-09 08:23:50.061 7f4bb10e2700  5 osd.4 pg_epoch: 655 pg[3.1( v 647'5475 (597'5175,647'5475] local-lis/les=653/654 n=49 ec=19/19 lis/c 653/555 les/c/f 654/556/0 652/653/433) [6,2,4]/[6,2,3] r=-1 lpr=653 pi=[555,653)/1 luod=0'0 crt=647'5475 active mbc={}] enter Started/ReplicaActive/RepNotRecovering

- process reservation request
2018-08-09 08:23:50.061 7f4bb10e2700 20 osd.4 op_wq(1) _process OpQueueItem(3.1 PGPeeringEvent(epoch_sent: 655 epoch_requested: 655 RequestBackfillPrio: priority 100) prio 255 cost 10 e655) queued
2018-08-09 08:23:50.061 7f4bb10e2700 20 osd.4 op_wq(1) _process 3.1 to_process <OpQueueItem(3.1 PGPeeringEvent(epoch_sent: 655 epoch_requested: 655 RequestBackfillPrio: priority 100) prio 255 cost 10 e655)> waiting <> waiting_peering {}
2018-08-09 08:23:50.061 7f4bb10e2700 20 osd.4 op_wq(1) _process OpQueueItem(3.1 PGPeeringEvent(epoch_sent: 655 epoch_requested: 655 RequestBackfillPrio: priority 100) prio 255 cost 10 e655) pg 0x562984cc3000
2018-08-09 08:23:50.061 7f4bb80f0700 15 osd.4 655 enqueue_peering_evt 2.3 epoch_sent: 654 epoch_requested: 654 RemoteBackfillReserved
2018-08-09 08:23:50.061 7f4bb80f0700 20 osd.4 op_wq(3) _enqueue OpQueueItem(2.3 PGPeeringEvent(epoch_sent: 654 epoch_requested: 654 RemoteBackfillReserved) prio 255 cost 10 e654)
2018-08-09 08:23:50.061 7f4bb10e2700 10 osd.4 pg_epoch: 655 pg[3.1( v 647'5475 (597'5175,647'5475] local-lis/les=653/654 n=49 ec=19/19 lis/c 653/555 les/c/f 654/556/0 652/653/433) [6,2,4]/[6,2,3] r=-1 lpr=653 pi=[555,653)/1 luod=0'0 crt=647'5475 active mbc={}] do_peering_event: epoch_sent: 655 epoch_requested: 655 RequestBackfillPrio: priority 100
2018-08-09 08:23:50.061 7f4bb10e2700  5 osd.4 pg_epoch: 655 pg[3.1( v 647'5475 (597'5175,647'5475] local-lis/les=653/654 n=49 ec=19/19 lis/c 653/555 les/c/f 654/556/0 652/653/433) [6,2,4]/[6,2,3] r=-1 lpr=653 pi=[555,653)/1 luod=0'0 crt=647'5475 active mbc={}] exit Started/ReplicaActive/RepNotRecovering 0.000023 1 0.000023
2018-08-09 08:23:50.061 7f4bb10e2700  5 osd.4 pg_epoch: 655 pg[3.1( v 647'5475 (597'5175,647'5475] local-lis/les=653/654 n=49 ec=19/19 lis/c 653/555 les/c/f 654/556/0 652/653/433) [6,2,4]/[6,2,3] r=-1 lpr=653 pi=[555,653)/1 luod=0'0 crt=647'5475 active mbc={}] enter Started/ReplicaActive/RepWaitBackfillReserved

- process RecoveryDone
2018-08-09 08:23:50.061 7f4bb10e2700 10 osd.4 pg_epoch: 655 pg[3.1( v 647'5475 (597'5175,647'5475] local-lis/les=653/654 n=49 ec=19/19 lis/c 653/555 les/c/f 654/556/0 652/653/433) [6,2,4]/[6,2,3] r=-1 lpr=653 pi=[555,653)/1 luod=0'0 crt=647'5475 active mbc={}] do_peering_event: epoch_sent: 655 epoch_requested: 655 RequestBackfillPrio: priority 100
2018-08-09 08:23:50.061 7f4bb10e2700  5 osd.4 pg_epoch: 655 pg[3.1( v 647'5475 (597'5175,647'5475] local-lis/les=653/654 n=49 ec=19/19 lis/c 653/555 les/c/f 654/556/0 652/653/433) [6,2,4]/[6,2,3] r=-1 lpr=653 pi=[555,653)/1 luod=0'0 crt=647'5475 active mbc={}] exit Started/ReplicaActive/RepNotRecovering 0.000023 1 0.000023
2018-08-09 08:23:50.061 7f4bb10e2700  5 osd.4 pg_epoch: 655 pg[3.1( v 647'5475 (597'5175,647'5475] local-lis/les=653/654 n=49 ec=19/19 lis/c 653/555 les/c/f 654/556/0 652/653/433) [6,2,4]/[6,2,3] r=-1 lpr=653 pi=[555,653)/1 luod=0'0 crt=647'5475 active mbc={}] enter Started/ReplicaActive/RepWaitBackfillReserved

...and somehow, through all of that, don't send a response to the backfill resv request

/a/sage-2018-08-08_17:50:08-rados-wip-sage2-testing-2018-08-08-0724-distro-basic-smithi/2882603

Actions #1

Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New
Actions

Also available in: Atom PDF