Project

General

Profile

Bug #40117

PG stuck in WaitActingChange

Added by Samuel Just almost 5 years ago. Updated almost 4 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous,mimic,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

osd.9 requests a switch to acting set=[5] from [9,5] which never shows up. The teuthology test hangs waiting for that pg to go active+clean. Below is the mon output from the receipt of the pg_temp message as well as the section from osd.9 requesting the acting change:

The suite output is sjust-2019-06-02_20:15:42-rados-wip-sjust-crimson-peering-distro-basic-smithi/3999331

2019-06-03 04:59:20.564 7f6db6915700  1 -- v1:172.21.15.83:6789/0 <== osd.9 v1:172.21.15.83:6808/44419 382 ==== osd_pgtemp(e2675 {15.17=[5]} v2675) v2 ==== 52+0+0 (unknown 1235329701 0 0) 0x5582f1b69a00 con 0x5582f487ad00
2019-06-03 04:59:20.564 7f6db6915700 20 mon.c@0(leader) e1 _ms_dispatch existing session 0x5582eb9e3600 for osd.9
2019-06-03 04:59:20.564 7f6db6915700 20 mon.c@0(leader) e1  caps allow *
2019-06-03 04:59:20.564 7f6db6915700 10 mon.c@0(leader).paxosservice(osdmap 1501..2675) dispatch 0x5582f1b69a00 osd_pgtemp(e2675 {15.17=[5]} v2675) v2 from osd.9 v1:172.21.15.83:6808/44419 con 0x5582f487ad00
2019-06-03 04:59:20.564 7f6db6915700  5 mon.c@0(leader).paxos(paxos active c 6778..7521) is_readable = 1 - now=2019-06-03 04:59:20.569798 lease_expire=2019-06-03 04:59:25.536743 has v0 lc 7521
2019-06-03 04:59:20.564 7f6db6915700 10 mon.c@0(leader).osd e2675 preprocess_query osd_pgtemp(e2675 {15.17=[5]} v2675) v2 from osd.9 v1:172.21.15.83:6808/44419
2019-06-03 04:59:20.564 7f6db6915700 10 mon.c@0(leader).osd e2675 preprocess_pgtemp osd_pgtemp(e2675 {15.17=[5]} v2675) v2
2019-06-03 04:59:20.564 7f6db6915700 20 is_capable service=osd command= exec addr v1:172.21.15.83:6808/44419 on cap allow *
2019-06-03 04:59:20.564 7f6db6915700 20  allow so far , doing grant allow *
2019-06-03 04:59:20.564 7f6db6915700 20  allow all
2019-06-03 04:59:20.564 7f6db6915700 20 mon.c@0(leader).osd e2675  15.17[] -> [5]
2019-06-03 04:59:20.564 7f6db6915700  7 mon.c@0(leader).osd e2675 prepare_update osd_pgtemp(e2675 {15.17=[5]} v2675) v2 from osd.9 v1:172.21.15.83:6808/44419
2019-06-03 04:59:20.564 7f6db6915700  7 mon.c@0(leader).osd e2675 prepare_pgtemp e2675 from osd.9 v1:172.21.15.83:6808/44419
2019-06-03 04:59:20.564 7f6db6915700 10 mon.c@0(leader).osd e2675 should_propose
2019-06-03 04:59:20.564 7f6db6915700 10 mon.c@0(leader).paxosservice(osdmap 1501..2675)  proposal_timer already set
2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).paxosservice(mgrstat 1751..1961) propose_pending
2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).mgrstat  1962
2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader) e1 log_health updated 0 previous 0
2019-06-03 04:59:20.624 7f6db911a700  5 mon.c@0(leader).paxos(paxos active c 6778..7521) queue_pending_finisher 0x5582f7b84800
2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).paxos(paxos active c 6778..7521) trigger_propose active, proposing now
2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).paxos(paxos active c 6778..7521) propose_pending 7522 10485 bytes
2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).paxos(paxos updating c 6778..7521) begin for 7522 10485 bytes
2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).paxos(paxos updating c 6778..7521)  sending begin to mon.1
2019-06-03 04:59:20.624 7f6db911a700  1 -- v1:172.21.15.83:6789/0 send_to--> mon v1:172.21.15.103:6789/0 -- paxos(begin lc 7521 fc 0 pn 500 opn 0) v4 -- ?+0 0x5582ec128c00
2019-06-03 04:59:20.624 7f6db911a700  1 -- v1:172.21.15.83:6789/0 --> v1:172.21.15.103:6789/0 -- paxos(begin lc 7521 fc 0 pn 500 opn 0) v4 -- 0x5582ec128c00 con 0x5582e9d5bf80
2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).paxos(paxos updating c 6778..7521)  sending begin to mon.2
2019-06-03 04:59:20.624 7f6db911a700  1 -- v1:172.21.15.83:6789/0 send_to--> mon v1:172.21.15.200:6789/0 -- paxos(begin lc 7521 fc 0 pn 500 opn 0) v4 -- ?+0 0x5582eb1a9600
2019-06-03 04:59:20.624 7f6db911a700  1 -- v1:172.21.15.83:6789/0 --> v1:172.21.15.200:6789/0 -- paxos(begin lc 7521 fc 0 pn 500 opn 0) v4 -- 0x5582eb1a9600 con 0x5582e9d5c400
2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).paxosservice(logm 3022..3661) propose_pending
2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).log v3661 encode_full log v 3661
2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.5 15.17( v 2674'3903 (0'0,2674'3903] local-lis/les=2638/2639 n=2508 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675)
2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.6 15.17( v 2674'3903 (0'0,2674'3903] local-lis/les=2638/2639 n=2508 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675)
2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.8 15.17( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0)
2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.9 15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675)
2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] calc_replicated_acting newest update on osd.5 with 15.17( v 2674'3903 (0'0,2674'3903] local-lis/les=2638/2639 n=2508 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675)
2019-06-03 04:59:20.564 7f3b7848b700 20 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_async_recovery_replicated candidates by cost are: 3903,9
2019-06-03 04:59:20.564 7f3b7848b700 20 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_async_recovery_replicated result want=[5] async_recovery=9
2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting want [5] != acting [9,8], requesting pg_temp change
2019-06-03 04:59:20.564 7f3b7848b700  5 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering/GetLog 0.000134 0 0.000000
2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering>: Leaving Peering
2019-06-03 04:59:20.564 7f3b7848b700  5 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering 0.003482 0 0.000000
2019-06-03 04:59:20.564 7f3b7848b700  5 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] exit Started/Primary 0.003498 0 0.000000
2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] clear_primary_state
2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] release_backoffs [15:e8000000::::0,15:f0000000::::head)
2019-06-03 04:59:20.564 7f3b7848b700 20 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] agent_stop
2019-06-03 04:59:20.564 7f3b7848b700  5 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] enter Started/Primary
2019-06-03 04:59:20.564 7f3b7848b700  5 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] enter Started/Primary/WaitActingChange
2019-06-03 04:59:20.564 7f3b7c493700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] do_peering_event: epoch_sent: 2675 epoch_requested: 2675 MNotifyRec 15.17 from 6 notify: (query:2675 sent:2675 15.17( v 2674'3903 (0'0,2674'3903] local-lis/les=2638/2639 n=2508 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675)) features: 0x3f01cfb8ffacffff ([2638,2674] intervals=([2638,2674] acting 5,6)) +create_info
2019-06-03 04:59:20.564 7f3b7c493700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] state<Started/Primary/WaitActingChange>: In WaitActingChange, ignoring MNotifyRec
2019-06-03 04:59:20.664 7f3b7848b700 10 osd.9 2675 dequeue_op 0x562b9ac0edc0 prio 63 cost 65536 latency 0.000147 osd_op(client.108494.0:79368 15.17 15.c0b0eb57 (undecoded) ondisk+write+known_if_redirected e2675) v8 pg pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}]
2019-06-03 04:59:20.664 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] can_handle_while_inactive: 0x562b9ac0edc0
2019-06-03 04:59:20.668 7f3b7c493700 10 osd.9 2675 dequeue_op 0x562bb07749a0 prio 63 cost 65536 latency 0.000147 osd_op(client.108494.0:79375 15.17 15.56224037 (undecoded) ondisk+write+known_if_redirected e2675) v8 pg pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}]
2019-06-03 04:59:20.668 7f3b7c493700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] can_handle_while_inactive: 0x562bb07749a0
2019-06-03 04:59:21.540 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] handle_advance_map [9,8]/[9,8] -- 9/9
2019-06-03 04:59:21.540 7f3b7848b700 10 osd.9 pg_epoch: 2676 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] state<Started/Primary/WaitActingChange>: verifying no want_acting [] targets didn't go down

Related issues

Related to RADOS - Bug #41190: osd: pg stuck in waitactingchange when new acting set doesn't change Resolved

History

#1 Updated by Samuel Just almost 5 years ago

  • Description updated (diff)

#2 Updated by Samuel Just almost 5 years ago

  • Description updated (diff)

#3 Updated by Samuel Just almost 5 years ago

  • Project changed from Ceph to RADOS

#4 Updated by qiuzhang chen over 4 years ago

PR: https://github.com/ceph/ceph/pull/29669
I've already commit a pr to master.

#5 Updated by Neha Ojha over 4 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 29669

#6 Updated by Neha Ojha over 4 years ago

  • Backport set to luminous,mimic,nautilus

#7 Updated by Neha Ojha over 4 years ago

  • Related to Bug #41190: osd: pg stuck in waitactingchange when new acting set doesn't change added

#8 Updated by Neha Ojha almost 4 years ago

  • Status changed from Fix Under Review to Duplicate

Also available in: Atom PDF