Bug #40117
PG stuck in WaitActingChange
Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
luminous,mimic,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
osd.9 requests a switch to acting set=[5] from [9,5] which never shows up. The teuthology test hangs waiting for that pg to go active+clean. Below is the mon output from the receipt of the pg_temp message as well as the section from osd.9 requesting the acting change:
The suite output is sjust-2019-06-02_20:15:42-rados-wip-sjust-crimson-peering-distro-basic-smithi/3999331
2019-06-03 04:59:20.564 7f6db6915700 1 -- v1:172.21.15.83:6789/0 <== osd.9 v1:172.21.15.83:6808/44419 382 ==== osd_pgtemp(e2675 {15.17=[5]} v2675) v2 ==== 52+0+0 (unknown 1235329701 0 0) 0x5582f1b69a00 con 0x5582f487ad00 2019-06-03 04:59:20.564 7f6db6915700 20 mon.c@0(leader) e1 _ms_dispatch existing session 0x5582eb9e3600 for osd.9 2019-06-03 04:59:20.564 7f6db6915700 20 mon.c@0(leader) e1 caps allow * 2019-06-03 04:59:20.564 7f6db6915700 10 mon.c@0(leader).paxosservice(osdmap 1501..2675) dispatch 0x5582f1b69a00 osd_pgtemp(e2675 {15.17=[5]} v2675) v2 from osd.9 v1:172.21.15.83:6808/44419 con 0x5582f487ad00 2019-06-03 04:59:20.564 7f6db6915700 5 mon.c@0(leader).paxos(paxos active c 6778..7521) is_readable = 1 - now=2019-06-03 04:59:20.569798 lease_expire=2019-06-03 04:59:25.536743 has v0 lc 7521 2019-06-03 04:59:20.564 7f6db6915700 10 mon.c@0(leader).osd e2675 preprocess_query osd_pgtemp(e2675 {15.17=[5]} v2675) v2 from osd.9 v1:172.21.15.83:6808/44419 2019-06-03 04:59:20.564 7f6db6915700 10 mon.c@0(leader).osd e2675 preprocess_pgtemp osd_pgtemp(e2675 {15.17=[5]} v2675) v2 2019-06-03 04:59:20.564 7f6db6915700 20 is_capable service=osd command= exec addr v1:172.21.15.83:6808/44419 on cap allow * 2019-06-03 04:59:20.564 7f6db6915700 20 allow so far , doing grant allow * 2019-06-03 04:59:20.564 7f6db6915700 20 allow all 2019-06-03 04:59:20.564 7f6db6915700 20 mon.c@0(leader).osd e2675 15.17[] -> [5] 2019-06-03 04:59:20.564 7f6db6915700 7 mon.c@0(leader).osd e2675 prepare_update osd_pgtemp(e2675 {15.17=[5]} v2675) v2 from osd.9 v1:172.21.15.83:6808/44419 2019-06-03 04:59:20.564 7f6db6915700 7 mon.c@0(leader).osd e2675 prepare_pgtemp e2675 from osd.9 v1:172.21.15.83:6808/44419 2019-06-03 04:59:20.564 7f6db6915700 10 mon.c@0(leader).osd e2675 should_propose 2019-06-03 04:59:20.564 7f6db6915700 10 mon.c@0(leader).paxosservice(osdmap 1501..2675) proposal_timer already set 2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).paxosservice(mgrstat 1751..1961) propose_pending 2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).mgrstat 1962 2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader) e1 log_health updated 0 previous 0 2019-06-03 04:59:20.624 7f6db911a700 5 mon.c@0(leader).paxos(paxos active c 6778..7521) queue_pending_finisher 0x5582f7b84800 2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).paxos(paxos active c 6778..7521) trigger_propose active, proposing now 2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).paxos(paxos active c 6778..7521) propose_pending 7522 10485 bytes 2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).paxos(paxos updating c 6778..7521) begin for 7522 10485 bytes 2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).paxos(paxos updating c 6778..7521) sending begin to mon.1 2019-06-03 04:59:20.624 7f6db911a700 1 -- v1:172.21.15.83:6789/0 send_to--> mon v1:172.21.15.103:6789/0 -- paxos(begin lc 7521 fc 0 pn 500 opn 0) v4 -- ?+0 0x5582ec128c00 2019-06-03 04:59:20.624 7f6db911a700 1 -- v1:172.21.15.83:6789/0 --> v1:172.21.15.103:6789/0 -- paxos(begin lc 7521 fc 0 pn 500 opn 0) v4 -- 0x5582ec128c00 con 0x5582e9d5bf80 2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).paxos(paxos updating c 6778..7521) sending begin to mon.2 2019-06-03 04:59:20.624 7f6db911a700 1 -- v1:172.21.15.83:6789/0 send_to--> mon v1:172.21.15.200:6789/0 -- paxos(begin lc 7521 fc 0 pn 500 opn 0) v4 -- ?+0 0x5582eb1a9600 2019-06-03 04:59:20.624 7f6db911a700 1 -- v1:172.21.15.83:6789/0 --> v1:172.21.15.200:6789/0 -- paxos(begin lc 7521 fc 0 pn 500 opn 0) v4 -- 0x5582eb1a9600 con 0x5582e9d5c400 2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).paxosservice(logm 3022..3661) propose_pending 2019-06-03 04:59:20.624 7f6db911a700 10 mon.c@0(leader).log v3661 encode_full log v 3661
2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.5 15.17( v 2674'3903 (0'0,2674'3903] local-lis/les=2638/2639 n=2508 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) 2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.6 15.17( v 2674'3903 (0'0,2674'3903] local-lis/les=2638/2639 n=2508 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) 2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.8 15.17( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) 2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.9 15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) 2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] calc_replicated_acting newest update on osd.5 with 15.17( v 2674'3903 (0'0,2674'3903] local-lis/les=2638/2639 n=2508 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) 2019-06-03 04:59:20.564 7f3b7848b700 20 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_async_recovery_replicated candidates by cost are: 3903,9 2019-06-03 04:59:20.564 7f3b7848b700 20 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_async_recovery_replicated result want=[5] async_recovery=9 2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting want [5] != acting [9,8], requesting pg_temp change 2019-06-03 04:59:20.564 7f3b7848b700 5 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering/GetLog 0.000134 0 0.000000 2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering>: Leaving Peering 2019-06-03 04:59:20.564 7f3b7848b700 5 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering 0.003482 0 0.000000 2019-06-03 04:59:20.564 7f3b7848b700 5 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] exit Started/Primary 0.003498 0 0.000000 2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] clear_primary_state 2019-06-03 04:59:20.564 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] release_backoffs [15:e8000000::::0,15:f0000000::::head) 2019-06-03 04:59:20.564 7f3b7848b700 20 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] agent_stop 2019-06-03 04:59:20.564 7f3b7848b700 5 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] enter Started/Primary 2019-06-03 04:59:20.564 7f3b7848b700 5 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] enter Started/Primary/WaitActingChange 2019-06-03 04:59:20.564 7f3b7c493700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] do_peering_event: epoch_sent: 2675 epoch_requested: 2675 MNotifyRec 15.17 from 6 notify: (query:2675 sent:2675 15.17( v 2674'3903 (0'0,2674'3903] local-lis/les=2638/2639 n=2508 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675)) features: 0x3f01cfb8ffacffff ([2638,2674] intervals=([2638,2674] acting 5,6)) +create_info 2019-06-03 04:59:20.564 7f3b7c493700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] state<Started/Primary/WaitActingChange>: In WaitActingChange, ignoring MNotifyRec 2019-06-03 04:59:20.664 7f3b7848b700 10 osd.9 2675 dequeue_op 0x562b9ac0edc0 prio 63 cost 65536 latency 0.000147 osd_op(client.108494.0:79368 15.17 15.c0b0eb57 (undecoded) ondisk+write+known_if_redirected e2675) v8 pg pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] 2019-06-03 04:59:20.664 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] can_handle_while_inactive: 0x562b9ac0edc0 2019-06-03 04:59:20.668 7f3b7c493700 10 osd.9 2675 dequeue_op 0x562bb07749a0 prio 63 cost 65536 latency 0.000147 osd_op(client.108494.0:79375 15.17 15.56224037 (undecoded) ondisk+write+known_if_redirected e2675) v8 pg pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] 2019-06-03 04:59:20.668 7f3b7c493700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] can_handle_while_inactive: 0x562bb07749a0 2019-06-03 04:59:21.540 7f3b7848b700 10 osd.9 pg_epoch: 2675 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] handle_advance_map [9,8]/[9,8] -- 9/9 2019-06-03 04:59:21.540 7f3b7848b700 10 osd.9 pg_epoch: 2676 pg[15.17( empty local-lis/les=0/0 n=0 ec=2638/2630 lis/c 2638/2638 les/c/f 2639/2639/0 2675/2675/2675) [9,8] r=0 lpr=2675 pi=[2638,2675)/1 crt=0'0 mlcod 0'0 unknown mbc={}] state<Started/Primary/WaitActingChange>: verifying no want_acting [] targets didn't go down
Related issues
History
#1 Updated by Samuel Just almost 5 years ago
- Description updated (diff)
#2 Updated by Samuel Just almost 5 years ago
- Description updated (diff)
#3 Updated by Samuel Just almost 5 years ago
- Project changed from Ceph to RADOS
#4 Updated by qiuzhang chen over 4 years ago
PR: https://github.com/ceph/ceph/pull/29669
I've already commit a pr to master.
#5 Updated by Neha Ojha over 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 29669
#6 Updated by Neha Ojha over 4 years ago
- Backport set to luminous,mimic,nautilus
#7 Updated by Neha Ojha over 4 years ago
- Related to Bug #41190: osd: pg stuck in waitactingchange when new acting set doesn't change added
#8 Updated by Neha Ojha almost 4 years ago
- Status changed from Fix Under Review to Duplicate