Actions
Bug #39552
openmons fail to process send_alive message causing pg stuck creating
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
sjust-2019-04-28_20:59:54-rados-wip-sjust-peering-refactor-distro-basic-smithi/3905696/
PG 200.3 is stuck creating waiting on up_thru to be updated. osd.0 does send the send_alive message and appears to be waiting on subsequent calls for queue_want_up_thru.
2019-04-29 12:10:51.617 7f054f01b700 10 osd.0 pg_epoch: 351 pg[200.3( empty local-lis/les=0/0 n=0 ec=350/350 lis/c 0/0 les/c/f 0/0/0 350/350/350) [0,5] r=0 lpr=351 crt=0'0 mlcod 0'0 creating+peering mbc={}] up_thru 309 < same_since 350, mu st notify monitor ... 2019-04-29 12:10:51.617 7f054f01b700 10 osd.0 351 queue_want_up_thru now 350 (was 309), currently 309 ... 2019-04-29 12:10:51.617 7f054f01b700 10 osd.0 351 send_alive up_thru currently 309 want 350 ... 2019-04-29 12:10:51.617 7f054f01b700 1 -- [v2:172.21.15.175:6805/11459,v1:172.21.15.175:6808/11459] --> [v2:172.21.15.175:3300/0,v1:172.21.15.175:6789/0] -- osd_alive(want up_thru 350 have 351) v1 -- 0x55ccdead4780 con 0x55ccdc8b4900 ... 2019-04-29 12:14:42.845 7f054b013700 10 osd.0 356 queue_want_up_thru want 350 <= queued 350, currently 309
The mons appear to receive it, but are unable to process it:
2019-04-29 12:10:51.617 7f22ff9c5700 1 -- [v2:172.21.15.175:3300/0,v1:172.21.15.175:6789/0] <== osd.0 v2:172.21.15.175:6805/11459 7 ==== osd_alive(want up_thru 350 have 351) v1 ==== 22+0+0 (crc 0 0 0) 0x5580405ccb20 con 0x55803b50c400 2019-04-29 12:10:51.617 7f22ff9c5700 20 mon.a@0(probing) e1 _ms_dispatch existing session 0x55803ce6ba80 for osd.0 2019-04-29 12:10:51.617 7f22ff9c5700 20 mon.a@0(probing) e1 caps allow * 2019-04-29 12:10:51.617 7f22ff9c5700 10 mon.a@0(probing).paxosservice(osdmap 31..351) dispatch 0x5580405ccb20 osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459 con 0x55803b50c400 2019-04-29 12:10:51.617 7f22ff9c5700 5 mon.a@0(probing).paxos(paxos recovering c 9790..10390) is_readable = 0 - now=2019-04-29 12:10:51.623715 lease_expire=2019-04-29 12:10:46.563901 has v0 lc 10390 2019-04-29 12:10:51.617 7f22ff9c5700 10 mon.a@0(probing).paxosservice(osdmap 31..351) waiting for paxos -> readable (v351) 2019-04-29 12:10:51.617 7f22ff9c5700 5 mon.a@0(probing).paxos(paxos recovering c 9790..10390) is_readable = 0 - now=2019-04-29 12:10:51.623725 lease_expire=2019-04-29 12:10:46.563901 has v0 lc 10390 ... 2019-04-29 12:10:51.617 7f22ff9c5700 10 mon.a@0(probing).paxosservice(osdmap 31..351) dispatch 0x5580405ccb20 osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459 con 0x55803b50c400 2019-04-29 12:10:51.617 7f22ff9c5700 5 mon.a@0(probing).paxos(paxos recovering c 9790..10390) is_readable = 0 - now=2019-04-29 12:10:51.623715 lease_expire=2019-04-29 12:10:46.563901 has v0 lc 10390 2019-04-29 12:10:51.617 7f22ff9c5700 10 mon.a@0(probing).paxosservice(osdmap 31..351) waiting for paxos -> readable (v351) 2019-04-29 12:10:51.617 7f22ff9c5700 5 mon.a@0(probing).paxos(paxos recovering c 9790..10390) is_readable = 0 - now=2019-04-29 12:10:51.623725 lease_expire=2019-04-29 12:10:46.563901 has v0 lc 10390 ... 2019-04-29 12:10:56.633 7f22ff9c5700 10 mon.a@0(leader).paxosservice(osdmap 31..351) dispatch 0x5580405ccb20 osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459 con 0x55803b50c400 2019-04-29 12:10:56.633 7f22ff9c5700 5 mon.a@0(leader).paxos(paxos updating c 9790..10390) is_readable = 1 - now=2019-04-29 12:10:56.638548 lease_expire=2019-04-29 12:11:01.633252 has v0 lc 10390 2019-04-29 12:10:56.633 7f22ff9c5700 10 mon.a@0(leader).osd e351 preprocess_query osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459 2019-04-29 12:10:56.633 7f22ff9c5700 20 is_capable service=osd command= exec addr v2:172.21.15.175:6805/11459 on cap allow * 2019-04-29 12:10:56.633 7f22ff9c5700 20 allow so far , doing grant allow * 2019-04-29 12:10:56.633 7f22ff9c5700 20 allow all 2019-04-29 12:10:56.633 7f22ff9c5700 10 mon.a@0(leader).osd e351 preprocess_alive want up_thru 350 from osd.0 v2:172.21.15.175:6805/11459 2019-04-29 12:10:56.633 7f22ff9c5700 7 mon.a@0(leader).osd e351 prepare_update osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459 2019-04-29 12:10:56.633 7f22ff9c5700 7 mon.a@0(leader).osd e351 prepare_alive want up_thru 350 have 351 from osd.0 v2:172.21.15.175:6805/11459 2019-04-29 12:10:56.633 7f22ff9c5700 10 mon.a@0(leader).osd e351 should_propose 2019-04-29 12:10:56.633 7f22ff9c5700 10 mon.a@0(leader).paxosservice(osdmap 31..351) proposal_timer already set ... 2019-04-29 12:11:06.685 7f23021ca700 10 mon.a@0(probing).paxosservice(osdmap 31..351) dispatch 0x5580405ccb20 osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459 con 0x55803b50c400 2019-04-29 12:11:06.685 7f23021ca700 5 mon.a@0(probing).paxos(paxos recovering c 9790..10391) is_readable = 0 - now=2019-04-29 12:11:06.689049 lease_expire=2019-04-29 12:11:01.640941 has v0 lc 10391 2019-04-29 12:11:06.685 7f23021ca700 10 mon.a@0(probing).paxosservice(osdmap 31..351) waiting for paxos -> readable (v351) 2019-04-29 12:11:06.685 7f23021ca700 5 mon.a@0(probing).paxos(paxos recovering c 9790..10391) is_readable = 0 - now=2019-04-29 12:11:06.689057 lease_expire=2019-04-29 12:11:01.640941 has v0 lc 10391 ... 2019-04-29 12:11:11.752 7f22ff9c5700 10 mon.a@0(leader).paxosservice(osdmap 31..354) dispatch 0x5580405ccb20 osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459 con 0x55803b50c400 2019-04-29 12:11:11.752 7f22ff9c5700 5 mon.a@0(leader).paxos(paxos updating c 9790..10403) is_readable = 1 - now=2019-04-29 12:11:11.757505 lease_expire=2019-04-29 12:11:16.752959 has v0 lc 10403 2019-04-29 12:11:11.752 7f22ff9c5700 10 mon.a@0(leader).osd e354 preprocess_query osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459 2019-04-29 12:11:11.752 7f22ff9c5700 20 is_capable service=osd command= exec addr v2:172.21.15.175:6805/11459 on cap allow * 2019-04-29 12:11:11.752 7f22ff9c5700 20 allow so far , doing grant allow * 2019-04-29 12:11:11.752 7f22ff9c5700 20 allow all 2019-04-29 12:11:11.752 7f22ff9c5700 10 mon.a@0(leader).osd e354 preprocess_alive want up_thru 350 from osd.0 v2:172.21.15.175:6805/11459 2019-04-29 12:11:11.752 7f22ff9c5700 7 mon.a@0(leader).osd e354 prepare_update osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459 2019-04-29 12:11:11.752 7f22ff9c5700 7 mon.a@0(leader).osd e354 prepare_alive want up_thru 350 have 351 from osd.0 v2:172.21.15.175:6805/11459 2019-04-29 12:11:11.752 7f22ff9c5700 10 mon.a@0(leader).osd e354 should_propose 2019-04-29 12:11:11.752 7f22ff9c5700 10 mon.a@0(leader).paxosservice(osdmap 31..354) proposal_timer already set
Those last two segments appear to continue repeating until the test terminates -- presumably something wrong with proposals waking after recovery?
Actions