Project

General

Profile

Actions

Bug #39552

open

mons fail to process send_alive message causing pg stuck creating

Added by Samuel Just almost 5 years ago. Updated almost 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

sjust-2019-04-28_20:59:54-rados-wip-sjust-peering-refactor-distro-basic-smithi/3905696/

PG 200.3 is stuck creating waiting on up_thru to be updated. osd.0 does send the send_alive message and appears to be waiting on subsequent calls for queue_want_up_thru.

2019-04-29 12:10:51.617 7f054f01b700 10 osd.0 pg_epoch: 351 pg[200.3( empty local-lis/les=0/0 n=0 ec=350/350 lis/c 0/0 les/c/f 0/0/0 350/350/350) [0,5] r=0 lpr=351 crt=0'0 mlcod 0'0 creating+peering mbc={}] up_thru 309 < same_since 350, mu
st notify monitor
...
2019-04-29 12:10:51.617 7f054f01b700 10 osd.0 351 queue_want_up_thru now 350 (was 309), currently 309
...
2019-04-29 12:10:51.617 7f054f01b700 10 osd.0 351 send_alive up_thru currently 309 want 350
...
2019-04-29 12:10:51.617 7f054f01b700  1 -- [v2:172.21.15.175:6805/11459,v1:172.21.15.175:6808/11459] --> [v2:172.21.15.175:3300/0,v1:172.21.15.175:6789/0] -- osd_alive(want up_thru 350 have 351) v1 -- 0x55ccdead4780 con 0x55ccdc8b4900
...
2019-04-29 12:14:42.845 7f054b013700 10 osd.0 356 queue_want_up_thru want 350 <= queued 350, currently 309

The mons appear to receive it, but are unable to process it:

2019-04-29 12:10:51.617 7f22ff9c5700  1 -- [v2:172.21.15.175:3300/0,v1:172.21.15.175:6789/0] <== osd.0 v2:172.21.15.175:6805/11459 7 ==== osd_alive(want up_thru 350 have 351) v1 ==== 22+0+0 (crc 0 0 0) 0x5580405ccb20 con 0x55803b50c400
2019-04-29 12:10:51.617 7f22ff9c5700 20 mon.a@0(probing) e1 _ms_dispatch existing session 0x55803ce6ba80 for osd.0
2019-04-29 12:10:51.617 7f22ff9c5700 20 mon.a@0(probing) e1  caps allow *
2019-04-29 12:10:51.617 7f22ff9c5700 10 mon.a@0(probing).paxosservice(osdmap 31..351) dispatch 0x5580405ccb20 osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459 con 0x55803b50c400
2019-04-29 12:10:51.617 7f22ff9c5700  5 mon.a@0(probing).paxos(paxos recovering c 9790..10390) is_readable = 0 - now=2019-04-29 12:10:51.623715 lease_expire=2019-04-29 12:10:46.563901 has v0 lc 10390
2019-04-29 12:10:51.617 7f22ff9c5700 10 mon.a@0(probing).paxosservice(osdmap 31..351)  waiting for paxos -> readable (v351)
2019-04-29 12:10:51.617 7f22ff9c5700  5 mon.a@0(probing).paxos(paxos recovering c 9790..10390) is_readable = 0 - now=2019-04-29 12:10:51.623725 lease_expire=2019-04-29 12:10:46.563901 has v0 lc 10390
...
2019-04-29 12:10:51.617 7f22ff9c5700 10 mon.a@0(probing).paxosservice(osdmap 31..351) dispatch 0x5580405ccb20 osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459 con 0x55803b50c400
2019-04-29 12:10:51.617 7f22ff9c5700  5 mon.a@0(probing).paxos(paxos recovering c 9790..10390) is_readable = 0 - now=2019-04-29 12:10:51.623715 lease_expire=2019-04-29 12:10:46.563901 has v0 lc 10390
2019-04-29 12:10:51.617 7f22ff9c5700 10 mon.a@0(probing).paxosservice(osdmap 31..351)  waiting for paxos -> readable (v351)
2019-04-29 12:10:51.617 7f22ff9c5700  5 mon.a@0(probing).paxos(paxos recovering c 9790..10390) is_readable = 0 - now=2019-04-29 12:10:51.623725 lease_expire=2019-04-29 12:10:46.563901 has v0 lc 10390
...
2019-04-29 12:10:56.633 7f22ff9c5700 10 mon.a@0(leader).paxosservice(osdmap 31..351) dispatch 0x5580405ccb20 osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459 con 0x55803b50c400
2019-04-29 12:10:56.633 7f22ff9c5700  5 mon.a@0(leader).paxos(paxos updating c 9790..10390) is_readable = 1 - now=2019-04-29 12:10:56.638548 lease_expire=2019-04-29 12:11:01.633252 has v0 lc 10390
2019-04-29 12:10:56.633 7f22ff9c5700 10 mon.a@0(leader).osd e351 preprocess_query osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459
2019-04-29 12:10:56.633 7f22ff9c5700 20 is_capable service=osd command= exec addr v2:172.21.15.175:6805/11459 on cap allow *
2019-04-29 12:10:56.633 7f22ff9c5700 20  allow so far , doing grant allow *
2019-04-29 12:10:56.633 7f22ff9c5700 20  allow all
2019-04-29 12:10:56.633 7f22ff9c5700 10 mon.a@0(leader).osd e351 preprocess_alive want up_thru 350 from osd.0 v2:172.21.15.175:6805/11459
2019-04-29 12:10:56.633 7f22ff9c5700  7 mon.a@0(leader).osd e351 prepare_update osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459
2019-04-29 12:10:56.633 7f22ff9c5700  7 mon.a@0(leader).osd e351 prepare_alive want up_thru 350 have 351 from osd.0 v2:172.21.15.175:6805/11459
2019-04-29 12:10:56.633 7f22ff9c5700 10 mon.a@0(leader).osd e351 should_propose
2019-04-29 12:10:56.633 7f22ff9c5700 10 mon.a@0(leader).paxosservice(osdmap 31..351)  proposal_timer already set
...
2019-04-29 12:11:06.685 7f23021ca700 10 mon.a@0(probing).paxosservice(osdmap 31..351) dispatch 0x5580405ccb20 osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459 con 0x55803b50c400
2019-04-29 12:11:06.685 7f23021ca700  5 mon.a@0(probing).paxos(paxos recovering c 9790..10391) is_readable = 0 - now=2019-04-29 12:11:06.689049 lease_expire=2019-04-29 12:11:01.640941 has v0 lc 10391
2019-04-29 12:11:06.685 7f23021ca700 10 mon.a@0(probing).paxosservice(osdmap 31..351)  waiting for paxos -> readable (v351)
2019-04-29 12:11:06.685 7f23021ca700  5 mon.a@0(probing).paxos(paxos recovering c 9790..10391) is_readable = 0 - now=2019-04-29 12:11:06.689057 lease_expire=2019-04-29 12:11:01.640941 has v0 lc 10391
...
2019-04-29 12:11:11.752 7f22ff9c5700 10 mon.a@0(leader).paxosservice(osdmap 31..354) dispatch 0x5580405ccb20 osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459 con 0x55803b50c400
2019-04-29 12:11:11.752 7f22ff9c5700  5 mon.a@0(leader).paxos(paxos updating c 9790..10403) is_readable = 1 - now=2019-04-29 12:11:11.757505 lease_expire=2019-04-29 12:11:16.752959 has v0 lc 10403
2019-04-29 12:11:11.752 7f22ff9c5700 10 mon.a@0(leader).osd e354 preprocess_query osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459
2019-04-29 12:11:11.752 7f22ff9c5700 20 is_capable service=osd command= exec addr v2:172.21.15.175:6805/11459 on cap allow *
2019-04-29 12:11:11.752 7f22ff9c5700 20  allow so far , doing grant allow *
2019-04-29 12:11:11.752 7f22ff9c5700 20  allow all
2019-04-29 12:11:11.752 7f22ff9c5700 10 mon.a@0(leader).osd e354 preprocess_alive want up_thru 350 from osd.0 v2:172.21.15.175:6805/11459
2019-04-29 12:11:11.752 7f22ff9c5700  7 mon.a@0(leader).osd e354 prepare_update osd_alive(want up_thru 350 have 351) v1 from osd.0 v2:172.21.15.175:6805/11459
2019-04-29 12:11:11.752 7f22ff9c5700  7 mon.a@0(leader).osd e354 prepare_alive want up_thru 350 have 351 from osd.0 v2:172.21.15.175:6805/11459
2019-04-29 12:11:11.752 7f22ff9c5700 10 mon.a@0(leader).osd e354 should_propose
2019-04-29 12:11:11.752 7f22ff9c5700 10 mon.a@0(leader).paxosservice(osdmap 31..354)  proposal_timer already set

Those last two segments appear to continue repeating until the test terminates -- presumably something wrong with proposals waking after recovery?

Actions #1

Updated by Brad Hubbard almost 5 years ago

  • Project changed from Ceph to RADOS
Actions #2

Updated by Samuel Just almost 5 years ago

  • Assignee deleted (Samuel Just)
Actions

Also available in: Atom PDF