Project

General

Profile

Bug #40772

mon: pg size change delayed 1 minute because osdmap 35 delay

Added by David Zafman 12 months ago. Updated 12 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

osd-recovery-prio.sh TEST_recovery_pool_priority fails intermittently due to a delay in recovery starting on a pg. The test requires simultaneous recovery of 2 PGs.

ceph osd pool set $pool1 size 2
ceph osd pool set $pool2 size 2
Running the test alone doesn't necessarily reproduce the problem which I saw when running all standalone tests (although they are run sequentially).
  1. ../qa/run-standalone.sh "osd-recovery-prio.sh TEST_recovery_pool_priority"

No pg[2.0 messages for almost 1 minute. Map 35 seems to be the one that has the size change.\

2019-07-12T00:29:49.072-0700 7fc35220f700 10 osd.0 pg_epoch: 34 pg[2.0( v 31'600 (0'0,31'600] local-lis/les=29/30 n=200 ec=15/15 lis/c 29/29 les/c/f 30/30/0 29/29/15) [0] r=0 lpr=29 crt=31'600 lcod 31'599 mlcod 31'599 active+clean] do_peering_event: epoch_sent: 34 epoch_requested: 34 NullEvt
2019-07-12T00:30:44.588-0700 7fc35220f700 10 osd.0 pg_epoch: 34 pg[2.0( v 31'600 (0'0,31'600] local-lis/les=29/30 n=200 ec=15/15 lis/c 29/29 les/c/f 30/30/0 29/29/15) [0] r=0 lpr=29 crt=31'600 lcod 31'599 mlcod 31'599 active+clean] handle_advance_map: 35
"message": "pg 2.0 is stuck undersized for 61.208213, current state active+recovering+undersized+degraded+remapped, last acting [0]"
"message": "pg 2.0 is stuck undersized for 63.209241, current state active+recovering+undersized+degraded+remapped, last acting [0]"

mon.sec1.log.gz - Part 1 of split of ceph-mon during critical time (508 KB) David Zafman, 07/12/2019 11:05 PM

mon.sec2.log.gz - Part 2 of split of ceph-mon during critical time (517 KB) David Zafman, 07/12/2019 11:06 PM


Related issues

Related to RADOS - Bug #40820: standalone/scrub/osd-scrub-test.sh +3 day failed assert New

History

#1 Updated by David Zafman 12 months ago

  • Subject changed from PG 1 minute recovery delay to mon: pg size change delayed 1 minute because osdmap 35 delay

This looks to be a monitor issue. We see that osdmap 35 may be getting hung up during the critical period 00:29:49 and 00:30:44 when osd.0 finally gets the new map.

2019-07-12T00:29:47.612-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..31) post_refresh
2019-07-12T00:29:47.984-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..31) post_refresh
2019-07-12T00:29:48.016-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..31) post_refresh
2019-07-12T00:29:48.664-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..32) post_refresh
2019-07-12T00:29:48.692-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..32) post_refresh
2019-07-12T00:29:48.772-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..33) post_refresh
2019-07-12T00:29:48.812-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..33) post_refresh
2019-07-12T00:29:48.888-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..34) post_refresh
2019-07-12T00:29:48.992-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..34) post_refresh
2019-07-12T00:29:49.108-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..34) post_refresh
2019-07-12T00:29:49.140-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..35) post_refresh
.... 110 LINES NOT SHOWN ...
2019-07-12T00:30:44.444-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..35) post_refresh
2019-07-12T00:30:44.584-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..36) post_refresh
2019-07-12T00:30:44.688-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..37) post_refresh
2019-07-12T00:30:44.712-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..37) post_refresh
2019-07-12T00:30:44.808-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..38) post_refresh
2019-07-12T00:30:44.836-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..38) post_refresh
2019-07-12T00:30:44.960-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..38) post_refresh
2019-07-12T00:30:45.576-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..38) post_refresh
2019-07-12T00:30:45.948-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..38) post_refresh
2019-07-12T00:30:46.076-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..38) post_refresh
2019-07-12T00:30:46.836-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..38) post_refresh
2019-07-12T00:30:47.240-0700 7f8af4385700 10 mon.a@0(leader).paxosservice(osdmap 1..38) post_refresh

#2 Updated by David Zafman 12 months ago

#3 Updated by David Zafman 12 months ago

#4 Updated by David Zafman 12 months ago

  • Assignee changed from David Zafman to Kefu Chai

Kefu can you take a look? See the attached monitor logs.

#5 Updated by David Zafman 12 months ago

  • Description updated (diff)

#6 Updated by David Zafman 12 months ago

  • Related to Bug #40820: standalone/scrub/osd-scrub-test.sh +3 day failed assert added

Also available in: Atom PDF