Project

General

Profile

Actions

Bug #41657

closed

osd/PeeringState.cc: 2540: FAILED ceph_assert(cct->_conf->osd_find_best_info_ignore_history_les || i nfo.last_epoch_started >= info.history.last_epoch_started)

Added by Sage Weil over 4 years ago. Updated over 4 years ago.

Status:
Rejected
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus, mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

   -71> 2019-09-04T13:01:52.601+0000 7ff917a01700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetLog>: processing master log
   -68> 2019-09-04T13:01:52.601+0000 7ff917a01700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] proc_master_log for osd.3(3): log((0'0,0'0], crt=0'0) missing(0 may_include_deletes = 0)
   -66> 2019-09-04T13:01:52.601+0000 7ff917a01700 10 merge_log log((0'0,0'0], crt=0'0) from osd.3(3) into log((0'0,0'0], crt=0'0)
   -64> 2019-09-04T13:01:52.601+0000 7ff917a01700 10 merge_log result log((0'0,0'0], crt=0'0) missing(0 may_include_deletes = 1) changed=0
   -59> 2019-09-04T13:01:52.601+0000 7ff917a01700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}]  peer osd.3(3) now 2.1s3( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0 les/c/f=0/0/0 sis=0) missing(0 may_include_deletes = 0)
   -56> 2019-09-04T13:01:52.601+0000 7ff917a01700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] reg_next_scrub pg 2.1s1 register next scrub, scrub time 2019-09-04T13:02:18.510271+0000, must = 0
    -1> 2019-09-04T13:01:52.607+0000 7ff917a01700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-4560-gffa27c2/rpm/el7/BUILD/ceph-15.0.0-4560-gffa27c2/src/osd/PeeringState.cc: In function 'void PeeringState::proc_master_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_missing_t&, pg_shard_t)' thread 7ff917a01700 time 2019-09-04T13:01:52.602903+0000

/a/sage-2019-09-04_12:34:34-rados:thrash-wip-sage3-testing-2019-09-03-1016-distro-basic-smithi/4277818

The primary has not local last_epoch_started. From GetInfo,

  -313> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.0(1) 2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49)
  -311> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.3(3) 2.1s3( empty local-lis/les=41/42 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49)
  -309> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.4(2) 2.1s2( empty local-lis/les=41/42 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49)
  -308> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.5(1) 2.1s1( empty local-lis/les=41/42 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49)
  -306> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] For position 0:  failed to fill position 0
  -304> 2019-09-04T13:01:52.599+0000 7ff91ba09700 20 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_async_recovery_ec candidates by cost are: 
  -303> 2019-09-04T13:01:52.599+0000 7ff91ba09700 20 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_async_recovery_ec result want=[2147483647,0,4,3] async_recovery=
  -300> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] acting_recovery_backfill is 0(1),3(3),4(2)
  -297> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting want=[2147483647,0,4,3] backfill_targets= async_recovery_targets=
  -295> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetLog>:  requesting log from osd.3(3)

Actions #1

Updated by Sage Weil over 4 years ago

The find_best_info process excludes getting a master log from an osd with an old(er) last_epoch_started. However, there is no such restriction on the choice of primary that I can see.

I think we should just remove this assertion...

Actions #2

Updated by Sage Weil over 4 years ago

  • Status changed from 12 to Fix Under Review
Actions #3

Updated by Sage Weil over 4 years ago

  • Pull request ID set to 30153
Actions #4

Updated by Sage Weil over 4 years ago

  • Backport set to nautilus, mimic, luminous
Actions #5

Updated by Sage Weil over 4 years ago

  • Status changed from Fix Under Review to Rejected

this is caused by a bug in my test branch

Actions

Also available in: Atom PDF