Actions
Bug #41657
closedosd/PeeringState.cc: 2540: FAILED ceph_assert(cct->_conf->osd_find_best_info_ignore_history_les || i nfo.last_epoch_started >= info.history.last_epoch_started)
Status:
Rejected
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
nautilus, mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
-71> 2019-09-04T13:01:52.601+0000 7ff917a01700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetLog>: processing master log -68> 2019-09-04T13:01:52.601+0000 7ff917a01700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] proc_master_log for osd.3(3): log((0'0,0'0], crt=0'0) missing(0 may_include_deletes = 0) -66> 2019-09-04T13:01:52.601+0000 7ff917a01700 10 merge_log log((0'0,0'0], crt=0'0) from osd.3(3) into log((0'0,0'0], crt=0'0) -64> 2019-09-04T13:01:52.601+0000 7ff917a01700 10 merge_log result log((0'0,0'0], crt=0'0) missing(0 may_include_deletes = 1) changed=0 -59> 2019-09-04T13:01:52.601+0000 7ff917a01700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] peer osd.3(3) now 2.1s3( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0 les/c/f=0/0/0 sis=0) missing(0 may_include_deletes = 0) -56> 2019-09-04T13:01:52.601+0000 7ff917a01700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] reg_next_scrub pg 2.1s1 register next scrub, scrub time 2019-09-04T13:02:18.510271+0000, must = 0 -1> 2019-09-04T13:01:52.607+0000 7ff917a01700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-4560-gffa27c2/rpm/el7/BUILD/ceph-15.0.0-4560-gffa27c2/src/osd/PeeringState.cc: In function 'void PeeringState::proc_master_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_missing_t&, pg_shard_t)' thread 7ff917a01700 time 2019-09-04T13:01:52.602903+0000
/a/sage-2019-09-04_12:34:34-rados:thrash-wip-sage3-testing-2019-09-03-1016-distro-basic-smithi/4277818
The primary has not local last_epoch_started. From GetInfo,
-313> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.0(1) 2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) -311> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.3(3) 2.1s3( empty local-lis/les=41/42 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) -309> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.4(2) 2.1s2( empty local-lis/les=41/42 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) -308> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.5(1) 2.1s1( empty local-lis/les=41/42 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) -306> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] For position 0: failed to fill position 0 -304> 2019-09-04T13:01:52.599+0000 7ff91ba09700 20 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_async_recovery_ec candidates by cost are: -303> 2019-09-04T13:01:52.599+0000 7ff91ba09700 20 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_async_recovery_ec result want=[2147483647,0,4,3] async_recovery= -300> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] acting_recovery_backfill is 0(1),3(3),4(2) -297> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting want=[2147483647,0,4,3] backfill_targets= async_recovery_targets= -295> 2019-09-04T13:01:52.599+0000 7ff91ba09700 10 osd.0 pg_epoch: 49 pg[2.1s1( empty local-lis/les=0/0 n=0 ec=21/21 lis/c=41/23 les/c/f=42/24/0 sis=49) [2147483647,0,4,3]p0(1) r=1 lpr=49 pi=[23,49)/1 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetLog>: requesting log from osd.3(3)
Updated by Sage Weil over 4 years ago
The find_best_info process excludes getting a master log from an osd with an old(er) last_epoch_started. However, there is no such restriction on the choice of primary that I can see.
I think we should just remove this assertion...
Updated by Sage Weil over 4 years ago
- Status changed from 12 to Fix Under Review
Updated by Sage Weil over 4 years ago
- Status changed from Fix Under Review to Rejected
this is caused by a bug in my test branch
Actions