Actions
Bug #57738
closedcrimson: repop ordering bug
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
0# 0x00007FB4B2A0EE3C in /lib64/libc.so.6 1# raise in /lib64/libc.so.6 2# abort in /lib64/libc.so.6 3# ceph::__ceph_assert_fail(ceph::assert_data const&) at /home/sam/git-checkouts/ceph/src/crimson/common/assert.cc:13 4# PeeringState::add_log_entry(pg_log_entry_t const&, bool) at /home/sam/git-checkouts/ceph/src/osd/PeeringState.cc:4099 5# PeeringState::append_log(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, eversion_t, eversion_t, eversion_t, ceph::os::Transaction&, bool, bool) at /home/sam/git-checkouts/ceph/src/osd/PeeringState.cc:4149 6# crimson::osd::PG::handle_rep_op(boost::intrusive_ptr<MOSDRepOp>) at /home/sam/git-checkouts/ceph/src/crimson/osd/pg.cc:1243 7# crimson::osd::RepRequest::with_pg(crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::PG>)::{lambda()#1}::operator()() const at /home/sam/git-checkouts/ceph/src/crimson/osd/osd_operations/replicated_request.cc:64 8# auto seastar::futurize_invoke<crimson::osd::RepRequest::with_pg(crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::PG>)::{lambda()#1}>(crimson::osd::RepRequest::with_pg(crimson::osd::ShardServices&
Either the primary is sending them out of order or the replica is reordering on receipt.
Updated by Samuel Just over 1 year ago
DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): starting start_pg_operation DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): 0x6170000a15b8, start_pg_operation in await_active stage DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): 0x6170000a15b8 start_pg_operation active, entering await_map DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): 0x6170000a15b8 start_pg_operation await_map stage DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): 0x6170000a15b8, got map 275, entering get_pg DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): starting start_pg_operation DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): 0x6170000a15b8, start_pg_operation in await_active stage DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): 0x6170000a15b8 start_pg_operation active, entering await_map DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): 0x6170000a15b8 start_pg_operation await_map stage DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): 0x6170000a15b8, got map 275, entering get_pg ... DEBUG 2022-09-30 18:53:39,163 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): 0x6170000a15b8 in get_pg core 0 DEBUG 2022-09-30 18:53:39,163 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): in get_pg DEBUG 2022-09-30 18:53:39,163 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): !can_create DEBUG 2022-09-30 18:53:39,163 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): !can_create DEBUG 2022-09-30 18:53:39,163 [shard 1] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): have_pg DEBUG 2022-09-30 18:53:39,163 [shard 1] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): RepRequest::with_pg ... DEBUG 2022-09-30 18:53:39,188 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): 0x6170000a15b8 in get_pg core 0 DEBUG 2022-09-30 18:53:39,188 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): in get_pg DEBUG 2022-09-30 18:53:39,188 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): !can_create DEBUG 2022-09-30 18:53:39,188 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): !can_create DEBUG 2022-09-30 18:53:39,188 [shard 1] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): have_pg DEBUG 2022-09-30 18:53:39,188 [shard 1] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): RepRequest::with_pg
Looks like the ops were received out of order.
Updated by Samuel Just over 1 year ago
DEBUG 2022-09-30 18:53:32,168 [shard 2] osd - pg_epoch 275 pg[159.6( v 275'186 (0'0,275'186] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'186 lcod 275'5 mlcod 275'5 active+clean append_log log((0'0,275'186], crt=275'186) {275'187 (275'186) modify 159:602f83fe:::foo:head by client.4444.0:188 2022-09-30T18:53:30.105115+0000 0 ObjectCleanRegions clean_offsets: [9~18446744073709551606], clean_omap : 1, new_object: 0} DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - pg_epoch 275 pg[159.6( v 275'187 (0'0,275'187] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'186 lcod 275'5 mlcod 275'5 active+clean add_log_entry 275'187 (275'186) modify 159:602f83fe:::foo:head by client.4444.0:188 2022-09-30T18:53:30.105115+0000 0 ObjectCleanRegions clean_offsets: [9~18446744073709551606], clean_omap: 1, new_object: 0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - pg_epoch 275 pg[159.6( v 275'187 (0'0,275'187] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'187 lcod 275'5 mlcod 275'5 active+clean append_log approx pg log length = 187 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - pg_epoch 275 pg[159.6( v 275'187 (0'0,275'187] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'187 lcod 275'5 mlcod 275'5 active+clean append_log dups pg log length = 0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - pg_epoch 275 pg[159.6( v 275'187 (0'0,275'187] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'187 lcod 275'5 mlcod 275'5 active+clean append_log transaction_applied = 1 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - trim proposed trim_to = 0'0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 275'187, trimmed: , trimmed_dups: , clear_divergent_priors: 0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing clearing up to 0'0 dirty_to_dups=0'0 dirty_from_dups=4294967295'18446744073709551615 write_from_dups=4294967295'18446744073709551615 trimmed_dups.size()=0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing going to encode log.dups.size()=0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing 1st round encoded log.dups.size()=0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing 2st round encoded log.dups.size()=0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - end of _write_log_and_missing TRACE 2022-09-30 18:53:32,169 [shard 2] osd - mutate_object: num_ops=2 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - final snapset 0=[]:{} in 159:602f83fe:::foo:head DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - ReplicatedBackend::_submit_transaction: do_transaction... ... DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - pg_epoch 275 pg[159.6( v 275'187 (0'0,275'187] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'187 lcod 275'5 mlcod 275'5 active+clean append_log log((0'0,275'187], crt=275'187) {275'188 (275'187) modify 159:602f83fe:::foo:head by client.4444.0:189 2022-09-30T18:53:30.105176+0000 0 ObjectCleanRegions clean_offsets: [9~18446744073709551606], clean_omap: 1, new_object: 0} DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - pg_epoch 275 pg[159.6( v 275'188 (0'0,275'188] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'187 lcod 275'5 mlcod 275'5 active+clean add_log_entry 275'188 (275'187) modify 159:602f83fe:::foo:head by client.4444.0:189 2022-09-30T18:53:30.105176+0000 0 ObjectCleanRegions clean_offsets: [9~18446744073709551606], clean_omap: 1, new_object: 0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - pg_epoch 275 pg[159.6( v 275'188 (0'0,275'188] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'188 lcod 275'5 mlcod 275'5 active+clean append_log approx pg log length = 188 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - pg_epoch 275 pg[159.6( v 275'188 (0'0,275'188] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'188 lcod 275'5 mlcod 275'5 active+clean append_log dups pg log length = 0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - pg_epoch 275 pg[159.6( v 275'188 (0'0,275'188] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'188 lcod 275'5 mlcod 275'5 active+clean append_log transaction_applied = 1 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - trim proposed trim_to = 0'0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 275'188, trimmed: , trimmed_dups: , clear_divergent_priors: 0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing clearing up to 0'0 dirty_to_dups=0'0 dirty_from_dups=4294967295'18446744073709551615 write_from_dups=4294967295'18446744073709551615 trimmed_dups.size()=0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing going to encode log.dups.size()=0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing 1st round encoded log.dups.size()=0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing 2st round encoded log.dups.size()=0 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - end of _write_log_and_missing TRACE 2022-09-30 18:53:32,169 [shard 2] osd - mutate_object: num_ops=2 DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - final snapset 0=[]:{} in 159:602f83fe:::foo:head DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - ReplicatedBackend::_submit_transaction: do_transaction...
I think the repop message send is being reordered, almost certainly due to the cross-core messaging.
Actions