Project

General

Profile

Actions

Bug #57738

closed

crimson: repop ordering bug

Added by Samuel Just over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

 0# 0x00007FB4B2A0EE3C in /lib64/libc.so.6
 1# raise in /lib64/libc.so.6
 2# abort in /lib64/libc.so.6
 3# ceph::__ceph_assert_fail(ceph::assert_data const&) at /home/sam/git-checkouts/ceph/src/crimson/common/assert.cc:13
 4# PeeringState::add_log_entry(pg_log_entry_t const&, bool) at /home/sam/git-checkouts/ceph/src/osd/PeeringState.cc:4099
 5# PeeringState::append_log(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, eversion_t, eversion_t, eversion_t, ceph::os::Transaction&, bool, bool) at /home/sam/git-checkouts/ceph/src/osd/PeeringState.cc:4149
 6# crimson::osd::PG::handle_rep_op(boost::intrusive_ptr<MOSDRepOp>) at /home/sam/git-checkouts/ceph/src/crimson/osd/pg.cc:1243
 7# crimson::osd::RepRequest::with_pg(crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::PG>)::{lambda()#1}::operator()() const at /home/sam/git-checkouts/ceph/src/crimson/osd/osd_operations/replicated_request.cc:64
 8# auto seastar::futurize_invoke<crimson::osd::RepRequest::with_pg(crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::PG>)::{lambda()#1}>(crimson::osd::RepRequest::with_pg(crimson::osd::ShardServices&

Either the primary is sending them out of order or the replica is reordering on receipt.

Actions #1

Updated by Samuel Just over 1 year ago

DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): starting start_pg_operation
DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): 0x6170000a15b8, start_pg_operation in await_active stage
DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): 0x6170000a15b8 start_pg_operation active, entering await_map
DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): 0x6170000a15b8 start_pg_operation await_map stage
DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): 0x6170000a15b8, got map 275, entering get_pg
DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): starting start_pg_operation
DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): 0x6170000a15b8, start_pg_operation in await_active stage
DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): 0x6170000a15b8 start_pg_operation active, entering await_map
DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): 0x6170000a15b8 start_pg_operation await_map stage
DEBUG 2022-09-30 18:53:32,184 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): 0x6170000a15b8, got map 275, entering get_pg
...
DEBUG 2022-09-30 18:53:39,163 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): 0x6170000a15b8 in get_pg core 0
DEBUG 2022-09-30 18:53:39,163 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): in get_pg
DEBUG 2022-09-30 18:53:39,163 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): !can_create
DEBUG 2022-09-30 18:53:39,163 [shard 0] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): !can_create
DEBUG 2022-09-30 18:53:39,163 [shard 1] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): have_pg
DEBUG 2022-09-30 18:53:39,163 [shard 1] osd - replicated_request(id=253236, detail=RepRequest(from=2 req=osd_repop(client.4444.0:189 159.6 e275/273 159:602f83fe:::foo:head v 275'188, mlcod=275'188) v3)): RepRequest::with_pg
...
DEBUG 2022-09-30 18:53:39,188 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): 0x6170000a15b8 in get_pg core 0
DEBUG 2022-09-30 18:53:39,188 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): in get_pg
DEBUG 2022-09-30 18:53:39,188 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): !can_create
DEBUG 2022-09-30 18:53:39,188 [shard 0] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): !can_create
DEBUG 2022-09-30 18:53:39,188 [shard 1] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): have_pg
DEBUG 2022-09-30 18:53:39,188 [shard 1] osd - replicated_request(id=253237, detail=RepRequest(from=2 req=osd_repop(client.4444.0:188 159.6 e275/273 159:602f83fe:::foo:head v 275'187, mlcod=275'187) v3)): RepRequest::with_pg

Looks like the ops were received out of order.

Actions #2

Updated by Samuel Just over 1 year ago

DEBUG 2022-09-30 18:53:32,168 [shard 2] osd -  pg_epoch 275 pg[159.6( v 275'186 (0'0,275'186] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'186 lcod 275'5 mlcod 275'5 active+clean append_log log((0'0,275'186], crt=275'186) {275'187 (275'186) modify   159:602f83fe:::foo:head by client.4444.0:188 2022-09-30T18:53:30.105115+0000 0 ObjectCleanRegions clean_offsets: [9~18446744073709551606], clean_omap
: 1, new_object: 0}
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd -  pg_epoch 275 pg[159.6( v 275'187 (0'0,275'187] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'186 lcod 275'5 mlcod 275'5 active+clean add_log_entry 275'187 (275'186) modify   159:602f83fe:::foo:head by client.4444.0:188 2022-09-30T18:53:30.105115+0000 0 ObjectCleanRegions clean_offsets: [9~18446744073709551606], clean_omap: 1, new_object: 0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd -  pg_epoch 275 pg[159.6( v 275'187 (0'0,275'187] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'187 lcod 275'5 mlcod 275'5 active+clean append_log approx pg log length =  187
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd -  pg_epoch 275 pg[159.6( v 275'187 (0'0,275'187] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'187 lcod 275'5 mlcod 275'5 active+clean append_log dups pg log length =  0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd -  pg_epoch 275 pg[159.6( v 275'187 (0'0,275'187] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'187 lcod 275'5 mlcod 275'5 active+clean append_log transaction_applied = 1
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - trim proposed trim_to = 0'0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 275'187, trimmed: , trimmed_dups: , clear_divergent_priors: 0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing clearing up to 0'0 dirty_to_dups=0'0 dirty_from_dups=4294967295'18446744073709551615 write_from_dups=4294967295'18446744073709551615 trimmed_dups.size()=0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing going to encode log.dups.size()=0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing 1st round encoded log.dups.size()=0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing 2st round encoded log.dups.size()=0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - end of _write_log_and_missing
TRACE 2022-09-30 18:53:32,169 [shard 2] osd - mutate_object: num_ops=2
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - final snapset 0=[]:{} in 159:602f83fe:::foo:head
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - ReplicatedBackend::_submit_transaction: do_transaction...
...
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd -  pg_epoch 275 pg[159.6( v 275'187 (0'0,275'187] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'187 lcod 275'5 mlcod 275'5 active+clean append_log log((0'0,275'187], crt=275'187) {275'188 (275'187) modify   159:602f83fe:::foo:head by client.4444.0:189 2022-09-30T18:53:30.105176+0000 0 ObjectCleanRegions clean_offsets: [9~18446744073709551606], clean_omap: 1, new_object: 0}
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd -  pg_epoch 275 pg[159.6( v 275'188 (0'0,275'188] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'187 lcod 275'5 mlcod 275'5 active+clean add_log_entry 275'188 (275'187) modify   159:602f83fe:::foo:head by client.4444.0:189 2022-09-30T18:53:30.105176+0000 0 ObjectCleanRegions clean_offsets: [9~18446744073709551606], clean_omap: 1, new_object: 0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd -  pg_epoch 275 pg[159.6( v 275'188 (0'0,275'188] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'188 lcod 275'5 mlcod 275'5 active+clean append_log approx pg log length =  188
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd -  pg_epoch 275 pg[159.6( v 275'188 (0'0,275'188] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'188 lcod 275'5 mlcod 275'5 active+clean append_log dups pg log length =  0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd -  pg_epoch 275 pg[159.6( v 275'188 (0'0,275'188] local-lis/les=273/275 n=1 ec=273/273 lis/c=273/273 les/c/f=275/275/0 sis=273) [2,1,0] r=0 lpr=273 luod=275'5 lua=0'0 crt=275'188 lcod 275'5 mlcod 275'5 active+clean append_log transaction_applied = 1
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - trim proposed trim_to = 0'0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 275'188, trimmed: , trimmed_dups: , clear_divergent_priors: 0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing clearing up to 0'0 dirty_to_dups=0'0 dirty_from_dups=4294967295'18446744073709551615 write_from_dups=4294967295'18446744073709551615 trimmed_dups.size()=0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing going to encode log.dups.size()=0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing 1st round encoded log.dups.size()=0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - _write_log_and_missing 2st round encoded log.dups.size()=0
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - end of _write_log_and_missing
TRACE 2022-09-30 18:53:32,169 [shard 2] osd - mutate_object: num_ops=2
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - final snapset 0=[]:{} in 159:602f83fe:::foo:head
DEBUG 2022-09-30 18:53:32,169 [shard 2] osd - ReplicatedBackend::_submit_transaction: do_transaction...

I think the repop message send is being reordered, almost certainly due to the cross-core messaging.

Actions #3

Updated by Samuel Just over 1 year ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF