Project

General

Profile

Actions

Bug #64192

open

Incorrect behavior on combined cmpext+write ops in the face of session resets

Added by Ilya Dryomov 3 months ago. Updated 2 months ago.

Status:
Pending Backport
Priority:
Urgent
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
reef,quincy
Regression:
Yes
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

There seems to be an expectation mismatch between the OSD and the Objecter (or at least the neorados bit of the Objecter). Based on a quick look, I'm inclined to say that this is an OSD bug.

1. A combined cmpext+write op arrives on the OSD:

2024-01-26T10:52:24.783-0500 7f6a861a1700  1 -- [v2:172.21.9.34:6802/39207300,v1:172.21.9.34:6803/39207300] <== client.4499 172.21.9.34:0/3763700879 4 ==== osd_op(client.4499.0:76 59.0 59.95312129 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e304) v8 ==== 284+0+28 (crc 0 0 0) 0x5623c974e000 con 0x5623ca360480

2. cmpext fails the compare (as expected -- the client is sending data that doesn't match on purpose):

2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] _handle_message: osd_op(client.4499.0:76 59.0 59.95312129 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e304) v8
2024-01-26T10:52:24.783-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_op: op osd_op(client.4499.0:76 59.0 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e304) v8
2024-01-26T10:52:24.783-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] op_has_sufficient_caps session=0x5623c8f2b400 pool=59 (test-librbd-senta04-2299691-58 ) pool_app_metadata={rados={}} need_read_cap=1 need_write_cap=1 classes=[] -> yes
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_op osd_op(client.4499.0:76 59.0 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e304) v8 may_write may_read -> write-ordered flags ondisk+write+known_if_redirected+supports_pool_eio
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] get_object_context: found obc in cache: obc(59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head rwstate(none n=0 w=0))
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] get_object_context: obc(59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head rwstate(none n=0 w=0)) oi: 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head(304'22 client.4499.0:70 dirty s 526 uv 22 alloc_hint [4194304 4194304 0]) exists: 1 ssc(59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:snapdir snapset: 0=[]:{} ref: 1 registered: 1 exists: 1)
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] find_object_context 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head @head oi=59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head(304'22 client.4499.0:70 dirty s 526 uv 22 alloc_hint [4194304 4194304 0])
2024-01-26T10:52:24.783-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] maybe_handle_manifest_detail: 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head is not manifest object 
2024-01-26T10:52:24.783-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_op obc obc(59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head rwstate(excl n=1 w=0))
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] execute_ctx 0x5623c967c400
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] execute_ctx 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b] ov 304'22 av 304'28 snapc 0=[] snapset 0=[]:{}
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_osd_op 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b]
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_osd_op  cmpext 512~14 in=14b
2024-01-26T10:52:24.783-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_extent_cmp
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_osd_op 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [sync_read 512~14]
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_osd_op  sync_read 512~14
2024-01-26T10:52:24.783-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_read
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]]  read got 14 / 14 bytes from obj 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_osd_ops error: (4100) Unknown error 4100
2024-01-26T10:52:24.783-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] execute_ctx alloc reply 0x5623c80b2d80 result -4100
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] PeeringState::calc_trim_to_aggressive limit = 304'25
2024-01-26T10:52:24.787-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]]  op order client.4499 tid 76 last was 70
2024-01-26T10:52:24.787-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] execute_ctx update_log_only -- result=-4100
2024-01-26T10:52:24.787-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] record_write_error r=-4100
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] submit_log_entries 304'28 (0'0) error    59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head by client.4499.0:76 0.000000 -4100 ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] new_repop: repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=0 r=-4100)
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] PeeringState::merge_new_log_entries 304'28 (0'0) error    59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head by client.4499.0:76 0.000000 -4100 ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
2024-01-26T10:52:24.787-0500 7f6a645e6700 20 update missing, append 304'28 (0'0) error    59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head by client.4499.0:76 0.000000 -4100 ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
2024-01-26T10:52:24.787-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 lua=304'27 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] PeeringState::append_log_entries_update_missing trim_to bool = 1 trim_to = 0'0
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 trim proposed trim_to = 0'0
2024-01-26T10:52:24.787-0500 7f6a645e6700  6 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 304'28, trimmed: , trimmed_dups: , clear_divergent_priors: 0
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 _write_log_and_missing clearing up to 0'0 dirty_to_dups=0'0 dirty_from_dups=4294967295'18446744073709551615 write_from_dups=4294967295'18446744073709551615 trimmed_dups.size()=0
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 _write_log_and_missing going to encode log.dups.size()=0
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 _write_log_and_missing 1st round encoded log.dups.size()=0
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 _write_log_and_missing 2st round encoded log.dups.size()=0
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 end of _write_log_and_missing
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 lua=304'27 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] op_applied version 304'28
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] PeeringState::calc_trim_to_aggressive limit = 304'25
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 osd.0 304 dequeue_op osd_op(client.4499.0:76 59.0 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e304) v8 finish

3. Before the op is ready, OSD session gets reset (due to ms_inject_socket_failures injection on the client):

2024-01-26T10:52:24.787-0500 7f6a861a1700  1 --2- [v2:172.21.9.34:6802/39207300,v1:172.21.9.34:6803/39207300] >> 172.21.9.34:0/3763700879 conn(0x5623ca360480 0x5623c97b8580 crc :-1 s=READY pgs=20 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1
2024-01-26T10:52:24.787-0500 7f6a861a1700  1 --2- [v2:172.21.9.34:6802/39207300,v1:172.21.9.34:6803/39207300] >> 172.21.9.34:0/3763700879 conn(0x5623ca360480 0x5623c97b8580 crc :-1 s=READY pgs=20 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop
2024-01-26T10:52:24.787-0500 7f6a75b6c700  2 osd.0 304 ms_handle_reset con 0x5623ca360480 session 0x5623c8f2b400

4. The client resends the op and dup detection on the OSD kicks in:

2024-01-26T10:52:24.791-0500 7f6a861a1700  1 -- [v2:172.21.9.34:6802/39207300,v1:172.21.9.34:6803/39207300] <== client.4499 172.21.9.34:0/3763700879 1 ==== osd_op(client.4499.0:76 59.0 59.95312129 (undecoded) ondisk+retry+write+known_if_redirected+supports_pool_eio e304) v8 ==== 284+0+28 (crc 0 0 0) 0x5623c974e780 con 0x5623ca3e6480

2024-01-26T10:52:24.791-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] _handle_message: osd_op(client.4499.0:76 59.0 59.95312129 (undecoded) ondisk+retry+write+known_if_redirected+supports_pool_eio e304) v8
2024-01-26T10:52:24.791-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] do_op: op osd_op(client.4499.0:76 59.0 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b] snapc 0=[] RETRY=1 ondisk+retry+write+known_if_redirected+supports_pool_eio e304) v8
2024-01-26T10:52:24.791-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] op_has_sufficient_caps session=0x5623c974e280 pool=59 (test-librbd-senta04-2299691-58 ) pool_app_metadata={rados={}} need_read_cap=1 need_write_cap=1 classes=[] -> yes
2024-01-26T10:52:24.791-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] do_op osd_op(client.4499.0:76 59.0 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b] snapc 0=[] RETRY=1 ondisk+retry+write+known_if_redirected+supports_pool_eio e304) v8 may_write may_read -> write-ordered flags ondisk+retry+write+known_if_redirected+supports_pool_eio
2024-01-26T10:52:24.791-0500 7f6a645e6700  3 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] do_op dup client.4499.0:76 version 304'28

^^^ It's not logged, but I suspect that PG::check_in_progress_op() returns with user_version = 0 and an empty op_returns vector here.

2024-01-26T10:52:24.791-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] already_complete: 304'28
2024-01-26T10:52:24.791-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] already_complete: repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=0 r=-4100)
2024-01-26T10:52:24.791-0500 7f6a685ee700 20 osd.0 op_wq(0) _process 59.0 to_process <> waiting <> waiting_peering {}
2024-01-26T10:52:24.791-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] already_complete: repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=0 r=-4100) not committed, returning false

5. The op becomes ready:

2024-01-26T10:52:24.791-0500 7f6a685ee700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'26 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] repop_all_committed: repop tid 1813 all committed
2024-01-26T10:52:24.791-0500 7f6a685ee700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'26 active+clean ps=[2~1]] PeeringState::calc_min_last_complete_ondisk last_complete_ondisk is updated to: 304'27 from: 304'26
2024-01-26T10:52:24.791-0500 7f6a685ee700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]] eval_repop repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=1 r=-4100)
2024-01-26T10:52:24.791-0500 7f6a685ee700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]]  commit: repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=1 r=-4100)

6. Reply with user_version = 0 is sent:

2024-01-26T10:52:24.791-0500 7f6a685ee700  1 -- [v2:172.21.9.34:6802/39207300,v1:172.21.9.34:6803/39207300] --> 172.21.9.34:0/3763700879 -- osd_op_reply(76 rbd_data.1193ad1baaa.0000000000000000 [cmpext 512~14,write 512~14] v304'28 uv0 ondisk = -4100 ((4100) Unknown error 4100)) v8 -- 0x5623c92786c0 con 0x5623ca3e6480
2024-01-26T10:52:24.791-0500 7f6a685ee700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]] PeeringState::prepare_stats_for_publish reporting purged_snaps [2~1]
2024-01-26T10:52:24.791-0500 7f6a685ee700 15 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]] PeeringState::prepare_stats_for_publish publish_stats_to_osd 304:75
2024-01-26T10:52:24.791-0500 7f6a685ee700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]]  removing repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=1 r=-4100)
2024-01-26T10:52:24.791-0500 7f6a685ee700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]]    q front is repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=1 r=-4100)
2024-01-26T10:52:24.791-0500 7f6a685ee700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]] finished operator() r=-4100
2024-01-26T10:52:24.791-0500 7f6a685ee700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]]  sending commit on osd_op(client.4499.0:76 59.0 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e304) v8 0x5623c80b2d80

7. Reply with user_version != 0 is sent:

2024-01-26T10:52:24.791-0500 7f6a685ee700  1 -- [v2:172.21.9.34:6802/39207300,v1:172.21.9.34:6803/39207300] --> 172.21.9.34:0/3763700879 -- osd_op_reply(76 rbd_data.1193ad1baaa.0000000000000000 [cmpext 512~14,write 512~14] v304'28 uv22 ondisk = -4100 ((4100) Unknown error 4100)) v8 -- 0x5623c80b2d80 con 0x5623ca360480
2024-01-26T10:52:24.791-0500 7f6a685ee700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]] remove_repop repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=1 r=-4100)

8. On the client side, Objecter processes the reply with user_version = 0. This reply also happens to have rval = 0 for all ops, including cmpext even though it failed the compare:

2024-01-26T10:52:24.791-0500 7ff755ffb700  1 -- 172.21.9.34:0/3763700879 <== osd.0 v2:172.21.9.34:6802/39207300 1 ==== osd_op_reply(76 rbd_data.1193ad1baaa.0000000000000000 [cmpext 512~14,write 512~14] v304'28 uv0 ondisk = -4100 ((4100) Unknown error 4100)) v8 ==== 223+0+0 (crc 0 0 0) 0x7ff704078040 con 0x7ff72c025d50
2024-01-26T10:52:24.791-0500 7ff755ffb700 10 client.4499.objecter ms_dispatch 0x55bffcc34780 osd_op_reply(76 rbd_data.1193ad1baaa.0000000000000000 [cmpext 512~14,write 512~14] v304'28 uv0 ondisk = -4100 ((4100) Unknown error 4100)) v8
2024-01-26T10:52:24.791-0500 7ff755ffb700 10 client.4499.objecter in handle_osd_op_reply
2024-01-26T10:52:24.791-0500 7ff755ffb700  7 client.4499.objecter handle_osd_op_reply 76 ondisk uv 0 in 59.0 attempt 1
2024-01-26T10:52:24.791-0500 7ff755ffb700 10 client.4499.objecter  op 0 rval 0 len 0
2024-01-26T10:52:24.791-0500 7ff755ffb700 10 client.4499.objecter  op 1 rval 0 len 0
2024-01-26T10:52:24.791-0500 7ff755ffb700 15 client.4499.objecter handle_osd_op_reply completed tid 76
2024-01-26T10:52:24.791-0500 7ff755ffb700 15 client.4499.objecter _finish_op 76

Because of rval = 0 there, Adam's cmpext "handler" from https://github.com/ceph/ceph/pull/52495 doesn't set mismatch_offset and doesn't adjust the overall return code. -4100 is returned to the user (librbd in this case), which chokes on it: we expect -MAX_ERRNO (-4095) overall return code and mismatch_offset = 5.

For comparison, here is Objecter processing the same thing when retries aren't involved:

2024-01-26T10:47:07.259-0500 7ff743fff700 10 client.4142.objecter ms_dispatch 0x55bffca54cd0 osd_op_reply(67 rbd_data.102eba81544b.0000000000000000 [cmpext 512~14,write 512~14] v12'23 uv21 ondisk = -4100 ((4100) Unknown error 4100)) v8
2024-01-26T10:47:07.259-0500 7ff743fff700 10 client.4142.objecter in handle_osd_op_reply
2024-01-26T10:47:07.259-0500 7ff743fff700  7 client.4142.objecter handle_osd_op_reply 67 ondisk uv 21 in 2.0 attempt 0
2024-01-26T10:47:07.259-0500 7ff743fff700 10 client.4142.objecter  op 0 rval -4100 len 0
2024-01-26T10:47:07.259-0500 7ff743fff700 10 client.4142.objecter ERROR: tid 67: handler function threw CmpExt mismatch [osd:4095]
2024-01-26T10:47:07.259-0500 7ff743fff700 10 client.4142.objecter  op 1 rval 0 len 0
2024-01-26T10:47:07.259-0500 7ff743fff700 15 client.4142.objecter handle_osd_op_reply completed tid 67
2024-01-26T10:47:07.259-0500 7ff743fff700 15 client.4142.objecter _finish_op 67

Radoslaw, is PG::check_in_progress_op() returning with user_version = 0 and an empty op_returns vector, causing a dummy reply with user_version = 0 and rval = 0 for all ops being sent first, expected? Should such a dummy reply be sent at all?


Related issues 2 (2 open0 closed)

Copied to RADOS - Backport #64575: reef: Incorrect behavior on combined cmpext+write ops in the face of session resetsNewRadoslaw ZarzynskiActions
Copied to RADOS - Backport #64576: quincy: Incorrect behavior on combined cmpext+write ops in the face of session resetsNewRadoslaw ZarzynskiActions
Actions #1

Updated by Radoslaw Zarzynski 3 months ago

  • Status changed from New to In Progress
Actions #2

Updated by Radoslaw Zarzynski 3 months ago

Hi Ilya!

I looked over the log's reader (PG::check_in_progress_op() + PGLog::get_request()) and nothing has struck at the first glance -- it seems that for every single return true the op_returns should be set.

However, the writer part looks suspicious:

void PrimaryLogPG::record_write_error(OpRequestRef op, const hobject_t &soid,
                                      MOSDOpReply *orig_reply, int r,
                                      OpContext *ctx_for_op_returns)
{
  dout(20) << __func__ << " r=" << r << dendl;
  ceph_assert(op->may_write());
  const osd_reqid_t &reqid = op->get_req<MOSDOp>()->get_reqid();
  mempool::osd_pglog::list<pg_log_entry_t> entries;
  entries.push_back(pg_log_entry_t(pg_log_entry_t::ERROR, soid,
                                   get_next_version(), eversion_t(), 0,
                                   reqid, utime_t(), r));
  if (ctx_for_op_returns) {
    entries.back().set_op_returns(*ctx_for_op_returns->ops);
    dout(20) << __func__ << " op_returns=" << entries.back().op_returns << dendl;
  }

There is the level-20 log I can't see in the snippets:

  if (ctx_for_op_returns) {
    entries.back().set_op_returns(*ctx_for_op_returns->ops);
    dout(20) << __func__ << " op_returns=" << entries.back().op_returns << dendl;
  }

If that's the case, ctx_for_op_returns has been passed by the caller.

Actions #3

Updated by Radoslaw Zarzynski 3 months ago

The caller of the recorder is:

void PrimaryLogPG::execute_ctx(OpContext *ctx)
{
  // ...
  if (ctx->update_log_only) {
    if (result >= 0)
      do_osd_op_effects(ctx, m->get_connection());

    dout(20) << __func__ << " update_log_only -- result=" << result << dendl;
    // save just what we need from ctx
    MOSDOpReply *reply = ctx->reply;
    ctx->reply = nullptr;
    reply->get_header().data_off = (ctx->data_off ? *ctx->data_off : 0);

    if (result == -ENOENT) {
      reply->set_enoent_reply_versions(info.last_update,
                                       info.last_user_version);
    }
    reply->add_flags(CEPH_OSD_FLAG_ACK | CEPH_OSD_FLAG_ONDISK);
    // append to pg log for dup detection - don't save buffers for now
    record_write_error(op, soid, reply, result,
                       ctx->op->allows_returnvec() ? ctx : nullptr);
    close_op_ctx(ctx);
    return;
  }

The question is: what was the actual value of allows_returnvec()?

Let's track the the allows_returnvec paths:

bool OpInfo::allows_returnvec() const {
  return check_rmw(CEPH_OSD_RMW_FLAG_RETURNVEC);
}
void OpInfo::set_returnvec() { set_rmw_flags(CEPH_OSD_RMW_FLAG_RETURNVEC); }
int OpInfo::set_from_op(
  const MOSDOp *m,
  const OSDMap &osdmap)
{
  // client flags have no bearing on whether an op is a read, write, etc.
  clear();

  if (m->has_flag(CEPH_OSD_FLAG_RWORDERED)) {
    set_force_rwordered();
  }
  if (m->has_flag(CEPH_OSD_FLAG_RETURNVEC)) {
    set_returnvec();
  }
  return set_from_op(m->ops, m->get_pg(), osdmap);
}

So, the request message had to be flagged with CEPH_OSD_FLAG_RETURNVEC.

What can be helpful with disproving the hypothesis is:

const char *ceph_osd_flag_name(unsigned flag)
{
  switch (flag) {
  case CEPH_OSD_FLAG_ACK: return "ack";
  case CEPH_OSD_FLAG_ONNVRAM: return "onnvram";
  case CEPH_OSD_FLAG_ONDISK: return "ondisk";
  case CEPH_OSD_FLAG_RETRY: return "retry";
  case CEPH_OSD_FLAG_READ: return "read";
  case CEPH_OSD_FLAG_WRITE: return "write";
  case CEPH_OSD_FLAG_ORDERSNAP: return "ordersnap";
  case CEPH_OSD_FLAG_PEERSTAT_OLD: return "peerstat_old";
  case CEPH_OSD_FLAG_BALANCE_READS: return "balance_reads";
  case CEPH_OSD_FLAG_PARALLELEXEC: return "parallelexec";
  case CEPH_OSD_FLAG_PGOP: return "pgop";
  case CEPH_OSD_FLAG_EXEC: return "exec";
  case CEPH_OSD_FLAG_EXEC_PUBLIC: return "exec_public";
  case CEPH_OSD_FLAG_LOCALIZE_READS: return "localize_reads";
  case CEPH_OSD_FLAG_RWORDERED: return "rwordered";
  case CEPH_OSD_FLAG_IGNORE_CACHE: return "ignore_cache";
  case CEPH_OSD_FLAG_SKIPRWLOCKS: return "skiprwlocks";
  case CEPH_OSD_FLAG_IGNORE_OVERLAY: return "ignore_overlay";
  case CEPH_OSD_FLAG_FLUSH: return "flush"; 
  case CEPH_OSD_FLAG_MAP_SNAP_CLONE: return "map_snap_clone";
  case CEPH_OSD_FLAG_ENFORCE_SNAPC: return "enforce_snapc";
  case CEPH_OSD_FLAG_REDIRECTED: return "redirected";
  case CEPH_OSD_FLAG_KNOWN_REDIR: return "known_if_redirected";
  case CEPH_OSD_FLAG_FULL_TRY: return "full_try";
  case CEPH_OSD_FLAG_FULL_FORCE: return "full_force";
  case CEPH_OSD_FLAG_IGNORE_REDIRECT: return "ignore_redirect";
  case CEPH_OSD_FLAG_RETURNVEC: return "returnvec";
  case CEPH_OSD_FLAG_SUPPORTSPOOLEIO: return "supports_pool_eio";
  default: return "???";
  }
}

From the provided snippets we can see there this flag has NOT been set:

2024-01-26T10:52:24.791-0500 7f6a861a1700  1 -- [v2:172.21.9.34:6802/39207300,v1:172.21.9.34:6803/39207300] <== client.4499 172.21.9.34:0/3763700879 1 ==== osd_op(client.4499.0:76 59.0 59.95312129 (undecoded) ondisk+retry+write+known_if_redirected+supports_pool_eio e304) v8 ==== 284+0+28 (crc 0 0 0) 0x5623c974e780 con 0x5623ca3e6480
2024-01-26T10:52:24.783-0500 7f6a861a1700  1 -- [v2:172.21.9.34:6802/39207300,v1:172.21.9.34:6803/39207300] <== client.4499 172.21.9.34:0/3763700879 4 ==== osd_op(client.4499.0:76 59.0 59.95312129 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e304) v8 ==== 284+0+28 (crc 0 0 0) 0x5623c974e000 con 0x5623ca360480
Actions #4

Updated by Radoslaw Zarzynski 3 months ago

Now the Q is why the other reply carried the op_returns vector.

Actions #5

Updated by Ilya Dryomov 3 months ago

  • Category deleted (Correctness/Safety)
  • Assignee changed from Radoslaw Zarzynski to Greg Farnum
  • Severity deleted (3 - minor)

Radoslaw Zarzynski wrote:

Now the Q is why the other reply carried the op_returns vector.

By the other reply, do you mean the reply from step 7 (with user_version != 0)? I don't think we know whether it actually carried the op_returns vector -- only that it was "more populated" in the sense that user_version was set.

Actions #6

Updated by Ilya Dryomov 3 months ago

  • Category set to Correctness/Safety
  • Assignee changed from Greg Farnum to Radoslaw Zarzynski

I'm puzzled by how this ended up being reassigned to Greg or how the Category got cleared. I haven't touched any of those fields...

Actions #7

Updated by Ilya Dryomov 3 months ago

  • Regression changed from No to Yes
Actions #8

Updated by Ilya Dryomov 3 months ago

  • Priority changed from High to Urgent

Bumping priority -- this would be a blocker from the RBD side for Squid.

Actions #9

Updated by Radoslaw Zarzynski 3 months ago

Let's take a look on how a retry (dup's request) is being handled:

void PrimaryLogPG::do_op(OpRequestRef& op)
{
  // ...
  // dup/resent?
  if (op->may_write() || op->may_cache()) {
    // warning: we will get back *a* request for this reqid, but not
    // necessarily the most recent.  this happens with flush and
    // promote ops, but we can't possible have both in our log where
    // the original request is still not stable on disk, so for our
    // purposes here it doesn't matter which one we get.
    eversion_t version;
    version_t user_version;
    int return_code = 0;
    vector<pg_log_op_return_item_t> op_returns;
    bool got = check_in_progress_op(
      m->get_reqid(), &version, &user_version, &return_code, &op_returns);
    if (got) {
      dout(3) << __func__ << " dup " << m->get_reqid()
              << " version " << version << dendl;
      if (already_complete(version)) {
        osd->reply_op_error(op, return_code, version, user_version, op_returns);
      } else {
        dout(10) << " waiting for " << version << " to commit" << dendl;
        // always queue ondisk waiters, so that we can requeue if needed
        waiting_for_ondisk[version].emplace_back(op, user_version, return_code,
                                                 op_returns);
        op->mark_delayed("waiting for ondisk");
      }
      return;
    }
  }

There are two paths:

1. If the underlying operation is completed (the already_complete() true case), a response based solely on the PGLog (regardless whether it has been filled with proper values in record_write_error()) is prepared:

      if (already_complete(version)) {
        osd->reply_op_error(op, return_code, version, user_version, op_returns);

It's a no-brainer. However, this is NOT what has happened in the case described by Ilya:

2024-01-26T10:52:24.791-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] already_complete: repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=0 r=-4100) not committed, returning false

2. If the operation is still in progress, more complex path over the waiting_for_ondisk is selected:

      } else {
        dout(10) << " waiting for " << version << " to commit" << dendl;
        // always queue ondisk waiters, so that we can requeue if needed
        waiting_for_ondisk[version].emplace_back(op, user_version, return_code,
                                                 op_returns);
        op->mark_delayed("waiting for ondisk");
      }

The entry is translated into a response in eval_repop(). In the case this method has first dealt with the response to the retry (dup):

void PrimaryLogPG::eval_repop(RepGather *repop)
{
  // ...
    // send dup commits, in order
    auto it = waiting_for_ondisk.find(repop->v);
    if (it != waiting_for_ondisk.end()) {
      ceph_assert(waiting_for_ondisk.begin()->first == repop->v);
      for (auto& i : it->second) {
        int return_code = repop->r;
        if (return_code >= 0) {
          return_code = std::get<2>(i);
        }
        osd->reply_op_error(std::get<0>(i), return_code, repop->v,
                            std::get<1>(i), std::get<3>(i));
      }
      waiting_for_ondisk.erase(it);
    }

    publish_stats_to_osd();
    // ...

Although return_code can be altered in some cases, the user version (fourth parameter to reply_op_error) shouldn't.

  void reply_op_error(OpRequestRef op, int err, eversion_t v, version_t uv,
                      std::vector<pg_log_op_return_item_t> op_returns);

I think this is the place where the reply with uv 0 has been sent – it's the response to the retry. `0` came from PGLog as it's exactly what has been put there the recorder. Let's retake a look:

void PrimaryLogPG::record_write_error(OpRequestRef op, const hobject_t &soid,
                                      MOSDOpReply *orig_reply, int r,
                                      OpContext *ctx_for_op_returns)
{ 
  // ...
  entries.push_back(pg_log_entry_t(pg_log_entry_t::ERROR, soid,
                                   get_next_version(), eversion_t(), 0,
                                   reqid, utime_t(), r));
  if (ctx_for_op_returns) {
    entries.back().set_op_returns(*ctx_for_op_returns->ops);
    dout(20) << __func__ << " op_returns=" << entries.back().op_returns << dendl;
  }
  pg_log_entry_t(int _op, const hobject_t& _soid,
                const eversion_t& v, const eversion_t& pv,
                version_t uv,
                const osd_reqid_t& rid, const utime_t& mt,
                int return_code)
   : soid(_soid), reqid(rid), version(v), prior_version(pv), user_version(uv),
     mtime(mt), return_code(return_code), op(_op),
     invalid_hash(false), invalid_pool(false) {
    snaps.reassign_to_mempool(mempool::mempool_osd_pglog);
  }

uv is always zero in these 2 paths and I'm afraid they haven't changed in last 4 years.

To summarize: a response to a retry is sent with uv == 0 and contains the op_returnvec **only if the client sets the returnvec flag (as shown in the previous comments). I guess this is the differentiation point with the original message. Let's take a look on.

The response to the original request came from eval_ops as well, it has been sent after response to the retry:

void PrimaryLogPG::eval_repop(RepGather *repop)
{
  // ...
    // send dup commits, in order
    // ...

    publish_stats_to_osd();

    dout(10) << " removing " << *repop << dendl;
    ceph_assert(!repop_queue.empty());
    dout(20) << "   q front is " << *repop_queue.front() << dendl;
    if (repop_queue.front() == repop) {
      RepGather *to_remove = nullptr;
      while (!repop_queue.empty() &&
             (to_remove = repop_queue.front())->all_committed) {
        repop_queue.pop_front();
        for (auto p = to_remove->on_success.begin();
             p != to_remove->on_success.end();
             to_remove->on_success.erase(p++)) {
          (*p)();
        }
        remove_repop(to_remove);
      }
    }

The callback here came from the same record_write_error() that is putting uv = 0 to the PGLog (for the sake a of retries):

void PrimaryLogPG::record_write_error(OpRequestRef op, const hobject_t &soid,
                                      MOSDOpReply *orig_reply, int r,
                                      OpContext *ctx_for_op_returns)
{ 
  // ...
  struct OnComplete {
    PrimaryLogPG *pg;
    OpRequestRef op;
    boost::intrusive_ptr<MOSDOpReply> orig_reply;
    int r;
    OnComplete(
      PrimaryLogPG *pg, 
      OpRequestRef op,
      MOSDOpReply *orig_reply,
      int r)
      : pg(pg), op(op),
        orig_reply(orig_reply, false /* take over ref */), r(r)
      {}
    void operator()() {
      ldpp_dout(pg, 20) << "finished " << __func__ << " r=" << r << dendl;
      auto m = op->get_req<MOSDOp>();
      MOSDOpReply *reply = orig_reply.detach();
      ldpp_dout(pg, 10) << " sending commit on " << *m << " " << reply << dendl;
      pg->osd->send_message_osd_client(reply, m->get_connection());
    }
  };

  ObcLockManager lock_manager;
  submit_log_entries(
    entries,
    std::move(lock_manager),
    std::optional<std::function<void(void)> >(
      OnComplete(this, op, orig_reply, r)),
    op,
    r);
}

The final look goes to the crafting of MOSDOpReply in the error scenario:

  bool ignore_out_data = false;
  if (!ctx->op_t->empty() &&
      op->may_write() &&
      result >= 0) {
    // successful update
    // ...
  }

  // prepare the reply
  ctx->reply = new MOSDOpReply(m, result, get_osdmap_epoch(), 0,
                               ignore_out_data);
  // ...
  ctx->reply->set_reply_versions(ctx->at_version, ctx->user_at_version);
  // ...
  if (ctx->update_log_only) {
    // ...
    // append to pg log for dup detection - don't save buffers for now
    record_write_error(op, soid, reply, result,
                       ctx->op->allows_returnvec() ? ctx : nullptr);

To summarize: for the reply to the original request the presence of returnvec in the error scenario (result < 0) is unconditional.

Actions #10

Updated by Radoslaw Zarzynski 3 months ago

The intuitive (not saying yet: proper nor good) would be to replicate the behavior of the main path and record retvec on result < 0 regardless of a client's will.

Actions #11

Updated by Radoslaw Zarzynski 3 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 55570
Actions #12

Updated by Radoslaw Zarzynski 2 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to reef,quincy
Actions #13

Updated by Backport Bot 2 months ago

  • Copied to Backport #64575: reef: Incorrect behavior on combined cmpext+write ops in the face of session resets added
Actions #14

Updated by Backport Bot 2 months ago

  • Copied to Backport #64576: quincy: Incorrect behavior on combined cmpext+write ops in the face of session resets added
Actions #15

Updated by Backport Bot 2 months ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF