Project

General

Profile

Bug #39304

short pg log+nautilus-p2p-stress-split: "Error: finished tid 3 when last_acked_tid was 5" in upgrade:nautilus-p2p

Added by Yuri Weinstein 10 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
luminous, mimic, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/nautilus-p2p
Component(RADOS):
Pull request ID:
Crash signature:

Description

Run: http://pulpito.ceph.com/yuriw-2019-04-13_15:18:33-upgrade:nautilus-p2p-wip-yuri6-testing-2019-04-12-1636-nautilus-distro-basic-smithi/
Job: 3843006
Logs: /a/yuriw-2019-04-13_15:18:33-upgrade:nautilus-p2p-wip-yuri6-testing-2019-04-12-1636-nautilus-distro-basic-smithi/3843006/teuthology.log

2019-04-13T17:25:02.255 INFO:tasks.rados.rados.0.smithi074.stdout:2911:  expect (ObjNum 226 snap 0 seq_num 226)
2019-04-13T17:25:02.256 INFO:tasks.rados.rados.0.smithi074.stdout:2912: delete oid 344 current snap is 0
2019-04-13T17:25:02.257 INFO:tasks.rados.rados.0.smithi074.stdout:2904:  finishing write tid 3 to smithi07488316-109 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
2019-04-13T17:25:02.257 INFO:tasks.rados.rados.0.smithi074.stderr:Error: finished tid 3 when last_acked_tid was 5
2019-04-13T17:25:02.257 INFO:tasks.rados.rados.0.smithi074.stderr:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7f07167fc700 time 2019-04-13 17:25:02.253653
2019-04-13T17:25:02.257 INFO:tasks.rados.rados.0.smithi074.stderr:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/test/osd/RadosModel.h: 905: abort()
2019-04-13T17:25:02.257 INFO:tasks.rados.rados.0.smithi074.stderr: ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)
2019-04-13T17:25:02.258 INFO:tasks.rados.rados.0.smithi074.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xd8) [0x7f072b62181a]
2019-04-13T17:25:02.258 INFO:tasks.rados.rados.0.smithi074.stderr: 2: (WriteOp::_finish(TestOp::CallbackInfo*)+0x5ea) [0x55c3f4bad66a]
2019-04-13T17:25:02.258 INFO:tasks.rados.rados.0.smithi074.stderr: 3: (write_callback(void*, void*)+0x19) [0x55c3f4bcb139]
2019-04-13T17:25:02.258 INFO:tasks.rados.rados.0.smithi074.stderr: 4: (()+0x88176) [0x7f0734a60176]
2019-04-13T17:25:02.259 INFO:tasks.rados.rados.0.smithi074.stderr: 5: (()+0x4e659) [0x7f0734a26659]
2019-04-13T17:25:02.259 INFO:tasks.rados.rados.0.smithi074.stderr: 6: (Finisher::finisher_thread_entry()+0x156) [0x7f072b665096]
2019-04-13T17:25:02.259 INFO:tasks.rados.rados.0.smithi074.stderr: 7: (()+0x7e25) [0x7f0729d41e25]
2019-04-13T17:25:02.259 INFO:tasks.rados.rados.0.smithi074.stderr: 8: (clone()+0x6d) [0x7f07289f4bad]
2019-04-13T17:25:02.264 INFO:teuthology.orchestra.run.smithi204:Running:
2019-04-13T17:25:02.264 INFO:teuthology.orchestra.run.smithi204:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_historic_ops
2019-04-13T17:25:02.281 INFO:teuthology.orchestra.run.smithi132:Running:
2019-04-13T17:25:02.281 INFO:teuthology.orchestra.run.smithi132:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell 'osd.*' injectargs --osd_enable_op_tracker=false

Related issues

Copied to RADOS - Backport #39719: luminous: short pg log+nautilus-p2p-stress-split: "Error: finished tid 3 when last_acked_tid was 5" in upgrade:nautilus-p2p Resolved
Copied to RADOS - Backport #39720: mimic: short pg log+nautilus-p2p-stress-split: "Error: finished tid 3 when last_acked_tid was 5" in upgrade:nautilus-p2p Resolved
Copied to RADOS - Backport #39721: nautilus: short pg log+nautilus-p2p-stress-split: "Error: finished tid 3 when last_acked_tid was 5" in upgrade:nautilus-p2p Resolved

History

#1 Updated by David Zafman 10 months ago

After writing tids 1,2,3 the output shows finishing tids 1,2,3,4,5. We can see that the 3rd set of writes was interrupted by a change of primary. The tid 3 write at offset 2775265 had a RETRY=2 after coming back to osd.0.

The sequence of writes from teuthology.log:

2019-04-13T17:21:49.394 INFO:tasks.rados.rados.0.smithi074.stdout:109:  writing smithi07488316-109 ... from 615156 to 1170990 tid 1
2019-04-13T17:21:49.399 INFO:tasks.rados.rados.0.smithi074.stdout:109:  writing smithi07488316-109 ... from 1603467 to 2019828 tid 2
2019-04-13T17:21:49.407 INFO:tasks.rados.rados.0.smithi074.stdout:109:  writing smithi07488316-109 ... from 2816389 to 3509884 tid 3
2019-04-13T17:22:13.986 INFO:tasks.rados.rados.0.smithi074.stdout:825:  writing smithi07488316-109 ... from 746409 to 1314242 tid 1
2019-04-13T17:22:13.993 INFO:tasks.rados.rados.0.smithi074.stdout:825:  writing smithi07488316-109 ... from 1901600 to 2445841 tid 2
2019-04-13T17:22:13.999 INFO:tasks.rados.rados.0.smithi074.stdout:825:  writing smithi07488316-109 ... from 3065949 to 3546409 tid 3
2019-04-13T17:24:47.743 INFO:tasks.rados.rados.0.smithi074.stdout:2904:  writing smithi07488316-109 ... from 726564 to 1161491 tid 1
2019-04-13T17:24:47.751 INFO:tasks.rados.rados.0.smithi074.stdout:2904:  writing smithi07488316-109 ... from 1874534 to 2312847 tid 2
2019-04-13T17:24:47.758 INFO:tasks.rados.rados.0.smithi074.stdout:2904:  writing smithi07488316-109 ... from 2775265 to 3126564 tid 3
Primary is 0

2019-04-13 17:21:49.387 7ffa2a7b0700 10 osd.0 pg_epoch: 2002 pg[38.4( v 2002'34 (0'0,2002'34] local-lis/les=1998/1999 n=11 ec=1998/1998 lis/c 1998/1998 les/c/f 1999/1999/0 1998/1998/1998) [0,3] r=0 lpr=1998 crt=2002'34 lcod 2001'33 mlcod 2001'32 active+clean]  sending reply on osd_op(client.28586.0:481 38.4 38:21f0cbf0:::smithi07488316-109 ...:head [write 615156~555834,stat] snapc 0=[] ondisk+write+known_if_redirected e2002) v8 0x557503530000
2019-04-13 17:21:49.391 7ffa2a7b0700 10 osd.0 pg_epoch: 2002 pg[38.4( v 2002'35 (0'0,2002'35] local-lis/les=1998/1999 n=11 ec=1998/1998 lis/c 1998/1998 les/c/f 1999/1999/0 1998/1998/1998) [0,3] r=0 lpr=1998 crt=2002'35 lcod 2002'34 mlcod 2001'33 active+clean]  sending reply on osd_op(client.28586.0:482 38.4 38:21f0cbf0:::smithi07488316-109 ...:head [write 1603467~416361] snapc 0=[] ondisk+write+known_if_redirected e2002) v8 0x557503530000
2019-04-13 17:21:49.399 7ffa2a7b0700 10 osd.0 pg_epoch: 2002 pg[38.4( v 2002'37 (0'0,2002'37] local-lis/les=1998/1999 n=11 ec=1998/1998 lis/c 1998/1998 les/c/f 1999/1999/0 1998/1998/1998) [0,3] r=0 lpr=1998 luod=2002'36 crt=2002'37 lcod 2002'35 mlcod 2002'34 active+clean]  sending reply on osd_op(client.28586.0:483 38.4 38:21f0cbf0:::smithi07488316-109 ...:head [write 2816389~693495] snapc 0=[] ondisk+write+known_if_redirected e2002) v8 0x557503530000
2019-04-13 17:22:13.980 7ffa2a7b0700 10 osd.0 pg_epoch: 2020 pg[38.4( v 2020'159 (2004'100,2020'159] local-lis/les=1998/1999 n=36 ec=1998/1998 lis/c 1998/1998 les/c/f 1999/1999/0 1998/1998/1998) [0,3] r=0 lpr=1998 crt=2020'159 lcod 2019'158 mlcod 2019'157 active+clean]  sending reply on osd_op(client.28586.0:3390 38.4 38:21f0cbf0:::smithi07488316-109 ...:head [write 746409~567833] snapc 0=[] ondisk+write+known_if_redirected e2020) v8 0x557502a82780
2019-04-13 17:22:13.986 7ffa2a7b0700 10 osd.0 pg_epoch: 2020 pg[38.4( v 2020'160 (2004'100,2020'160] local-lis/les=1998/1999 n=36 ec=1998/1998 lis/c 1998/1998 les/c/f 1999/1999/0 1998/1998/1998) [0,3] r=0 lpr=1998 crt=2020'160 lcod 2020'159 mlcod 2019'158 active+clean]  sending reply on osd_op(client.28586.0:3391 38.4 38:21f0cbf0:::smithi07488316-109 ...:head [write 1901600~544241] snapc 0=[] ondisk+write+known_if_redirected e2020) v8 0x557502a82780
2019-04-13 17:22:13.992 7ffa2a7b0700 10 osd.0 pg_epoch: 2020 pg[38.4( v 2020'162 (2004'100,2020'162] local-lis/les=1998/1999 n=36 ec=1998/1998 lis/c 1998/1998 les/c/f 1999/1999/0 1998/1998/1998) [0,3] r=0 lpr=1998 luod=2020'161 crt=2020'162 lcod 2020'160 mlcod 2020'159 active+clean]  sending reply on osd_op(client.28586.0:3392 38.4 38:21f0cbf0:::smithi07488316-109 ...:head [write 3065949~480460] snapc 0=[] ondisk+write+known_if_redirected e2020) v8 0x557502a82780

Primary fails to 3

2019-04-13 17:24:47.736 7eff26f07700 10 osd.3 pg_epoch: 2119 pg[38.4( v 2119'370 (2075'300,2119'370] local-lis/les=2093/2094 n=14 ec=1998/1998 lis/c 2093/2093 les/c/f 2094/2094/0 2090/2093/2090) [3,5] r=0 lpr=2093 crt=2119'370 lcod 2117'369 mlcod 2117'368 active+clean]  sending reply on osd_op(client.28586.0:10662 38.4 38:21f0cbf0:::smithi07488316-109 ...:head [write 726564~434927] snapc 0=[] ondisk+write+known_if_redirected e2119) v8 0x55a927f04500
2019-04-13 17:24:47.744 7eff26f07700 10 osd.3 pg_epoch: 2119 pg[38.4( v 2119'371 (2075'300,2119'371] local-lis/les=2093/2094 n=14 ec=1998/1998 lis/c 2093/2093 les/c/f 2094/2094/0 2090/2093/2090) [3,5] r=0 lpr=2093 crt=2119'371 lcod 2119'370 mlcod 2117'369 active+clean]  sending reply on osd_op(client.28586.0:10663 38.4 38:21f0cbf0:::smithi07488316-109 ...:head [write 1874534~438313] snapc 0=[] ondisk+write+known_if_redirected e2119) v8 0x55a927f04500

Primary fails back to 0

2019-04-13 17:25:02.167 7ffa2a7b0700 10 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean]  sending reply on osd_op(client.28586.0:10664 38.4 38:21f0cbf0:::smithi07488316-109 ...:head [write 2775265~351299] snapc 0=[] RETRY=2 ondisk+retry+write+known_if_redirected e2130) v8 0x557502d09180

#2 Updated by David Zafman 10 months ago

Before changing primaries from 3 to 0, these 4 operations came in with versions 2119'370 (write 726564), 2119'371 (write 1874534), 2119'372 (write 2775265) and 2119'373 (setattr/trunc)

2019-04-13 17:24:47.734 7eff2af0f700 10 osd.3 pg_epoch: 2119 pg[38.4( v 2117'369 (2075'300,2117'369] local-lis/les=2093/2094 n=14 ec=1998/1998 lis/c 2093/2093 les/c/f 2094/2094/0 2090/2093/2090) [3,5] r=0 lpr=2093 crt=2117'369 lcod 2117'368 mlcod 2117'368 active+clean] execute_ctx 38:21f0cbf0:::smithi07488316-109 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 726564~434927] ov 2102'343 av 2119'370 snapc 0=[] snapset 0=[]:{}
2019-04-13 17:24:47.742 7eff2af0f700 10 osd.3 pg_epoch: 2119 pg[38.4( v 2119'370 (2075'300,2119'370] local-lis/les=2093/2094 n=14 ec=1998/1998 lis/c 2093/2093 les/c/f 2094/2094/0 2090/2093/2090) [3,5] r=0 lpr=2093 crt=2119'370 lcod 2117'369 mlcod 2117'369 active+clean] execute_ctx 38:21f0cbf0:::smithi07488316-109 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 1874534~438313] ov 2119'370 av 2119'371 snapc 0=[] snapset 0=[]:{}
2019-04-13 17:24:47.749 7eff26f07700 10 osd.3 pg_epoch: 2119 pg[38.4( v 2119'371 (2075'300,2119'371] local-lis/les=2093/2094 n=14 ec=1998/1998 lis/c 2093/2093 les/c/f 2094/2094/0 2090/2093/2090) [3,5] r=0 lpr=2093 crt=2119'371 lcod 2119'370 mlcod 2119'370 active+clean] execute_ctx 38:21f0cbf0:::smithi07488316-109 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 2775265~351299] ov 2119'371 av 2119'372 snapc 0=[] snapset 0=[]:{}
2019-04-13 17:24:47.750 7eff26f07700 10 osd.3 pg_epoch: 2119 pg[38.4( v 2119'372 (2075'300,2119'372] local-lis/les=2093/2094 n=14 ec=1998/1998 lis/c 2093/2093 les/c/f 2094/2094/0 2090/2093/2090) [3,5] r=0 lpr=2093 luod=2119'371 crt=2119'372 lcod 2119'370 mlcod 2119'370 active+clean] execute_ctx 38:21f0cbf0:::smithi07488316-109 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [setxattr _header (358),truncate 3126564] ov 2119'372 av 2119'373 snapc 0=[] snapset 0=[]:{}

But already_complete() determines that the client.28586.0:10665 setattr 2119'373 finished even though client also sent client.28586.0:10664 write 2119'372:

2019-04-13 17:25:02.161 7ffa2a7b0700 10 osd.0 2130 dequeue_op 0x557502c48420 prio 63 cost 365 latency 0.001541 osd_op(client.28586.0:10665 38.4 38.fd30f84 (undecoded) ondisk+retry+write+known_if_redirected e2130) v8 pg pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 luod=2119'373 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean]
2019-04-13 17:25:02.161 7ffa2a7b0700 20 osd.0 2130 share_map client.28586 172.21.15.74:0/3781542038 2130
2019-04-13 17:25:02.161 7ffa2a7b0700 20 osd.0 2130 should_share_map client.28586 172.21.15.74:0/3781542038 2130
2019-04-13 17:25:02.161 7ffa2a7b0700 10 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 luod=2119'373 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean] _handle_message: 0x557502c48420
2019-04-13 17:25:02.161 7ffa2a7b0700 20 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 luod=2119'373 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean] do_op: op osd_op(client.28586.0:10665 38.4 38:21f0cbf0:::smithi07488316-109 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [setxattr _header (358),truncate 3126564] snapc 0=[] RETRY=2 ondisk+retry+write+known_if_redirected e2130) v8
2019-04-13 17:25:02.161 7ffa2a7b0700 20 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 luod=2119'373 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean] op_has_sufficient_caps session=0x557502e36800 pool=38 (unique_pool_15 ) pool_app_metadata={rados={}} need_read_cap=0 need_write_cap=1 classes=[] -> yes
2019-04-13 17:25:02.161 7ffa2a7b0700 10 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 luod=2119'373 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean] do_op osd_op(client.28586.0:10665 38.4 38:21f0cbf0:::smithi07488316-109 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [setxattr _header (358),truncate 3126564] snapc 0=[] RETRY=2 ondisk+retry+write+known_if_redirected e2130) v8 may_write -> write-ordered flags ondisk+retry+write+known_if_redirected
2019-04-13 17:25:02.161 7ffa2a7b0700  3 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 luod=2119'373 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean] do_op dup client.28586.0:10665 version 2119'373
2019-04-13 17:25:02.161 7ffa2a7b0700 20 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 luod=2119'373 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean] already_complete: 2119'373
2019-04-13 17:25:02.161 7ffa2a7b0700 20 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 luod=2119'373 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean] already_complete: repgather(0x557502bc1e60 2130'374 rep_tid=3338 committed?=0 r=0)
2019-04-13 17:25:02.161 7ffa2a7b0700 20 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 luod=2119'373 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean] already_complete: repgather(0x557502bc1e60 2130'374 rep_tid=3338 committed?=0 r=0) (*i)->v past v
2019-04-13 17:25:02.161 7ffa2a7b0700 20 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 luod=2119'373 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean] already_complete: returning true
2019-04-13 17:25:02.161 7ffa3e42f700 20 journal do_write latency 0.000250
2019-04-13 17:25:02.161 7ffa3e42f700 20 journal do_write queueing finishers through seq 1375327
2019-04-13 17:25:02.161 7ffa3e42f700 10 journal queue_completions_thru seq 1375327 queueing seq 1375327 0x5575043e00f0 lat 0.000407
2019-04-13 17:25:02.161 7ffa2a7b0700  1 -- [v2:172.21.15.132:6800/296244,v1:172.21.15.132:6801/296244] --> 172.21.15.74:0/3781542038 -- osd_op_reply(10665 smithi07488316-109 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo [setxattr _header (358),truncate 3126564] v2119'373 uv373 ondisk = 0) v8 -- 0x557502fc9680 con 0x557500063c00

This happened out of order.

2019-04-13 17:25:02.167 7ffa2a7b0700 10 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 luod=2119'373 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean] _handle_message: 0x5575025f3600
2019-04-13 17:25:02.167 7ffa2a7b0700  7 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 luod=2119'373 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean] do_repop_reply: tid 3338 op  ack_type 4 from 3
2019-04-13 17:25:02.167 7ffa2a7b0700 10 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 luod=2119'373 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean] repop_all_committed: repop tid 3338 all committed
2019-04-13 17:25:02.167 7ffa2a7b0700 10 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean] eval_repop repgather(0x557502bc1e60 2130'374 rep_tid=3338 committed?=1 r=0)
2019-04-13 17:25:02.167 7ffa2a7b0700 10 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean]  commit: repgather(0x557502bc1e60 2130'374 rep_tid=3338 committed?=1 r=0)
2019-04-13 17:25:02.167 7ffa2a7b0700 15 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean] log_op_stats osd_op(client.28586.0:10664 38.4 38:21f0cbf0:::smithi07488316-109 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 2775265~351299] snapc 0=[] RETRY=2 ondisk+retry+write+known_if_redirected e2130) v8 inb 351299 outb 0 lat 0.008033
2019-04-13 17:25:02.167 7ffa2a7b0700 10 osd.0 pg_epoch: 2130 pg[38.4( v 2130'374 (2119'372,2130'374] local-lis/les=2125/2126 n=14 ec=1998/1998 lis/c 2125/2125 les/c/f 2126/2126/0 2120/2125/2125) [0,3] r=0 lpr=2125 crt=2130'374 lcod 2119'373 mlcod 0'0 active+clean]  sending reply on osd_op(client.28586.0:10664 38.4 38:21f0cbf0:::smithi07488316-109 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 2775265~351299] snapc 0=[] RETRY=2 ondisk+retry+write+known_if_redirected e2130) v8 0x557502d09180
2019-04-13 17:25:02.167 7ffa2a7b0700  1 -- [v2:172.21.15.132:6800/296244,v1:172.21.15.132:6801/296244] --> 172.21.15.74:0/3781542038 -- osd_op_reply(10664 smithi07488316-109 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo [write 2775265~351299] v2130'374 uv374 ondisk = 0) v8 -- 0x557502d09180 con 0x557500063c00

#3 Updated by David Zafman 10 months ago

Probably 1.1.short_pg_log.yaml produces this

osd_max_pg_log_entries: 2
osd_min_pg_log_entries: 1

Breaks duplicate detection so a retried write because of [3,5] -> [0,3] PG change is re-done and we get out of order replies.

#4 Updated by Neha Ojha 10 months ago

  • Subject changed from "Error: finished tid 3 when last_acked_tid was 5" in upgrade:nautilus-p2p to short pg log+nautilus-p2p-stress-split: "Error: finished tid 3 when last_acked_tid was 5" in upgrade:nautilus-p2p
  • Status changed from New to 12

2119'372 (write 2775265) does not get identified as a dup when the log boundaries are (2119'372,2119'373], while 2119'373 (setattr/trunc) does and hence gets acked first.

#5 Updated by David Zafman 10 months ago

  • Assignee set to David Zafman

Before fix:

{
   "pg_log_t": {
       "head": "26'102",
       "tail": "26'101",
       "log": [
           {
               "op": "modify",
               "object": "1:6aa5e219:::obj-102:head",
               "version": "26'102",
               "prior_version": "0'0",
               "reqid": "client.4470.0:1",
               "extra_reqids": [],
               "mtime": "2019-04-30 15:05:12.402443",
               "return_code": 0,
               "mod_desc": {
                   "object_mod_desc": {
                       "can_local_rollback": false,
                       "rollback_info_completed": false,
                       "ops": []
                   }
               }
           }
       ],
       "dups": []
   },
   "pg_missing_t": {
       "missing": [],
       "may_include_deletes": true
   }
}

After fix:

{
    "pg_log_t": {
        "head": "26'102",
        "tail": "26'101",
        "log": [
            {
                "op": "modify",
                "object": "1:6aa5e219:::obj-102:head",
                "version": "26'102",
                "prior_version": "0'0",
                "reqid": "client.4470.0:1",
                "extra_reqids": [],
                "mtime": "2019-04-30 16:00:06.738212",
                "return_code": 0,
                "mod_desc": {
                    "object_mod_desc": {
                        "can_local_rollback": false,
                        "rollback_info_completed": false,
                        "ops": []
                    }
                }
            }
        ],
        "dups": [
            {
                "reqid": "client.4167.0:1",
                "version": "26'1",
                "user_version": "1",
                "return_code": "0" 
            },
            {
                "reqid": "client.4170.0:1",
                "version": "26'2",
                "user_version": "2",
                "return_code": "0" 
            },
            {
                "reqid": "client.4173.0:1",
                "version": "26'3",
                "user_version": "3",
                "return_code": "0" 
            },
            {
                "reqid": "client.4176.0:1",
                "version": "26'4",
                "user_version": "4",
                "return_code": "0" 
            },
            {
                "reqid": "client.4179.0:1",
                "version": "26'5",
                "user_version": "5",
                "return_code": "0" 
            },
            {
                "reqid": "client.4182.0:1",
                "version": "26'6",
                "user_version": "6",
                "return_code": "0" 
            },
            {
                "reqid": "client.4185.0:1",
                "version": "26'7",
                "user_version": "7",
                "return_code": "0" 
            },
            {
                "reqid": "client.4188.0:1",
                "version": "26'8",
                "user_version": "8",
                "return_code": "0" 
            },
...
        ]
    },
    "pg_missing_t": {
        "missing": [],
        "may_include_deletes": true
    }
}

#6 Updated by David Zafman 10 months ago

I have a version that not only copies the existing dups, but if copy_up_to is excluding any log entries it adds those to the dups.

pg[1.0( v 26'102 (26'100,26'102] local-lis/les=28/29 n=102 ec=21/21 lis/c 25/25 les/c/f 26/26/0 27/28/28) [4,3,5]/[0,1,2] backfill=[3,4,5] r=0 lpr=28 pi=[25,28)/1 crt=26'102 lcod 26'101 mlcod 0'0 remapped mbc={}] activate calling copy_up_to() 1
pg[1.0( v 26'102 (26'100,26'102] local-lis/les=28/29 n=102 ec=21/21 lis/c 25/25 les/c/f 26/26/0 27/28/28) [4,3,5]/[0,1,2] backfill=[3,4,5] r=0 lpr=28 pi=[25,28)/1 crt=26'102 lcod 26'101 mlcod 0'0 remapped mbc={}] activate calling copy_up_to() log size 2
pg[1.0( v 26'102 (26'100,26'102] local-lis/les=28/29 n=102 ec=21/21 lis/c 25/25 les/c/f 26/26/0 27/28/28) [4,3,5]/[0,1,2] backfill=[3,4,5] r=0 lpr=28 pi=[25,28)/1 crt=26'102 lcod 26'101 mlcod 0'0 remapped mbc={}] activate calling copy_up_to() dups size 100
pg[1.0( v 26'102 (26'100,26'102] local-lis/les=28/29 n=102 ec=21/21 lis/c 25/25 les/c/f 26/26/0 27/28/28) [4,3,5]/[0,1,2] backfill=[3,4,5] r=0 lpr=28 pi=[25,28)/1 crt=26'102 lcod 26'101 mlcod 0'0 remapped mbc={}] activate after copy_up_to() log size 1
pg[1.0( v 26'102 (26'100,26'102] local-lis/les=28/29 n=102 ec=21/21 lis/c 25/25 les/c/f 26/26/0 27/28/28) [4,3,5]/[0,1,2] backfill=[3,4,5] r=0 lpr=28 pi=[25,28)/1 crt=26'102 lcod 26'101 mlcod 0'0 remapped mbc={}] activate after copy_up_to() dups size 101

#7 Updated by David Zafman 10 months ago

  • Status changed from 12 to In Progress
  • Pull request ID set to 27941

#8 Updated by David Zafman 10 months ago

  • Status changed from In Progress to Fix Under Review

#9 Updated by David Zafman 9 months ago

  • Status changed from Fix Under Review to Resolved

#10 Updated by David Zafman 9 months ago

  • Status changed from Resolved to Pending Backport
  • Backport set to luminous, mimic, nautilus
  • Pull request ID changed from 27941 to 27914

#11 Updated by David Zafman 9 months ago

  • Copied to Backport #39719: luminous: short pg log+nautilus-p2p-stress-split: "Error: finished tid 3 when last_acked_tid was 5" in upgrade:nautilus-p2p added

#12 Updated by David Zafman 9 months ago

  • Copied to Backport #39720: mimic: short pg log+nautilus-p2p-stress-split: "Error: finished tid 3 when last_acked_tid was 5" in upgrade:nautilus-p2p added

#13 Updated by David Zafman 9 months ago

  • Copied to Backport #39721: nautilus: short pg log+nautilus-p2p-stress-split: "Error: finished tid 3 when last_acked_tid was 5" in upgrade:nautilus-p2p added

#14 Updated by David Zafman 7 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF