Project

General

Profile

Bug #42328

osd/PrimaryLogPG.cc: 3962: ceph_abort_msg("out of order op")

Added by Mykola Golub 11 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

Observing on the recent master when running rbd suite [1]:

2019-10-15T09:03:57.023 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr:2019-10-15T09:03:57.025+0000 7fe9d0384700 -1 osd.0 pg_epoch: 34 pg[2.3b( v 34'214 (0'0,34'214] local-lis/les=13/14 n=14 ec=13/13 lis/c=13/13 les/c/f=14/14/0 sis=13) [0,1] r=0 lpr=13 luod=34'213 crt=34'213 lcod 34'212 mlcod 34'212 active+clean ps=[1b~5]] bad op order, already applied 31182 > this 31181
2019-10-15T09:03:57.026 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-6149-g3ebc546/rpm/el7/BUILD/ceph-15.0.0-6149-g3ebc546/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)' thread 7fe9d0384700 time 2019-10-15T09:03:57.025331+0000
2019-10-15T09:03:57.026 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-6149-g3ebc546/rpm/el7/BUILD/ceph-15.0.0-6149-g3ebc546/src/osd/PrimaryLogPG.cc: 3962: ceph_abort_msg("out of order op")
2019-10-15T09:03:57.027 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr: ceph version 15.0.0-6149-g3ebc546 (3ebc5465d42ca4df6f6d8453da743de24a676228) octopus (dev)
2019-10-15T09:03:57.027 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x558ccdc73cc0]
2019-10-15T09:03:57.027 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr: 2: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x187e) [0x558ccdeabdce]
2019-10-15T09:03:57.027 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr: 3: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x324f) [0x558ccdeaf16f]
2019-10-15T09:03:57.027 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr: 4: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xc8d) [0x558ccdeb4f6d]
2019-10-15T09:03:57.027 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x558ccdd533c9]
2019-10-15T09:03:57.028 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr: 6: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x558ccdf98299]
2019-10-15T09:03:57.028 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr: 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x558ccdd6e58a]
2019-10-15T09:03:57.028 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr: 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x558cce33ff56]
2019-10-15T09:03:57.028 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr: 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558cce342aa0]
2019-10-15T09:03:57.028 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr: 10: (()+0x7ea5) [0x7fe9f5608ea5]
2019-10-15T09:03:57.028 INFO:tasks.ceph.cluster1.osd.0.smithi045.stderr: 11: (clone()+0x6d) [0x7fe9f44cc8cd]

[1] http://pulpito.ceph.com/trociny-2019-10-15_07:49:13-rbd-master-distro-basic-smithi/

History

#1 Updated by Neha Ojha 10 months ago

  • Priority changed from Normal to High

/a/trociny-2019-10-15_07:49:13-rbd-master-distro-basic-smithi/4414497/

#5 Updated by Neha Ojha 9 months ago

  • Priority changed from High to Urgent

#6 Updated by Samuel Just 9 months ago

I observed something similar on a ceph_test_rados teuthology run: sjust-2019-12-19_20:05:13-rados-wip-sjust-read-from-replica-distro-basic-smithi/4615580

At least on that one, it seems to be the case that the backoff release ack jumped the queue ahead of a write that should have been dropped. OpSchedulerItem::PGOpItem::get_scheduler_class() needs to return client for backoff messages.

#7 Updated by Samuel Just 9 months ago

  • Status changed from New to Resolved

I can't check the original reports (logs have been removed), but assuming it's the same root cause PR #32382 5bb932c3d388cb515b63e0807fbc9da6569d473a should have resolved it.

#8 Updated by Jason Dillaman 9 months ago

  • Status changed from Resolved to New

#9 Updated by Neha Ojha 9 months ago

Reproduces with -s rbd:mirror-thrash and --filter 'rbd-mirror-fsx-workunit'

http://pulpito.ceph.com/nojha-2020-01-10_19:11:03-rbd:mirror-thrash-master-distro-basic-smithi/

#10 Updated by Neha Ojha 9 months ago

http://pulpito.ceph.com/nojha-2020-01-10_19:11:03-rbd:mirror-thrash-master-distro-basic-smithi/4653675/

Observations from osd.2's log

2020-01-10T19:33:17.675+0000 7fa94972e700 20 osd.2 pg_epoch: 145 pg[2.21( v 145'1417 (0'0,145'1417] local-lis/les=13/14 n=191 ec=13/13 lis/c=13/13 les/c/f=14/14/0 sis=13) [2,0] r=0 lpr=13 luod=145'1415 crt=145'1417 lcod 145'1414 mlcod 145'1414 active+clean ps=[1b9~1]] do_op: op osd_op(client.4316.0:91264 2.21 2:8583b784:::rbd_header.10dc1668adb4:head [watch reconnect cookie 140413085368928 gen 4] snapc 0=[] ondisk+write+known_if_redirected e145) v8

2020-01-10T19:33:17.676+0000 7fa94972e700 20 osd.2 pg_epoch: 145 pg[2.21( v 145'1417 (0'0,145'1417] local-lis/les=13/14 n=191 ec=13/13 lis/c=13/13 les/c/f=14/14/0 sis=13) [2,0] r=0 lpr=13 luod=145'1415 crt=145'1417 lcod 145'1414 mlcod 145'1414 active+clean ps=[1b9~1]] op order client.4316 tid 91264 last was 91220

2020-01-10T19:33:17.679+0000 7fa94972e700 20 osd.2 pg_epoch: 145 pg[2.21( v 145'1418 (0'0,145'1418] local-lis/les=13/14 n=191 ec=13/13 lis/c=13/13 les/c/f=14/14/0 sis=13) [2,0] r=0 lpr=13 luod=145'1415 crt=145'1417 lcod 145'1414 mlcod 145'1414 active+clean ps=[1b9~1]] do_op: op osd_op(client.4316.0:91263 2.21 2:8583b784:::rbd_header.10dc1668adb4:head [watch reconnect cookie 140413085368928 gen 3] snapc 0=[] ondisk+write+known_if_redirected e145) v8

2020-01-10T19:33:17.680+0000 7fa94972e700 -1 osd.2 pg_epoch: 145 pg[2.21( v 145'1418 (0'0,145'1418] local-lis/les=13/14 n=191 ec=13/13 lis/c=13/13 les/c/f=14/14/0 sis=13) [2,0] r=0 lpr=13 luod=145'1415 crt=145'1417 lcod 145'1414 mlcod 145'1414 active+clean ps=[1b9~1]] bad op order, already applied 91264 > this 91263

#11 Updated by Neha Ojha 8 months ago

Looking at the logs that Jason added http://qa-proxy.ceph.com/teuthology/jdillaman-2020-01-09_16:20:49-rbd-wip-jd-testing-distro-basic-smithi/4649815/teuthology.log, it is pretty clear that the common factor here is that this out of order issue happens on a "watch reconnect" op.

020-01-09T21:45:31.043+0000 7f9b6b363700 20 osd.1 pg_epoch: 48 pg[2.24( v 48'12 (0'0,48'12] local-lis/les=13/14 n=1 ec=13/13 lis/c=13/13 les/c/f=14/14/0 sis=13) [1,0] r=0 lpr=13 luod=46'11 crt=46'11 lcod 46'10 mlcod 46'10 active+clean ps=[f~1,2d~5,38~5]] do_op: op osd_op(client.4343.0:5553 2.24 2:26751c6f:::journal.10f7a7e099b5:head [watch reconnect cookie 94368403188352 gen 6] snapc 0=[] ondisk+write+known_if_redirected e48) v8

2020-01-09T21:45:31.043+0000 7f9b6b363700 20 osd.1 pg_epoch: 48 pg[2.24( v 48'12 (0'0,48'12] local-lis/les=13/14 n=1 ec=13/13 lis/c=13/13 les/c/f=14/14/0 sis=13) [1,0] r=0 lpr=13 luod=46'11 crt=46'11 lcod 46'10 mlcod 46'10 active+clean ps=[f~1,2d~5,38~5]] op order client.4343 tid 5553 last was 5425

2020-01-09T21:45:31.045+0000 7f9b6735b700 20 osd.1 pg_epoch: 48 pg[2.24( v 48'13 (0'0,48'13] local-lis/les=13/14 n=1 ec=13/13 lis/c=13/13 les/c/f=14/14/0 sis=13) [1,0] r=0 lpr=13 luod=46'11 crt=46'11 lcod 46'10 mlcod 46'10 active+clean ps=[f~1,2d~5,38~5]] do_op: op osd_op(client.4343.0:5507 2.24 2:26751c6f:::journal.10f7a7e099b5:head [watch reconnect cookie 94368403188352 gen 5] snapc 0=[]
ondisk+write+known_if_redirected e48) v8

2020-01-09T21:45:31.045+0000 7f9b6735b700 -1 osd.1 pg_epoch: 48 pg[2.24( v 48'13 (0'0,48'13] local-lis/les=13/14 n=1 ec=13/13 lis/c=13/13 les/c/f=14/14/0 sis=13) [1,0] r=0 lpr=13 luod=46'11 crt=46'11 lcod 46'10 mlcod 46'10 active+clean ps=[f~1,2d~5,38~5]] bad op order, already applied 5553 > this 5507

I'll look through the merges during the questionable time period to see if anything related stands out.

#12 Updated by Neha Ojha 8 months ago

I reproduced the bug with increased messenger and objecter logging on the client and osd.

Logs: /a/nojha-2020-01-18_17:55:30-rbd:mirror-thrash-master-distro-basic-smithi/4681540

On the client side:

client.4325.0:54833 is processed first

2020-01-18T18:38:33.322+0000 7f83dbe2e700 15 client.4325.objecter _send_op 54833 to 2.4 on osd.1
.
.
2020-01-18T18:38:33.323+0000 7f83de633700  5 --2- 172.21.15.89:0/3092513796 >> [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791] conn(0x55cd0c60c900 0x55cd0996f600 crc :-1 s=READY pgs=366 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x55cd08b05b00 seq=32 osd_op(client.4325.0:54833 2.4 2:208d1b4b:::journal.10d444fac985:head [watch reconnect cookie 94339097717376 gen 6] snapc 0=[] ondisk+write+known_if_redirected e489) v8

then client.4325.0:54879

2020-01-18T18:38:33.326+0000 7f83dbe2e700 15 client.4325.objecter _send_op 54879 to 2.4 on osd.1
.
.
2020-01-18T18:38:33.327+0000 7f83de633700  5 --2- 172.21.15.89:0/3092513796 >> [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791] conn(0x55cd08506000 0x55cd1000b180 crc :-1 s=READY pgs=367 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x55cd08912600 seq=32 osd_op(client.4325.0:54879 2.4 2:208d1b4b:::journal.10d444fac985:head [watch reconnect cookie 94339097717376 gen 7] snapc 0=[] ondisk+write+known_if_redirected e489) v8 

The messenger received 54879 before 54833.

2020-01-18T18:38:33.333+0000 7fd93f555700  5 --2- [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791] >> 172.21.15.89:0/3092513796 conn(0x561607d0f180 0x561600e0e680 crc :-1 s=READ_MESSAGE_COMPLETE pgs=169 cs=0 l=1 rx=0 tx=0).handle_message received message m=0x5615fe07d500 seq=32 from=client.4325 type=42 osd_op(client.4325.0:54879 2.4 2.d2d8b104 (undecoded) ondisk+write+known_if_redirected e489) v8
.
.
2020-01-18T18:38:33.338+0000 7fd93e553700  5 --2- [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791] >> 172.21.15.89:0/3092513796 conn(0x5616077cb200 0x561600e0ec00 crc :-1 s=READ_MESSAGE_COMPLETE pgs=168 cs=0 l=1 rx=0 tx=0).handle_message received message m=0x5615fe495200 seq=32 from=client.4325 type=42 osd_op(client.4325.0:54833 2.4 2.d2d8b104 (undecoded) ondisk+write+known_if_redirected e489) v8

then the osd complains

2020-01-18T18:38:33.338+0000 7fd91d299700 20 osd.1 pg_epoch: 489 pg[2.4( v 489'809 (0'0,489'809] local-lis/les=14/15 n=189 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=14) [1,0] r=0 lpr=14 luod=489'808 crt=489'807 lcod 489'807 mlcod 489'807 active+clean]  op order client.4325 tid 54833 last was 54879

Unless I am missing something, the problem seems to be in the messenger layer.

#13 Updated by Sage Weil 8 months ago

client log is cluster2-client.mirror.3.24765.log

#14 Updated by Sage Weil 8 months ago

Ah, those are two different connections. after sending the first message,

2020-01-18T18:38:33.324+0000 7f83dbe2e700  1 client.4325.objecter ms_handle_reset 0x55cd0c60c900 session 0x55cd0660c2c0 osd.1
2020-01-18T18:38:33.324+0000 7f83dbe2e700 10 client.4325.objecter reopen_session osd.1 session, addr now [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791]
2020-01-18T18:38:33.324+0000 7f83dbe2e700  1 -- 172.21.15.89:0/3092513796 >> [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791] conn(0x55cd0c60c900 msgr2=0x55cd0996f600 unknown :-1 s=STATE_CLOSED l=1).mark_down
2020-01-18T18:38:33.324+0000 7f83dbe2e700  1 --2- 172.21.15.89:0/3092513796 >> [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791] conn(0x55cd0c60c900 0x55cd0996f600 unknown :-1 s=CLOSED pgs=366 cs=0 l=1 rx=0 tx=0).stop
2020-01-18T18:38:33.324+0000 7f83dbe2e700 10 -- 172.21.15.89:0/3092513796 create_connect [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791], creating connection and registering
2020-01-18T18:38:33.324+0000 7f83dbe2e700 10 -- 172.21.15.89:0/3092513796 >> [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791] conn(0x55cd08506000 msgr2=0x55cd1000b180 unknown :-1 s=STATE_NONE l=1)._connect
2020-01-18T18:38:33.324+0000 7f83dbe2e700  1 --2- 172.21.15.89:0/3092513796 >> [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791] conn(0x55cd08506000 0x55cd1000b180 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
2020-01-18T18:38:33.324+0000 7f83dbe2e700 20 Event(0x55cd04c88bc0 nevent=5000 time_id=124).wakeup
2020-01-18T18:38:33.324+0000 7f83dbe2e700 10 -- 172.21.15.89:0/3092513796 create_connect 0x55cd08506000 [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791] [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791]
2020-01-18T18:38:33.324+0000 7f83dbe2e700 10 -- 172.21.15.89:0/3092513796 connect_to [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791] new 0x55cd08506000
...
2020-01-18T18:38:33.324+0000 7f83dbe2e700 15 client.4325.objecter cancel_op 54833
2020-01-18T18:38:33.324+0000 7f83dbe2e700 15 client.4325.objecter _finish_op 54833
2020-01-18T18:38:33.324+0000 7f83dbe2e700 20 client.4325.objecter put_session s=0x55cd0660c2c0 osd=1 64
2020-01-18T18:38:33.324+0000 7f83dbe2e700 15 client.4325.objecter _session_op_remove 1 54833
...
2020-01-18T18:38:33.326+0000 7f83dbe2e700 10 client.4325.objecter _op_submit oid journal.10d444fac985 '@2' '@2' [watch reconnect cookie 94339097717376 gen 7] tid 54879 osd.1
2020-01-18T18:38:33.326+0000 7f83dbe2e700 20 client.4325.objecter get_session s=0x55cd0660c2c0 osd=1 81
2020-01-18T18:38:33.326+0000 7f83dbe2e700 15 client.4325.objecter _session_op_assign 1 54879
2020-01-18T18:38:33.326+0000 7f83dbe2e700 15 client.4325.objecter _send_op 54879 to 2.4 on osd.1
2020-01-18T18:38:33.326+0000 7f83dbe2e700  1 -- 172.21.15.89:0/3092513796 --> [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791] -- osd_op(unknown.0.0:54879 2.4 2:208d1b4b:::journal.10d444fac985:head [watch reconnect cookie 94339097717376 gen 7] snapc 0=[] ondisk+write+known_if_redirected e489) v8 -- 0x55cd08912600 con 0x55cd08506000
2020-01-18T18:38:33.326+0000 7f83dbe2e700  5 --2- 172.21.15.89:0/3092513796 >> [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791] conn(0x55cd08506000 0x55cd1000b180 crc :-1 s=READY pgs=367 cs=0 l=1 rx=0 tx=0).send_message enqueueing message m=0x55cd08912600 type=42 osd_op(client.4325.0:54879 2.4 2:208d1b4b:::journal.10d444fac985:head [watch reconnect cookie 94339097717376 gen 7] snapc 0=[] ondisk+write+known_if_redirected e489) v8
2020-01-18T18:38:33.326+0000 7f83dbe2e700 15 --2- 172.21.15.89:0/3092513796 >> [v2:172.21.15.89:6808/22791,v1:172.21.15.89:6809/22791] conn(0x55cd08506000 0x55cd1000b180 crc :-1 s=READY pgs=367 cs=0 l=1 rx=0 tx=0).send_message inline write is denied, reschedule m=0x55cd08912600

#15 Updated by Sage Weil 8 months ago

I'm not really sure what the fix is here. The OSD doesn't really know (and can't really know) that the message coming in on the old connection is old.

Normally, the new connection would send the old request and the new request, so the old request would show up as a dup. I think that is the root of this problem: that the objecter is canceling the old op and that old op is possibly in flight and may still be delivered.

I vaguely recall that it was important that resends of reconnect have a fresh tid. Maybe a "fix" is to send a no-op write on the new connection with the old tid so that it gets registered in the pglog/dup ops map?

#16 Updated by Neha Ojha 8 months ago

The root cause of this issue was introduced in https://github.com/ceph/ceph/pull/30223, which aligns with Jason's comment https://tracker.ceph.com/issues/42328#note-4

#17 Updated by Sage Weil 8 months ago

  • Related to Bug #42977: mon/Elector.cc: FAILED ceph_assert(m->epoch == get_epoch()) added

#18 Updated by Sage Weil 8 months ago

  • Target version set to v15.0.0

#19 Updated by Neha Ojha 8 months ago

  • Related to deleted (Bug #42977: mon/Elector.cc: FAILED ceph_assert(m->epoch == get_epoch()))

#20 Updated by Sage Weil 8 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 33163

#21 Updated by Sage Weil 7 months ago

  • Status changed from Fix Under Review to Resolved

#22 Updated by Jason Dillaman 7 months ago

  • Status changed from Resolved to New

#23 Updated by Sage Weil 7 months ago

follow-up fix: https://github.com/ceph/ceph/pull/33559 (typo in original commit)

#24 Updated by Jason Dillaman 7 months ago

  • Status changed from New to Resolved

Also available in: Atom PDF