Bug #13666: "osd/ReplicatedPG.cc: 2818: FAILED assert(0 == "out of order op")" with async msgr - Ceph - Ceph

Actions

Copy link

Bug #13666

closed

"osd/ReplicatedPG.cc: 2818: FAILED assert(0 == "out of order op")" with async msgr

Added by Yuri Weinstein over 8 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Haomai Wang

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

rados

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Run: http://pulpito.ceph.com/teuthology-2015-10-29_21:00:11-rados-infernalis-distro-basic-multi/
Jobs: ['1131148', '1131285']
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-10-29_21:00:11-rados-infernalis-distro-basic-multi/1131148/teuthology.log

Assertion: osd/ReplicatedPG.cc: 2818: FAILED assert(0 == "out of order op")
ceph version 9.1.0-98-g3af21a1 (3af21a1432aabb86f82f9063aef7a6ab9865345e)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fcbb9c1894b]
 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x183d) [0x7fcbb987640d]
 3: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x2962) [0x7fcbb9879182]
 4: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x6dd) [0x7fcbb981493d]
 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3bd) [0x7fcbb968743d]
 6: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x7fcbb968765d]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8b4) [0x7fcbb96ac614]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x85f) [0x7fcbb9c0936f]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fcbb9c0b270]
 10: (()+0x8182) [0x7fcbb8238182]
 11: (clone()+0x6d) [0x7fcbb657f47d]

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Yuri Weinstein over 8 years ago

Related to Bug #13017: osd/ReplicatedPG.cc: 2752: FAILED assert(0 == "out of order op") added

Actions

Copy link

Updated by Sage Weil over 8 years ago

Subject changed from "osd/ReplicatedPG.cc: 2818: FAILED assert(0 == "out of order op")" in rados-infernalis-distro-basic-multi to "osd/ReplicatedPG.cc: 2818: FAILED assert(0 == "out of order op")" with async msgr

Actions

Copy link

Updated by Haomai Wang over 8 years ago

Oh, I found why simple messenger won't happen with this. Async messenger reply ack together(https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L847).

So I guess we need to reply ack ASAP

Actions

Copy link

Updated by Greg Farnum over 8 years ago

I'm not understanding. How does batching up acks lead to this assert?

Actions

Copy link

Updated by Haomai Wang over 8 years ago

For example, osd.0 send three messages to osd.1 successfully, and because of batch ack osd.1 doesn't reply ack to osd.0. Then inject socket error happen, osd.0 reconnect to osd.1 and send previous three messages again. Meanwhile osd.1 trimed pg log and make do_op not detect dup op(https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L1664). So osd.1 would hit this assert later. If we always reply ack message ASAP, it wouldn't happen gap between pg log trim and acked messages.

Actions

Copy link