Project

General

Profile

Actions

Bug #13666

closed

"osd/ReplicatedPG.cc: 2818: FAILED assert(0 == "out of order op")" with async msgr

Added by Yuri Weinstein over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ceph.com/teuthology-2015-10-29_21:00:11-rados-infernalis-distro-basic-multi/
Jobs: ['1131148', '1131285']
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-10-29_21:00:11-rados-infernalis-distro-basic-multi/1131148/teuthology.log

Assertion: osd/ReplicatedPG.cc: 2818: FAILED assert(0 == "out of order op")
ceph version 9.1.0-98-g3af21a1 (3af21a1432aabb86f82f9063aef7a6ab9865345e)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fcbb9c1894b]
 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x183d) [0x7fcbb987640d]
 3: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x2962) [0x7fcbb9879182]
 4: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x6dd) [0x7fcbb981493d]
 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3bd) [0x7fcbb968743d]
 6: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x7fcbb968765d]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8b4) [0x7fcbb96ac614]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x85f) [0x7fcbb9c0936f]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fcbb9c0b270]
 10: (()+0x8182) [0x7fcbb8238182]
 11: (clone()+0x6d) [0x7fcbb657f47d]

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #13017: osd/ReplicatedPG.cc: 2752: FAILED assert(0 == "out of order op")Closed09/10/2015

Actions
Actions #1

Updated by Yuri Weinstein over 8 years ago

  • Related to Bug #13017: osd/ReplicatedPG.cc: 2752: FAILED assert(0 == "out of order op") added
Actions #2

Updated by Sage Weil over 8 years ago

  • Subject changed from "osd/ReplicatedPG.cc: 2818: FAILED assert(0 == "out of order op")" in rados-infernalis-distro-basic-multi to "osd/ReplicatedPG.cc: 2818: FAILED assert(0 == "out of order op")" with async msgr
Actions #3

Updated by Haomai Wang over 8 years ago

Oh, I found why simple messenger won't happen with this. Async messenger reply ack together(https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L847).

So I guess we need to reply ack ASAP

Actions #4

Updated by Greg Farnum over 8 years ago

I'm not understanding. How does batching up acks lead to this assert?

Actions #5

Updated by Haomai Wang over 8 years ago

For example, osd.0 send three messages to osd.1 successfully, and because of batch ack osd.1 doesn't reply ack to osd.0. Then inject socket error happen, osd.0 reconnect to osd.1 and send previous three messages again. Meanwhile osd.1 trimed pg log and make do_op not detect dup op(https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L1664). So osd.1 would hit this assert later. If we always reply ack message ASAP, it wouldn't happen gap between pg log trim and acked messages.

Actions #6

Updated by Greg Farnum over 8 years ago

Oh, so this is just a log trimming/op replay race that the async messenger is making a lot larger than usual. Got it. That's...pretty impressive.

Actions #8

Updated by Kefu Chai over 8 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Haomai Wang
Actions #9

Updated by Haomai Wang over 8 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF