Project

General

Profile

Bug #8130

Objecter: resending Ops to wrong target

Added by Greg Farnum over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
Objecter
Target version:
-
Start date:
04/16/2014
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

From teuthology:/a/gregf-2014-04-16_12:06:55-rados:thrash-wip-fast-dispatch-testing-basic-plana

Note how it marks_down a connection and then tries to use it to send a message.

2014-04-16 13:18:09.724311 7ff4e6a84700  1 -- 10.214.132.11:6804/6733 mark_down 0x34552c0 -- 0x2d7ea00
2014-04-16 13:18:09.724336 7ff4e6a84700  1 -- 10.214.132.11:6804/6733 --> 10.214.132.11:6814/8215 -- osd_op(osd.5.2:33 plana6530821-4 [assert-version v389,copy-get max 8388608] 3.8b29dc11 RETRY=2 ack+retry+read e619) v4 -- ?+0 0x3317240 con 0x34552c0
2014-04-16 13:18:09.727863 7ff4e6a84700 -1 common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7ff4e6a84700 time 2014-04-16 13:18:09.724427
common/Mutex.cc: 93: FAILED assert(r == 0)

 ceph version 0.79-263-g3fb425e (3fb425e42bba6a053a2f58d46066a7718e32b476)
 1: (Mutex::Lock(bool)+0x1c3) [0xa25fa3]
 2: (SimpleMessenger::submit_message(Message*, Connection*, entity_addr_t const&, int, bool)+0x59) [0xa4c209]
 3: (SimpleMessenger::_send_message(Message*, Connection*, bool)+0x288) [0xa4d358]
 4: (Objecter::send_op(Objecter::Op*)+0x866) [0x70f2e6]
 5: (Objecter::handle_osd_map(MOSDMap*)+0xaa2) [0x718a42]
 6: (OSD::handle_osd_map(MOSDMap*)+0x43c) [0x66f2fc]
 7: (OSD::_dispatch(Message*)+0x30b) [0x67344b]
 8: (OSD::ms_dispatch(Message*)+0x1f6) [0x673b66]
 9: (DispatchQueue::entry()+0x4e9) [0xb140e9]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0xa4f67d]
 11: (()+0x7e9a) [0x7ff4f4b1de9a]
 12: (clone()+0x6d) [0x7ff4f30de3fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The lock assert is because it got EINVAL. I have similar failures on 197059, 197060 (? I think; heap corruption), 197082, 197083, 197099, 197122, 197132, 197146.

I'm not quite seeing how it's happening right now, but it seems pretty clear that:
1) handle_osd_map is calling scan_requests
2) scan_requests is finding Ops targeted at now-down OSDs and putting them on the need_resend list
2b) but somehow not resetting the op->session
3) handle_osd_map is calling mark_down() on the down OSDs
4) handle_osd_map calls send_op() on the Op that has a bad session pointer.

Associated revisions

Revision 93c0515f (diff)
Added by Sage Weil over 5 years ago

osdc/Objecter: fix osd target for newly-homeless op

If we recalculate the mapping and find that there is no primary, we need
to set the 'osd' field to -1. Otherwise, the caller will try to resend
to a dead session with bad results.

This was introduced in the refactor 860d72770c.

Fixes: #8130
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil over 5 years ago

  • Status changed from New to In Progress
  • Assignee set to Sage Weil

this is affecting master now too:
teuthology-2014-04-16_02:30:03-rados-master-testing-basic-plana has many failures

#2 Updated by Sage Weil over 5 years ago

  • Status changed from In Progress to Need Review
  • Assignee deleted (Sage Weil)

#3 Updated by Sage Weil over 5 years ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF