Project

General

Profile

Bug #9384

OSD is crashing while io is running and querying withadmin socket

Added by Somnath Roy over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I faced a crash in OSD with latest Ceph master. Here is the log trace for the same.

ceph version 0.85-677-gd5777c4 (d5777c421548e7f039bb2c77cb0df2e9c7404723)
1: ceph-osd() [0x990def]
2: (()+0xfbb0) [0x7f72ae6e6bb0]
3: (gsignal()+0x37) [0x7f72acc08f77]
4: (abort()+0x148) [0x7f72acc0c5e8]
5: (_gnu_cxx::_verbose_terminate_handler()+0x155) [0x7f72ad5146e5]
6: (()+0x5e856) [0x7f72ad512856]
7: (()+0x5e883) [0x7f72ad512883]
8: (()+0x5eaae) [0x7f72ad512aae]
9: (ceph::buffer::list::substr_of(ceph::buffer::list const&, unsigned int, unsigned int)+0x277) [0xa88747]
10: (ceph::buffer::list::write(int, int, std::ostream&) const+0x81) [0xa89541]
11: (operator<<(std::ostream&, OSDOp const&)+0x1f6) [0x717a16]
12: (MOSDOp::print(std::ostream&) const+0x172) [0x6e5e32]
13: (TrackedOp::dump(utime_t, ceph::Formatter*) const+0x223) [0x6b6483]
14: (OpTracker::dump_ops_in_flight(ceph::Formatter*)+0xa7) [0x6b7057]
15: (OSD::asok_command(std::string, std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >&, std::string, std::ostream&)+0x1d7) [0x612cb7]
16: (OSDSocketHook::call(std::string, std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >&, std::string, ceph::buffer::list&)+0x67) [0x67c8b7]
17: (AdminSocket::do_accept()+0x1007) [0xa79817]
18: (AdminSocket::entry()+0x258) [0xa7b448]
19: (()+0x7f6e) [0x7f72ae6def6e]
20: (clone()+0x6d) [0x7f72acccc9cd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Steps to reproduce:
-----------------------

1. Run ios
2. While ios running , run the following command continuously.

“ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight”

3. At some point the osd will be crashed.

Associated revisions

Revision 11082f7a (diff)
Added by Somnath Roy over 9 years ago

OpTracker: Race condition removed while dumping ops through admin socket

OSD was crashing due to a race condition while IO was going on and
user wants to dump in flight ops. This was happening because Message
data and payloads are removed before the op itself is removed from
the in flight list. Calling op->_unregistered() after removing
the op from the in flight list fixes the issue.

Fixes: #9384

Signed-off-by: Somnath Roy <>

History

#1 Updated by Somnath Roy over 9 years ago

I think I have root caused it..

1. OpTracker::RemoveOnDelete::operator() is calling op->_unregistered() which clears out message->data() and payload
2. After that, if optracking is enabled we are calling unregister_inflight_op() which removed the op from the xlist.
3. Now, while dumping ops, we are calling _dump_op_descriptor_unlocked() from TrackedOP::dump, which tries to print the message.
4. So, there is a race condition when it tries to print the message whoes ops (data) field is already cleared.

Fix could be, call this op->_unregistered (in case optracking is enabled) after it is removed from xlist.

#2 Updated by Sage Weil over 9 years ago

  • Priority changed from Normal to Urgent

#3 Updated by Somnath Roy over 9 years ago

Following pull request has the fix.

https://github.com/ceph/ceph/pull/2440

#4 Updated by Sage Weil over 9 years ago

  • Status changed from New to Resolved

#5 Updated by Lukas Pustina over 9 years ago

We're running 0.80.7-1trusty and I observed the exact same trace in the log of the crashed OSD.
I tried to figure out, if is already in the Firefly, but could find it there. Can you please confirm that this is still an issue in Firefly 0.80.7?

Thanks a lot,
Lukas

#6 Updated by Daniel Schneller about 9 years ago

Any idea which release this fix will be incorporated in, now that the state says "resolved"? We currently had to disable monitoring of this metric to prevent the monitoring system from causing crashes (0.80.7). Would the REST API be a workaround for this, or does it have the same code path to go through which triggers this crash?

Also available in: Atom PDF