Project

General

Profile

Actions

Bug #18696

open

OSD might assert when LTTNG tracing is enabled

Added by Igor Fedotov about 7 years ago. Updated almost 7 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
LTTNG
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Following assert happens occasionally when LTTNG is enabled:
2017-01-27 13:52:07.451981 7f9edbf80700 -1 /root/ceph/reddy/src/messages/MOSDOp.h: In function 'object_t& MOSDOp::get_oid()' thread 7f9edbf80700 time 2017-01-27 13:52:07.449173
/root/ceph/reddy/src/messages/MOSDOp.h: 126: FAILED assert(!final_decode_needed)

ceph version 11.1.0-6973-g8f196e5 (8f196e5b4b9cebfcbeb744fdd35456bb334f83ad)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f9ef8f001f2]
2: (()+0x55da55) [0x7f9ef91caa55]
3: (EventTrace::trace_oid_event(Message*, char const*, char const*, char const*, int, bool)+0x6f) [0x7f9ef91cb71f]
4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3e3) [0x55885749fbe3]
5: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> const&)+0x52) [0x55885749ff52]
6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x776) [0x5588574c67c6]
7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) [0x7f9ef8f068c9]
8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f9ef8f09c90]
9: (()+0x76fa) [0x7f9ef87116fa]
10: (clone()+0x6d) [0x7f9ef779ab5d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The root cause for that is the call for MOSDOp::get_oid() from
OID_EVENT_TRACE_WITH_MSG(op->get_req(), "DEQUEUE_OP_END", true)
macro at OSD::dequeue_op while the message isn't fully decoded.
This occurs when message handling is delayed due to 'flushes in progress' in PrimaryLogPG::do_request.
In this case no MOSDOp::finish_decode is called and hence the message isn't fully decoded.
Corresponding log output:
2017-01-27 13:52:07.450896 7f9edb77f700 10 osd.0 6 dequeue_op 0x55885b8054a0 prio 127 cost 5048 latency 0.000148 osd_op(mds.0.4:25 2.c52
65ab3 (undecoded) ondisk+write+known_if_redirected+full_force e6) v7 pg pg[2.3( empty local-les=0 n=0 ec=5 les/c/f 0/0/0 6/6/6) [0] r=0
lpr=6 pi=5-5/1 crt=0'0 mlcod 0'0 creating+peering]
2017-01-27 13:52:07.450951 7f9edb77f700 20 osd.0 pg_epoch: 6 pg[2.3( empty local-les=0 n=0 ec=5 les/c/f 0/0/0 6/6/6) [0] r=0 lpr=6 pi=5-
5/1 crt=0'0 mlcod 0'0 creating+peering] 1 flushes_in_progress pending waiting for active on 0x55885b8054a0
2017-01-27 13:52:07.450977 7f9edb77f700 10 osd.0 6 dequeue_op 0x55885b8054a0 finish


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #18576: Enabling LTTnG causes Segmentation fault in libgccResolved01/17/2017

Actions
Actions #1

Updated by Nathan Cutler about 7 years ago

  • Related to Bug #18576: Enabling LTTnG causes Segmentation fault in libgcc added
Actions #2

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category set to LTTNG
  • Component(RADOS) OSD added
Actions #3

Updated by Ganesh Mahalingam almost 7 years ago

Wonder if this PR https://github.com/ceph/ceph/pull/14304 fixes this issue as well.

Actions

Also available in: Atom PDF