Bug #18696: OSD might assert when LTTNG tracing is enabled - RADOS - Ceph

Actions

Copy link

Bug #18696

open

OSD might assert when LTTNG tracing is enabled

Added by Igor Fedotov about 7 years ago. Updated almost 7 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

LTTNG

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Following assert happens occasionally when LTTNG is enabled:
2017-01-27 13:52:07.451981 7f9edbf80700 -1 /root/ceph/reddy/src/messages/MOSDOp.h: In function 'object_t& MOSDOp::get_oid()' thread 7f9edbf80700 time 2017-01-27 13:52:07.449173
/root/ceph/reddy/src/messages/MOSDOp.h: 126: FAILED assert(!final_decode_needed)

ceph version 11.1.0-6973-g8f196e5 (8f196e5b4b9cebfcbeb744fdd35456bb334f83ad)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f9ef8f001f2]
 2: (()+0x55da55) [0x7f9ef91caa55]
 3: (EventTrace::trace_oid_event(Message*, char const*, char const*, char const*, int, bool)+0x6f) [0x7f9ef91cb71f]
 4: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, std::shared_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x3e3) [0x55885749fbe3]
 5: (PGQueueable::RunVis::operator()(std::shared_ptr&lt;OpRequest&gt; const&)+0x52) [0x55885749ff52]
 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x776) [0x5588574c67c6]
 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) [0x7f9ef8f068c9]
 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f9ef8f09c90]
 9: (()+0x76fa) [0x7f9ef87116fa]
 10: (clone()+0x6d) [0x7f9ef779ab5d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

The root cause for that is the call for MOSDOp::get_oid() from
OID_EVENT_TRACE_WITH_MSG(op->get_req(), "DEQUEUE_OP_END", true)
macro at OSD::dequeue_op while the message isn't fully decoded.
This occurs when message handling is delayed due to 'flushes in progress' in PrimaryLogPG::do_request.
In this case no MOSDOp::finish_decode is called and hence the message isn't fully decoded.
Corresponding log output:
2017-01-27 13:52:07.450896 7f9edb77f700 10 osd.0 6 dequeue_op 0x55885b8054a0 prio 127 cost 5048 latency 0.000148 osd_op(mds.0.4:25 2.c52
65ab3 (undecoded) ondisk+write+known_if_redirected+full_force e6) v7 pg pg[2.3( empty local-les=0 n=0 ec=5 les/c/f 0/0/0 6/6/6) [0] r=0
lpr=6 pi=5-5/1 crt=0'0 mlcod 0'0 creating+peering]
2017-01-27 13:52:07.450951 7f9edb77f700 20 osd.0 pg_epoch: 6 pg[2.3( empty local-les=0 n=0 ec=5 les/c/f 0/0/0 6/6/6) [0] r=0 lpr=6 pi=5-
5/1 crt=0'0 mlcod 0'0 creating+peering] 1 flushes_in_progress pending waiting for active on 0x55885b8054a0
2017-01-27 13:52:07.450977 7f9edb77f700 10 osd.0 6 dequeue_op 0x55885b8054a0 finish

Related issues 1 (0 open — 1 closed)