Project

General

Profile

Bug #21931

osd: src/osd/ECBackend.cc: 2164: FAILED assert((offset + length) <= (range.first.get_off() + range.first.get_len()))

Added by Patrick Donnelly about 1 year ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
10/25/2017
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
mimic, luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
EC plugins
Pull request ID:

Description

2017-10-25T18:19:16.475 INFO:tasks.ceph.osd.0.smithi171.stderr:/build/ceph-13.0.0-2328-gfc31752/src/osd/ECBackend.cc: In function 'void ECBackend::objects_read_async(const hobject_t&, const std::list<std::pair<boost::tuples::tuple<long unsigned int, long unsigned int, unsigned int>, std::pair<ceph::buffer::list*, Context*> > >&, Context*, bool)::cb::operator()(std::map<hobject_t, std::pair<int, interval_map<long unsigned int, ceph::buffer::list, bl_split_merge> > >&&)' thread 7f6ebf10f700 time 2017-10-25 18:19:16.472433
2017-10-25T18:19:16.475 INFO:tasks.ceph.osd.0.smithi171.stderr:/build/ceph-13.0.0-2328-gfc31752/src/osd/ECBackend.cc: 2164: FAILED assert((offset + length) <= (range.first.get_off() + range.first.get_len()))
2017-10-25T18:19:16.479 INFO:tasks.ceph.osd.0.smithi171.stderr: ceph version 13.0.0-2328-gfc31752 (fc317524b8a24874a4a57c192e5da46ea11d593f) mimic (dev)
2017-10-25T18:19:16.479 INFO:tasks.ceph.osd.0.smithi171.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x55711ade9eee]
2017-10-25T18:19:16.479 INFO:tasks.ceph.osd.0.smithi171.stderr: 2: (()+0x825151) [0x55711ab63151]
2017-10-25T18:19:16.479 INFO:tasks.ceph.osd.0.smithi171.stderr: 3: (ECBackend::kick_reads()+0x48) [0x55711ab79bc8]
2017-10-25T18:19:16.479 INFO:tasks.ceph.osd.0.smithi171.stderr: 4: (CallClientContexts::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x569) [0x55711ab89879]
2017-10-25T18:19:16.480 INFO:tasks.ceph.osd.0.smithi171.stderr: 5: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x6f) [0x55711ab6123f]
2017-10-25T18:19:16.480 INFO:tasks.ceph.osd.0.smithi171.stderr: 6: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0xf7e) [0x55711ab6854e]
2017-10-25T18:19:16.480 INFO:tasks.ceph.osd.0.smithi171.stderr: 7: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x19f) [0x55711ab7242f]
2017-10-25T18:19:16.480 INFO:tasks.ceph.osd.0.smithi171.stderr: 8: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x78) [0x55711aa88db8]
2017-10-25T18:19:16.480 INFO:tasks.ceph.osd.0.smithi171.stderr: 9: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x57c) [0x55711aa42e8c]
2017-10-25T18:19:16.480 INFO:tasks.ceph.osd.0.smithi171.stderr: 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3e3) [0x55711a899ec3]
2017-10-25T18:19:16.480 INFO:tasks.ceph.osd.0.smithi171.stderr: 11: (PGOpItem::run(OSD*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x4a) [0x55711aaef1da]
2017-10-25T18:19:16.480 INFO:tasks.ceph.osd.0.smithi171.stderr: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xd83) [0x55711a8ad823]
2017-10-25T18:19:16.480 INFO:tasks.ceph.osd.0.smithi171.stderr: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x83f) [0x55711adef6af]
2017-10-25T18:19:16.481 INFO:tasks.ceph.osd.0.smithi171.stderr: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55711adf1600]
2017-10-25T18:19:16.481 INFO:tasks.ceph.osd.0.smithi171.stderr: 15: (()+0x8184) [0x7f6ed95a7184]
2017-10-25T18:19:16.481 INFO:tasks.ceph.osd.0.smithi171.stderr: 16: (clone()+0x6d) [0x7f6ed8696ffd]
2017-10-25T18:19:16.481 INFO:tasks.ceph.osd.0.smithi171.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

From: /ceph/teuthology-archive/pdonnell-2017-10-25_18:05:03-kcephfs-wip-pdonnell-testing-201710240526-testing-basic-smithi/1773350/teuthology.log

Better backtrace from gdb:

(gdb) bt
#0  0x00007f6ed95af1fb in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x000055711adadb77 in reraise_fatal (signum=6) at /build/ceph-13.0.0-2328-gfc31752/src/global/signal_handler.cc:74
#2  handle_fatal_signal (signum=6) at /build/ceph-13.0.0-2328-gfc31752/src/global/signal_handler.cc:138
#3  <signal handler called>
#4  0x00007f6ed85cfc37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#5  0x00007f6ed85d3028 in __GI_abort () at abort.c:89
#6  0x000055711adea060 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x55711b339c30 "(offset + length) <= (range.first.get_off() + range.first.get_len())", 
    file=file@entry=0x55711b339a98 "/build/ceph-13.0.0-2328-gfc31752/src/osd/ECBackend.cc", line=line@entry=2164, 
    func=func@entry=0x55711b33b4e0 <ECBackend::objects_read_async(hobject_t const&, std::list<std::pair<boost::tuples::tuple<unsigned long, unsigned long, unsigned int, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>, std::pair<ceph::buffer::list*, Context*> >, std::allocator<std::pair<boost::tuples::tuple<unsigned long, unsigned long, unsigned int, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>, std::pair<ceph::buffer::list*, Context*> > > > const&, Context*, bool)::cb::operator()(std::map<hobject_t, std::pair<int, interval_map<unsigned long, ceph::buffer::list, bl_split_merge> >, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, std::pair<int, interval_map<unsigned long, ceph::buffer::list, bl_split_merge> > > > >&&)::__PRETTY_FUNCTION__> "void ECBackend::objects_read_async(const hobject_t&, const std::list<std::pair<boost::tuples::tuple<long unsigned int, long unsigned int, unsigned int>, std::pair<ceph::buffer::list*, Context*> > >&, "...)
    at /build/ceph-13.0.0-2328-gfc31752/src/common/assert.cc:66
#7  0x000055711ab63151 in ECBackend::cb::operator()(<unknown type in /usr/lib/debug/.build-id/65/a8bbeba956c61cef84edc7c56b27f6fb569fc5.debug, CU 0x3673e91, DIE 0x3a19678>) (this=0x5571283f5608, 
    results=<optimized out>) at /build/ceph-13.0.0-2328-gfc31752/src/osd/ECBackend.cc:2162
#8  0x000055711ab79bc8 in complete<std::map<hobject_t, std::pair<int, interval_map<unsigned long, ceph::buffer::list, bl_split_merge> > > > (t=<optimized out>, this=0x5571283f5600)
    at /build/ceph-13.0.0-2328-gfc31752/src/include/Context.h:48
#9  run (this=<optimized out>) at /build/ceph-13.0.0-2328-gfc31752/src/osd/ECBackend.h:176
#10 ECBackend::kick_reads (this=0x557127e6b380) at /build/ceph-13.0.0-2328-gfc31752/src/osd/ECBackend.h:208
#11 0x000055711ab89879 in CallClientContexts::finish (this=0x55712b6ff980, in=...) at /build/ceph-13.0.0-2328-gfc31752/src/osd/ECBackend.cc:2253
#12 0x000055711ab6123f in complete<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&> (t=..., this=0x55712b6ff980) at /build/ceph-13.0.0-2328-gfc31752/src/include/Context.h:48
#13 ECBackend::complete_read_op (this=this@entry=0x557127e6b380, rop=..., m=m@entry=0x7f6ebf10bab0) at /build/ceph-13.0.0-2328-gfc31752/src/osd/ECBackend.cc:1269
#14 0x000055711ab6854e in ECBackend::handle_sub_read_reply (this=this@entry=0x557127e6b380, from=..., op=..., m=m@entry=0x7f6ebf10bab0, trace=...) at /build/ceph-13.0.0-2328-gfc31752/src/osd/ECBackend.cc:1252
#15 0x000055711ab7242f in ECBackend::_handle_message (this=0x557127e6b380, _op=...) at /build/ceph-13.0.0-2328-gfc31752/src/osd/ECBackend.cc:788
#16 0x000055711aa88db8 in PGBackend::handle_message (this=<optimized out>, op=...) at /build/ceph-13.0.0-2328-gfc31752/src/osd/PGBackend.cc:114
#17 0x000055711aa42e8c in PrimaryLogPG::do_request (this=0x557127f20000, op=..., handle=...) at /build/ceph-13.0.0-2328-gfc31752/src/osd/PrimaryLogPG.cc:1713
#18 0x000055711a899ec3 in OSD::dequeue_op (this=this@entry=0x5571258e2000, pg=..., op=..., handle=...) at /build/ceph-13.0.0-2328-gfc31752/src/osd/OSD.cc:9126
#19 0x000055711aaef1da in PGOpItem::run (this=<optimized out>, osd=0x5571258e2000, pg=..., handle=...) at /build/ceph-13.0.0-2328-gfc31752/src/osd/OpQueueItem.cc:22
#20 0x000055711a8ad823 in run (handle=..., pg=..., osd=<optimized out>, this=0x7f6ebf10c1d0) at /build/ceph-13.0.0-2328-gfc31752/src/osd/OpQueueItem.h:121
#21 OSD::ShardedOpWQ::_process (this=0x5571258e32b8, thread_index=<optimized out>, hb=0x557125c842e0) at /build/ceph-13.0.0-2328-gfc31752/src/osd/OSD.cc:9907
#22 0x000055711adef6af in ShardedThreadPool::shardedthreadpool_worker (this=0x5571258e2a48, thread_index=<optimized out>) at /build/ceph-13.0.0-2328-gfc31752/src/common/WorkQueue.cc:339
#23 0x000055711adf1600 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at /build/ceph-13.0.0-2328-gfc31752/src/common/WorkQueue.h:689
#24 0x00007f6ed95a7184 in start_thread (arg=0x7f6ebf10f700) at pthread_create.c:312
#25 0x00007f6ed8696ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Related issues

Related to RADOS - Bug #22887: osd/ECBackend.cc: 2202: FAILED assert((offset + length) <= (range.first.get_off() + range.first.get_len())) Duplicate 02/01/2018
Related to RADOS - Bug #23195: Read operations segfaulting multiple OSDs Resolved 03/02/2018
Related to RADOS - Bug #22330: ec: src/common/interval_map.h: 161: FAILED assert(len > 0) Resolved 12/06/2017
Copied to RADOS - Backport #36439: mimic: osd: src/osd/ECBackend.cc: 2164: FAILED assert((offset + length) <= (range.first.get_off() + range.first.get_len())) Resolved
Copied to RADOS - Backport #36440: luminous: osd: src/osd/ECBackend.cc: 2164: FAILED assert((offset + length) <= (range.first.get_off() + range.first.get_len())) Resolved

History

#1 Updated by Patrick Donnelly about 1 year ago

Dead OSD is accessible at smithi171 as of now.

http://pulpito.ceph.com/pdonnell-2017-10-25_18:05:03-kcephfs-wip-pdonnell-testing-201710240526-testing-basic-smithi/1773350/

Eventually I'll kill it so teuthology colelcts the coredump.

#2 Updated by Patrick Donnelly about 1 year ago

Logs and coredump saved in: teuthology:/home/pdonnell/1773350

#3 Updated by Greg Farnum about 1 year ago

Oh, I think this is the thing where we're requesting reads of non-existent chunks?

#5 Updated by Greg Farnum 8 months ago

That was the fix I was wondering about, but it was merged to master as https://github.com/ceph/ceph/pull/15712 and so should have been included here.

#6 Updated by Greg Farnum 8 months ago

  • Related to Bug #22887: osd/ECBackend.cc: 2202: FAILED assert((offset + length) <= (range.first.get_off() + range.first.get_len())) added

#7 Updated by Greg Farnum 8 months ago

  • Related to Bug #23195: Read operations segfaulting multiple OSDs added

#8 Updated by Greg Farnum 8 months ago

From #22887, this also appeared in /ceph/teuthology-archive/pdonnell-2018-01-30_23:38:56-kcephfs-wip-pdonnell-i22627-testing-basic-smithi/2129806/teuthology.log

/ceph/teuthology-archive/pdonnell-2018-03-11_22:42:18-multimds-wip-pdonnell-testing-20180311.180352-testing-basic-smithi/2279890/teuthology.log

#9 Updated by Patrick Donnelly 8 months ago

  • Source set to Q/A
  • Severity changed from 3 - minor to 2 - major
  • Component(RADOS) EC plugins added
  • Component(RADOS) deleted (OSD)

Live multimds run: /ceph/teuthology-archive/pdonnell-2018-04-13_04:19:33-multimds-wip-pdonnell-testing-20180413.022831-testing-basic-smithi/2393142/teuthology.log

#10 Updated by Josh Durgin 8 months ago

  • Status changed from New to In Progress
  • Assignee set to Josh Durgin

This is a bug in trimtrunc handling with EC pools.

#11 Updated by Patrick Donnelly 4 months ago

Run with cores/logs: /ceph/teuthology-archive/pdonnell-2018-08-02_13:06:29-multimds-wip-pdonnell-testing-20180802.044402-testing-basic-smithi/2852861

#12 Updated by Josh Durgin 3 months ago

  • Assignee changed from Josh Durgin to Neha Ojha
  • Backport set to mimic, luminous

#13 Updated by Neha Ojha 3 months ago

Running the multimds:basic suite with --filter 'clusters/9-mds.yaml conf/{client.yaml mds.yaml mon.yaml osd.yaml} inline/yes.yaml mount/kclient/{mount.yaml overrides/{distro/random/{k-testing.yaml supported$/{rhel_latest.yaml}} ms-die-on-skipped.yaml}} objectstore-ec/bluestore-ec-root.yaml overrides/{basic/{frag_enable.yaml whitelist_health.yaml whitelist_wrongly_marked_down.yaml} fuse-default-perm-no.yaml} q_check_counter/check_counter.yaml tasks/cfuse_workunit_suites_fsx.yaml' N=20 times, reproduces this failure reliably.

http://pulpito.ceph.com/nojha-2018-09-17_15:49:25-multimds:basic-master-distro-basic-smithi/

#14 Updated by Neha Ojha 2 months ago

  • Related to Bug #22330: ec: src/common/interval_map.h: 161: FAILED assert(len > 0) added

#15 Updated by Neha Ojha 2 months ago

  • Status changed from In Progress to Need Review

#16 Updated by Neha Ojha 2 months ago

  • Status changed from Need Review to Pending Backport

#17 Updated by Nathan Cutler about 2 months ago

  • Copied to Backport #36439: mimic: osd: src/osd/ECBackend.cc: 2164: FAILED assert((offset + length) <= (range.first.get_off() + range.first.get_len())) added

#18 Updated by Nathan Cutler about 2 months ago

  • Copied to Backport #36440: luminous: osd: src/osd/ECBackend.cc: 2164: FAILED assert((offset + length) <= (range.first.get_off() + range.first.get_len())) added

#19 Updated by Nathan Cutler about 2 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF