Project

General

Profile

Bug #41816

Enable auto-scaler and get src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update

Added by Sage Weil almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus, mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

(gdb) bt
#0  raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x0000555b460bf090 in reraise_fatal (signum=6) at ./src/global/signal_handler.cc:81
#2  handle_fatal_signal (signum=6) at ./src/global/signal_handler.cc:326
#3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#5  0x00007efcd5006801 in __GI_abort () at abort.c:79
#6  0x0000555b45a64d1f in ceph::__ceph_assert_fail (assertion=<optimized out>, file=<optimized out>, line=<optimized out>, 
    func=0x555b4696e440 <PeeringState::recover_got(hobject_t const&, eversion_t, bool, ceph::os::Transaction&)::__PRETTY_FUNCTION__> "void PeeringState::recover_got(const hobject_t&, eversion_t, bool, ObjectStore::Transaction&)") at ./src/common/assert.cc:73
#7  0x0000555b45a64ea9 in ceph::__ceph_assert_fail (ctx=...) at ./src/common/assert.cc:78
#8  0x0000555b45d507cb in PeeringState::recover_got (this=this@entry=0x555b53b4d428, oid=..., v=..., is_delete=is_delete@entry=false, t=...) at ./src/osd/PeeringState.cc:3671
#9  0x0000555b45bc5db2 in PrimaryLogPG::on_local_recover (this=<optimized out>, hoid=..., _recovery_info=..., obc=std::shared_ptr<ObjectContext> (use count 3, weak count 1) = {...}, is_delete=<optimized out>, t=0x7efcb6609af0) at ./src/osd/PrimaryLogPG.cc:388
#10 0x0000555b45e0ed8f in ReplicatedBackend::handle_pull_response (this=this@entry=0x555b5076f680, from=..., from@entry=..., pop=..., response=0x555b51922400, to_continue=to_continue@entry=0x7efcb6609ad0, t=t@entry=0x7efcb6609af0) at ./src/osd/ReplicatedBackend.cc:1868
f#11 0x0000555b45e0fc46 in ReplicatedBackend::_do_pull_response (this=this@entry=0x555b5076f680, op=...) at ./src/osd/ReplicatedBackend.cc:845
#12 0x0000555b45e10317 in ReplicatedBackend::do_push (op=..., this=0x555b5076f680) at ./src/osd/ReplicatedBackend.h:233
#13 ReplicatedBackend::_handle_message (this=0x555b5076f680, op=...) at ./src/osd/ReplicatedBackend.cc:193
#14 0x0000555b45c9ad77 in PGBackend::handle_message (this=<optimized out>, op=...) at ./src/osd/PGBackend.cc:114
 6
#15 0x0000555b45c45a45 in PrimaryLogPG::do_request (this=0x555b53b4c000, op=..., handle=...) at ./src/osd/PrimaryLogPG.cc:1548
#16 0x0000555b45ad42fb in OSD::dequeue_op (this=this@entry=0x555b51504000, pg=..., op=..., handle=...) at ./src/osd/OSD.cc:9843
#17 0x0000555b45d2b6a2 in PGOpItem::run (this=<optimized out>, osd=0x555b51504000, sdata=<optimized out>, pg=..., handle=...) at ./src/osd/OpQueueItem.cc:24
#18 0x0000555b45af341c in OpQueueItem::run (handle=..., pg=..., sdata=<optimized out>, osd=<optimized out>, this=0x7efcb660a510) at ./src/osd/OpQueueItem.h:134
#19 OSD::ShardedOpWQ::_process (this=0x555b51505058, thread_index=<optimized out>, hb=<optimized out>) at ./src/osd/OSD.cc:11035
#20 0x0000555b4611040c in ShardedThreadPool::shardedthreadpool_worker (this=0x555b51504a08, thread_index=1) at ./src/common/WorkQueue.cc:311
#21 0x0000555b46113570 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at ./src/common/WorkQueue.h:705
#22 0x00007efcd63476db in start_thread (arg=0x7efcb660d700) at pthread_create.c:463
#23 0x00007efcd50e788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) p info.last_complete
$2 = {version = 0, epoch = 0, __pad = 0}
(gdb) p info.last_update
$3 = {version = 100, epoch = 29, __pad = 0}

/a/sage-2019-09-12_21:32:16-rados-wip-sage-testing-2019-09-12-1043-distro-basic-smithi/4300731

core, but incomplete log :/


Related issues

Related to RADOS - Bug #41900: auto-scaler breaks many standalone tests Resolved 09/17/2019
Copied to RADOS - Backport #42014: nautilus: Enable auto-scaler and get src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update Resolved
Copied to RADOS - Backport #42036: mimic: Enable auto-scaler and get src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update Resolved
Copied to RADOS - Backport #42037: luminous: Enable auto-scaler and get src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update Resolved

History

#1 Updated by Neha Ojha almost 2 years ago

  • Assignee set to Neha Ojha

I'll try to see if I can reproduce this.

#2 Updated by Neha Ojha almost 2 years ago

Reproduced with logs: /a/nojha-2019-09-13_21:45:51-rados:standalone-master-distro-basic-smithi/4304313/remote/smithi164/log


2019-09-13T22:52:00.103+0000 7fec3b5a5700 10 osd.1 pg_epoch: 51 pg[1.1( v 29'100 lc 0'0 (29'90,29'100] local-lis/les=50/51 n=25 ec=37/23 lis/c=50/23 les/c/f=51/24/0 sis=50) [1,0,2] r=0 lpr=50 pi=[23,50)/2 crt=29'100 mlcod 0'0 active+recovering+degraded rops=1 m=1 mbc={255={(2+0)=2}}] last_complete now 0'0 log.complete_to at end
2019-09-13T22:52:00.115+0000 7fec3b5a5700 -1 /build/ceph-15.0.0-5017-g9e59ba4/src/osd/PeeringState.cc: In function 'void PeeringState::recover_got(const hobject_t&, eversion_t, bool, ObjectStore::Transaction&)' thread 7fec3b5a5700 time 2019-09-13T22:52:00.108500+0000
/build/ceph-15.0.0-5017-g9e59ba4/src/osd/PeeringState.cc: 3671: FAILED ceph_assert(info.last_complete == info.last_update)

 ceph version 15.0.0-5017-g9e59ba4 (9e59ba40c35b5155e8a0f6d3b453dbab6e2e430f) octopus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x558a6d293e4e]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x558a6d294029]
 3: (PeeringState::recover_got(hobject_t const&, eversion_t, bool, ceph::os::Transaction&)+0x44b) [0x558a6d57fa1b]
 4: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0xa2) [0x558a6d3f5052]
 5: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&, PullOp*, std::__cxx11::list<ReplicatedBackend::pull_complete_info, std::allocator<ReplicatedBackend::pull_complete_info> >*, ceph::os::Transaction*)+0x65f) [0x558a6d63dfcf]
 6: (ReplicatedBackend::_do_pull_response(boost::intrusive_ptr<OpRequest>)+0x196) [0x558a6d63ee86]
 7: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x107) [0x558a6d63f557]
 8: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x558a6d4c9fc7]
 9: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x715) [0x558a6d474c25]
 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b) [0x558a6d302dcb]
 11: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x558a6d55a8f2]
 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90c) [0x558a6d321eec]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x558a6d93f41c]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558a6d942580]
 15: (()+0x76db) [0x7fec5aade6db]
 16: (clone()+0x3f) [0x7fec5987e88f]

Reproduces with rados:standalone/{supported-random-distro$/{ubuntu_latest.yaml} workloads/osd.yaml}
http://pulpito.ceph.com/nojha-2019-09-13_21:45:51-rados:standalone-master-distro-basic-smithi/

#3 Updated by David Zafman almost 2 years ago

This crash didn't reproduce for me using run-standalone.sh with the auto scaler turned off.

#4 Updated by David Zafman almost 2 years ago

  • Related to Bug #41900: auto-scaler breaks many standalone tests added

#5 Updated by David Zafman almost 2 years ago

  • Subject changed from src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update to Enable auto-scaler and get src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update
  • Assignee changed from Neha Ojha to Sage Weil

This can be reproduced by setting config osd_pool_default_pg_autoscale_mode="on" and executing this test:

../qa/run-standalone.sh "divergent-priors.sh TEST_divergent_2"

#6 Updated by David Zafman almost 2 years ago

  • Assignee changed from Sage Weil to Neha Ojha

#7 Updated by Sage Weil almost 2 years ago

The complete_to pointer is already at log end before recover_got() is called. I think it's because during split() we do

  // reset last_complete, we might have modified pg_log & missing above
  pg_log.reset_complete_to(&info);
  child->pg_log.reset_complete_to(&child->info);

#8 Updated by Sage Weil almost 2 years ago

  • Status changed from 12 to Fix Under Review
  • Pull request ID set to 30500

#9 Updated by Neha Ojha almost 2 years ago

  • Assignee changed from Neha Ojha to Sage Weil

#10 Updated by Sage Weil almost 2 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to nautilus
  • Pull request ID changed from 30500 to 30503

#11 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #42014: nautilus: Enable auto-scaler and get src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update added

#12 Updated by Sage Weil almost 2 years ago

  • Backport changed from nautilus to nautilus, mimic, luminous

#13 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #42036: mimic: Enable auto-scaler and get src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update added

#14 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #42037: luminous: Enable auto-scaler and get src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update added

#15 Updated by Nathan Cutler over 1 year ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF