Bug #41816
Enable auto-scaler and get src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update
0%
Description
(gdb) bt #0 raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x0000555b460bf090 in reraise_fatal (signum=6) at ./src/global/signal_handler.cc:81 #2 handle_fatal_signal (signum=6) at ./src/global/signal_handler.cc:326 #3 <signal handler called> #4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #5 0x00007efcd5006801 in __GI_abort () at abort.c:79 #6 0x0000555b45a64d1f in ceph::__ceph_assert_fail (assertion=<optimized out>, file=<optimized out>, line=<optimized out>, func=0x555b4696e440 <PeeringState::recover_got(hobject_t const&, eversion_t, bool, ceph::os::Transaction&)::__PRETTY_FUNCTION__> "void PeeringState::recover_got(const hobject_t&, eversion_t, bool, ObjectStore::Transaction&)") at ./src/common/assert.cc:73 #7 0x0000555b45a64ea9 in ceph::__ceph_assert_fail (ctx=...) at ./src/common/assert.cc:78 #8 0x0000555b45d507cb in PeeringState::recover_got (this=this@entry=0x555b53b4d428, oid=..., v=..., is_delete=is_delete@entry=false, t=...) at ./src/osd/PeeringState.cc:3671 #9 0x0000555b45bc5db2 in PrimaryLogPG::on_local_recover (this=<optimized out>, hoid=..., _recovery_info=..., obc=std::shared_ptr<ObjectContext> (use count 3, weak count 1) = {...}, is_delete=<optimized out>, t=0x7efcb6609af0) at ./src/osd/PrimaryLogPG.cc:388 #10 0x0000555b45e0ed8f in ReplicatedBackend::handle_pull_response (this=this@entry=0x555b5076f680, from=..., from@entry=..., pop=..., response=0x555b51922400, to_continue=to_continue@entry=0x7efcb6609ad0, t=t@entry=0x7efcb6609af0) at ./src/osd/ReplicatedBackend.cc:1868 f#11 0x0000555b45e0fc46 in ReplicatedBackend::_do_pull_response (this=this@entry=0x555b5076f680, op=...) at ./src/osd/ReplicatedBackend.cc:845 #12 0x0000555b45e10317 in ReplicatedBackend::do_push (op=..., this=0x555b5076f680) at ./src/osd/ReplicatedBackend.h:233 #13 ReplicatedBackend::_handle_message (this=0x555b5076f680, op=...) at ./src/osd/ReplicatedBackend.cc:193 #14 0x0000555b45c9ad77 in PGBackend::handle_message (this=<optimized out>, op=...) at ./src/osd/PGBackend.cc:114 6 #15 0x0000555b45c45a45 in PrimaryLogPG::do_request (this=0x555b53b4c000, op=..., handle=...) at ./src/osd/PrimaryLogPG.cc:1548 #16 0x0000555b45ad42fb in OSD::dequeue_op (this=this@entry=0x555b51504000, pg=..., op=..., handle=...) at ./src/osd/OSD.cc:9843 #17 0x0000555b45d2b6a2 in PGOpItem::run (this=<optimized out>, osd=0x555b51504000, sdata=<optimized out>, pg=..., handle=...) at ./src/osd/OpQueueItem.cc:24 #18 0x0000555b45af341c in OpQueueItem::run (handle=..., pg=..., sdata=<optimized out>, osd=<optimized out>, this=0x7efcb660a510) at ./src/osd/OpQueueItem.h:134 #19 OSD::ShardedOpWQ::_process (this=0x555b51505058, thread_index=<optimized out>, hb=<optimized out>) at ./src/osd/OSD.cc:11035 #20 0x0000555b4611040c in ShardedThreadPool::shardedthreadpool_worker (this=0x555b51504a08, thread_index=1) at ./src/common/WorkQueue.cc:311 #21 0x0000555b46113570 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at ./src/common/WorkQueue.h:705 #22 0x00007efcd63476db in start_thread (arg=0x7efcb660d700) at pthread_create.c:463 #23 0x00007efcd50e788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (gdb) p info.last_complete $2 = {version = 0, epoch = 0, __pad = 0} (gdb) p info.last_update $3 = {version = 100, epoch = 29, __pad = 0}
/a/sage-2019-09-12_21:32:16-rados-wip-sage-testing-2019-09-12-1043-distro-basic-smithi/4300731
core, but incomplete log :/
Related issues
History
#1 Updated by Neha Ojha over 4 years ago
- Assignee set to Neha Ojha
I'll try to see if I can reproduce this.
#2 Updated by Neha Ojha over 4 years ago
Reproduced with logs: /a/nojha-2019-09-13_21:45:51-rados:standalone-master-distro-basic-smithi/4304313/remote/smithi164/log
2019-09-13T22:52:00.103+0000 7fec3b5a5700 10 osd.1 pg_epoch: 51 pg[1.1( v 29'100 lc 0'0 (29'90,29'100] local-lis/les=50/51 n=25 ec=37/23 lis/c=50/23 les/c/f=51/24/0 sis=50) [1,0,2] r=0 lpr=50 pi=[23,50)/2 crt=29'100 mlcod 0'0 active+recovering+degraded rops=1 m=1 mbc={255={(2+0)=2}}] last_complete now 0'0 log.complete_to at end 2019-09-13T22:52:00.115+0000 7fec3b5a5700 -1 /build/ceph-15.0.0-5017-g9e59ba4/src/osd/PeeringState.cc: In function 'void PeeringState::recover_got(const hobject_t&, eversion_t, bool, ObjectStore::Transaction&)' thread 7fec3b5a5700 time 2019-09-13T22:52:00.108500+0000 /build/ceph-15.0.0-5017-g9e59ba4/src/osd/PeeringState.cc: 3671: FAILED ceph_assert(info.last_complete == info.last_update) ceph version 15.0.0-5017-g9e59ba4 (9e59ba40c35b5155e8a0f6d3b453dbab6e2e430f) octopus (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x558a6d293e4e] 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x558a6d294029] 3: (PeeringState::recover_got(hobject_t const&, eversion_t, bool, ceph::os::Transaction&)+0x44b) [0x558a6d57fa1b] 4: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0xa2) [0x558a6d3f5052] 5: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&, PullOp*, std::__cxx11::list<ReplicatedBackend::pull_complete_info, std::allocator<ReplicatedBackend::pull_complete_info> >*, ceph::os::Transaction*)+0x65f) [0x558a6d63dfcf] 6: (ReplicatedBackend::_do_pull_response(boost::intrusive_ptr<OpRequest>)+0x196) [0x558a6d63ee86] 7: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x107) [0x558a6d63f557] 8: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x558a6d4c9fc7] 9: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x715) [0x558a6d474c25] 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b) [0x558a6d302dcb] 11: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x558a6d55a8f2] 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90c) [0x558a6d321eec] 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x558a6d93f41c] 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558a6d942580] 15: (()+0x76db) [0x7fec5aade6db] 16: (clone()+0x3f) [0x7fec5987e88f]
Reproduces with rados:standalone/{supported-random-distro$/{ubuntu_latest.yaml} workloads/osd.yaml}
http://pulpito.ceph.com/nojha-2019-09-13_21:45:51-rados:standalone-master-distro-basic-smithi/
#3 Updated by David Zafman over 4 years ago
This crash didn't reproduce for me using run-standalone.sh with the auto scaler turned off.
#4 Updated by David Zafman over 4 years ago
- Related to Bug #41900: auto-scaler breaks many standalone tests added
#5 Updated by David Zafman over 4 years ago
- Subject changed from src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update to Enable auto-scaler and get src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update
- Assignee changed from Neha Ojha to Sage Weil
This can be reproduced by setting config osd_pool_default_pg_autoscale_mode="on" and executing this test:
../qa/run-standalone.sh "divergent-priors.sh TEST_divergent_2"
#6 Updated by David Zafman over 4 years ago
- Assignee changed from Sage Weil to Neha Ojha
#7 Updated by Sage Weil over 4 years ago
The complete_to pointer is already at log end before recover_got() is called. I think it's because during split() we do
// reset last_complete, we might have modified pg_log & missing above pg_log.reset_complete_to(&info); child->pg_log.reset_complete_to(&child->info);
#8 Updated by Sage Weil over 4 years ago
- Status changed from 12 to Fix Under Review
- Pull request ID set to 30500
#9 Updated by Neha Ojha over 4 years ago
- Assignee changed from Neha Ojha to Sage Weil
#10 Updated by Sage Weil over 4 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to nautilus
- Pull request ID changed from 30500 to 30503
#11 Updated by Nathan Cutler over 4 years ago
- Copied to Backport #42014: nautilus: Enable auto-scaler and get src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update added
#12 Updated by Sage Weil over 4 years ago
- Backport changed from nautilus to nautilus, mimic, luminous
#13 Updated by Nathan Cutler over 4 years ago
- Copied to Backport #42036: mimic: Enable auto-scaler and get src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update added
#14 Updated by Nathan Cutler over 4 years ago
- Copied to Backport #42037: luminous: Enable auto-scaler and get src/osd/PeeringState.cc:3671: failed assert info.last_complete == info.last_update added
#15 Updated by Nathan Cutler over 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".