Bug #45202
openRepeatedly OSD crashes in PrimaryLogPG::hit_set_trim()
0%
Description
After a network troubles I got 1 pg in a state recovery_unfound
I tried to solve this problem using command:
ceph pg 2.f8 mark_unfound_lost revert
And in about one hour after connectivity was restored I got crash for OSD.12:
ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable) 1: (()+0x911e70) [0x564d0067fe70] 2: (()+0xf5d0) [0x7f1272dad5d0] 3: (gsignal()+0x37) [0x7f1271dce2c7] 4: (abort()+0x148) [0x7f1271dcf9b8] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x242) [0x7f12762252b2] 6: (()+0x25a337) [0x7f1276225337] 7: (PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x930) [0x564d002ab480] 8: (PrimaryLogPG::hit_set_persist()+0xa0c) [0x564d002afafc] 9: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2989) [0x564d002c5f09] 10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xc99) [0x564d002cac09] 11: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1b7) [0x564d00124c87] 12: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x564d0039d8c2] 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x592) [0x564d00144ae2] 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3) [0x7f127622aec3] 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f127622bab0] 16: (()+0x7dd5) [0x7f1272da5dd5] 17: (clone()+0x6d) [0x7f1271e95f6d]
Crashes for this OSD was repeated many times.
I tried to:
- deep-scrub for all PG's on this OSD;
- ceph-bluestore-tool fsck --deep yes for this OSD;
- upgrade ceph on this node from 13.2.4 to 13.2.9.
After this I tried to flush PG's from cache poll using:
rados -p vms-cache cache-try-flush-evict-all
And got a crash for OSD.13 on another node.
Additionally both OSD's crashes in a seconds after start (from ~5 second to <60 seconds).
I set:
ceph osd tier cache-mode vms-cache forward --yes-i-really-mean-it
And decreased target_max_bytes.
After this change in about one hour OSD stopped to crash, and for last ~30 minutes works properly.
But I think, when I continue to flush PG's from cache pool, OSD can crash again.
I collected a log output of both OSD's and uploaded log using ceph-post-file:
ceph-post-file: 900533d2-8558-11ea-ad44-00144fca4038
ceph-post-file: d45082b8-8558-11ea-ad44-00144fca4038