Project

General

Profile

Actions

Bug #45202

open

Repeatedly OSD crashes in PrimaryLogPG::hit_set_trim()

Added by KOT MATPOCKuH about 4 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After a network troubles I got 1 pg in a state recovery_unfound
I tried to solve this problem using command:

ceph pg 2.f8 mark_unfound_lost revert

And in about one hour after connectivity was restored I got crash for OSD.12:

 ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
 1: (()+0x911e70) [0x564d0067fe70]
 2: (()+0xf5d0) [0x7f1272dad5d0]
 3: (gsignal()+0x37) [0x7f1271dce2c7]
 4: (abort()+0x148) [0x7f1271dcf9b8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x242) [0x7f12762252b2]
 6: (()+0x25a337) [0x7f1276225337]
 7: (PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x930) [0x564d002ab480]
 8: (PrimaryLogPG::hit_set_persist()+0xa0c) [0x564d002afafc]
 9: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2989) [0x564d002c5f09]
 10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xc99) [0x564d002cac09]
 11: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1b7) [0x564d00124c87]
 12: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x564d0039d8c2]
 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x592) [0x564d00144ae2]
 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3) [0x7f127622aec3]
 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f127622bab0]
 16: (()+0x7dd5) [0x7f1272da5dd5]
 17: (clone()+0x6d) [0x7f1271e95f6d]

Crashes for this OSD was repeated many times.
I tried to:
- deep-scrub for all PG's on this OSD;
- ceph-bluestore-tool fsck --deep yes for this OSD;
- upgrade ceph on this node from 13.2.4 to 13.2.9.

After this I tried to flush PG's from cache poll using:

rados -p vms-cache cache-try-flush-evict-all

And got a crash for OSD.13 on another node.
Additionally both OSD's crashes in a seconds after start (from ~5 second to <60 seconds).

I set:

ceph osd tier cache-mode vms-cache forward --yes-i-really-mean-it

And decreased target_max_bytes.
After this change in about one hour OSD stopped to crash, and for last ~30 minutes works properly.
But I think, when I continue to flush PG's from cache pool, OSD can crash again.

I collected a log output of both OSD's and uploaded log using ceph-post-file:
ceph-post-file: 900533d2-8558-11ea-ad44-00144fca4038
ceph-post-file: d45082b8-8558-11ea-ad44-00144fca4038

My problem can duplicate:
Bug #19185
Bug #40388

Actions

Also available in: Atom PDF