Actions
Bug #23258
openOSDs keep crashing.
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
At least two OSDs (#11 and #20) on two different hosts in our cluster keep crashing, which prevent our cluster to get into HEALTH_OK.
Sometimes both run for a longer time, sometimes only one crashes, sometimes both crash.
As far as I can see both log the same backtrace when crashing all the time:
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (()+0xa74234) [0x55871ead3234] 2: (()+0x11390) [0x7feb910da390] 3: (gsignal()+0x38) [0x7feb90075428] 4: (abort()+0x16a) [0x7feb9007702a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55871eb169fe] 6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0xd63) [0x55871e687d43] 7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ObjectStore::Transaction*)+0x2da) [0x55871e81532a] 8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x12e) [0x55871e81555e] 9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2c1) [0x55871e824861] 10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x55871e733ca0] 11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x543) [0x55871e6989d3] 12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x55871e5123b9] 13: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55871e7b5047] 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x130e) [0x55871e53a9ae] 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x55871eb1b664] 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55871eb1e6a0] 17: (()+0x76ba) [0x7feb910d06ba] 18: (clone()+0x6d) [0x7feb9014741d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
As far as I can say everything started when we were trying to repair a scrub error on pg 0.1b2.
ceph -s:
cluster: id: c59e56df-2043-4c92-9492-25f05f268d9f health: HEALTH_ERR 133367/16098531 objects misplaced (0.828%) 4 scrub errors Possible data damage: 1 pg inconsistent Degraded data redundancy: 2/16098531 objects degraded (0.000%), 1 pg degraded services: mon: 3 daemons, quorum head1,head2,head3 mgr: head2(active), standbys: head1, head3 osd: 24 osds: 24 up, 24 in; 15 remapped pgs data: pools: 1 pools, 768 pgs objects: 5240k objects, 18357 GB usage: 60198 GB used, 29166 GB / 89364 GB avail pgs: 2/16098531 objects degraded (0.000%) 133367/16098531 objects misplaced (0.828%) 750 active+clean 14 active+remapped+backfill_wait 2 active+clean+scrubbing+deep 1 active+remapped+backfilling 1 active+recovery_wait+degraded+inconsistent io: recovery: 22638 kB/s, 6 objects/s
ceph osd tree:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 87.27049 root default -2 29.08960 host ceph1 0 hdd 3.63620 osd.0 up 1.00000 1.00000 1 hdd 3.63620 osd.1 up 1.00000 1.00000 2 hdd 3.63620 osd.2 up 1.00000 1.00000 3 hdd 3.63620 osd.3 up 1.00000 1.00000 4 hdd 3.63620 osd.4 up 1.00000 1.00000 5 hdd 3.63620 osd.5 up 1.00000 1.00000 6 hdd 3.63620 osd.6 up 1.00000 1.00000 7 hdd 3.63620 osd.7 up 1.00000 1.00000 -3 29.08960 host ceph2 8 hdd 3.63620 osd.8 up 1.00000 1.00000 9 hdd 3.63620 osd.9 up 1.00000 1.00000 10 hdd 3.63620 osd.10 up 1.00000 1.00000 11 hdd 3.63620 osd.11 up 1.00000 1.00000 12 hdd 3.63620 osd.12 up 1.00000 1.00000 13 hdd 3.63620 osd.13 up 1.00000 1.00000 14 hdd 3.63620 osd.14 up 1.00000 1.00000 15 hdd 3.63620 osd.15 up 1.00000 1.00000 -4 29.09129 host ceph3 16 hdd 3.63620 osd.16 up 1.00000 1.00000 18 hdd 3.63620 osd.18 up 1.00000 1.00000 19 hdd 3.63620 osd.19 up 1.00000 1.00000 20 hdd 3.63620 osd.20 up 1.00000 1.00000 21 hdd 3.63620 osd.21 up 1.00000 1.00000 22 hdd 3.63620 osd.22 up 1.00000 1.00000 23 hdd 3.63620 osd.23 up 1.00000 1.00000 24 hdd 3.63789 osd.24 up 1.00000 1.00000
Actions