Project

General

Profile

Actions

Bug #23258

open

OSDs keep crashing.

Added by Jan Marquardt about 6 years ago. Updated about 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

At least two OSDs (#11 and #20) on two different hosts in our cluster keep crashing, which prevent our cluster to get into HEALTH_OK.
Sometimes both run for a longer time, sometimes only one crashes, sometimes both crash.

As far as I can see both log the same backtrace when crashing all the time:

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (()+0xa74234) [0x55871ead3234]
 2: (()+0x11390) [0x7feb910da390]
 3: (gsignal()+0x38) [0x7feb90075428]
 4: (abort()+0x16a) [0x7feb9007702a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55871eb169fe]
 6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0xd63) [0x55871e687d43]
 7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ObjectStore::Transaction*)+0x2da) [0x55871e81532a]
 8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x12e) [0x55871e81555e]
 9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2c1) [0x55871e824861]
 10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x55871e733ca0]
 11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x543) [0x55871e6989d3]
 12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x55871e5123b9]
 13: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55871e7b5047]
 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x130e) [0x55871e53a9ae]
 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x55871eb1b664]
 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55871eb1e6a0]
 17: (()+0x76ba) [0x7feb910d06ba]
 18: (clone()+0x6d) [0x7feb9014741d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

As far as I can say everything started when we were trying to repair a scrub error on pg 0.1b2.

ceph -s:

  cluster:
    id:     c59e56df-2043-4c92-9492-25f05f268d9f
    health: HEALTH_ERR
            133367/16098531 objects misplaced (0.828%)
            4 scrub errors
            Possible data damage: 1 pg inconsistent
            Degraded data redundancy: 2/16098531 objects degraded (0.000%), 1 pg degraded

  services:
    mon: 3 daemons, quorum head1,head2,head3
    mgr: head2(active), standbys: head1, head3
    osd: 24 osds: 24 up, 24 in; 15 remapped pgs

  data:
    pools:   1 pools, 768 pgs
    objects: 5240k objects, 18357 GB
    usage:   60198 GB used, 29166 GB / 89364 GB avail
    pgs:     2/16098531 objects degraded (0.000%)
             133367/16098531 objects misplaced (0.828%)
             750 active+clean
             14  active+remapped+backfill_wait
             2   active+clean+scrubbing+deep
             1   active+remapped+backfilling
             1   active+recovery_wait+degraded+inconsistent

  io:
    recovery: 22638 kB/s, 6 objects/s

ceph osd tree:

ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       87.27049 root default
-2       29.08960     host ceph1
 0   hdd  3.63620         osd.0      up  1.00000 1.00000
 1   hdd  3.63620         osd.1      up  1.00000 1.00000
 2   hdd  3.63620         osd.2      up  1.00000 1.00000
 3   hdd  3.63620         osd.3      up  1.00000 1.00000
 4   hdd  3.63620         osd.4      up  1.00000 1.00000
 5   hdd  3.63620         osd.5      up  1.00000 1.00000
 6   hdd  3.63620         osd.6      up  1.00000 1.00000
 7   hdd  3.63620         osd.7      up  1.00000 1.00000
-3       29.08960     host ceph2
 8   hdd  3.63620         osd.8      up  1.00000 1.00000
 9   hdd  3.63620         osd.9      up  1.00000 1.00000
10   hdd  3.63620         osd.10     up  1.00000 1.00000
11   hdd  3.63620         osd.11     up  1.00000 1.00000
12   hdd  3.63620         osd.12     up  1.00000 1.00000
13   hdd  3.63620         osd.13     up  1.00000 1.00000
14   hdd  3.63620         osd.14     up  1.00000 1.00000
15   hdd  3.63620         osd.15     up  1.00000 1.00000
-4       29.09129     host ceph3
16   hdd  3.63620         osd.16     up  1.00000 1.00000
18   hdd  3.63620         osd.18     up  1.00000 1.00000
19   hdd  3.63620         osd.19     up  1.00000 1.00000
20   hdd  3.63620         osd.20     up  1.00000 1.00000
21   hdd  3.63620         osd.21     up  1.00000 1.00000
22   hdd  3.63620         osd.22     up  1.00000 1.00000
23   hdd  3.63620         osd.23     up  1.00000 1.00000
24   hdd  3.63789         osd.24     up  1.00000 1.00000

Actions

Also available in: Atom PDF