Project

General

Profile

Bug #23258

OSDs keep crashing.

Added by Jan Marquardt over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
03/07/2018
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

At least two OSDs (#11 and #20) on two different hosts in our cluster keep crashing, which prevent our cluster to get into HEALTH_OK.
Sometimes both run for a longer time, sometimes only one crashes, sometimes both crash.

As far as I can see both log the same backtrace when crashing all the time:

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (()+0xa74234) [0x55871ead3234]
 2: (()+0x11390) [0x7feb910da390]
 3: (gsignal()+0x38) [0x7feb90075428]
 4: (abort()+0x16a) [0x7feb9007702a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55871eb169fe]
 6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0xd63) [0x55871e687d43]
 7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ObjectStore::Transaction*)+0x2da) [0x55871e81532a]
 8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x12e) [0x55871e81555e]
 9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2c1) [0x55871e824861]
 10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x55871e733ca0]
 11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x543) [0x55871e6989d3]
 12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x55871e5123b9]
 13: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55871e7b5047]
 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x130e) [0x55871e53a9ae]
 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x55871eb1b664]
 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55871eb1e6a0]
 17: (()+0x76ba) [0x7feb910d06ba]
 18: (clone()+0x6d) [0x7feb9014741d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

As far as I can say everything started when we were trying to repair a scrub error on pg 0.1b2.

ceph -s:

  cluster:
    id:     c59e56df-2043-4c92-9492-25f05f268d9f
    health: HEALTH_ERR
            133367/16098531 objects misplaced (0.828%)
            4 scrub errors
            Possible data damage: 1 pg inconsistent
            Degraded data redundancy: 2/16098531 objects degraded (0.000%), 1 pg degraded

  services:
    mon: 3 daemons, quorum head1,head2,head3
    mgr: head2(active), standbys: head1, head3
    osd: 24 osds: 24 up, 24 in; 15 remapped pgs

  data:
    pools:   1 pools, 768 pgs
    objects: 5240k objects, 18357 GB
    usage:   60198 GB used, 29166 GB / 89364 GB avail
    pgs:     2/16098531 objects degraded (0.000%)
             133367/16098531 objects misplaced (0.828%)
             750 active+clean
             14  active+remapped+backfill_wait
             2   active+clean+scrubbing+deep
             1   active+remapped+backfilling
             1   active+recovery_wait+degraded+inconsistent

  io:
    recovery: 22638 kB/s, 6 objects/s

ceph osd tree:

ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       87.27049 root default
-2       29.08960     host ceph1
 0   hdd  3.63620         osd.0      up  1.00000 1.00000
 1   hdd  3.63620         osd.1      up  1.00000 1.00000
 2   hdd  3.63620         osd.2      up  1.00000 1.00000
 3   hdd  3.63620         osd.3      up  1.00000 1.00000
 4   hdd  3.63620         osd.4      up  1.00000 1.00000
 5   hdd  3.63620         osd.5      up  1.00000 1.00000
 6   hdd  3.63620         osd.6      up  1.00000 1.00000
 7   hdd  3.63620         osd.7      up  1.00000 1.00000
-3       29.08960     host ceph2
 8   hdd  3.63620         osd.8      up  1.00000 1.00000
 9   hdd  3.63620         osd.9      up  1.00000 1.00000
10   hdd  3.63620         osd.10     up  1.00000 1.00000
11   hdd  3.63620         osd.11     up  1.00000 1.00000
12   hdd  3.63620         osd.12     up  1.00000 1.00000
13   hdd  3.63620         osd.13     up  1.00000 1.00000
14   hdd  3.63620         osd.14     up  1.00000 1.00000
15   hdd  3.63620         osd.15     up  1.00000 1.00000
-4       29.09129     host ceph3
16   hdd  3.63620         osd.16     up  1.00000 1.00000
18   hdd  3.63620         osd.18     up  1.00000 1.00000
19   hdd  3.63620         osd.19     up  1.00000 1.00000
20   hdd  3.63620         osd.20     up  1.00000 1.00000
21   hdd  3.63620         osd.21     up  1.00000 1.00000
22   hdd  3.63620         osd.22     up  1.00000 1.00000
23   hdd  3.63620         osd.23     up  1.00000 1.00000
24   hdd  3.63789         osd.24     up  1.00000 1.00000

History

#1 Updated by Jan Marquardt over 1 year ago

Additional info: We were running on Kraken until last week, than upgraded to 12.2.3, where the problems started and upgraded today to 12.2.4.

#2 Updated by Jan Marquardt over 1 year ago

After extending the cluster to 40 osds and removing osd.11 from it, the problem has moved to osd.1:

     0> 2018-03-12 11:21:10.129505 7f8c87765700 -1 *** Caught signal (Aborted) **
 in thread 7f8c87765700 thread_name:tp_osd_tp

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (()+0xa74234) [0x55d904890234]
 2: (()+0x11390) [0x7f8ca4598390]
 3: (gsignal()+0x38) [0x7f8ca3533428]
 4: (abort()+0x16a) [0x7f8ca353502a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55d9048d39fe]
 6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0x1325) [0x55d904445305]
 7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ObjectStore::Transaction*)+0x2da) [0x55d9045d232a]
 8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x12e) [0x55d9045d255e]
 9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2c1) [0x55d9045e1861]
 10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x55d9044f0ca0]
 11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x543) [0x55d9044559d3]
 12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x55d9042cf3b9]
 13: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55d904572047]
 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x130e) [0x55d9042f79ae]
 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x55d9048d8664]
 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d9048db6a0]
 17: (()+0x76ba) [0x7f8ca458e6ba]
 18: (clone()+0x6d) [0x7f8ca360541d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#3 Updated by Greg Farnum over 1 year ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)

#4 Updated by Jan Marquardt over 1 year ago

We are now having the same issue on osd.1, osd.11, osd.20 and osd.25, each located on different host. osd.1 uses filestore, the others bluestore.

#5 Updated by Jan Marquardt over 1 year ago

  cluster:
    id:     c59e56df-2043-4c92-9492-25f05f268d9f
    health: HEALTH_ERR
            2302693/18411429 objects misplaced (12.507%)
            2008/6137143 objects unfound (0.033%)
            4 scrub errors
            Reduced data availability: 8 pgs inactive, 1 pg down
            Possible data damage: 1 pg inconsistent
            Degraded data redundancy: 1556558/18411429 objects degraded (8.454%), 226 pgs degraded, 200 pgs undersized

  services:
    mon: 3 daemons, quorum head1,head2,head3
    mgr: head3(active), standbys: head1, head2
    osd: 40 osds: 36 up, 36 in; 266 remapped pgs

  data:
    pools:   1 pools, 768 pgs
    objects: 5993k objects, 20942 GB
    usage:   65889 GB used, 68581 GB / 131 TB avail
    pgs:     1.042% pgs not active
             1556558/18411429 objects degraded (8.454%)
             2302693/18411429 objects misplaced (12.507%)
             2008/6137143 objects unfound (0.033%)
             501 active+clean
             164 active+undersized+degraded+remapped+backfill_wait
             40  active+remapped+backfill_wait
             26  active+recovery_wait+degraded+remapped
             21  active+recovery_wait+undersized+degraded+remapped
             7   active+undersized+degraded+remapped+backfilling
             6   recovery_wait+undersized+degraded+remapped+peered
             1   undersized+degraded+remapped+backfill_wait+peered
             1   down
             1   active+undersized+degraded+remapped+inconsistent+backfill_wait

  io:
    recovery: 84261 kB/s, 22 objects/s
ID  CLASS WEIGHT    TYPE NAME      STATUS REWEIGHT PRI-AFF
 -1       143.29654 root default
 -2        26.45340     host ceph1
  0   hdd   3.63620         osd.0      up  1.00000 1.00000
  1   hdd   1.00000         osd.1    down        0 1.00000
  2   hdd   3.63620         osd.2      up  1.00000 1.00000
  3   hdd   3.63620         osd.3      up  1.00000 1.00000
  4   hdd   3.63620         osd.4      up  1.00000 1.00000
  5   hdd   3.63620         osd.5      up  1.00000 1.00000
  6   hdd   3.63620         osd.6      up  1.00000 1.00000
  7   hdd   3.63620         osd.7      up  1.00000 1.00000
 -3        29.13217     host ceph2
  8   hdd   3.63620         osd.8      up  1.00000 1.00000
  9   hdd   3.63620         osd.9      up  1.00000 1.00000
 10   hdd   3.63620         osd.10     up  1.00000 1.00000
 11   hdd   3.65749         osd.11   down        0 1.00000
 12   hdd   3.63620         osd.12     up  1.00000 1.00000
 13   hdd   3.65749         osd.13     up  1.00000 1.00000
 14   hdd   3.63620         osd.14     up  1.00000 1.00000
 15   hdd   3.63620         osd.15     up  1.00000 1.00000
 -4        29.11258     host ceph3
 16   hdd   3.63620         osd.16     up  1.00000 1.00000
 18   hdd   3.63620         osd.18     up  1.00000 1.00000
 19   hdd   3.63620         osd.19     up  1.00000 1.00000
 20   hdd   3.65749         osd.20   down        0 1.00000
 21   hdd   3.63620         osd.21     up  1.00000 1.00000
 22   hdd   3.63620         osd.22     up  1.00000 1.00000
 23   hdd   3.63620         osd.23     up  1.00000 1.00000
 24   hdd   3.63789         osd.24     up  1.00000 1.00000
 -9        29.29919     host ceph4
 17   hdd   3.66240         osd.17     up  1.00000 1.00000
 25   hdd   3.66240         osd.25   down        0 1.00000
 26   hdd   3.66240         osd.26     up  1.00000 1.00000
 27   hdd   3.66240         osd.27     up  1.00000 1.00000
 28   hdd   3.66240         osd.28     up  1.00000 1.00000
 29   hdd   3.66240         osd.29     up  1.00000 1.00000
 30   hdd   3.66240         osd.30     up  1.00000 1.00000
 31   hdd   3.66240         osd.31     up  1.00000 1.00000
-11        29.29919     host ceph5
 32   hdd   3.66240         osd.32     up  1.00000 1.00000
 33   hdd   3.66240         osd.33     up  1.00000 1.00000
 34   hdd   3.66240         osd.34     up  1.00000 1.00000
 35   hdd   3.66240         osd.35     up  1.00000 1.00000
 36   hdd   3.66240         osd.36     up  1.00000 1.00000
 37   hdd   3.66240         osd.37     up  1.00000 1.00000
 38   hdd   3.66240         osd.38     up  1.00000 1.00000
 39   hdd   3.66240         osd.39     up  1.00000 1.00000

#6 Updated by Brian Woods over 1 year ago

I was about to start a new bug and found this, I am also seeing 0xa74234 and ceph::__ceph_assert_fail...

A while back I filed this bug: https://tracker.ceph.com/issues/23365
I haven't really had any time work work on it, but after further investigation, it looks like one of my OSD is in a loop of some sort rather than failed as I had though.

Looking though the log I just see this repeating over and over:

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (()+0xa74234) [0x558d65225234]
 2: (()+0x11390) [0x7f66bdbd5390]
 3: (gsignal()+0x38) [0x7f66bcb70428]
 4: (abort()+0x16a) [0x7f66bcb7202a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x558d652689fe]
 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x435) [0x558d64eed115]
 7: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x396) [0x558d64eca8a6]
 8: (OSD::build_past_intervals_parallel()+0xd9d) [0x558d64c7417d]
 9: (OSD::load_pgs()+0x14fb) [0x558d64c76c7b]
 10: (OSD::init()+0x2217) [0x558d64c94d07]
 11: (main()+0x2f07) [0x558d64ba3f17]
 12: (__libc_start_main()+0xf0) [0x7f66bcb5b830]
 13: (_start()+0x29) [0x558d64c2f6b9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---

#7 Updated by Greg Farnum over 1 year ago

Brian, that's a separate bug; the code address you've picked up on is just part of the generic failure handling code.

Jan, do you have any other logs of this event? There are several asserts in the PrimaryLogPG::on_local_recover function that could be involved with this.

Also available in: Atom PDF