Bug #23258
openOSDs keep crashing.
0%
Description
At least two OSDs (#11 and #20) on two different hosts in our cluster keep crashing, which prevent our cluster to get into HEALTH_OK.
Sometimes both run for a longer time, sometimes only one crashes, sometimes both crash.
As far as I can see both log the same backtrace when crashing all the time:
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (()+0xa74234) [0x55871ead3234] 2: (()+0x11390) [0x7feb910da390] 3: (gsignal()+0x38) [0x7feb90075428] 4: (abort()+0x16a) [0x7feb9007702a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55871eb169fe] 6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0xd63) [0x55871e687d43] 7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ObjectStore::Transaction*)+0x2da) [0x55871e81532a] 8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x12e) [0x55871e81555e] 9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2c1) [0x55871e824861] 10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x55871e733ca0] 11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x543) [0x55871e6989d3] 12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x55871e5123b9] 13: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55871e7b5047] 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x130e) [0x55871e53a9ae] 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x55871eb1b664] 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55871eb1e6a0] 17: (()+0x76ba) [0x7feb910d06ba] 18: (clone()+0x6d) [0x7feb9014741d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
As far as I can say everything started when we were trying to repair a scrub error on pg 0.1b2.
ceph -s:
cluster: id: c59e56df-2043-4c92-9492-25f05f268d9f health: HEALTH_ERR 133367/16098531 objects misplaced (0.828%) 4 scrub errors Possible data damage: 1 pg inconsistent Degraded data redundancy: 2/16098531 objects degraded (0.000%), 1 pg degraded services: mon: 3 daemons, quorum head1,head2,head3 mgr: head2(active), standbys: head1, head3 osd: 24 osds: 24 up, 24 in; 15 remapped pgs data: pools: 1 pools, 768 pgs objects: 5240k objects, 18357 GB usage: 60198 GB used, 29166 GB / 89364 GB avail pgs: 2/16098531 objects degraded (0.000%) 133367/16098531 objects misplaced (0.828%) 750 active+clean 14 active+remapped+backfill_wait 2 active+clean+scrubbing+deep 1 active+remapped+backfilling 1 active+recovery_wait+degraded+inconsistent io: recovery: 22638 kB/s, 6 objects/s
ceph osd tree:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 87.27049 root default -2 29.08960 host ceph1 0 hdd 3.63620 osd.0 up 1.00000 1.00000 1 hdd 3.63620 osd.1 up 1.00000 1.00000 2 hdd 3.63620 osd.2 up 1.00000 1.00000 3 hdd 3.63620 osd.3 up 1.00000 1.00000 4 hdd 3.63620 osd.4 up 1.00000 1.00000 5 hdd 3.63620 osd.5 up 1.00000 1.00000 6 hdd 3.63620 osd.6 up 1.00000 1.00000 7 hdd 3.63620 osd.7 up 1.00000 1.00000 -3 29.08960 host ceph2 8 hdd 3.63620 osd.8 up 1.00000 1.00000 9 hdd 3.63620 osd.9 up 1.00000 1.00000 10 hdd 3.63620 osd.10 up 1.00000 1.00000 11 hdd 3.63620 osd.11 up 1.00000 1.00000 12 hdd 3.63620 osd.12 up 1.00000 1.00000 13 hdd 3.63620 osd.13 up 1.00000 1.00000 14 hdd 3.63620 osd.14 up 1.00000 1.00000 15 hdd 3.63620 osd.15 up 1.00000 1.00000 -4 29.09129 host ceph3 16 hdd 3.63620 osd.16 up 1.00000 1.00000 18 hdd 3.63620 osd.18 up 1.00000 1.00000 19 hdd 3.63620 osd.19 up 1.00000 1.00000 20 hdd 3.63620 osd.20 up 1.00000 1.00000 21 hdd 3.63620 osd.21 up 1.00000 1.00000 22 hdd 3.63620 osd.22 up 1.00000 1.00000 23 hdd 3.63620 osd.23 up 1.00000 1.00000 24 hdd 3.63789 osd.24 up 1.00000 1.00000
Updated by Jan Marquardt about 6 years ago
Additional info: We were running on Kraken until last week, than upgraded to 12.2.3, where the problems started and upgraded today to 12.2.4.
Updated by Jan Marquardt about 6 years ago
After extending the cluster to 40 osds and removing osd.11 from it, the problem has moved to osd.1:
0> 2018-03-12 11:21:10.129505 7f8c87765700 -1 *** Caught signal (Aborted) ** in thread 7f8c87765700 thread_name:tp_osd_tp ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (()+0xa74234) [0x55d904890234] 2: (()+0x11390) [0x7f8ca4598390] 3: (gsignal()+0x38) [0x7f8ca3533428] 4: (abort()+0x16a) [0x7f8ca353502a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55d9048d39fe] 6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0x1325) [0x55d904445305] 7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ObjectStore::Transaction*)+0x2da) [0x55d9045d232a] 8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x12e) [0x55d9045d255e] 9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2c1) [0x55d9045e1861] 10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x55d9044f0ca0] 11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x543) [0x55d9044559d3] 12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x55d9042cf3b9] 13: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55d904572047] 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x130e) [0x55d9042f79ae] 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x55d9048d8664] 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d9048db6a0] 17: (()+0x76ba) [0x7f8ca458e6ba] 18: (clone()+0x6d) [0x7f8ca360541d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Greg Farnum about 6 years ago
- Project changed from Ceph to RADOS
- Category deleted (
OSD)
Updated by Jan Marquardt about 6 years ago
We are now having the same issue on osd.1, osd.11, osd.20 and osd.25, each located on different host. osd.1 uses filestore, the others bluestore.
Updated by Jan Marquardt about 6 years ago
cluster: id: c59e56df-2043-4c92-9492-25f05f268d9f health: HEALTH_ERR 2302693/18411429 objects misplaced (12.507%) 2008/6137143 objects unfound (0.033%) 4 scrub errors Reduced data availability: 8 pgs inactive, 1 pg down Possible data damage: 1 pg inconsistent Degraded data redundancy: 1556558/18411429 objects degraded (8.454%), 226 pgs degraded, 200 pgs undersized services: mon: 3 daemons, quorum head1,head2,head3 mgr: head3(active), standbys: head1, head2 osd: 40 osds: 36 up, 36 in; 266 remapped pgs data: pools: 1 pools, 768 pgs objects: 5993k objects, 20942 GB usage: 65889 GB used, 68581 GB / 131 TB avail pgs: 1.042% pgs not active 1556558/18411429 objects degraded (8.454%) 2302693/18411429 objects misplaced (12.507%) 2008/6137143 objects unfound (0.033%) 501 active+clean 164 active+undersized+degraded+remapped+backfill_wait 40 active+remapped+backfill_wait 26 active+recovery_wait+degraded+remapped 21 active+recovery_wait+undersized+degraded+remapped 7 active+undersized+degraded+remapped+backfilling 6 recovery_wait+undersized+degraded+remapped+peered 1 undersized+degraded+remapped+backfill_wait+peered 1 down 1 active+undersized+degraded+remapped+inconsistent+backfill_wait io: recovery: 84261 kB/s, 22 objects/s
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 143.29654 root default -2 26.45340 host ceph1 0 hdd 3.63620 osd.0 up 1.00000 1.00000 1 hdd 1.00000 osd.1 down 0 1.00000 2 hdd 3.63620 osd.2 up 1.00000 1.00000 3 hdd 3.63620 osd.3 up 1.00000 1.00000 4 hdd 3.63620 osd.4 up 1.00000 1.00000 5 hdd 3.63620 osd.5 up 1.00000 1.00000 6 hdd 3.63620 osd.6 up 1.00000 1.00000 7 hdd 3.63620 osd.7 up 1.00000 1.00000 -3 29.13217 host ceph2 8 hdd 3.63620 osd.8 up 1.00000 1.00000 9 hdd 3.63620 osd.9 up 1.00000 1.00000 10 hdd 3.63620 osd.10 up 1.00000 1.00000 11 hdd 3.65749 osd.11 down 0 1.00000 12 hdd 3.63620 osd.12 up 1.00000 1.00000 13 hdd 3.65749 osd.13 up 1.00000 1.00000 14 hdd 3.63620 osd.14 up 1.00000 1.00000 15 hdd 3.63620 osd.15 up 1.00000 1.00000 -4 29.11258 host ceph3 16 hdd 3.63620 osd.16 up 1.00000 1.00000 18 hdd 3.63620 osd.18 up 1.00000 1.00000 19 hdd 3.63620 osd.19 up 1.00000 1.00000 20 hdd 3.65749 osd.20 down 0 1.00000 21 hdd 3.63620 osd.21 up 1.00000 1.00000 22 hdd 3.63620 osd.22 up 1.00000 1.00000 23 hdd 3.63620 osd.23 up 1.00000 1.00000 24 hdd 3.63789 osd.24 up 1.00000 1.00000 -9 29.29919 host ceph4 17 hdd 3.66240 osd.17 up 1.00000 1.00000 25 hdd 3.66240 osd.25 down 0 1.00000 26 hdd 3.66240 osd.26 up 1.00000 1.00000 27 hdd 3.66240 osd.27 up 1.00000 1.00000 28 hdd 3.66240 osd.28 up 1.00000 1.00000 29 hdd 3.66240 osd.29 up 1.00000 1.00000 30 hdd 3.66240 osd.30 up 1.00000 1.00000 31 hdd 3.66240 osd.31 up 1.00000 1.00000 -11 29.29919 host ceph5 32 hdd 3.66240 osd.32 up 1.00000 1.00000 33 hdd 3.66240 osd.33 up 1.00000 1.00000 34 hdd 3.66240 osd.34 up 1.00000 1.00000 35 hdd 3.66240 osd.35 up 1.00000 1.00000 36 hdd 3.66240 osd.36 up 1.00000 1.00000 37 hdd 3.66240 osd.37 up 1.00000 1.00000 38 hdd 3.66240 osd.38 up 1.00000 1.00000 39 hdd 3.66240 osd.39 up 1.00000 1.00000
Updated by Brian Woods about 6 years ago
I was about to start a new bug and found this, I am also seeing 0xa74234 and ceph::__ceph_assert_fail...
A while back I filed this bug: https://tracker.ceph.com/issues/23365
I haven't really had any time work work on it, but after further investigation, it looks like one of my OSD is in a loop of some sort rather than failed as I had though.
Looking though the log I just see this repeating over and over:
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (()+0xa74234) [0x558d65225234] 2: (()+0x11390) [0x7f66bdbd5390] 3: (gsignal()+0x38) [0x7f66bcb70428] 4: (abort()+0x16a) [0x7f66bcb7202a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x558d652689fe] 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x435) [0x558d64eed115] 7: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x396) [0x558d64eca8a6] 8: (OSD::build_past_intervals_parallel()+0xd9d) [0x558d64c7417d] 9: (OSD::load_pgs()+0x14fb) [0x558d64c76c7b] 10: (OSD::init()+0x2217) [0x558d64c94d07] 11: (main()+0x2f07) [0x558d64ba3f17] 12: (__libc_start_main()+0xf0) [0x7f66bcb5b830] 13: (_start()+0x29) [0x558d64c2f6b9] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.9.log --- end dump of recent events ---
Updated by Greg Farnum about 6 years ago
Brian, that's a separate bug; the code address you've picked up on is just part of the generic failure handling code.
Jan, do you have any other logs of this event? There are several asserts in the PrimaryLogPG::on_local_recover function that could be involved with this.