Support #36326
openHuge traffic spike and assert(is_primary())
0%
Description
Hello,
We use ceph version 12.2.8 now. It was upgraded from jewel.
We faced with osd assert after wiered network incident.
Huge amount of traffic has been generated between mon's and osd's in client network, and between osd's in cluster network. All hosts have started to work on that traffic, consuming a lot of CPU. There's one pool where osd's has utilized cpu up to 100%. There're a lot of "heartbeat_check: no reply" in osd logs with timestamp according to minute when traffic spike happend. After that, a lot of osd's went down on different hosts in that pool.
After this incident one of osd's went down with this in logs:
0> 2018-10-03 16:18:16.620351 7f4c33e29700 -1 /build/ceph-12.2.8/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f4c33e29700 time 2018-10-03 16:18:16.214698 /build/ceph-12.2.8/src/osd/PrimaryLogPG.cc: 376: FAILED assert(is_primary()) ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x5651433e1d9e] 2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0x7de) [0x565142fc94be] 3: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ObjectStore::Transaction*)+0x2d3) [0x565143134bb3] 4: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x190) [0x565143134e30] 5: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a1) [0x565143142351] 6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x78) [0x56514306de88] 7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x56c) [0x565142fdbd3c] 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3e6) [0x565142e67006] 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x47) [0x5651430dc947] 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfbf) [0x565142e9556f] 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x83f) [0x5651433e755f] 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5651433e94b0] 13: (()+0x8184) [0x7f4c59f3a184] 14: (clone()+0x6d) [0x7f4c59029ffd]
When i tried to start this osd, it went down again. We've had one host in that pool with primary-affinity set to 0 for all osd's. I set primary-affinity on that host to 1 and after that it was possible to start that osd.
I assume that it was possible to start this osd because it was already not primary for buggy pg.
While we're trying to figure out the root cause of traffic spike, what could be the cause of that osd assertion fail? Could the problem which caused the assert fail be the root of the traffic spike?
Will this assert fail again if we set primary affinity to previous state?
In attached logs:
16:09 - problem with traffic starts.
16:18:16 - assert failed
16:43:40 - first start after fail
16:44:43 - assert failed again
17:39:01 - start after affinity changed 0->1 on another host.
Files