Project

General

Profile

Actions

Support #36326

open

Huge traffic spike and assert(is_primary())

Added by Aleksei Zakharov over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(RADOS):
OSD
Pull request ID:

Description

Hello,

We use ceph version 12.2.8 now. It was upgraded from jewel.

We faced with osd assert after wiered network incident.

Huge amount of traffic has been generated between mon's and osd's in client network, and between osd's in cluster network. All hosts have started to work on that traffic, consuming a lot of CPU. There's one pool where osd's has utilized cpu up to 100%. There're a lot of "heartbeat_check: no reply" in osd logs with timestamp according to minute when traffic spike happend. After that, a lot of osd's went down on different hosts in that pool.

After this incident one of osd's went down with this in logs:

     0> 2018-10-03 16:18:16.620351 7f4c33e29700 -1 /build/ceph-12.2.8/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f4c33e29700 time 2018-10-03 16:18:16.214698
/build/ceph-12.2.8/src/osd/PrimaryLogPG.cc: 376: FAILED assert(is_primary())

 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x5651433e1d9e]
 2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0x7de) [0x565142fc94be]
 3: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ObjectStore::Transaction*)+0x2d3) [0x565143134bb3]
 4: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x190) [0x565143134e30]
 5: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a1) [0x565143142351]
 6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x78) [0x56514306de88]
 7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x56c) [0x565142fdbd3c]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3e6) [0x565142e67006]
 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x47) [0x5651430dc947]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfbf) [0x565142e9556f]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x83f) [0x5651433e755f]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5651433e94b0]
 13: (()+0x8184) [0x7f4c59f3a184]
 14: (clone()+0x6d) [0x7f4c59029ffd]

When i tried to start this osd, it went down again. We've had one host in that pool with primary-affinity set to 0 for all osd's. I set primary-affinity on that host to 1 and after that it was possible to start that osd.
I assume that it was possible to start this osd because it was already not primary for buggy pg.

While we're trying to figure out the root cause of traffic spike, what could be the cause of that osd assertion fail? Could the problem which caused the assert fail be the root of the traffic spike?
Will this assert fail again if we set primary affinity to previous state?

In attached logs:
16:09 - problem with traffic starts.
16:18:16 - assert failed
16:43:40 - first start after fail
16:44:43 - assert failed again
17:39:01 - start after affinity changed 0->1 on another host.


Files

ceph-osd.bad.day.88.log.gz (383 KB) ceph-osd.bad.day.88.log.gz Aleksei Zakharov, 10/05/2018 09:40 AM
Actions

Also available in: Atom PDF