Support #36326: Huge traffic spike and assert(is_primary()) - RADOS - Ceph

Actions

Copy link

Support #36326

open

Huge traffic spike and assert(is_primary())

Added by Aleksei Zakharov over 5 years ago. Updated over 5 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Ceph - v12.2.8

Component(RADOS):

OSD

Pull request ID:

Description

Hello,

We use ceph version 12.2.8 now. It was upgraded from jewel.

We faced with osd assert after wiered network incident.

Huge amount of traffic has been generated between mon's and osd's in client network, and between osd's in cluster network. All hosts have started to work on that traffic, consuming a lot of CPU. There's one pool where osd's has utilized cpu up to 100%. There're a lot of "heartbeat_check: no reply" in osd logs with timestamp according to minute when traffic spike happend. After that, a lot of osd's went down on different hosts in that pool.

After this incident one of osd's went down with this in logs:

     0> 2018-10-03 16:18:16.620351 7f4c33e29700 -1 /build/ceph-12.2.8/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f4c33e29700 time 2018-10-03 16:18:16.214698
/build/ceph-12.2.8/src/osd/PrimaryLogPG.cc: 376: FAILED assert(is_primary())

 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x5651433e1d9e]
 2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0x7de) [0x565142fc94be]
 3: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ObjectStore::Transaction*)+0x2d3) [0x565143134bb3]
 4: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x190) [0x565143134e30]
 5: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a1) [0x565143142351]
 6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x78) [0x56514306de88]
 7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x56c) [0x565142fdbd3c]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3e6) [0x565142e67006]
 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x47) [0x5651430dc947]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfbf) [0x565142e9556f]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x83f) [0x5651433e755f]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5651433e94b0]
 13: (()+0x8184) [0x7f4c59f3a184]
 14: (clone()+0x6d) [0x7f4c59029ffd]

When i tried to start this osd, it went down again. We've had one host in that pool with primary-affinity set to 0 for all osd's. I set primary-affinity on that host to 1 and after that it was possible to start that osd.
I assume that it was possible to start this osd because it was already not primary for buggy pg.

While we're trying to figure out the root cause of traffic spike, what could be the cause of that osd assertion fail? Could the problem which caused the assert fail be the root of the traffic spike?
Will this assert fail again if we set primary affinity to previous state?

In attached logs:
16:09 - problem with traffic starts.
16:18:16 - assert failed
16:43:40 - first start after fail
16:44:43 - assert failed again
17:39:01 - start after affinity changed 0->1 on another host.

Files

ceph-osd.bad.day.88.log.gz (383 KB) ceph-osd.bad.day.88.log.gz

Aleksei Zakharov, 10/05/2018 09:40 AM

Actions

Copy link

Updated by Greg Farnum over 5 years ago

Tracker changed from Bug to Support

Given what you've showed here it's unlikely that the network issue was caused by this — more likely the other way around. I speculate the issue is that with the primary-affinity settings you managed to force the cluster role assignments into an otherwise-unanticipated or disallowed state, but it's not immediately clear why.

Actions

Copy link

Updated by Aleksei Zakharov over 5 years ago

Thanks for the answer! It looks like traffic spike was caused by another issue: ceph-mon's db grows up to 15GB and it shrinks only after all monitors are restarted one by one. So when there're a lot of pg epoch changes it leads to huge traffic spikes from monitors to osd's. Is it possible or do i miss something?

As we can't know the root cause of is_primary() assert, another question: is it safe to change primary-affinity settings while cluster is in unhealthy state? For example: during recovery or peering operations.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Support #36326

Huge traffic spike and assert(is_primary())

Updated by Greg Farnum over 5 years ago

Updated by Aleksei Zakharov over 5 years ago