Project

General

Profile

Bug #38579

osd: should not mark cluster_messenger when commited new osdmap

Added by bing lin 8 months ago. Updated 2 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
03/05/2019
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

when we run some fault test in Luminous 12.2.10, got coredump like

/root/rpmbuild/BUILD/ceph-12.2.10-448-g236c32a/src/os/bluestore/BlueStore.cc: 9726: FAILED assert(0 == "unexpected er
ror")

 ceph version 12.2.10-448-g236c32a (236c32aee16791ef308baae875f21776f3303e4a) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55c97c283470]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1487) [0x55c97c119367]
 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<Obje
ctStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x3a0) [0x55c97c11a3b0]
 4: (PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction>
>&, boost::intrusive_ptr<OpRequest>)+0x65) [0x55c97be7d245]
 5: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xb0b) [0x55c97bf8ecfb]
 6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2b4) [0x55c97bf91a44]
 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x55c97beb2cf0]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x59c) [0x55c97be1d5dc]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9) [0x55c9
7bc9aed9]
 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55c97bf282e7]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x55c97bcc9e0e]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x55c97c288f89]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55c97c28af20]

replica recive an error Op which has not been written.
1. when Primary osd got SIGHUB and prepare to stop, sending MOSDMarkMeDown to monitor
2. monitor inc new epoch osdmap like 20
3. Primary osd got new obj write, like op1、op2,and op1 will generate touch transaction.
4. before Primary osd send op1 to Replica osd, Replica osd handle new osdmap 20, and note peer Primary osd cluster_messenger's AsyncConnection down. and peer Primary osd will set AsyncConnection to Standby state.
5. so when Primary osd want sending op1 to Replica osd, she will found a Standby AsyncConnection, so op1 should not be send.
6. when another logic call get_connection with peer Replica, the Conneciton with peer Replica will rebuild.
7. so when before op2 dequeue, the Conneciton is rebuild, op2 will be sending to Replica osd.
8. During this period, Primary osd haven't receive new osdmap 20.
9. Replica osd have got osdmap 20 , but pg have not peering, so pg do not know osdmap 20, and when she process op2. op2 does not have touch transaction, and will cause op setatter error.

i think the problem is we should not mark cluster_message's AsyncConnection down, because when osd shutdown, AsyncConnection is still in use, why we need to intervation?

History

#1 Updated by Greg Farnum 8 months ago

Do you have a reproducer for this?

I get nervous when people want to remove mark_down calls, as they are generally quite important to maintain correctness. In this case, the replica should not be getting any incoming messages from the primary, but it also shouldn't be allowing a new incoming Connection — that's probably the part of the process that is breaking. (And maybe something should be sharing the missing osdmap? I think we try and do that as well.)

#2 Updated by bing lin 8 months ago

Greg Farnum wrote:

Do you have a reproducer for this?

I get nervous when people want to remove mark_down calls, as they are generally quite important to maintain correctness. In this case, the replica should not be getting any incoming messages from the primary, but it also shouldn't be allowing a new incoming Connection — that's probably the part of the process that is breaking. (And maybe something should be sharing the missing osdmap? I think we try and do that as well.)

yes,i have reproducer this in my test env many times.
what i think like you before, Primary should not sending to Replica during this period. but
1. Primary does not konw if Replica have know he is down, connection shutdown not mean that Replica konw Primary is shutdown, and Primary need drain remained op.
2. Replica pg has not peering, even if Replica OSD has committed new osdmap, but pg do not kown.
3. i agree with you that maybe something should be sharing the missing osdmap, but i think we don't need to mark_down an inuse connection also.
4. maybe i missunderstand mark_down effect, could you piont out, thankyou.

#3 Updated by Sage Weil 8 months ago

  • Status changed from New to Need More Info
  • Priority changed from Normal to Urgent

So in step 5, the primary hasn't seen osdmap 20, right? Only the replica has? The part I don't understand is that in step 6, the connection will still be in standby state on the primary, and when the connection get's rebuilt, it should fail to connect/deliver to the replica. That is the intention, at least.. I think that is where the problem is.

The problem with not marking down the peer connection is that we will never clean up the state for these connections and we'll leak memory over the long term. Maybe this could be paired with a mark_down() call when the osd comes back up so that then we replace the old connection, at a minimum?

#4 Updated by bing lin 8 months ago

Sage Weil wrote:

So in step 5, the primary hasn't seen osdmap 20, right? Only the replica has? The part I don't understand is that in step 6, the connection will still be in standby state on the primary, and when the connection get's rebuilt, it should fail to connect/deliver to the replica. That is the intention, at least.. I think that is where the problem is.

The problem with not marking down the peer connection is that we will never clean up the state for these connections and we'll leak memory over the long term. Maybe this could be paired with a mark_down() call when the osd comes back up so that then we replace the old connection, at a minimum?

1. yes, Primary hasn't seen osdmap 20. Replica has seen osdmap 20.
2. why should standby connection rebuild failed? even if Replica has seen osdmap 20, but pg hasn't seen, he has not peering. means that pg has not known Primary is going down.
3. yep, i guessed that why we need to mark down when committed_osdmap found down osd is that there is no other way to reclaim socket。so why not add epoll event EPOLLRDHUP to capture peer socket close event, and then reclaim socket.

#5 Updated by Sage Weil 7 months ago

Ok, I think I understand. This would noramlly trigger a RESETSESSION in the v1 protocol because the primary's connect_seq > 0 and the replica has no record of the session (it did mark_down). In v2 the reconnects are more explicit and generally safer. However, I'm not sure we have the correct behavior there yet... what we really want is for the replica to never accept that (re)connection, but I suspect we might be retrying.

In any case, can you add a reproducer for this scenario to ceph_test_msgr? see src/test/msgr/test_msgr.cc. That will make it easy for us to verify we are resolving the problem... especially in the v2 protocol. I have a feeling v1 will be harder to fix, but I think it can be done too.

Thanks!

#6 Updated by Greg Farnum 2 months ago

  • Priority changed from Urgent to Normal

Also available in: Atom PDF