Bug #21006: assert in can_discard_replica_op - Ceph - Ceph

Actions

Copy link

Bug #21006

closed

assert in can_discard_replica_op

Added by sheng qiu over 6 years ago. Updated over 6 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

v10.2.10

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

recently, we got an assert in function can_discard_replica_op() when
osd is handling replica op reply. The assert is caused by
get_down_at() which checks if the source osd is still exists(),
otherwise it assert.

seems in our testing environment, the source osd send an op reply to
primary osd and then died.

should we first check exists() and avoid the assert happen
in get_down_at() or it's expected to be always exists() at this
situation?

1: (()+0x9322fd) [0x7f6c7bbe52fd]
 2: (()+0xf100) [0x7f6c79a1e100]
 3: (gsignal()+0x37) [0x7f6c77fe05f7]
 4: (abort()+0x148) [0x7f6c77fe1ce8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7f6c7bce2b16]
 6: (()+0x30cc20) [0x7f6c7b5bfc20]
 7: (bool PG::can_discard_replica_op&lt;MOSDRepOpReply, 113&gt;(std::shared_ptr&lt;OpRequest&gt;&)+0xd5) [0x7f6c7b73a595]
 8: (PG::can_discard_request(std::shared_ptr&lt;OpRequest&gt;&)+0x1c5) [0x7f6c7b6f5095]
 9: (ReplicatedPG::do_request(std::shared_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x99) [0x7f6c7b797419]
 10: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, std::shared_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x405) [0x7f6c7b6493b5]
 11: (PGQueueable::RunVis::operator()(std::shared_ptr&lt;OpRequest&gt;&)+0x6d) [0x7f6c7b6495cd]
 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x7f6c7b64e1e9]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887) [0x7f6c7bcd2907]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f6c7bcd4870]
 15: (()+0x7dc5) [0x7f6c79a16dc5]
 16: (clone()+0x6d) [0x7f6c780a1ced]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Actions

Copy link

Updated by Greg Farnum over 6 years ago

The OSD should always exist in the map. Which version of Ceph are you running?

We've seen a few crashes where we were passing "-1" in as an OSD ID, and that doesn't work — what OSD ID was in use here?

Actions

Copy link

Updated by sheng qiu over 6 years ago

Greg Farnum wrote:

The OSD should always exist in the map. Which version of Ceph are you running?

We've seen a few crashes where we were passing "-1" in as an OSD ID, and that doesn't work — what OSD ID was in use here?

thanks for your quick response.

the osd id is a positive value, should not be the problem here.
i saw when handle osdping message, we do check if the sending osd exist in osdmap.
and there was a fix here http://tracker.ceph.com/issues/5223

in my understanding, if the sending osd died after sent op to its peer, it might cause the assert.
can you clarify it ?

thanks

Actions

Copy link

Updated by Greg Farnum over 6 years ago

No, an OSD exists in the map until it's been actually deleted by the administrator. It doesn't have to be up and in to exist; it just needs to be there as a valid OSD ID number.

That ticket was for dealing with an OSD which really didn't exist because it had been deleted entirely (in the course of our testing runs).

Actions

Copy link

Updated by sheng qiu over 6 years ago

Greg Farnum wrote:

No, an OSD exists in the map until it's been actually deleted by the administrator. It doesn't have to be up and in to exist; it just needs to be there as a valid OSD ID number.

That ticket was for dealing with an OSD which really didn't exist because it had been deleted entirely (in the course of our testing runs).

does that mean if we delete that osd after it sent the op, it may cause the assert? this might be our testing case.

thanks

Actions

Copy link

Updated by Greg Farnum over 6 years ago

Status changed from New to Closed

Yes, but it would have to be very soon afterwards. I'm surprised you managed to construct a case that hit it!

Still, if that's likely, I guess we can close this bug. Administrators deleting live OSDs from the map isn't really a scenario it's worth the cpu time cost of supporting.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #21006

assert in can_discard_replica_op

Updated by Greg Farnum over 6 years ago

Updated by sheng qiu over 6 years ago

Updated by Greg Farnum over 6 years ago

Updated by sheng qiu over 6 years ago

Updated by Greg Farnum over 6 years ago