Bug #21006
closedassert in can_discard_replica_op
0%
Description
recently, we got an assert in function can_discard_replica_op() when
osd is handling replica op reply. The assert is caused by
get_down_at() which checks if the source osd is still exists(),
otherwise it assert.
seems in our testing environment, the source osd send an op reply to
primary osd and then died.
should we first check exists() and avoid the assert happen
in get_down_at() or it's expected to be always exists() at this
situation?
1: (()+0x9322fd) [0x7f6c7bbe52fd]
2: (()+0xf100) [0x7f6c79a1e100]
3: (gsignal()+0x37) [0x7f6c77fe05f7]
4: (abort()+0x148) [0x7f6c77fe1ce8]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7f6c7bce2b16]
6: (()+0x30cc20) [0x7f6c7b5bfc20]
7: (bool PG::can_discard_replica_op<MOSDRepOpReply, 113>(std::shared_ptr<OpRequest>&)+0xd5) [0x7f6c7b73a595]
8: (PG::can_discard_request(std::shared_ptr<OpRequest>&)+0x1c5) [0x7f6c7b6f5095]
9: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x99) [0x7f6c7b797419]
10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x7f6c7b6493b5]
11: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d) [0x7f6c7b6495cd]
12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x7f6c7b64e1e9]
13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887) [0x7f6c7bcd2907]
14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f6c7bcd4870]
15: (()+0x7dc5) [0x7f6c79a16dc5]
16: (clone()+0x6d) [0x7f6c780a1ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Greg Farnum over 6 years ago
The OSD should always exist in the map. Which version of Ceph are you running?
We've seen a few crashes where we were passing "-1" in as an OSD ID, and that doesn't work — what OSD ID was in use here?
Updated by sheng qiu over 6 years ago
Greg Farnum wrote:
The OSD should always exist in the map. Which version of Ceph are you running?
We've seen a few crashes where we were passing "-1" in as an OSD ID, and that doesn't work — what OSD ID was in use here?
thanks for your quick response.
the osd id is a positive value, should not be the problem here.
i saw when handle osdping message, we do check if the sending osd exist in osdmap.
and there was a fix here http://tracker.ceph.com/issues/5223
in my understanding, if the sending osd died after sent op to its peer, it might cause the assert.
can you clarify it ?
thanks
Updated by Greg Farnum over 6 years ago
No, an OSD exists in the map until it's been actually deleted by the administrator. It doesn't have to be up and in to exist; it just needs to be there as a valid OSD ID number.
That ticket was for dealing with an OSD which really didn't exist because it had been deleted entirely (in the course of our testing runs).
Updated by sheng qiu over 6 years ago
Greg Farnum wrote:
No, an OSD exists in the map until it's been actually deleted by the administrator. It doesn't have to be up and in to exist; it just needs to be there as a valid OSD ID number.
That ticket was for dealing with an OSD which really didn't exist because it had been deleted entirely (in the course of our testing runs).
does that mean if we delete that osd after it sent the op, it may cause the assert? this might be our testing case.
thanks
Updated by Greg Farnum over 6 years ago
- Status changed from New to Closed
Yes, but it would have to be very soon afterwards. I'm surprised you managed to construct a case that hit it!
Still, if that's likely, I guess we can close this bug. Administrators deleting live OSDs from the map isn't really a scenario it's worth the cpu time cost of supporting.