Project

General

Profile

Bug #17968

Ceph:OSD can't finish recovery+backfill process due to assertion failure

Added by Xuehan Xu almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Dev Interfaces
Target version:
-
Start date:
11/20/2016
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Component(RADOS):
OSD, Objecter, librados
Pull request ID:

Description

Under some condition, OSD could be aborted during the recovery process due to the following assertion failure:

2016-11-19 07:00:49.133814 7fc7a77ff700 -1 error_msg osd/ReplicatedPG.cc: In function 'void ReplicatedPG::wait_for_unreadable_object(const hobject_t&, OpRequestRef)' thread 7fc7a77ff700 time 2016-11-19 07:00:48.914231
osd/ReplicatedPG.cc: 387: FAILED assert(needs_recovery)

ceph version 0.94.5-12-g83f56a1 (83f56a1c84e3dbd95a4c394335a7b1dc926dd1c4)
1: (ReplicatedPG::wait_for_unreadable_object(hobject_t const&, std::tr1::shared_ptr<OpRequest>)+0x3f5) [0x8b5a65]
2: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x5e9) [0x8f0c79]
3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x4e3) [0x87fdc3]
4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x178) [0x66b3f8]
5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x59e) [0x66f8ee]
6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x795) [0xa76d85]
7: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa7a610]
8: /lib64/libpthread.so.0() [0x393da07a51]
9: (clone()+0x6d) [0x393d6e893d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

History

#1 Updated by Sage Weil almost 2 years ago

  • Status changed from New to Need More Info

This is due to the BALANCE_READS option, right?

#2 Updated by Greg Farnum over 1 year ago

  • Status changed from Need More Info to Can't reproduce

#3 Updated by Greg Farnum over 1 year ago

  • Status changed from Can't reproduce to Need More Info

#4 Updated by Xuehan Xu over 1 year ago

Hi, everyone.

Sorry, I forgot to watch my issues.

We found that the problem is due to "librados::OPERATION_BALANCE_READS". If a read op with this flag reaches a non-primary OSD, and the target object hasn't been recovered on that OSD, then this assert failuer happens.

#5 Updated by Xuehan Xu over 1 year ago

I have a document that provides the detail of our analysis of this problem, but it's written in chinese. If needed, I'll translate and upload it.

#7 Updated by Greg Farnum over 1 year ago

  • Project changed from Ceph to RADOS
  • Category changed from OSD to Dev Interfaces
  • Status changed from Need More Info to Testing
  • Component(RADOS) OSD, Objecter, librados added

#8 Updated by Greg Farnum over 1 year ago

  • Assignee set to Xuehan Xu

#9 Updated by Kefu Chai over 1 year ago

  • Status changed from Testing to Resolved

Also available in: Atom PDF