Actions
Bug #24652
closedOSD crashes when repairing pg
Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
After a deep-scrub on the primary OSD for the pg we get:
2018-06-25 14:13:39.261196 7fbc0c821700 0 log_channel(cluster) log [INF] : 0.20 deep-scrub starts 2018-06-25 14:14:11.362752 7fbc0c821700 -1 log_channel(cluster) log [ERR] : 0.20 shard 31 missing d5bd3420/rbd_data.15cec2ae8944a.00000000000dfbd4/head//0 2018-06-25 14:14:11.362759 7fbc0c821700 -1 log_channel(cluster) log [ERR] : 0.20 shard 35 missing d5bd3420/rbd_data.15cec2ae8944a.00000000000dfbd4/head//0 2018-06-25 14:15:27.416461 7fbc0c821700 -1 log_channel(cluster) log [ERR] : 0.20 deep-scrub 1 missing, 0 inconsistent objects 2018-06-25 14:15:27.416478 7fbc0c821700 -1 log_channel(cluster) log [ERR] : 0.20 deep-scrub 2 errors
After issuing a pg repair the pg is seemingly repaired (HEALTH_OK) but the primary OSD crashes right after fixing the pg. The crashed OSD restarts automatically without problems. However when manually issuing a deep-scrub on the pg we get back to the initial inconsistent pg
OSD logs (debug osd = 20)
OSD.35 primary b1a76fc3-ebd3-4061-a6c3-7d75ffc51471
OSD.31 secondary f8134aef-f141-4241-b0f0-a60b2630e446
OSD.44 secondary 53272d75-014b-40af-a064-36c043d8cd57
Updated by Josh Durgin almost 6 years ago
- Status changed from New to Won't Fix
This should be fixed in later versions - hammer is end of life.
The crash was:
0> 2018-06-25 14:30:24.657285 7fcaa0e81700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, ObjectStore::Transa ction*)' thread 7fcaa0e81700 time 2018-06-25 14:30:24.652025 osd/ReplicatedPG.cc: 244: FAILED assert(recovering.count(obc->obs.oi.soid)) ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xba8b8b] 2: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0xba2) [0x83aa62] 3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x91c) [0x9f929c] 4: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x1d8) [0x9f9898] 5: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x2ee) [0x9fcdee] 6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x167) [0x826b27] 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3bd) [0x6961dd] 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x338) [0x696708] 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x875) [0xb98555] 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb9a670] 11: (()+0x8182) [0x7fcabdfe0182] 12: (clone()+0x6d) [0x7fcabc54b47d]
Actions