Bug #24652: OSD crashes when repairing pg - RADOS - Ceph

Actions

Copy link

Bug #24652

closed

OSD crashes when repairing pg

Added by Ana Aviles almost 6 years ago. Updated almost 6 years ago.

Status:

Won't Fix

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v0.94

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After a deep-scrub on the primary OSD for the pg we get:

2018-06-25 14:13:39.261196 7fbc0c821700  0 log_channel(cluster) log [INF] : 0.20 deep-scrub starts
2018-06-25 14:14:11.362752 7fbc0c821700 -1 log_channel(cluster) log [ERR] : 0.20 shard 31 missing d5bd3420/rbd_data.15cec2ae8944a.00000000000dfbd4/head//0
2018-06-25 14:14:11.362759 7fbc0c821700 -1 log_channel(cluster) log [ERR] : 0.20 shard 35 missing d5bd3420/rbd_data.15cec2ae8944a.00000000000dfbd4/head//0
2018-06-25 14:15:27.416461 7fbc0c821700 -1 log_channel(cluster) log [ERR] : 0.20 deep-scrub 1 missing, 0 inconsistent objects
2018-06-25 14:15:27.416478 7fbc0c821700 -1 log_channel(cluster) log [ERR] : 0.20 deep-scrub 2 errors

After issuing a pg repair the pg is seemingly repaired (HEALTH_OK) but the primary OSD crashes right after fixing the pg. The crashed OSD restarts automatically without problems. However when manually issuing a deep-scrub on the pg we get back to the initial inconsistent pg

OSD logs (debug osd = 20)
OSD.35 primary b1a76fc3-ebd3-4061-a6c3-7d75ffc51471
OSD.31 secondary f8134aef-f141-4241-b0f0-a60b2630e446
OSD.44 secondary 53272d75-014b-40af-a064-36c043d8cd57

Actions

Copy link

Updated by Greg Farnum almost 6 years ago

Project changed from Ceph to RADOS

Actions

Copy link

Updated by Josh Durgin almost 6 years ago

Status changed from New to Won't Fix

This should be fixed in later versions - hammer is end of life.

The crash was:

     0> 2018-06-25 14:30:24.657285 7fcaa0e81700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, ObjectStore::Transa
ction*)' thread 7fcaa0e81700 time 2018-06-25 14:30:24.652025
osd/ReplicatedPG.cc: 244: FAILED assert(recovering.count(obc->obs.oi.soid))

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xba8b8b]
 2: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0xba2) [0x83aa62]
 3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x91c) [0x9f929c]
 4: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x1d8) [0x9f9898]
 5: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x2ee) [0x9fcdee]
 6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x167) [0x826b27]
 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3bd) [0x6961dd]
 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x338) [0x696708]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x875) [0xb98555]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb9a670]
 11: (()+0x8182) [0x7fcabdfe0182]
 12: (clone()+0x6d) [0x7fcabc54b47d]

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #24652

OSD crashes when repairing pg

Updated by Greg Farnum almost 6 years ago

Updated by Josh Durgin almost 6 years ago