Actions
Bug #20854
closed(small-scoped) recovery_lock being blocked by pg lock holders
Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
-5> 2017-07-29 15:10:15.977505 7f1754bb5700 -1 received signal: Hangup from PID: 12079 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 4 UID: 0 -4> 2017-07-29 15:10:15.977587 7f1773c22700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f175a3c0700' had timed out after 15 -3> 2017-07-29 15:10:15.977595 7f1773c22700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f175a3c0700' had suicide timed out after 150 -2> 2017-07-29 15:10:15.979706 7f175ebc9700 10 monclient: tick -1> 2017-07-29 15:10:15.979721 7f175ebc9700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2017-07-29 15:09:45.979719) 0> 2017-07-29 15:10:15.982934 7f175a3c0700 -1 *** Caught signal (Aborted) ** in thread 7f175a3c0700 thread_name:tp_osd_tp ceph version 12.1.1-848-g425db1b (425db1b94d6b388d84f9c1996d471264018c9b6a) luminous (rc) 1: (()+0xa490e9) [0x7f177ab940e9] 2: (()+0x10330) [0x7f1778653330] 3: (()+0xef1c) [0x7f1778651f1c] 4: (()+0xa649) [0x7f177864d649] 5: (pthread_mutex_lock()+0x70) [0x7f177864d470] 6: (Mutex::Lock(bool)+0x48) [0x7f177abb1658] 7: (OSDService::start_recovery_op(PG*, hobject_t const&)+0x2f) [0x7f177a653c2f] 8: (PG::start_recovery_op(hobject_t const&)+0x5d) [0x7f177a6fe05d] 9: (PrimaryLogPG::prep_object_replica_pushes(hobject_t const&, eversion_t, PGBackend::RecoveryHandle*)+0x6e1) [0x7f177a7e49a1] 10: (PrimaryLogPG::recover_replicas(unsigned long, ThreadPool::TPHandle&)+0xec3) [0x7f177a820ea3] 11: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x82c) [0x7f177a82791c] 12: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x74e) [0x7f177a6815fe] 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xeee) [0x7f177a6a13be] 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x83f) [0x7f177abd5f7f] 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f177abd7ed0] 16: (()+0x8184) [0x7f177864b184] 17: (clone()+0x6d) [0x7f177773b37d]
/a/kchai-2017-07-29_05:31:06-rados-wip-kefu-testing-distro-basic-mira/1460302$ zless remote/*/log/ceph-osd.4.l*
Updated by Greg Farnum over 6 years ago
Naively this looks like something else was blocked while holding the recovery_lock, which is a bit scary since that sure looks like it's supposed to have pretty narrow scope. But it looks like OSDService::adjust_pg_priorities invokes PG::change_recovery_force_mode(), and that requires locking the PG and then publish_stats_to_osd().
I think this got busted by ff9a32d94bed8a03942b2b4e50455b3d7c5c892c.
Updated by Greg Farnum over 6 years ago
That's from https://github.com/ceph/ceph/pull/13723, which was 7 days ago.
Updated by Greg Farnum over 6 years ago
- Subject changed from abort in OSDService::start_recovery_op() to (small-scoped) recovery_lock being blocked by pg lock holders
- Priority changed from Normal to Urgent
Updated by Greg Farnum over 6 years ago
- Is duplicate of Bug #20808: osd deadlock: forced recovery added
Actions