Project

General

Profile

Actions

Bug #20854

closed

(small-scoped) recovery_lock being blocked by pg lock holders

Added by Kefu Chai over 6 years ago. Updated over 6 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

    -5> 2017-07-29 15:10:15.977505 7f1754bb5700 -1 received  signal: Hangup from  PID: 12079 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 4
 UID: 0
    -4> 2017-07-29 15:10:15.977587 7f1773c22700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f175a3c0700' had timed out after 15
    -3> 2017-07-29 15:10:15.977595 7f1773c22700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f175a3c0700' had suicide timed out after 150
    -2> 2017-07-29 15:10:15.979706 7f175ebc9700 10 monclient: tick
    -1> 2017-07-29 15:10:15.979721 7f175ebc9700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2017-07-29 15:09:45.979719)
     0> 2017-07-29 15:10:15.982934 7f175a3c0700 -1 *** Caught signal (Aborted) **
 in thread 7f175a3c0700 thread_name:tp_osd_tp

 ceph version 12.1.1-848-g425db1b (425db1b94d6b388d84f9c1996d471264018c9b6a) luminous (rc)
 1: (()+0xa490e9) [0x7f177ab940e9]
 2: (()+0x10330) [0x7f1778653330]
 3: (()+0xef1c) [0x7f1778651f1c]
 4: (()+0xa649) [0x7f177864d649]
 5: (pthread_mutex_lock()+0x70) [0x7f177864d470]
 6: (Mutex::Lock(bool)+0x48) [0x7f177abb1658]
 7: (OSDService::start_recovery_op(PG*, hobject_t const&)+0x2f) [0x7f177a653c2f]
 8: (PG::start_recovery_op(hobject_t const&)+0x5d) [0x7f177a6fe05d]
 9: (PrimaryLogPG::prep_object_replica_pushes(hobject_t const&, eversion_t, PGBackend::RecoveryHandle*)+0x6e1) [0x7f177a7e49a1]
 10: (PrimaryLogPG::recover_replicas(unsigned long, ThreadPool::TPHandle&)+0xec3) [0x7f177a820ea3]
 11: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x82c) [0x7f177a82791c]
 12: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x74e) [0x7f177a6815fe]
 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xeee) [0x7f177a6a13be]
 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x83f) [0x7f177abd5f7f]
 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f177abd7ed0]
 16: (()+0x8184) [0x7f177864b184]
 17: (clone()+0x6d) [0x7f177773b37d]

/a/kchai-2017-07-29_05:31:06-rados-wip-kefu-testing-distro-basic-mira/1460302$ zless remote/*/log/ceph-osd.4.l*


Related issues 1 (0 open1 closed)

Is duplicate of RADOS - Bug #20808: osd deadlock: forced recoveryResolvedGreg Farnum07/27/2017

Actions
Actions #1

Updated by Kefu Chai over 6 years ago

  • Category set to Correctness/Safety
Actions #2

Updated by Greg Farnum over 6 years ago

Naively this looks like something else was blocked while holding the recovery_lock, which is a bit scary since that sure looks like it's supposed to have pretty narrow scope. But it looks like OSDService::adjust_pg_priorities invokes PG::change_recovery_force_mode(), and that requires locking the PG and then publish_stats_to_osd().

I think this got busted by ff9a32d94bed8a03942b2b4e50455b3d7c5c892c.

Actions #3

Updated by Greg Farnum over 6 years ago

That's from https://github.com/ceph/ceph/pull/13723, which was 7 days ago.

Actions #4

Updated by Greg Farnum over 6 years ago

  • Subject changed from abort in OSDService::start_recovery_op() to (small-scoped) recovery_lock being blocked by pg lock holders
  • Priority changed from Normal to Urgent
Actions #5

Updated by Greg Farnum over 6 years ago

  • Status changed from New to Duplicate
Actions #6

Updated by Greg Farnum over 6 years ago

  • Is duplicate of Bug #20808: osd deadlock: forced recovery added
Actions

Also available in: Atom PDF