Project

General

Profile

Actions

Bug #48525

closed

[rbd-mirror] UnlinkPeerRequest state machine might loop

Added by Jason Dillaman over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Jason Dillaman
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A MGR process died with over 40,000 frames of bracktrace in the following loop:

#0  0x00007f72363a739e in md_config_t::_get_val (this=0x55e500973450, values=..., o=..., stack=0x0, err=0x0) at /home/jdillaman/ceph_wip/src/common/config.cc:1082
#1  0x00007f72363a7b97 in md_config_t::_get_val (this=0x55e500973450, values=..., key=..., stack=0x0, err=0x0) at /home/jdillaman/ceph_wip/src/common/config.cc:1074
#2  0x00007f72363a7e63 in md_config_t::get_val_generic[abi:cxx11](ConfigValues const&, std::basic_string_view<char, std::char_traits<char> >) const (this=<optimized out>, values=..., key=...)
    at /home/jdillaman/ceph_wip/src/common/config.cc:1052
#3  0x00007f721b100a11 in md_config_t::get_val<unsigned long> (key=..., values=..., this=0x55e500973450) at /home/jdillaman/ceph_wip/src/common/config.h:353
#4  ceph::common::ConfigProxy::get_val<unsigned long> (key="rbd_mirroring_max_mirroring_snapshots", this=0x55e500970008) at /home/jdillaman/ceph_wip/src/common/config_proxy.h:143
#5  librbd::mirror::snapshot::CreatePrimaryRequest<librbd::ImageCtx>::unlink_peer (this=0x55e5006764e0) at /home/jdillaman/ceph_wip/src/librbd/mirror/snapshot/CreatePrimaryRequest.cc:185
#6  0x00007f721b11006b in Context::complete (r=0, this=0x55e50268b190) at /home/jdillaman/ceph_wip/src/include/Context.h:99
#7  librbd::mirror::snapshot::UnlinkPeerRequest<librbd::ImageCtx>::finish (this=this@entry=0x55e50167f7c0, r=r@entry=0) at /home/jdillaman/ceph_wip/src/librbd/mirror/snapshot/UnlinkPeerRequest.cc:226
#8  0x00007f721b1117ce in librbd::mirror::snapshot::UnlinkPeerRequest<librbd::ImageCtx>::remove_snapshot (this=0x55e50167f7c0) at /home/jdillaman/ceph_wip/src/log/SubsystemMap.h:72
#9  0x00007f721b100c53 in librbd::mirror::snapshot::CreatePrimaryRequest<librbd::ImageCtx>::unlink_peer (this=0x55e5006764e0) at /usr/include/c++/10/bits/basic_string.h:907
#10 0x00007f721b11006b in Context::complete (r=0, this=0x55e50268b180) at /home/jdillaman/ceph_wip/src/include/Context.h:99
#11 librbd::mirror::snapshot::UnlinkPeerRequest<librbd::ImageCtx>::finish (this=this@entry=0x55e50167f7c0, r=r@entry=0) at /home/jdillaman/ceph_wip/src/librbd/mirror/snapshot/UnlinkPeerRequest.cc:226

This was due to a snapshot that was no longer linked to any peers attempting to be removed but it failed the test in "remove_snapshot":

$ rbd --cluster cluster1 --pool mirror snap ls image0001 --all
SNAPID  NAME                                                                                       SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                         
  4336  .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.de0af227-363a-43a9-ac48-9737d2578151  1 MiB             Wed Dec  9 19:05:00 2020  mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f])
  7836  .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.10a07443-37d6-4b58-a13c-f3171d6d2cea  1 MiB             Wed Dec  9 19:10:00 2020  mirror (primary peer_uuids:[])                                    
 11348  .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.507b7ec7-472e-4ad9-ad9a-225db0af7e67  1 MiB             Wed Dec  9 19:15:00 2020  mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f])
 14113  .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.1a4350ca-1a95-4144-beb7-34f1d52f5a4f  1 MiB             Wed Dec  9 19:21:29 2020  mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f])

Related issues 1 (0 open1 closed)

Copied to rbd - Backport #48561: octopus: [rbd-mirror] UnlinkPeerRequest state machine might loopResolvedJason DillamanActions
Actions #1

Updated by Jason Dillaman over 3 years ago

Found another image that was missing its peers in the mirror snapshot:

$ rbd --cluster cluster1 --pool mirror snap ls image0002 --all
SNAPID  NAME                                                                                       SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                         
  7837  .mirror.primary.dd9747a6-7628-4e2b-a39a-8757735ac1e0.3bb60c7d-442a-41cc-ae88-a68bc3ad3d6f  1 MiB             Wed Dec  9 19:10:00 2020  mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f])
 11350  .mirror.primary.dd9747a6-7628-4e2b-a39a-8757735ac1e0.e21dfc60-7bfe-4ae0-9a3d-062f829d06a7  1 MiB             Wed Dec  9 19:15:00 2020  mirror (primary peer_uuids:[])                                    

The source of this bug needs to be fixed but unlink should remove the snapshot if it's not the first mirror snapshot. It should also gracefully handle the case where the notify refresh failed due to a timeout.

Actions #2

Updated by Jason Dillaman over 3 years ago

  • Status changed from New to In Progress
  • Assignee set to Jason Dillaman
Actions #3

Updated by Jason Dillaman over 3 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 38517
Actions #4

Updated by Mykola Golub over 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Backport Bot over 3 years ago

  • Copied to Backport #48561: octopus: [rbd-mirror] UnlinkPeerRequest state machine might loop added
Actions #6

Updated by Loïc Dachary about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF