Project

General

Profile

Actions

Bug #48525

closed

[rbd-mirror] UnlinkPeerRequest state machine might loop

Added by Jason Dillaman over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Jason Dillaman
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A MGR process died with over 40,000 frames of bracktrace in the following loop:

#0  0x00007f72363a739e in md_config_t::_get_val (this=0x55e500973450, values=..., o=..., stack=0x0, err=0x0) at /home/jdillaman/ceph_wip/src/common/config.cc:1082
#1  0x00007f72363a7b97 in md_config_t::_get_val (this=0x55e500973450, values=..., key=..., stack=0x0, err=0x0) at /home/jdillaman/ceph_wip/src/common/config.cc:1074
#2  0x00007f72363a7e63 in md_config_t::get_val_generic[abi:cxx11](ConfigValues const&, std::basic_string_view<char, std::char_traits<char> >) const (this=<optimized out>, values=..., key=...)
    at /home/jdillaman/ceph_wip/src/common/config.cc:1052
#3  0x00007f721b100a11 in md_config_t::get_val<unsigned long> (key=..., values=..., this=0x55e500973450) at /home/jdillaman/ceph_wip/src/common/config.h:353
#4  ceph::common::ConfigProxy::get_val<unsigned long> (key="rbd_mirroring_max_mirroring_snapshots", this=0x55e500970008) at /home/jdillaman/ceph_wip/src/common/config_proxy.h:143
#5  librbd::mirror::snapshot::CreatePrimaryRequest<librbd::ImageCtx>::unlink_peer (this=0x55e5006764e0) at /home/jdillaman/ceph_wip/src/librbd/mirror/snapshot/CreatePrimaryRequest.cc:185
#6  0x00007f721b11006b in Context::complete (r=0, this=0x55e50268b190) at /home/jdillaman/ceph_wip/src/include/Context.h:99
#7  librbd::mirror::snapshot::UnlinkPeerRequest<librbd::ImageCtx>::finish (this=this@entry=0x55e50167f7c0, r=r@entry=0) at /home/jdillaman/ceph_wip/src/librbd/mirror/snapshot/UnlinkPeerRequest.cc:226
#8  0x00007f721b1117ce in librbd::mirror::snapshot::UnlinkPeerRequest<librbd::ImageCtx>::remove_snapshot (this=0x55e50167f7c0) at /home/jdillaman/ceph_wip/src/log/SubsystemMap.h:72
#9  0x00007f721b100c53 in librbd::mirror::snapshot::CreatePrimaryRequest<librbd::ImageCtx>::unlink_peer (this=0x55e5006764e0) at /usr/include/c++/10/bits/basic_string.h:907
#10 0x00007f721b11006b in Context::complete (r=0, this=0x55e50268b180) at /home/jdillaman/ceph_wip/src/include/Context.h:99
#11 librbd::mirror::snapshot::UnlinkPeerRequest<librbd::ImageCtx>::finish (this=this@entry=0x55e50167f7c0, r=r@entry=0) at /home/jdillaman/ceph_wip/src/librbd/mirror/snapshot/UnlinkPeerRequest.cc:226

This was due to a snapshot that was no longer linked to any peers attempting to be removed but it failed the test in "remove_snapshot":

$ rbd --cluster cluster1 --pool mirror snap ls image0001 --all
SNAPID  NAME                                                                                       SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                         
  4336  .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.de0af227-363a-43a9-ac48-9737d2578151  1 MiB             Wed Dec  9 19:05:00 2020  mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f])
  7836  .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.10a07443-37d6-4b58-a13c-f3171d6d2cea  1 MiB             Wed Dec  9 19:10:00 2020  mirror (primary peer_uuids:[])                                    
 11348  .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.507b7ec7-472e-4ad9-ad9a-225db0af7e67  1 MiB             Wed Dec  9 19:15:00 2020  mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f])
 14113  .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.1a4350ca-1a95-4144-beb7-34f1d52f5a4f  1 MiB             Wed Dec  9 19:21:29 2020  mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f])

Related issues 1 (0 open1 closed)

Copied to rbd - Backport #48561: octopus: [rbd-mirror] UnlinkPeerRequest state machine might loopResolvedJason DillamanActions
Actions

Also available in: Atom PDF