Actions
Bug #48525
closed[rbd-mirror] UnlinkPeerRequest state machine might loop
Status:
Resolved
Priority:
High
Assignee:
Jason Dillaman
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Description
A MGR process died with over 40,000 frames of bracktrace in the following loop:
#0 0x00007f72363a739e in md_config_t::_get_val (this=0x55e500973450, values=..., o=..., stack=0x0, err=0x0) at /home/jdillaman/ceph_wip/src/common/config.cc:1082 #1 0x00007f72363a7b97 in md_config_t::_get_val (this=0x55e500973450, values=..., key=..., stack=0x0, err=0x0) at /home/jdillaman/ceph_wip/src/common/config.cc:1074 #2 0x00007f72363a7e63 in md_config_t::get_val_generic[abi:cxx11](ConfigValues const&, std::basic_string_view<char, std::char_traits<char> >) const (this=<optimized out>, values=..., key=...) at /home/jdillaman/ceph_wip/src/common/config.cc:1052 #3 0x00007f721b100a11 in md_config_t::get_val<unsigned long> (key=..., values=..., this=0x55e500973450) at /home/jdillaman/ceph_wip/src/common/config.h:353 #4 ceph::common::ConfigProxy::get_val<unsigned long> (key="rbd_mirroring_max_mirroring_snapshots", this=0x55e500970008) at /home/jdillaman/ceph_wip/src/common/config_proxy.h:143 #5 librbd::mirror::snapshot::CreatePrimaryRequest<librbd::ImageCtx>::unlink_peer (this=0x55e5006764e0) at /home/jdillaman/ceph_wip/src/librbd/mirror/snapshot/CreatePrimaryRequest.cc:185 #6 0x00007f721b11006b in Context::complete (r=0, this=0x55e50268b190) at /home/jdillaman/ceph_wip/src/include/Context.h:99 #7 librbd::mirror::snapshot::UnlinkPeerRequest<librbd::ImageCtx>::finish (this=this@entry=0x55e50167f7c0, r=r@entry=0) at /home/jdillaman/ceph_wip/src/librbd/mirror/snapshot/UnlinkPeerRequest.cc:226 #8 0x00007f721b1117ce in librbd::mirror::snapshot::UnlinkPeerRequest<librbd::ImageCtx>::remove_snapshot (this=0x55e50167f7c0) at /home/jdillaman/ceph_wip/src/log/SubsystemMap.h:72 #9 0x00007f721b100c53 in librbd::mirror::snapshot::CreatePrimaryRequest<librbd::ImageCtx>::unlink_peer (this=0x55e5006764e0) at /usr/include/c++/10/bits/basic_string.h:907 #10 0x00007f721b11006b in Context::complete (r=0, this=0x55e50268b180) at /home/jdillaman/ceph_wip/src/include/Context.h:99 #11 librbd::mirror::snapshot::UnlinkPeerRequest<librbd::ImageCtx>::finish (this=this@entry=0x55e50167f7c0, r=r@entry=0) at /home/jdillaman/ceph_wip/src/librbd/mirror/snapshot/UnlinkPeerRequest.cc:226
This was due to a snapshot that was no longer linked to any peers attempting to be removed but it failed the test in "remove_snapshot":
$ rbd --cluster cluster1 --pool mirror snap ls image0001 --all SNAPID NAME SIZE PROTECTED TIMESTAMP NAMESPACE 4336 .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.de0af227-363a-43a9-ac48-9737d2578151 1 MiB Wed Dec 9 19:05:00 2020 mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f]) 7836 .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.10a07443-37d6-4b58-a13c-f3171d6d2cea 1 MiB Wed Dec 9 19:10:00 2020 mirror (primary peer_uuids:[]) 11348 .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.507b7ec7-472e-4ad9-ad9a-225db0af7e67 1 MiB Wed Dec 9 19:15:00 2020 mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f]) 14113 .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.1a4350ca-1a95-4144-beb7-34f1d52f5a4f 1 MiB Wed Dec 9 19:21:29 2020 mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f])
Updated by Jason Dillaman over 3 years ago
Found another image that was missing its peers in the mirror snapshot:
$ rbd --cluster cluster1 --pool mirror snap ls image0002 --all SNAPID NAME SIZE PROTECTED TIMESTAMP NAMESPACE 7837 .mirror.primary.dd9747a6-7628-4e2b-a39a-8757735ac1e0.3bb60c7d-442a-41cc-ae88-a68bc3ad3d6f 1 MiB Wed Dec 9 19:10:00 2020 mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f]) 11350 .mirror.primary.dd9747a6-7628-4e2b-a39a-8757735ac1e0.e21dfc60-7bfe-4ae0-9a3d-062f829d06a7 1 MiB Wed Dec 9 19:15:00 2020 mirror (primary peer_uuids:[])
The source of this bug needs to be fixed but unlink should remove the snapshot if it's not the first mirror snapshot. It should also gracefully handle the case where the notify refresh failed due to a timeout.
Updated by Jason Dillaman over 3 years ago
- Status changed from New to In Progress
- Assignee set to Jason Dillaman
Updated by Jason Dillaman over 3 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 38517
Updated by Mykola Golub over 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot over 3 years ago
- Copied to Backport #48561: octopus: [rbd-mirror] UnlinkPeerRequest state machine might loop added
Updated by Loïc Dachary about 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".
Actions