Bug #44159
closed[rbd-mirror] Mirror daemon never recovers from being blacklisted
0%
Description
I can reproduce this rather reliably by:
- Restarting many OSDs (old nodes, slow spinning disks, likely exceeding default blacklist timeout).
- Sometimes, it also happens when restarting other RBD mirror daemons (we have 3).
The attached log is extracted from one blacklisted RBD mirror unable to recover at log level 15.
RBD volume names and domains are sanitized, otherwise the log is untouched.
Files
Updated by Mykola Golub about 4 years ago
- Status changed from New to In Progress
- Assignee set to Mykola Golub
Updated by Mykola Golub about 4 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 33411
In the provided log there are many messages like these ones:
2020-02-14 02:14:56.653 7f42f7ac1700 -1 rbd::mirror::InstanceReplayer: 0x55bab1dc3b80 start_image_replayer: global_image_id=446b538f-1f61-4daa-b05f-93f76cd5e652: blacklisted detected during image replay 2020-02-14 02:14:56.660 7f42f7ac1700 5 rbd::mirror::LeaderWatcher: 0x55bab29a9200 handle_rewatch_complete: r=-108
So both rbd-mirror's InstanceReplayer and LeaderWatcher detected the "blacklisted" state but it was not propagated on the higher level to restart the PoolReplayer.
Updated by Jason Dillaman about 4 years ago
- Backport set to luminous,mimic,nautilus
Updated by Jason Dillaman about 4 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler about 4 years ago
- Copied to Backport #44262: mimic: [rbd-mirror] Mirror daemon never recovers from being blacklisted added
Updated by Nathan Cutler about 4 years ago
- Copied to Backport #44263: nautilus: [rbd-mirror] Mirror daemon never recovers from being blacklisted added
Updated by Nathan Cutler about 4 years ago
- Copied to Backport #44264: luminous: [rbd-mirror] Mirror daemon never recovers from being blacklisted added
Updated by Nathan Cutler about 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".