Project

General

Profile

Actions

Bug #53537

open

[rbd-mirror] ceph_assert on rollback_snapshot info during a forcepromote while snapshot is syncing

Added by Deepika Upadhyay over 2 years ago. Updated over 1 year ago.

Status:
In Progress
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

-35> 2021-12-08T12:29:20.346+0000 7f1b9effd700 20 librbd::io::AsyncOperation: 0x7f1b78019450 finish_op
   -34> 2021-12-08T12:29:20.346+0000 7f1b9effd700 10 librbd::exclusive_lock::PostAcquireRequest: 0x7f1b98048000 handle_refresh: r=0
   -33> 2021-12-08T12:29:20.346+0000 7f1b9effd700 10 librbd::exclusive_lock::PostAcquireRequest: 0x7f1b98048000 send_open_object_map:
   -32> 2021-12-08T12:29:20.346+0000 7f1b9effd700 20 librbd::object_map::RefreshRequest: 0x7f1b64003170 send: object_count=25600
   -31> 2021-12-08T12:29:20.346+0000 7f1b9effd700 10 librbd::object_map::RefreshRequest: 0x7f1b64003170 send_lock: oid=rbd_object_map.10336bdbe570
   -30> 2021-12-08T12:29:20.346+0000 7f1b9effd700 10 librbd::object_map::LockRequest: 0x7f1b78018350 send_lock: oid=rbd_object_map.10336bdbe570
   -29> 2021-12-08T12:29:20.387+0000 7f1b9e7fc700 10 librbd::object_map::LockRequest: 0x7f1b78018350 handle_lock: r=0
   -28> 2021-12-08T12:29:20.387+0000 7f1b9e7fc700 10 librbd::object_map::RefreshRequest: 0x7f1b64003170 handle_lock
   -27> 2021-12-08T12:29:20.387+0000 7f1b9e7fc700 10 librbd::object_map::RefreshRequest: 0x7f1b64003170 send_load: oid=rbd_object_map.10336bdbe570
   -26> 2021-12-08T12:29:20.387+0000 7f1b9effd700 10 librbd::object_map::RefreshRequest: 0x7f1b64003170 handle_load: r=0
   -25> 2021-12-08T12:29:20.388+0000 7f1b9effd700 20 librbd::object_map::RefreshRequest: refreshed object map: num_objs=25600
   -24> 2021-12-08T12:29:20.388+0000 7f1b9effd700 10 librbd::exclusive_lock::PostAcquireRequest: 0x7f1b98048000 handle_open_object_map: r=0
   -23> 2021-12-08T12:29:20.388+0000 7f1b9effd700 10 librbd::ExclusiveLock: 0x7f1b78003b80 handle_post_acquiring_lock
   -22> 2021-12-08T12:29:20.388+0000 7f1b9effd700 10 librbd::ImageState: 0x55b796d79410 handle_prepare_lock_complete
   -21> 2021-12-08T12:29:20.388+0000 7f1b9effd700 10 librbd::exclusive_lock::PostAcquireRequest: 0x7f1b98048000 send_process_plugin_acquire_lock:
   -20> 2021-12-08T12:29:20.388+0000 7f1b9effd700 20 librbd::PluginRegistry: 0x55b796d76b30 acquired_exclusive_lock:
   -19> 2021-12-08T12:29:20.388+0000 7f1b9effd700 10 librbd::exclusive_lock::PostAcquireRequest: 0x7f1b98048000 handle_process_plugin_acquire_lock: r=0
   -18> 2021-12-08T12:29:20.388+0000 7f1b9e7fc700 10 librbd::ExclusiveLock: 0x7f1b78003b80 handle_post_acquired_lock: r=0
   -17> 2021-12-08T12:29:20.388+0000 7f1b9e7fc700 10 librbd::ImageWatcher: 0x7f1b78008a00 notify acquired lock
   -16> 2021-12-08T12:29:20.388+0000 7f1b9e7fc700 10 librbd::ImageWatcher: 0x7f1b78008a00 current lock owner: [4193,139755618649744]
   -15> 2021-12-08T12:29:20.388+0000 7f1b9e7fc700 20 librbd::watcher::Notifier: 0x7f1b78008ac0 notify: pending=1
   -14> 2021-12-08T12:29:20.388+0000 7f1b9e7fc700  1 lockdep reusing last freed id 75
   -13> 2021-12-08T12:29:20.388+0000 7f1b9e7fc700 20 librbd::exclusive_lock::ImageDispatch: 0x7f1b78015cf0 set_require_lock: direction=2, enabled=0
   -12> 2021-12-08T12:29:20.388+0000 7f1b9e7fc700 10 librbd::ManagedLock: 0x7f1b78003b98 handle_post_acquire_lock: r=0
   -11> 2021-12-08T12:29:20.388+0000 7f1b9e7fc700 15 librbd::mirror::snapshot::PromoteRequest: 0x7f1b64005310 handle_acquire_exclusive_lock: r=0
   -10> 2021-12-08T12:29:20.388+0000 7f1b9e7fc700 20 librbd::ManagedLock: 0x7f1b78003b98 is_lock_owner: =1
    -9> 2021-12-08T12:29:20.388+0000 7f1b9e7fc700 15 librbd::mirror::snapshot::PromoteRequest: 0x7f1b64005310 rollback:
    -8> 2021-12-08T12:29:20.388+0000 7f1b9effd700  5 librbd::Watcher: 0x7f1b78008a00 notifications_blocked: blocked=0
    -7> 2021-12-08T12:29:20.388+0000 7f1b9effd700 10 librbd::Watcher::C_NotifyAck 0x7f1b64004e90 C_NotifyAck: id=150323855466, handle=139755618649744
    -6> 2021-12-08T12:29:20.388+0000 7f1b9effd700 10 librbd::ImageWatcher: 0x7f1b78008a00 image exclusively locked announcement
    -5> 2021-12-08T12:29:20.388+0000 7f1b9effd700 10 librbd::ImageWatcher: 0x7f1b78008a00 current lock owner: [4193,139755618649744]
    -4> 2021-12-08T12:29:20.388+0000 7f1b9effd700 10 librbd::Watcher::C_NotifyAck 0x7f1b64004e90 finish: r=0
    -3> 2021-12-08T12:29:20.388+0000 7f1b9effd700 20 librbd::watcher::Notifier: 0x7f1b78008ac0 handle_notify: r=0
    -2> 2021-12-08T12:29:20.388+0000 7f1b9effd700 20 librbd::watcher::Notifier: 0x7f1b78008ac0 handle_notify: pending=0
    -1> 2021-12-08T12:29:20.389+0000 7f1b9e7fc700 -1 ../src/librbd/mirror/snapshot/PromoteRequest.cc: In function 'void librbd::mirror::snapshot::PromoteRequest<ImageCtxT>::rollback() [with ImageCtxT = librbd::ImageCtx]' thread 7f1b9e7fc700 time 2021-12-08T12:29:20.389837+0000
../src/librbd/mirror/snapshot/PromoteRequest.cc: 261: FAILED ceph_assert(info != nullptr)

 ceph version 17.0.0-9223-gf3d3dcee87a (f3d3dcee87aa6ccf5c8e905a275b4673aab98570) quincy (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7f1baf8930cf]
 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7f1baf893301]

Files

relocate.sh (8.66 KB) relocate.sh Deepika Upadhyay, 12/08/2021 01:06 PM

Related issues 2 (0 open2 closed)

Has duplicate rbd - Bug #56045: Force Promote Pool FailsDuplicate

Actions
Has duplicate rbd - Bug #58172: get_rollback_snap_id throws bad variant accessDuplicateChristopher Hoffman

Actions
Actions #1

Updated by Sunny Kumar over 2 years ago

  • Project changed from RADOS to rbd
Actions #2

Updated by Deepika Upadhyay over 2 years ago

Actions #3

Updated by Deepika Upadhyay over 2 years ago

more logs present at: /home/ideepika/tracker-53537 on vossi03 sepia lab

Actions #4

Updated by Mykola Golub over 2 years ago

Deepika Upadhyay wrote:

more logs present at: /home/ideepika/tracker-53537 on vossi03 sepia lab

[trociny@vossi03 ~]$ cd  /home/ideepika/tracker-53537
-bash: cd: /home/ideepika/tracker-53537: Permission denied
[trociny@vossi03 ~]$ ls -ld /home/ideepika/
drwx------. 25 ideepika ideepika 4096 Dec  8 19:48 /home/ideepika/

Ah, sudo works on this host. Sorry, never mind.

Actions #5

Updated by Mykola Golub over 2 years ago

Deepika Upadhyay wrote:

more logs present at: /home/ideepika/tracker-53537 on vossi03 sepia lab

What was happening there:

The force promote command was running at the time when rbd-mirror (in snapshot mode) was still replaying the image. It had a complete snapshot (id=8) and incomplete (being copied at that time) snapshot (id=13). As the last snapshot (13) was incomplete and the force flag was used, the promote decided to rollback to the last complete snapshot (id=8).

The next step for "promote" command was to create an orphan mirror snapshot, which is used as a "signal" for currently replaying rbd-mirrors to detect the state change and stop replaying. Adfer this the "promote" was waiting on watcher for all mirroring listeners to detach (i.e. to stop replay). But the rbd-mirror did not stop immediately, because it was already in process of copying the snapshot, so it completed copying first, then removed the older complet snapshot (8) which was not needed any more and then only stopped.

After this the "promote" proceeded with rollback, found that snapshot (id=8) did not exist any more and crashed.

In general, it is not a scenario where "forced" promotion is supposed to be used. It is supposed to be used as a last resort operation, when there is no connection with the other side, so the rbd-mirror cannot complete the operation. So if users observe this in real life I would rather investigate what they are doing, because it seems they are doing wrong things or have wrong expectations about the commands they use. In general using the "force" promotion flag means losing some data. Running this when the rbd-mirror successfully replaying image is wrong.

Anyway, we can try to improve "promote" [1] to work more gracefully in this scenario. It seems instead of detecting "requires_orphan" and "rollback_snap_id" in one step (by calling util::can_create_primary_snapshot) at the beginning, we could first just check if we need "orphan". Then if we need, proceed with crating "orphan" snapshot, wait for all watchers to detach (for the rbd-mirror to stop replay) and then check if we need rollback and to what snapshot. I.e. we need to call `util::can_create_primary_snapshot` the second time after `acquire_exclusive_lock` step.

[1] src/librbd/mirror/snapshot/PromoteRequest.cc

Actions #6

Updated by Neha Ojha over 2 years ago

  • Project changed from RADOS to rbd
Actions #7

Updated by Ilya Dryomov about 2 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Deepika Upadhyay
  • Backport set to octopus,pacific
  • Pull request ID set to 45375
Actions #8

Updated by Ilya Dryomov almost 2 years ago

  • Has duplicate Bug #56045: Force Promote Pool Fails added
Actions #9

Updated by Ilya Dryomov almost 2 years ago

  • Backport changed from octopus,pacific to octopus,pacific,quincy
Actions #10

Updated by Christopher Hoffman over 1 year ago

  • Assignee changed from Deepika Upadhyay to Christopher Hoffman
Actions #11

Updated by Ilya Dryomov over 1 year ago

  • Has duplicate Bug #58172: get_rollback_snap_id throws bad variant access added
Actions #12

Updated by Ilya Dryomov over 1 year ago

  • Status changed from Fix Under Review to In Progress
  • Assignee changed from Christopher Hoffman to Ilya Dryomov
  • Backport changed from octopus,pacific,quincy to pacific,quincy
  • Pull request ID deleted (45375)
Actions #13

Updated by Ilya Dryomov over 1 year ago

  • Priority changed from Normal to High
Actions

Also available in: Atom PDF