Project

General

Profile

Actions

Bug #19413

closed

Cannot delete some snapshots after upgrade from jewel to kraken

Added by Benoit Loriot about 7 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Jason Dillaman
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

after upgrade from Jewel to Kraken, I got snapshots on several images that I can't delete.

  1. rbd -p qa-volumes snap ls --image volume-51883b81-87a9-4353-b317-9bef54e2e92f
    SNAPID NAME SIZE
    485 snapshot-d5d9b870-2416-4200-bd6d-d9a40be892d6 10240 MB
  1. rbd -p qa-volumes snap rm --image volume-51883b81-87a9-4353-b317-9bef54e2e92f --snap snapshot-d5d9b870-2416-4200-bd6d-d9a40be892d6
    Removing snap: 0% complete...failed.
    rbd: failed to remove snapshot: (22) Invalid argument
  1. ceph --version
    ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)

Related issues 1 (0 open1 closed)

Copied to rbd - Backport #19833: kraken: Cannot delete some snapshots after upgrade from jewel to krakenResolvedJason DillamanActions
Actions #1

Updated by Denis Horbunov about 7 years ago

Hi there. I've also stambled upon this issue. My ceph cluster was migrated from 10.2.5 to 11.2.0 (ceph --version
ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)) last week. I was aware about this issue in advance and all snapshots were deleted prior to the migration as a precaution meassure but one image has a undeleteable shapshot taken after migration to kraken.
- # rbd info oneimages-pool/one-73
rbd image 'one-73':
size 200 GB in 51200 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.10425036238e1f29
format: 2
features: layering, exclusive-lock
flags:

- # rbd snap ls oneimages-pool/one-73
SNAPID NAME SIZE
35637 snapmirror.1 200 GB
35704 snapmirror.2 200 GB

- # rbd snap unprotect
rbd: snap is already unprotected

- # rbd snap rm
Removing snap: 0% complete...failed.
rbd: failed to remove snapshot: (22) Invalid argument

Actions #2

Updated by Denis Horbunov about 7 years ago

May be this would be more helpfull.

rbd snap rm --debug-rbd 50
2017-04-26 13:55:07.388358 7f89d2b88000 5 librbd::AioImageRequestWQ: 0x55887fef8a70 : ictx=0x55887ff677b0
2017-04-26 13:55:07.388485 7f89d2b88000 20 librbd::ImageState: 0x55887ff66e80 open
2017-04-26 13:55:07.388500 7f89d2b88000 10 librbd::ImageState: 0x55887ff66e80 0x55887ff66e80 send_open_unlock
2017-04-26 13:55:07.388519 7f89d2b88000 10 librbd::image::OpenRequest: 0x55887ff68d70 send_v2_detect_header
2017-04-26 13:55:07.390219 7f89ab7fe700 10 librbd::image::OpenRequest: handle_v2_detect_header: r=0
2017-04-26 13:55:07.390237 7f89ab7fe700 10 librbd::image::OpenRequest: 0x55887ff68d70 send_v2_get_id
2017-04-26 13:55:07.390986 7f89ab7fe700 10 librbd::image::OpenRequest: handle_v2_get_id: r=0
2017-04-26 13:55:07.391005 7f89ab7fe700 10 librbd::image::OpenRequest: 0x55887ff68d70 send_v2_get_immutable_metadata
2017-04-26 13:55:07.392548 7f89ab7fe700 10 librbd::image::OpenRequest: handle_v2_get_immutable_metadata: r=0
2017-04-26 13:55:07.392559 7f89ab7fe700 10 librbd::image::OpenRequest: 0x55887ff68d70 send_v2_get_stripe_unit_count
2017-04-26 13:55:07.393214 7f89ab7fe700 10 librbd::image::OpenRequest: handle_v2_get_stripe_unit_count: r=-8
2017-04-26 13:55:07.393219 7f89ab7fe700 10 librbd::image::OpenRequest: 0x55887ff68d70 send_v2_get_data_pool
2017-04-26 13:55:07.393855 7f89ab7fe700 10 librbd::image::OpenRequest: 0x55887ff68d70 handle_v2_get_data_pool: r=0
2017-04-26 13:55:07.393870 7f89ab7fe700 10 librbd::ImageCtx: init_layout stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix rbd_data.10425036238e1f29 format rbd_data.10425036238e1f29.%016llx
2017-04-26 13:55:07.393877 7f89ab7fe700 10 librbd::image::OpenRequest: 0x55887ff68d70 send_v2_apply_metadata: start_key=conf_
2017-04-26 13:55:07.394675 7f89ab7fe700 10 librbd::image::OpenRequest: 0x55887ff68d70 handle_v2_apply_metadata: r=0
2017-04-26 13:55:07.394694 7f89ab7fe700 20 librbd::ImageCtx: apply_metadata
2017-04-26 13:55:07.395198 7f89ab7fe700 20 librbd::ImageCtx: enabling caching...
2017-04-26 13:55:07.395203 7f89ab7fe700 20 librbd::ImageCtx: Initial cache settings: size=1024000 num_objects=10 max_dirty=768000 target_dirty=512000 max_dirty_age=1
2017-04-26 13:55:07.395290 7f89ab7fe700 10 librbd::ImageCtx: cache bytes 1024000 -> about 32 objects
2017-04-26 13:55:07.395330 7f89ab7fe700 10 librbd::image::OpenRequest: 0x55887ff68d70 send_register_watch
2017-04-26 13:55:07.395433 7f89ab7fe700 10 librbd::ImageWatcher: 0x7f8998034c50 registering image watcher
2017-04-26 13:55:07.401575 7f89ab7fe700 10 librbd::image::OpenRequest: 0x55887ff68d70 handle_register_watch: r=0
2017-04-26 13:55:07.401586 7f89ab7fe700 10 librbd::image::OpenRequest: 0x55887ff68d70 send_refresh
2017-04-26 13:55:07.401589 7f89ab7fe700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 send_v2_get_mutable_metadata
2017-04-26 13:55:07.402798 7f89ab7fe700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 handle_v2_get_mutable_metadata: r=0
2017-04-26 13:55:07.402821 7f89ab7fe700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 send_v2_get_flags
2017-04-26 13:55:07.403775 7f89ab7fe700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 handle_v2_get_flags: r=0
2017-04-26 13:55:07.403788 7f89ab7fe700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 send_v2_get_group
2017-04-26 13:55:07.404642 7f89ab7fe700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 handle_v2_get_group: r=0
2017-04-26 13:55:07.404652 7f89ab7fe700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 send_v2_get_snapshots
2017-04-26 13:55:07.406129 7f89ab7fe700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 handle_v2_get_snapshots: r=0
2017-04-26 13:55:07.406143 7f89ab7fe700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 send_v2_get_snap_namespaces
2017-04-26 13:55:07.407092 7f89ab7fe700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 handle_v2_get_snap_namespaces: r=0
2017-04-26 13:55:07.407115 7f89ab7fe700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 send_v2_init_exclusive_lock
2017-04-26 13:55:07.407123 7f89ab7fe700 10 librbd::ExclusiveLock: 0x7f8998041710 init
2017-04-26 13:55:07.407126 7f89ab7fe700 5 librbd::AioImageRequestWQ: block_writes: 0x55887ff677b0, num=1
2017-04-26 13:55:07.407198 7f89aaffd700 10 librbd::ExclusiveLock: 0x7f8998041710 handle_init_complete
2017-04-26 13:55:07.407208 7f89aaffd700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 handle_v2_init_exclusive_lock: r=0
2017-04-26 13:55:07.407211 7f89aaffd700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 send_v2_apply
2017-04-26 13:55:07.407215 7f89aaffd700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 handle_v2_apply
2017-04-26 13:55:07.407222 7f89aaffd700 20 librbd::image::RefreshRequest: 0x7f8998036ad0 apply
2017-04-26 13:55:07.407228 7f89aaffd700 20 librbd::image::RefreshRequest: new snapshot id=35704 name=snapmirror.2 size=214748364800
2017-04-26 13:55:07.407231 7f89aaffd700 20 librbd::image::RefreshRequest: new snapshot id=35637 name=snapmirror.1 size=214748364800
2017-04-26 13:55:07.407244 7f89aaffd700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 send_flush_aio
2017-04-26 13:55:07.407249 7f89aaffd700 10 librbd::image::RefreshRequest: 0x7f8998036ad0 handle_flush_aio: r=0
Removing snap: 2017-04-26 13:55:07.407254 7f89aaffd700 10 librbd::image::OpenRequest: handle_refresh: r=0
2017-04-26 13:55:07.407259 7f89aaffd700 10 librbd::ImageState: 0x55887ff66e80 0x55887ff66e80 handle_open: r=0
2017-04-26 13:55:07.407315 7f89d2b88000 20 librbd: snap_remove 0x55887ff677b0 snapmirror.1 flags: 0
2017-04-26 13:55:07.407327 7f89d2b88000 20 librbd: get_snap_namespace 0x55887ff677b0 snapmirror.10% complete...failed.

rbd: failed to remove snapshot: (22) Invalid argument
2017-04-26 13:55:07.407495 7f89d2b88000 20 librbd::ImageState: 0x55887ff66e80 close
2017-04-26 13:55:07.407502 7f89d2b88000 10 librbd::ImageState: 0x55887ff66e80 0x55887ff66e80 send_close_unlock
2017-04-26 13:55:07.407504 7f89d2b88000 10 librbd::image::CloseRequest: 0x55887ff68a50 send_shut_down_update_watchers
2017-04-26 13:55:07.407505 7f89d2b88000 20 librbd::ImageState: 0x55887ff66e80 shut_down_update_watchers
2017-04-26 13:55:07.407506 7f89d2b88000 20 librbd::ImageState: 0x55887ff66f60 ImageUpdateWatchers::shut_down
2017-04-26 13:55:07.407508 7f89d2b88000 20 librbd::ImageState: 0x55887ff66f60 ImageUpdateWatchers::shut_down: completing shut down
2017-04-26 13:55:07.407542 7f89aaffd700 10 librbd::image::CloseRequest: 0x55887ff68a50 handle_shut_down_update_watchers: r=0
2017-04-26 13:55:07.407550 7f89aaffd700 10 librbd::image::CloseRequest: 0x55887ff68a50 send_unregister_image_watcher
2017-04-26 13:55:07.407554 7f89aaffd700 10 librbd::ImageWatcher: 0x7f8998034c50 unregistering image watcher
2017-04-26 13:55:07.412588 7f89aaffd700 10 librbd::image::CloseRequest: 0x55887ff68a50 handle_unregister_image_watcher: r=0
2017-04-26 13:55:07.412597 7f89aaffd700 10 librbd::image::CloseRequest: 0x55887ff68a50 send_shut_down_aio_queue
2017-04-26 13:55:07.412600 7f89aaffd700 5 librbd::AioImageRequestWQ: shut_down: in_flight=0
2017-04-26 13:55:07.412604 7f89aaffd700 10 librbd::image::CloseRequest: 0x55887ff68a50 handle_shut_down_aio_queue: r=0
2017-04-26 13:55:07.412607 7f89aaffd700 10 librbd::image::CloseRequest: 0x55887ff68a50 send_shut_down_exclusive_lock
2017-04-26 13:55:07.412608 7f89aaffd700 10 librbd::ExclusiveLock: 0x7f8998041710 shut_down
2017-04-26 13:55:07.412612 7f89aaffd700 10 librbd::ExclusiveLock: 0x7f8998041710 handle_shutdown: r=0
2017-04-26 13:55:07.412615 7f89aaffd700 20 librbd::AioImageRequestWQ: clear_require_lock_on_read
2017-04-26 13:55:07.412617 7f89aaffd700 5 librbd::AioImageRequestWQ: unblock_writes: 0x55887ff677b0, num=0
2017-04-26 13:55:07.412628 7f89aaffd700 10 librbd::image::CloseRequest: 0x55887ff68a50 handle_shut_down_exclusive_lock: r=0
2017-04-26 13:55:07.412634 7f89aaffd700 10 librbd::image::CloseRequest: 0x55887ff68a50 send_flush_readahead
2017-04-26 13:55:07.412640 7f89aaffd700 10 librbd::image::CloseRequest: 0x55887ff68a50 handle_flush_readahead: r=0
2017-04-26 13:55:07.412644 7f89aaffd700 10 librbd::image::CloseRequest: 0x55887ff68a50 send_shut_down_cache
2017-04-26 13:55:07.412700 7f89aaffd700 10 librbd::image::CloseRequest: 0x55887ff68a50 handle_shut_down_cache: r=0
2017-04-26 13:55:07.412706 7f89aaffd700 10 librbd::image::CloseRequest: 0x55887ff68a50 send_flush_op_work_queue
2017-04-26 13:55:07.412710 7f89aaffd700 10 librbd::image::CloseRequest: 0x55887ff68a50 handle_flush_op_work_queue: r=0
2017-04-26 13:55:07.412713 7f89aaffd700 10 librbd::image::CloseRequest: 0x55887ff68a50 handle_flush_image_watcher: r=0
2017-04-26 13:55:07.412734 7f89aaffd700 10 librbd::ImageState: 0x55887ff66e80 0x55887ff66e80 handle_close: r=0

Actions #3

Updated by Jason Dillaman almost 7 years ago

  • Project changed from Ceph to rbd
  • Status changed from New to In Progress
  • Assignee set to Jason Dillaman
  • Priority changed from Normal to High
Actions #4

Updated by Jason Dillaman almost 7 years ago

  • Backport set to kraken
Actions #5

Updated by Jason Dillaman almost 7 years ago

  • Backport deleted (kraken)
Actions #6

Updated by Jason Dillaman almost 7 years ago

  • Status changed from In Progress to Need More Info

@Denis Liu: can you do me a favor, run the following, and attach the results?

# rados -p oneimages-pool listomapvals rbd_header.10425036238e1f29
Actions #7

Updated by Jason Dillaman almost 7 years ago

  • Status changed from Need More Info to Fix Under Review
  • Backport set to kraken
Actions #8

Updated by Benoit Loriot almost 7 years ago

# rados -p qa-volumes listomapvals rbd_data.3d91f429d7ac52
error getting omap keys qa-volumes/rbd_data.3d91f429d7ac52: (2) No such file or directory
Actions #9

Updated by Jason Dillaman almost 7 years ago

@Benoît Canet: switch "rbd_data" for "rbd_header" in your command above.

Actions #10

Updated by Benoit Loriot almost 7 years ago

Here is the correct output

# rados -p qa-volumes listomapvals rbd_header.3d91f429d7ac52 --debug-rbd 50
features
value (8 bytes) :
00000000  3d 00 00 00 00 00 00 00                           |=.......|
00000008

object_prefix
value (27 bytes) :
00000000  17 00 00 00 72 62 64 5f  64 61 74 61 2e 33 64 39  |....rbd_data.3d9|
00000010  31 66 34 32 39 64 37 61  63 35 32                 |1f429d7ac52|
0000001b

order
value (1 bytes) :
00000000  16                                                |.|
00000001

size
value (8 bytes) :
00000000  00 00 00 80 02 00 00 00                           |........|
00000008

snap_seq
value (8 bytes) :
00000000  e8 01 00 00 00 00 00 00                           |........|
00000008

snapshot_00000000000001e5
value (132 bytes) :
00000000  05 01 7e 00 00 00 e5 01  00 00 00 00 00 00 2d 00  |..~...........-.|
00000010  00 00 73 6e 61 70 73 68  6f 74 2d 64 35 64 39 62  |..snapshot-d5d9b|
00000020  38 37 30 2d 32 34 31 36  2d 34 32 30 30 2d 62 64  |870-2416-4200-bd|
00000030  36 64 2d 64 39 61 34 30  62 65 38 39 32 64 36 00  |6d-d9a40be892d6.|
00000040  00 00 80 02 00 00 00 3d  00 00 00 00 00 00 00 01  |.......=........|
00000050  01 1c 00 00 00 ff ff ff  ff ff ff ff ff 00 00 00  |................|
00000060  00 fe ff ff ff ff ff ff  ff 00 00 00 00 00 00 00  |................|
00000070  00 00 00 00 00 00 00 00  00 00 01 01 04 00 00 00  |................|
00000080  ff ff ff ff                                       |....|
00000084

Actions #11

Updated by Jason Dillaman almost 7 years ago

@Benoît Canet: Great, thanks. I was expecting the problem to be four 0xFF bytes at the end of your snapshot_XYZ value. The associated PR should prevent that issue from occurring in the future. In the meantime, the easiest way to resolve the issue would be to use a pre-Kraken rbd CLI to remove the snapshot. Alternatively, you can use the rados CLI to write the key value to a file, use a hex editor to change the last four bytes to all 0x00, and again use rados CLI to update the value using the contents of said file.

The issue occurred when the OSDs were upgraded to Kraken and a Jewel RBD client created a snapshot. The Kraken OSD incorrectly wrote the four 0xFF values (-1) instead of four 0x00 values since the Jewel RBD client doesn't understand snapshot namespaces. When the RBD client was later upgraded to Kraken, it interpreted that namespace value as an invalid namespace and returns a -EINVAL error code.

Actions #12

Updated by Benoit Loriot almost 7 years ago

Thanks Jason, snapshot deletion worked from a Jewel client.

Actions #13

Updated by Mykola Golub almost 7 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #14

Updated by Jason Dillaman almost 7 years ago

  • Copied to Backport #19833: kraken: Cannot delete some snapshots after upgrade from jewel to kraken added
Actions #15

Updated by Denis Horbunov almost 7 years ago

I'm sorry for being late. Thank you very much indeed Jason! This solved my problem too.

Actions #16

Updated by Ross Martyn over 6 years ago

Using an old client also fixed this issue for me. Glad this has been fixed in 11.2.1. Appreciate the info.

Actions #17

Updated by Jason Dillaman over 6 years ago

  • Status changed from Pending Backport to Resolved
Actions #18

Updated by Lionel BEARD over 6 years ago

Hi,

FYI, we just have exactly the same issue when upgrading from Kraken to Luminous (four OxFF).
It was fixed by using a jewel RBD client (as kraken RBD client crashed!) to remove snapshot.

I think I probably did something wrong when upgrading.
Do I have to mark out OSD before upgrading them?
What are the requirement when upgrading clients? After or before OSD? Client must be off?

Actions #19

Updated by Jason Dillaman over 6 years ago

@Lionel: the bug was fixed in kraken 11.2.1 but if you had previously created any snapshots on kraken 11.2.0 using jewel or earlier clients, the snapshot header would have already been incorrectly written. If you can still repeat this issue on luminous when creating a new snapshot using a jewel client and attempting to delete the snapshot using a luminous client, let me know.

Actions #20

Updated by Ross Martyn almost 6 years ago

Just seen this issue again, as above FFFF in header info. I was unable to remove with Luminous rbd client (12.2.5), but was able to remove with Mimic client (13.2.0).

Actions #21

Updated by Jason Dillaman almost 6 years ago

@Ross: what version of Ceph is your cluster running and what version of Ceph is your client running when you created a snapshot w/ 0xFFFF at the end? Can you provide the hexdump for its data?

Actions

Also available in: Atom PDF