Project

General

Profile

Actions

Bug #6047

closed

mon: Assert and monitor-crash when attemting to create pool-snapshots while rbd-snapshots are in use or have been used on a pool

Added by Oliver Daudey over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Joao Eduardo Luis
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While playing around on my test-cluster, I ran into a problem that I've seen before, but have never been able to reproduce until now. The use of pool-snapshots and rbd-snapshots seems to be mutually exclusive within the same pool, even if you have used one type of snapshot before and have since deleted all snapshots of that type. Unfortunately, the condition doesn't appear to be handled gracefully yet, leading, in one case, to monitors crashing. I think this one goes back at least as far as Bobtail and still exists in Dumpling. My cluster is a straightforward one with 3 Debian Squeeze-nodes, each running a mon, mds and osd. To reproduce:

  1. ceph osd pool create test 256 256
    pool 'test' created
  2. ceph osd pool mksnap test snapshot
    created pool test snap snapshot
  3. ceph osd pool rmsnap test snapshot
    removed pool test snap snapshot

So far, so good. Now we try to create an rbd-snapshot in the same pool:

  1. rbd --pool=test create --size=102400 image
  2. rbd --pool=test snap create image@snapshot
    rbd: failed to create snapshot: (22) Invalid argument
    2013-08-18 19:27:50.892291 7f983bc10780 -1 librbd: failed to create snap id: (22) Invalid argument

That failed, but at least the cluster is OK. Now we start over again and create the rbd-snapshot first:

  1. ceph osd pool delete test test --yes-i-really-really-mean-it
    pool 'test' deleted
  2. ceph osd pool create test 256 256
    pool 'test' created
  3. rbd --pool=test create --size=102400 image
  4. rbd --pool=test snap create image@snapshot
  5. rbd --pool=test snap ls image
    SNAPID NAME SIZE
    2 snapshot 102400 MB
  6. rbd --pool=test snap rm image@snapshot
  7. ceph osd pool mksnap test snapshot
    2013-08-18 19:35:59.494551 7f48d75a1700 0 monclient: hunting for new mon
    ^CError EINTR: (I pressed CTRL-C)

My leader monitor crashed at that last command, here's the apparent critical point in the logs:

3> 2013-08-18 19:35:59.315956 7f9b870b1700  1 - 194.109.43.18:6789/0 <== c
lient.5856 194.109.43.18:0/1030570 8 ==== mon_command({"snap": "snapshot", "pref
ix": "osd pool mksnap", "pool": "test"} v 0) v1 ==== 107+0+0 (1111983560 0 0) 0x23e4200 con 0x2d202c0
-2> 2013-08-18 19:35:59.316020 7f9b870b1700 0 mon.a@0(leader) e1 handle_command mon_command({"snap": "snapshot", "prefix": "osd pool mksnap", "pool": "test"} v 0) v1
-1> 2013-08-18 19:35:59.316033 7f9b870b1700 1 mon.a@0(leader).paxos(paxos active c 1190049..1190629) is_readable now=2013-08-18 19:35:59.316034 lease_expire=2013-08-18 19:36:03.535809 has v0 lc 1190629
0> 2013-08-18 19:35:59.317612 7f9b870b1700 -1 osd/osd_types.cc: In function 'void pg_pool_t::add_snap(const char*, utime_t)' thread 7f9b870b1700 time 2013-08-18 19:35:59.316102
osd/osd_types.cc: 682: FAILED assert(!is_unmanaged_snaps_mode())

Related issues 1 (0 open1 closed)

Related to Ceph - Fix #4635: mon: many ops expose uncommitted stateResolvedJoao Eduardo Luis04/02/2013

Actions
Actions #1

Updated by Joao Eduardo Luis over 10 years ago

  • Subject changed from Assert and monitor-crash when attemting to create pool-snapshots while rbd-snapshots are in use or have been used on a pool to mon: Assert and monitor-crash when attemting to create pool-snapshots while rbd-snapshots are in use or have been used on a pool
  • Status changed from New to 12
  • Assignee set to Joao Eduardo Luis
  • Priority changed from High to Urgent

This is pretty much the same as #5959, which was reported on cuttlefish and which we believed to have fixed on d1501938f5d07c067d908501fc5cfe3c857d7281.

It appears that this is not in fact fixed, and it's amazingly easy to reproduce on 0.67.1 following your instructions. Thanks!

Actions #2

Updated by Joao Eduardo Luis over 10 years ago

  • Status changed from 12 to In Progress
Actions #3

Updated by Joao Eduardo Luis over 10 years ago

  • Status changed from In Progress to Pending Backport
Actions #4

Updated by Sage Weil over 10 years ago

  • Status changed from Pending Backport to Resolved
  • Source changed from other to Q/A
Actions #5

Updated by Sage Weil over 10 years ago

  • Source changed from Q/A to Community (user)
Actions

Also available in: Atom PDF