Project

General

Profile

Actions

Bug #11228

closed

Multpile monitors are crashing after pool rename

Added by karan singh about 9 years ago. Updated about 9 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
ceph monitor crashing
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi Developers

Recently i have performed pool operations like copying , renaming and deleting pools.
This is because i want to reduce pg_num for pool and the only option to do this is to delete and recreate pools. So i have done this.

ceph osd pool create glance-devel-bkp 128 128
rados cppool glance-devel glance-devel-bkp
ceph osd pool rename glance-devel glance-devel-old
ceph osd pool rename glance-devel-bkp glance-devel
ceph osd pool delete glance-devel-old glance-devel-old --yes-i-really-really-mean-it
  • Similarly i have done for glance-test pool . During pool rename and pool deletion RBD clients i.e. glance and cinder were active.

Just after performing pool operations 2 of 3 monitors crashed, i tried restarting monitors several times but they are crashing after 40 seconds. Since 2 out of 3 monitors are down i cannot connect to cluster.

  • Sage assisted me with this issue on IRC and he suggested "firefly RBD clients identify the pool by name in some cases (instead of id)"
  • Attached is the monitor logs with debug mon=20 and debug ms = 1
# ceph osd lspools
1 metadata,7 cinder-devel,8 cinder-test,9 cinder-production,10 glance-production,11 glance-devel,12 glance-test,
#
# ceph -s
    cluster 98d89661-f616-49eb-9ccf-84d720e179c0
     health HEALTH_WARN 3025 pgs degraded; 914 pgs peering; 28 pgs stale; 890 pgs stuck inactive; 3694 pgs stuck unclean; recovery 53/489 objects degraded (10.838%); mds storage0101 is laggy; 19/110 in osds are down; nodown,noout,norecover flag(s) set; 1 mons down, quorum 1,2 storage0105,storage0110
     monmap e4: 3 mons at {storage0101=X.X.X.X:6789/0,storage0105=X.X.X.X:6789/0,storage0110=X.X.X.X:6789/0}, election epoch 216, quorum 1,2 storage0105,storage0110
     mdsmap e21: 1/1/1 up {0=storage0101=up:active(laggy or crashed)}
     osdmap e1007: 110 osds: 91 up, 110 in
            flags nodown,noout,norecover
      pgmap v110567: 8768 pgs, 7 pools, 1033 MB data, 163 objects
            55727 MB used, 297 TB / 298 TB avail
            53/489 objects degraded (10.838%)
                  37 inactive
                  15 stale+active+clean
                   1 degraded+remapped
                 830 peering
                  89 active+degraded+remapped
                4293 active+clean
                   7 stale+active+degraded
                   3 stale+active+remapped
                  11 degraded
                2917 active+degraded
                  81 remapped+peering
                 481 active+remapped
                   3 stale+peering

Output for ceph osd dump -f json-pretty[[http://pastebin.com/rHqQLfEe]]

Output of ceph monitor [[https://www.dropbox.com/s/akaw02rm5gvix6k/ceph-mon.storage0101.log?dl=1]]

Actions #1

Updated by Sage Weil about 9 years ago

  • Assignee set to Sage Weil
Actions #2

Updated by Sage Weil about 9 years ago

  • Status changed from New to Need More Info
  • Priority changed from Immediate to Urgent

still puzzled by this. the code is

  case POOL_OP_DELETE_UNMANAGED_SNAP:
    if (!pp.is_removed_snap(m->snapid)) {
      pp.remove_unmanaged_snap(m->snapid);
      changed = true;
    }
    break;

and afaics the pg_pool_t methods are doing the right thing. what i can't tell is what m->snapid is.

i assume this is no longer happening?

Actions #3

Updated by karan singh about 9 years ago

Hi Sage

This problem didn't came after i restarted cinder and glance services.

But not sure if its permanently fixed or now.

Actions #4

Updated by Sage Weil about 9 years ago

  • Priority changed from Urgent to High
Actions #5

Updated by Sage Weil about 9 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF