Bug #11228
closedMultpile monitors are crashing after pool rename
0%
Description
Hi Developers
Recently i have performed pool operations like copying , renaming and deleting pools.
This is because i want to reduce pg_num for pool and the only option to do this is to delete and recreate pools. So i have done this.
ceph osd pool create glance-devel-bkp 128 128 rados cppool glance-devel glance-devel-bkp ceph osd pool rename glance-devel glance-devel-old ceph osd pool rename glance-devel-bkp glance-devel ceph osd pool delete glance-devel-old glance-devel-old --yes-i-really-really-mean-it
- Similarly i have done for glance-test pool . During pool rename and pool deletion RBD clients i.e. glance and cinder were active.
Just after performing pool operations 2 of 3 monitors crashed, i tried restarting monitors several times but they are crashing after 40 seconds. Since 2 out of 3 monitors are down i cannot connect to cluster.
- Sage assisted me with this issue on IRC and he suggested "firefly RBD clients identify the pool by name in some cases (instead of id)"
- Attached is the monitor logs with debug mon=20 and debug ms = 1
# ceph osd lspools 1 metadata,7 cinder-devel,8 cinder-test,9 cinder-production,10 glance-production,11 glance-devel,12 glance-test, #
# ceph -s cluster 98d89661-f616-49eb-9ccf-84d720e179c0 health HEALTH_WARN 3025 pgs degraded; 914 pgs peering; 28 pgs stale; 890 pgs stuck inactive; 3694 pgs stuck unclean; recovery 53/489 objects degraded (10.838%); mds storage0101 is laggy; 19/110 in osds are down; nodown,noout,norecover flag(s) set; 1 mons down, quorum 1,2 storage0105,storage0110 monmap e4: 3 mons at {storage0101=X.X.X.X:6789/0,storage0105=X.X.X.X:6789/0,storage0110=X.X.X.X:6789/0}, election epoch 216, quorum 1,2 storage0105,storage0110 mdsmap e21: 1/1/1 up {0=storage0101=up:active(laggy or crashed)} osdmap e1007: 110 osds: 91 up, 110 in flags nodown,noout,norecover pgmap v110567: 8768 pgs, 7 pools, 1033 MB data, 163 objects 55727 MB used, 297 TB / 298 TB avail 53/489 objects degraded (10.838%) 37 inactive 15 stale+active+clean 1 degraded+remapped 830 peering 89 active+degraded+remapped 4293 active+clean 7 stale+active+degraded 3 stale+active+remapped 11 degraded 2917 active+degraded 81 remapped+peering 481 active+remapped 3 stale+peering
Output for ceph osd dump -f json-pretty[[http://pastebin.com/rHqQLfEe]]
Output of ceph monitor [[https://www.dropbox.com/s/akaw02rm5gvix6k/ceph-mon.storage0101.log?dl=1]]
Updated by Sage Weil about 9 years ago
- Status changed from New to Need More Info
- Priority changed from Immediate to Urgent
still puzzled by this. the code is
case POOL_OP_DELETE_UNMANAGED_SNAP: if (!pp.is_removed_snap(m->snapid)) { pp.remove_unmanaged_snap(m->snapid); changed = true; } break;
and afaics the pg_pool_t methods are doing the right thing. what i can't tell is what m->snapid is.
i assume this is no longer happening?
Updated by karan singh about 9 years ago
Hi Sage
This problem didn't came after i restarted cinder and glance services.
But not sure if its permanently fixed or now.
Updated by Sage Weil about 9 years ago
- Status changed from Need More Info to Can't reproduce