Project

General

Profile

Bug #12047

monitor segmentation fault on faulty crushmap

Added by Jonas Weismüller over 7 years ago. Updated over 7 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,
I accidently removed a root bucket "osd crush remove platter" (root=platter). The platter object was still used by a ruleset and the ruleset was still in use by a pool. My issue looks similar to #9485.

outline of the crushmap

root platter {
        id -1           # do not change unnecessarily
        # weight 12.000
        alg straw
        hash 0  # rjenkins1
        item platter-rack1 weight 4.000
        item platter-rack2 weight 4.000
        item platter-rack3 weight 4.000
}

rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take platter
        step chooseleaf firstn 0 type rack
        step emit
}

Until now I am not able to get the monitors up and running again.

mon_crash_20150617.log View (3.93 KB) Jonas Weismüller, 06/17/2015 06:54 AM

store.db.tar.bz2 (865 KB) Jonas Weismüller, 06/30/2015 12:46 PM


Related issues

Related to Ceph - Feature #12193: OSD's are not updating osdmap properly after monitoring crash Resolved 07/01/2015
Duplicates Ceph - Bug #11680: mon crashes when "ceph osd tree 85 --format json" Can't reproduce 05/19/2015

History

#1 Updated by Kefu Chai over 7 years ago

  • Status changed from New to Duplicate

Jonas, it should be addressed by the latest master branch. and the fix will be back ported to the next hammer release.

#2 Updated by Jonas Weismüller over 7 years ago

Hi Kefu,
thanks a lot for the information. Is there any chance to manually recover the cluster?

Or will the new release fix it automatically?

I am also willing to test the new fix. Is there already a pre-built deb package to test with? Or do I have to manually build it?

Cheers Jonas

#3 Updated by Kefu Chai over 7 years ago

Jonas Weismüller wrote:

Hi Kefu,
thanks a lot for the information. Is there any chance to manually recover the cluster?

i'd say "yes, but it's a non-trivial". Is your cluster in production?

Or will the new release fix it automatically?

see #11815, we plan to have a CLI tool to help user to fix it manually. And I am working on it. probably you can watch that ticket if you want to keep you updated.

I am also willing to test the new fix. Is there already a pre-built deb package to test with? Or do I have to manually build it?

The fix i mentioned in the #12047-1 will prevent you from injecting a faulty crush map, but your monitors won't come alive with it.

Cheers Jonas

#4 Updated by Jonas Weismüller over 7 years ago

Luckily it is not a production cluster, it is a development cluster. Though, I would prefer to get the cluster fixed and get it up and running again.

So I could be an alpha tester of your cli tool? I would really love to offer my help. The only think is, that I currently need my development cluster pretty soon for further testing. So it depends how long it last long to alpha test your client?

#5 Updated by Kefu Chai over 7 years ago

Jonas Weismüller wrote:

So I could be an alpha tester of your cli tool? I would really love to offer my help.

jonas, sorry to get back late. thanks! i pulled together a script at https://github.com/ceph/ceph/pull/5052 .

but this patch also includes a change in the a CLI tool. so you will need to recompile it or download the latest package from one of our gitbuilders. i will update you once the package is built.

The only think is, that I currently need my development cluster pretty soon for further testing. So it depends how long it last long to alpha test your client?

in minutes =)

#6 Updated by Jonas Weismüller over 7 years ago

I can wait for the package. Thanks a lot so far.

#7 Updated by Kefu Chai over 7 years ago

jonas, the packages are ready. may i learn what your distro/release/arch is?

since the repo's addresses for the packages under testing are different from distro to distro. for example, the

ubuntu precise amd64: http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/sha1/b8795313b2d0028f1e71b584bb28e014b5e06984/pool/main/c/ceph/

please note, the patch is still under testing. seems there are chances that the mon store does not get fixed. i am looking into it.

#8 Updated by Jonas Weismüller over 7 years ago

# lsb_release -a
No LSB modules are available.
Distributor ID:    Debian
Description:    Debian GNU/Linux 7.8 (wheezy)
Release:    7.8
Codename:    wheezy
# uname -r
3.2.0-4-amd64

#10 Updated by Kefu Chai over 7 years ago

the package is ready. you need to install ceph-test_9.0.1-1080-g0e78e54-1wheezy_amd64.deb in the repo.

and /usr/lib/ceph/ceph-monstore-update-crush.sh will help you with the broken mon store.

#11 Updated by Jonas Weismüller over 7 years ago

Which additional ceph packages do I have to update as well?

Can you give me a short hint, which commands to execute:

I tried the following, which ends up in a loop:

# /usr/lib/ceph/ceph-monstore-update-crush.sh --rewrite /var/lib/ceph/mon/ceph-vs64/ 
no action specified; -h for help
no action specified; -h for help
no action specified; -h for help
no action specified; -h for help
no action specified; -h for help

#12 Updated by Kefu Chai over 7 years ago

no action specified

this is printed by crushtool, which accepted some updates recently to enable it to check the completeness of a crush map.

it is packaged in "ceph". the most recent release of my fix can be found at http://gitbuilder.ceph.com/ceph-deb-wheezy-x86_64-basic/sha1/4f4c8fe40e3912cb895d5f9cd4d40b085b3e12ab/pool/main/c/ceph/ . you can extract the crushtool from the debian package using dpkg-deb without taking the risk of installing it.

#13 Updated by Jonas Weismüller over 7 years ago

does not revert to a previous monmap:

# /usr/lib/ceph/ceph-monstore-update-crush.sh --mon-store /var/lib/ceph/mon/ceph-vs64/ --rewrite
good crush map found at epoch 1547/1547
and mon store has no faulty crush maps.

attached you will find the store.db folder as tar.gz file.

#14 Updated by Jonas Weismüller over 7 years ago

I downloaded:
http://gitbuilder.ceph.com/ceph-deb-wheezy-x86_64-basic/sha1/0e78e54e07e637fa458adaae6165b7259c79e089/pool/main/c/ceph/ceph_9.0.1-1080-g0e78e54-1wheezy_amd64.deb
http://gitbuilder.ceph.com/ceph-deb-wheezy-x86_64-basic/sha1/0e78e54e07e637fa458adaae6165b7259c79e089/pool/main/c/ceph/ceph-test_9.0.1-1080-g0e78e54-1wheezy_amd64.deb

Then I replaced in /usr/bin on the leader monitor "crushtool, osdmaptool and ceph-monstore-tool". I extracted the crushmap and fixed the faulty references to non existing buckets (see diff).

# ceph-monstore-tool /var/lib/ceph/mon/ceph-vs64/store.db/ get osdmap -- -v 1547 -o /tmp/osdmap.1547
# ceph-monstore-tool /var/lib/ceph/mon/ceph-vs64/ get osdmap -- -v 1547 -o /tmp/osdmap.1547
# osdmaptool --export-crush /tmp/crush.1547 /tmp/osdmap.1547
# crushtool --decompile /tmp/crush.1547 --outfn /tmp/crush.1547.src_broken
# cp /tmp/crush.1547.src_broken /tmp/crush.1547.src
# diff /tmp/crush.1547.src_broken /tmp/crush.1547.src

148c148
<       step take bucket0
---
>       step take default
159c159
<       step take bucket0
---
>       step take default
168c168
<       step take bucket15
---
>       step take default
177c177
<       step take bucket15
---
>       step take default
180c180
<       step take bucket0
---
>       step take default

I recompiled the crushmap and injected it back to the leader monitor.

# crushtool -c /tmp/crush.1547.src -o /tmp/crush.1547.src_compiled
# /usr/bin/ceph-monstore-tool /var/lib/ceph/mon/ceph-vs64/ rewrite-crush -- --crush /tmp/crush.1547.src_compiled --good-epoch 1546

Then I restarted all other monitors and ended up in a "healthy" monitor state:

     cluster 76632a36-a0fe-4a3a-b8c0-b08cec03efd0
      health HEALTH_WARN
             103 pgs degraded
             16 pgs down
             42 pgs peering
             223 pgs stale
             103 pgs stuck degraded
             60 pgs stuck inactive
             223 pgs stuck stale
             512 pgs stuck unclean
             103 pgs stuck undersized
             103 pgs undersized
             recovery 1429/12927 objects degraded (11.054%)
             recovery 10423/12927 objects misplaced (80.630%)
             too few PGs per OSD (21 < min 30)
      monmap e1: 5 mons at {vs64=10.1.128.6:6789/0,vs65=10.1.128.7:6789/0,vs66=10.1.128.8:6789/0,vs67=10.1.128.9:6789
             election epoch 224, quorum 0,1,2,3,4 vs64,vs65,vs66,vs67,vs68
      osdmap e1548: 24 osds: 24 up, 24 in; 289 remapped pgs
       pgmap v41595: 512 pgs, 1 pools, 4309 MB data, 4309 objects
             14193 MB used, 2382 GB / 2396 GB avail
             1429/12927 objects degraded (11.054%)
             10423/12927 objects misplaced (80.630%)
                  258 active+remapped
                   91 stale+active+remapped
                   58 stale+active+undersized+degraded+remapped
                   31 active+undersized+degraded+remapped
                   26 stale+remapped+peering
                   18 stale+remapped
                   14 stale+down+remapped+peering
                   14 stale+active+undersized+degraded
                    2 stale+down+peering

#15 Updated by Jonas Weismüller over 7 years ago

Now I have the problem that the pg's are not recovering. This is probably due to approx. half of osd's are crashing right now:

2015-06-30 13:43:00.878364 7f9feb23d840  0 osd.5 1547 load_pgs
2015-06-30 13:43:01.076863 7f9feb23d840  0 osd.5 1547 load_pgs opened 43 pgs
2015-06-30 13:43:01.077486 7f9feb23d840 -1 osd.5 1547 log_to_monitors {default=true}
2015-06-30 13:43:01.086179 7f9fd76bd700  0 osd.5 1547 ignoring osdmap until we have initialized
2015-06-30 13:43:01.089618 7f9fd76bd700  0 osd.5 1547 ignoring osdmap until we have initialized
2015-06-30 13:43:02.086065 7f9fccea8700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f9fccea8700

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: /usr/bin/ceph-osd() [0xbef08c]
 2: (()+0xf0a0) [0x7f9fea1500a0]
 3: /usr/bin/ceph-osd() [0xd5c934]
 4: (crush_do_rule()+0x390) [0xd5d570]
 5: (CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const+0x8b) [0xcb146b]
 6: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*, int*, unsigned int*) const+0x7c) [0xc9a68c]
 7: (OSDMap::_pg_to_up_acting_osds(pg_t const&, std::vector<int, std::allocator<int> >*, int*, std::vector<int, std::allocator<int> >*, int*) const+0x10f) [0xc9a82f]
 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x1e2) [0x7d1dc2]
 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x262) [0x7d2cd2]
 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x14) [0x8385c4]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xccce79]
 12: (ThreadPool::WorkThread::entry()+0x10) [0xccee70]
 13: (()+0x6b50) [0x7f9fea147b50]
 14: (clone()+0x6d) [0x7f9fe8b6395d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -31> 2015-06-30 13:43:00.127913 7f9feb23d840  5 asok(0x50be000) register_command perfcounters_dump hook 0x5082050
   -30> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command 1 hook 0x5082050
   -29> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command perf dump hook 0x5082050
   -28> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command perfcounters_schema hook 0x5082050
   -27> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command 2 hook 0x5082050
   -26> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command perf schema hook 0x5082050
   -25> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command perf reset hook 0x5082050
   -24> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command config show hook 0x5082050
   -23> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command config set hook 0x5082050
   -22> 2015-06-30 13:43:00.128058 7f9feb23d840  5 asok(0x50be000) register_command config get hook 0x5082050
   -21> 2015-06-30 13:43:00.128060 7f9feb23d840  5 asok(0x50be000) register_command config diff hook 0x5082050
   -20> 2015-06-30 13:43:00.128067 7f9feb23d840  5 asok(0x50be000) register_command log flush hook 0x5082050
   -19> 2015-06-30 13:43:00.128067 7f9feb23d840  5 asok(0x50be000) register_command log dump hook 0x5082050
   -18> 2015-06-30 13:43:00.128085 7f9feb23d840  5 asok(0x50be000) register_command log reopen hook 0x5082050
   -17> 2015-06-30 13:43:00.130445 7f9feb23d840  0 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3), process ceph-osd, pid 2947
   -16> 2015-06-30 13:43:00.132297 7f9feb23d840  1 finished global_init_daemonize
   -15> 2015-06-30 13:43:00.152866 7f9feb23d840  0 filestore(/var/lib/ceph/osd/ceph-5) backend xfs (magic 0x58465342)
   -14> 2015-06-30 13:43:00.244827 7f9feb23d840  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: FIEMAP ioctl is supported and appears to work
   -13> 2015-06-30 13:43:00.244853 7f9feb23d840  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
   -12> 2015-06-30 13:43:00.604742 7f9feb23d840  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: syscall(SYS_syncfs, fd) fully supported
   -11> 2015-06-30 13:43:00.604855 7f9feb23d840  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: disabling extsize, kernel 3.2.0-4-amd64 is older than 3.5 and has buggy extsize ioctl
   -10> 2015-06-30 13:43:00.714798 7f9feb23d840  0 filestore(/var/lib/ceph/osd/ceph-5) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
    -9> 2015-06-30 13:43:00.869904 7f9feb23d840  0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
    -8> 2015-06-30 13:43:00.877680 7f9feb23d840  0 osd.5 1547 crush map has features 1107558400, adjusting msgr requires for clients
    -7> 2015-06-30 13:43:00.878332 7f9feb23d840  0 osd.5 1547 crush map has features 1107558400 was 8705, adjusting msgr requires for mons
    -6> 2015-06-30 13:43:00.878332 7f9feb23d840  0 osd.5 1547 crush map has features 1107558400, adjusting msgr requires for osds
    -5> 2015-06-30 13:43:00.878364 7f9feb23d840  0 osd.5 1547 load_pgs
    -4> 2015-06-30 13:43:01.076863 7f9feb23d840  0 osd.5 1547 load_pgs opened 43 pgs
    -3> 2015-06-30 13:43:01.077486 7f9feb23d840 -1 osd.5 1547 log_to_monitors {default=true}
    -2> 2015-06-30 13:43:01.086179 7f9fd76bd700  0 osd.5 1547 ignoring osdmap until we have initialized
    -1> 2015-06-30 13:43:01.089618 7f9fd76bd700  0 osd.5 1547 ignoring osdmap until we have initialized
     0> 2015-06-30 13:43:02.086065 7f9fccea8700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f9fccea8700
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: /usr/bin/ceph-osd() [0xbef08c]
 2: (()+0xf0a0) [0x7f9fea1500a0]
 3: /usr/bin/ceph-osd() [0xd5c934]
 4: (crush_do_rule()+0x390) [0xd5d570]
 5: (CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const+0x8b) [0xcb146b]
 6: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*, int*, unsigned int*) const+0x7c) [0xc9a68c]
 7: (OSDMap::_pg_to_up_acting_osds(pg_t const&, std::vector<int, std::allocator<int> >*, int*, std::vector<int, std::allocator<int> >*, int*) const+0x10f) [0xc9a82f]
 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x1e2) [0x7d1dc2]
 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x262) [0x7d2cd2]
 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x14) [0x8385c4]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xccce79]
 12: (ThreadPool::WorkThread::entry()+0x10) [0xccee70]
 13: (()+0x6b50) [0x7f9fea147b50]
 14: (clone()+0x6d) [0x7f9fe8b6395d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#16 Updated by Jonas Weismüller over 7 years ago

I also rebooted all of the OSD nodes, but still have of them are crashing (DOWN).

# ceph osd tree
ID  WEIGHT   TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-22 24.00000 root default                                         
-28  8.00000     rack rack1                                       
-23  4.00000         host vs72                                    
 23  1.00000             osd.23      up  1.00000          1.00000 
 22  1.00000             osd.22    down        0          1.00000 
 20  1.00000             osd.20    down        0          1.00000 
 21  1.00000             osd.21    down        0          1.00000 
-21  4.00000         host vs73                                    
 12  1.00000             osd.12      up  1.00000          1.00000 
  0  1.00000             osd.0     down        0          1.00000 
  8  1.00000             osd.8       up  1.00000          1.00000 
  1  1.00000             osd.1     down        0          1.00000 
-29  8.00000     rack rack2                                       
-24  4.00000         host vs74                                    
 10  1.00000             osd.10      up  1.00000          1.00000 
 13  1.00000             osd.13      up  1.00000          1.00000 
  7  1.00000             osd.7     down        0          1.00000 
  3  1.00000             osd.3     down        0          1.00000 
-25  4.00000         host vs75                                    
 14  1.00000             osd.14      up  1.00000          1.00000 
  9  1.00000             osd.9       up  1.00000          1.00000 
  6  1.00000             osd.6     down        0          1.00000 
  4  1.00000             osd.4     down        0          1.00000 
-30  8.00000     rack rack3                                       
-26  4.00000         host vs76                                    
 15  1.00000             osd.15      up  1.00000          1.00000 
 11  1.00000             osd.11      up  1.00000          1.00000 
  5  1.00000             osd.5     down        0          1.00000 
  2  1.00000             osd.2     down        0          1.00000 
-27  4.00000         host vs77                                    
 19  1.00000             osd.19      up  1.00000          1.00000 
 18  1.00000             osd.18      up  1.00000          1.00000 
 16  1.00000             osd.16    down        0          1.00000 
 17  1.00000             osd.17    down        0          1.00000 

#17 Updated by Jonas Weismüller over 7 years ago

opened #12193 to track further the osd map issue.

#18 Updated by Kefu Chai over 7 years ago

Jonas Weismüller wrote:

Hi,
I accidently removed a root bucket "osd crush remove platter" (root=platter). The platter object was still used by a ruleset and the ruleset was still in use by a pool. My issue looks similar to #9485.

weird, monitor will fail this command:

ceph osd crush rm default
Error EBUSY: (16) Device or resource busy

above is the output from my vstart cluster. i removed all descendents of the "default" bucket before issuing this command.

Also available in: Atom PDF