Bug #11680
closedmon crashes when "ceph osd tree 85 --format json"
0%
Description
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: /usr/bin/ceph-mon() [0x9017e2] 2: (()+0xf130) [0x7f30fab97130] 3: (gsignal()+0x37) [0x7f30f95b15d7] 4: (abort()+0x148) [0x7f30f95b2cc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f30f9eb59b5] 6: (()+0x5e926) [0x7f30f9eb3926] 7: (()+0x5e953) [0x7f30f9eb3953] 8: (()+0x5eb73) [0x7f30f9eb3b73] 9: (std::__throw_logic_error(char const*)+0x77) [0x7f30f9f08717] 10: (char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag)+0xa1) [0x7f30f9f14561] 11: (std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&)+0x38) [0x7f30f9f14918] 12: (CrushTreeDumper::dump_item_fields(CrushWrapper const*, CrushTreeDumper::Item const&, ceph::Formatter*)+0xb9) [0x6175c9] 13: (OSDMap::print_tree(std::ostream*, ceph::Formatter*) const+0x10d9) [0x787ab9] 14: (OSDMonitor::preprocess_command(MMonCommand*)+0xe55) [0x60a0b5] 15: (OSDMonitor::preprocess_query(PaxosServiceMessage*)+0x20b) [0x60f75b] 16: (PaxosService::dispatch(PaxosServiceMessage*)+0x833) [0x5cacd3] 17: (Monitor::handle_command(MMonCommand*)+0x147c) [0x591a4c] 18: (Monitor::dispatch(MonSession*, Message*, bool)+0xf9) [0x594c89] 19: (Monitor::_ms_dispatch(Message*)+0x1a6) [0x595936] 20: (Monitor::ms_dispatch(Message*)+0x23) [0x5b5403] 21: (DispatchQueue::entry()+0x64a) [0x8a1d9a] 22: (DispatchQueue::DispatchThread::entry()+0xd) [0x79bd9d] 23: (()+0x7df5) [0x7f30fab8fdf5] 24: (clone()+0x6d) [0x7f30f96721ad] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
the most recent log messages
-25> 2015-05-18 06:20:09.143639 7f09817a0700 10 log_client log_queue is 5 last_log 24 sent 23 num 5 unsent 1 sending 1 -24> 2015-05-18 06:20:09.143642 7f09817a0700 10 log_client will send 2015-05-18 06:20:09.143637 mon.0 10.8.128.33:6789/0 24 : audit [DBG] from='client.? 10.8.128.33:0/1004078' entity='client.admin' cmd=[{"prefix": "status", "format": "json"}]: dispatch -23> 2015-05-18 06:20:09.143652 7f09817a0700 1 -- 10.8.128.33:6789/0 --> 10.8.128.33:6789/0 -- log(1 entries) v1 -- ?+0 0x47f2400 con 0x4609600 -22> 2015-05-18 06:20:09.143795 7f09817a0700 1 -- 10.8.128.33:6789/0 --> 10.8.128.33:0/1004078 -- mon_command_ack([{"prefix": "status", "format": "json"}]=0 v0) v1 -- ?+1108 0x466b000 con 0x460aaa0 -21> 2015-05-18 06:20:09.143817 7f09817a0700 1 -- 10.8.128.33:6789/0 <== mon.0 10.8.128.33:6789/0 0 ==== log(1 entries) v1 ==== 0+0+0 (0 0 0) 0x47f2400 con 0x4609600 -20> 2015-05-18 06:20:09.143827 7f09817a0700 5 mon.Mon@0(leader).paxos(paxos active c 78062..78631) is_readable = 1 - now=2015-05-18 06:20:09.143828 lease_expire=0.000000 has v0 lc 78631 -19> 2015-05-18 06:20:09.145454 7f09817a0700 1 -- 10.8.128.33:6789/0 <== client.184098 10.8.128.33:0/1004078 8 ==== mon_command({"prefix": "osd dump", "format": "json"} v 0) v1 ==== 82+0+0 (3917390200 0 0) 0x466a000 con 0x460aaa0 -18> 2015-05-18 06:20:09.145494 7f09817a0700 0 mon.Mon@0(leader) e1 handle_command mon_command({"prefix": "osd dump", "format": "json"} v 0) v1 -17> 2015-05-18 06:20:09.145520 7f09817a0700 0 log_channel(audit) log [DBG] : from='client.? 10.8.128.33:0/1004078' entity='client.admin' cmd=[{"prefix": "osd dump", "format": "json"}]: dispatch -16> 2015-05-18 06:20:09.145522 7f09817a0700 10 log_client _send_to_monlog to self -15> 2015-05-18 06:20:09.145523 7f09817a0700 10 log_client log_queue is 6 last_log 25 sent 24 num 6 unsent 1 sending 1 -14> 2015-05-18 06:20:09.145527 7f09817a0700 10 log_client will send 2015-05-18 06:20:09.145522 mon.0 10.8.128.33:6789/0 25 : audit [DBG] from='client.? 10.8.128.33:0/1004078' entity='client.admin' cmd=[{"prefix": "osd dump", "format": "json"}]: dispatch -13> 2015-05-18 06:20:09.145535 7f09817a0700 1 -- 10.8.128.33:6789/0 --> 10.8.128.33:6789/0 -- log(1 entries) v1 -- ?+0 0x47f6540 con 0x4609600 -12> 2015-05-18 06:20:09.145546 7f09817a0700 5 mon.Mon@0(leader).paxos(paxos active c 78062..78631) is_readable = 1 - now=2015-05-18 06:20:09.145547 lease_expire=0.000000 has v0 lc 78631 -11> 2015-05-18 06:20:09.145779 7f09817a0700 1 -- 10.8.128.33:6789/0 --> 10.8.128.33:0/1004078 -- mon_command_ack([{"prefix": "osd dump", "format": "json"}]=0 v85) v1 -- ?+3436 0x4669800 con 0x460aaa0 -10> 2015-05-18 06:20:09.145805 7f09817a0700 1 -- 10.8.128.33:6789/0 <== mon.0 10.8.128.33:6789/0 0 ==== log(1 entries) v1 ==== 0+0+0 (0 0 0) 0x47f6540 con 0x4609600 -9> 2015-05-18 06:20:09.145814 7f09817a0700 5 mon.Mon@0(leader).paxos(paxos active c 78062..78631) is_readable = 1 - now=2015-05-18 06:20:09.145815 lease_expire=0.000000 has v0 lc 78631 -8> 2015-05-18 06:20:09.147249 7f09817a0700 1 -- 10.8.128.33:6789/0 <== client.184098 10.8.128.33:0/1004078 9 ==== mon_command({"prefix": "osd tree", "epoch": 85, "format": "json"} v 0) v1 ==== 95+0+0 (942654298 0 0) 0x466b800 con 0x460aaa0 -7> 2015-05-18 06:20:09.147309 7f09817a0700 0 mon.Mon@0(leader) e1 handle_command mon_command({"prefix": "osd tree", "epoch": 85, "format": "json"} v 0) v1 -6> 2015-05-18 06:20:09.147337 7f09817a0700 0 log_channel(audit) log [DBG] : from='client.? 10.8.128.33:0/1004078' entity='client.admin' cmd=[{"prefix": "osd tree", "epoch": 85, "format": "json"}]: dispatch -5> 2015-05-18 06:20:09.147343 7f09817a0700 10 log_client _send_to_monlog to self -4> 2015-05-18 06:20:09.147345 7f09817a0700 10 log_client log_queue is 7 last_log 26 sent 25 num 7 unsent 1 sending 1 -3> 2015-05-18 06:20:09.147347 7f09817a0700 10 log_client will send 2015-05-18 06:20:09.147343 mon.0 10.8.128.33:6789/0 26 : audit [DBG] from='client.? 10.8.128.33:0/1004078' entity='client.admin' cmd=[{"prefix": "osd tree", "epoch": 85, "format": "json"}]: dispatch -2> 2015-05-18 06:20:09.147355 7f09817a0700 1 -- 10.8.128.33:6789/0 --> 10.8.128.33:6789/0 -- log(1 entries) v1 -- ?+0 0x47f3cc0 con 0x4609600 -1> 2015-05-18 06:20:09.147365 7f09817a0700 5 mon.Mon@0(leader).paxos(paxos active c 78062..78631) is_readable = 1 - now=2015-05-18 06:20:09.147366 lease_expire=0.000000 has v0 lc 78631 0> 2015-05-18 06:20:09.149437 7f09817a0700 -1 *** Caught signal (Aborted) ** in thread 7f09817a0700
originally reported at https://bugzilla.redhat.com/show_bug.cgi?id=1222509
Files
Updated by Kefu Chai almost 9 years ago
- File file_1222509.log.txt file_1222509.log.txt added
- File ceph-mon.tgz ceph-mon.tgz added
Updated by Kefu Chai almost 9 years ago
and i am able to dump the osd map using
ceph-monstore-tool /var/lib/ceph/mon/ceph-Mon get osdmap -- --out /tmp/osdmap.85 osdmaptool /tmp/osdmap.85 --print --tree --dump-json
the output is attached:
epoch 85 fsid 283bbd8b-9205-4cfc-83fa-80546ba2fe36 created 2015-05-06 15:39:42.874568 modified 2015-05-18 06:20:07.885802 flags pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 max_osd 4 osd.0 down in weight 1 up_from 78 up_thru 78 down_at 85 last_clean_interval [46,72) 10.8.128.40:6800/25974 10.8.128.40:6804/25974 10.8.128.40:6808/25974 10.8.128.40:6809/25974 exists 4d27fc96-3a2f-478f-bca5-a4e55ec7e53c osd.1 up in weight 1 up_from 69 up_thru 80 down_at 63 last_clean_interval [8,62) 10.8.128.86:6800/19549 10.8.128.86:6801/19549 10.8.128.86:6802/19549 10.8.128.86:6803/19549 exists,up 61f9fbec-687e-4d89-8900-fc7d9ff5f88a osd.2 up in weight 1 up_from 84 up_thru 80 down_at 83 last_clean_interval [80,83) 10.8.128.40:6810/26274 10.8.128.40:6801/1026274 10.8.128.40:6802/1026274 10.8.128.40:6803/1026274 exists,up 7823abe2-208c-4381-9020-54b6bbf11b07 osd.3 down in weight 1 up_from 71 up_thru 80 down_at 85 last_clean_interval [17,65) 10.8.128.86:6804/19849 10.8.128.86:6805/19849 10.8.128.86:6808/19849 10.8.128.86:6809/19849 exists 8551c2bb-82ed-4b15-ad6e-305e6d560b59 { "epoch": 85, "fsid": "283bbd8b-9205-4cfc-83fa-80546ba2fe36", "created": "2015-05-06 15:39:42.874568", "modified": "2015-05-18 06:20:07.885802", "flags": "", "cluster_snapshot": "", "pool_max": 0, "max_osd": 4, "pools": [ { "pool": 0, "pool_name": "rbd", "flags": 1, "flags_names": "hashpspool", "type": 1, "size": 2, "min_size": 1, "crush_ruleset": 0, "object_hash": 2, "pg_num": 64, "pg_placement_num": 64, "crash_replay_interval": 0, "last_change": "1", "last_force_op_resend": "0", "auid": 0, "snap_mode": "selfmanaged", "snap_seq": 0, "snap_epoch": 0, "pool_snaps": [], "removed_snaps": "[]", "quota_max_bytes": 0, "quota_max_objects": 0, "tiers": [], "tier_of": -1, "read_tier": -1, "write_tier": -1, "cache_mode": "none", "target_max_bytes": 0, "target_max_objects": 0, "cache_target_dirty_ratio_micro": 0, "cache_target_full_ratio_micro": 0, "cache_min_flush_age": 0, "cache_min_evict_age": 0, "erasure_code_profile": "", "hit_set_params": { "type": "none" }, "hit_set_period": 0, "hit_set_count": 0, "min_read_recency_for_promote": 0, "stripe_width": 0, "expected_num_objects": 0 } ], "osds": [ { "osd": 0, "uuid": "4d27fc96-3a2f-478f-bca5-a4e55ec7e53c", "up": 0, "in": 1, "weight": 1.000000, "primary_affinity": 1.000000, "last_clean_begin": 46, "last_clean_end": 72, "up_from": 78, "up_thru": 78, "down_at": 85, "lost_at": 0, "public_addr": "10.8.128.40:6800\/25974", "cluster_addr": "10.8.128.40:6804\/25974", "heartbeat_back_addr": "10.8.128.40:6808\/25974", "heartbeat_front_addr": "10.8.128.40:6809\/25974", "state": [ "exists" ] }, { "osd": 1, "uuid": "61f9fbec-687e-4d89-8900-fc7d9ff5f88a", "up": 1, "in": 1, "weight": 1.000000, "primary_affinity": 1.000000, "last_clean_begin": 8, "last_clean_end": 62, "up_from": 69, "up_thru": 80, "down_at": 63, "lost_at": 0, "public_addr": "10.8.128.86:6800\/19549", "cluster_addr": "10.8.128.86:6801\/19549", "heartbeat_back_addr": "10.8.128.86:6802\/19549", "heartbeat_front_addr": "10.8.128.86:6803\/19549", "state": [ "exists", "up" ] }, { "osd": 2, "uuid": "7823abe2-208c-4381-9020-54b6bbf11b07", "up": 1, "in": 1, "weight": 1.000000, "primary_affinity": 1.000000, "last_clean_begin": 80, "last_clean_end": 83, "up_from": 84, "up_thru": 80, "down_at": 83, "lost_at": 0, "public_addr": "10.8.128.40:6810\/26274", "cluster_addr": "10.8.128.40:6801\/1026274", "heartbeat_back_addr": "10.8.128.40:6802\/1026274", "heartbeat_front_addr": "10.8.128.40:6803\/1026274", "state": [ "exists", "up" ] }, { "osd": 3, "uuid": "8551c2bb-82ed-4b15-ad6e-305e6d560b59", "up": 0, "in": 1, "weight": 1.000000, "primary_affinity": 1.000000, "last_clean_begin": 17, "last_clean_end": 65, "up_from": 71, "up_thru": 80, "down_at": 85, "lost_at": 0, "public_addr": "10.8.128.86:6804\/19849", "cluster_addr": "10.8.128.86:6805\/19849", "heartbeat_back_addr": "10.8.128.86:6808\/19849", "heartbeat_front_addr": "10.8.128.86:6809\/19849", "state": [ "exists" ] } ], "osd_xinfo": [ { "osd": 0, "down_stamp": "2015-05-18 06:20:07.885802", "laggy_probability": 0.249900, "laggy_interval": 47, "features": 1125899906842623, "old_weight": 65536 }, { "osd": 1, "down_stamp": "2015-05-14 04:58:18.937390", "laggy_probability": 0.357000, "laggy_interval": 933, "features": 1125899906842623, "old_weight": 65536 }, { "osd": 2, "down_stamp": "2015-05-18 05:32:05.357656", "laggy_probability": 0.474930, "laggy_interval": 8, "features": 1125899906842623, "old_weight": 0 }, { "osd": 3, "down_stamp": "2015-05-18 06:20:07.885802", "laggy_probability": 0.357000, "laggy_interval": 928, "features": 1125899906842623, "old_weight": 65536 } ], "pg_temp": [], "primary_temp": [], "blacklist": [], "erasure_code_profiles": { "default": { "directory": "\/usr\/lib64\/ceph\/erasure-code", "k": "2", "m": "1", "plugin": "jerasure", "technique": "reed_sol_van" } } } ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 0 0 osd.0 down 1.00000 1.00000 1 0 osd.1 up 1.00000 1.00000 2 0 osd.2 up 1.00000 1.00000 3 0 osd.3 down 1.00000 1.00000
Updated by Joao Eduardo Luis almost 9 years ago
kefu, we need logs from osdmap epoch 81 to 82. Last working crushmap is from epoch 81, all subsequent maps are broken.
Updated by Joao Eduardo Luis almost 9 years ago
Right now, the only way I can easily reproduce this is by setting an empty crushmap with 'ceph osd setcrushmap -i some-empty.crushmap'.
I have no idea whether this was deliberately performed or if this is was an unfortunate side effect of some operation.
Updated by Kefu Chai almost 9 years ago
diff --git a/src/tools/osdmaptool.cc b/src/tools/osdmaptool.cc index 14331c3..a628454 100644 --- a/src/tools/osdmaptool.cc +++ b/src/tools/osdmaptool.cc @@ -465,9 +465,14 @@ int main(int argc, const char **argv) osdmap.print(cout); if (print_json) osdmap.dump_json(cout); - if (tree) - osdmap.print_tree(&cout, NULL); - + if (tree) { + boost::scoped_ptr<Formatter> f(Formatter::create("json")); + f->open_object_section("tree"); + osdmap.print_tree(NULL, f.get()); + f->close_section(); + f->flush(cout); + // osdmap.print_tree(&cout, NULL); + } if (modified) { bl.clear(); osdmap.encode(bl, CEPH_FEATURES_SUPPORTED_DEFAULT | CEPH_FEATURE_RESERVED);
and with the patch above. we are able to reproduce the crash using
./osdmaptool /tmp/osdmap.85 --tree # in which, /tmp/osdmap.85 is the osdmap extracted using ceph-monstore-tool
Updated by Kefu Chai almost 9 years ago
yes, "osd setcrushmap" was issued before we ran into this issue.
$ zgrep setcrushmap * ceph.audit.log-20150519.gz:2015-05-18 05:14:12.613798 mon.0 10.8.128.33:6789/0 224816 : audit [INF] from='client.? 10.8.128.33:0/1023292' entity='client.admin' cmd=[{"prefix": "osd setcrushmap"}]: dispatch ceph.audit.log-20150519.gz:2015-05-18 05:14:12.613798 mon.0 10.8.128.33:6789/0 224816 : audit [INF] from='client.? 10.8.128.33:0/1023292' entity='client.admin' cmd=[{"prefix": "osd setcrushmap"}]: dispatch
Updated by Joao Eduardo Luis almost 9 years ago
Yeah, I noticed that a while ago. The annoying thing is that I can't find logs for the period when the osdmap moves from epoch 81 (the last correctly formed crushmap) to epoch 82 (first wrong map). The 'ceph osd setcrushmap' commands do fall in this gap, so I'm assuming this was indeed an incorrectly formed crushmap that was pushed to the cluster -- and by 'incorrectly formed', in this case, I mean an empty map.
I think the real bug here is that we are allowing setting an empty map without requiring the user to be really sure about that.
Updated by Kefu Chai almost 9 years ago
echoing joao
(gdb) p *this->crush $9 = {mapper_lock = {name = 0xb298b8 "CrushWrapper::mapper_lock", id = -1, recursive = false, lockdep = true, backtrace = false, _m = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\002", '\000' <repeats 22 times>, __align = 0}, nlock = 0, locked_by = 0, cct = 0x0, logger = 0x0}, type_map = std::map with 0 elements, name_map = std::map with 0 elements, rule_name_map = std::map with 0 elements, crush = 0x2cfeff0, have_rmaps = false, type_rmap = std::map with 0 elements, name_rmap = std::map with 0 elements, rule_name_rmap = std::map with 0 elements}
so in osdmap#85, we have 4 OSDs. but since the crushmap is empty, crush->get_typename(0)
returns NULL, hence the crash.
I think the real bug here is that we are allowing setting an empty map without requiring the user to be really sure about that.
we perform some sanity tests when handling "osd setcrushmap" or "osd crush set". but they failed to detect a crush map which not able to cover all OSDs in the osdmap.
will add a test for it.
Updated by Kefu Chai almost 9 years ago
- Status changed from New to Fix Under Review
Updated by Kefu Chai almost 9 years ago
- File 1029402.log.gz 1029402.log.gz added
- File 1029386.log.gz 1029386.log.gz added
- File CalAPITester.py CalAPITester.py added
Updated by Kefu Chai almost 9 years ago
in http://tracker.ceph.com/attachments/download/1716/1029402.log.gz, we have
-1> 2015-05-25 05:33:57.741951 7ffe076b2700 5 mon.Mon@0(leader).paxos(paxos active c 1..531) is_readable = 1 - now=2015-05-25 05:33:57.741956 lease_expire=0.000000 has v0 lc 531 0> 2015-05-25 05:33:57.748845 7ffe076b2700 -1 osd/OSDMap.h: In function 'unsigned int OSDMap::get_weight(int) const' thread 7ffe076b2700 time 2015-05-25 05:33:57.742429 osd/OSDMap.h: 374: FAILED assert(o < max_osd) ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7b3a95] 2: /usr/bin/ceph-mon() [0x773223] 3: (OSDMap::print_tree(std::ostream*, ceph::Formatter*) const+0x192a) [0x78835a] 4: (OSDMonitor::preprocess_command(MMonCommand*)+0xe55) [0x60a105] 5: (OSDMonitor::preprocess_query(PaxosServiceMessage*)+0x20b) [0x60f7ab] 6: (PaxosService::dispatch(PaxosServiceMessage*)+0x833) [0x5cad23] 7: (Monitor::handle_command(MMonCommand*)+0x147c) [0x591a9c] 8: (Monitor::dispatch(MonSession*, Message*, bool)+0xf9) [0x594cd9] 9: (Monitor::_ms_dispatch(Message*)+0x1a6) [0x595986]
where the crush map possesses some OSD which is missing in osdmap.
BTW, per http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ and http://ceph.com/docs/master/install/manual-deployment/. the suggested practice of adding an OSD manually is:
- add the bucket hosting the osd to crush map
- add the OSD to the host
- start the OSD instance
Updated by Kefu Chai almost 9 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to firefly,hammer
Updated by Nathan Cutler over 8 years ago
- Status changed from Pending Backport to Resolved
Updated by Kefu Chai over 8 years ago
- Status changed from Resolved to 12
reopen this ticket, as we have reported failure due to bad crushmap.
> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) > 1: /usr/bin/ceph-mon() [0x9a98aa] > 2: (()+0x10340) [0x7f8ab3a3d340] > 3: (crush_do_rule()+0x292) [0x85ada2] > 4: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*, int*, unsigned int*) const+0xeb) [0x7a85cb] > 5: (OSDMap:g_to_raw_up(pg_t, std::vector<int, std::allocator<int> >*, int*) const+0x94) [0x7a8a64] > 6: (OSDMap::remove_redundant_temporaries(CephContext*, OSDMap const&, OSDMap::Incremental*)+0x317) [0x7ab8f7] > 7: (OSDMonitor::create_pending()+0xf69) [0x60fdb9] > 8: (PaxosService::_active()+0x709) [0x6047b9] > 9: (PaxosService::election_finished()+0x67) [0x604ad7] > 10: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, MonCommand const*, int, std::set<int, std::less<int>, std::allocator<int> > const*) > +0x236) [0x5c34a6] > 11: (Monitor::win_standalone_election()+0x1cc) [0x5c388c] > 12: (Monitor::bootstrap()+0x9bb) [0x5c42eb] > 13: (Monitor::init()+0xd5) [0x5c4645] > 14: (main()+0x2470) [0x5769c0] > 15: (__libc_start_main()+0xf5) [0x7f8ab1ec7ec5] > 16: /usr/bin/ceph-mon() [0x5984f7]
Updated by Kefu Chai over 8 years ago
- Status changed from 12 to Need More Info
quote Eino's report:
I'm getting a segmentation fault error from the monitor of our test
cluster. The cluster was in a bad state because I have recently removed
three hosts from it. Now I started cleaning it up and first marked the
removed osd's as lost (ceph osd lost), and then I tried to remove the
osd's from the crush map (ceph osd crush remove). After a few successful
commands the cluster ceased to respond. On monitor seemed to stay up (it
was responding through the admin socket), so I stopped it and used
monmaptool to remove the failed monitor from the monmap. But, now also the
second monitor segfaults when I try to start it.The cluster does not have any important data, but I'd like to get the
monitors up as a practice. How do I debug this further?Linux cephmon-test-02 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00
UTC 2014 x86_64 x86_64 x86_64 GNU/LinuxThe output:
Updated by Kefu Chai over 8 years ago
quote of Kefu's inquiry
Eino, i was looking at your issue at http://tracker.ceph.com/issues/12876.
seems it is due to a fault crush rule, see http://tracker.ceph.com/issues/12876#note-5.
may i know how you managed to inject it into the monitor? i tried using$ ceph osd setcrushmap -i new-crush-map
Error EINVAL: Failed to parse crushmap: ** Caught signal (Segmentation fault) *but no luck.
quote from Eino's reply
I'm pretty sure I did it just like you were trying to do. The cluster has since been upgraded a couple of times. Unfortunately I can't remember when I created that particular faulty rule.
Updated by Kefu Chai over 8 years ago
- Status changed from Need More Info to Can't reproduce