Bug #1694
monitor crash: FAILED assert(get_max_osd() >= crush.get_max_devices())
0%
Description
I just did a fresh install of my cluster and after starting I saw my monitors go down with:
Nov 8 14:40:35 monitor-sec mon.sec[1611]: ./osd/OSDMap.h: In function 'int OSDMap::_pg_to_osds(const pg_pool_t&, pg_t, std::vector<int>&)', in thread '7f9c04742700'#012./osd/OSDMap.h: 454: FAILED assert(get_max_osd() >= crush.get_max_devices()) Nov 8 14:40:35 monitor-sec mon.sec[1611]: ceph version 0.37-314-g40843eb (commit:40843eb36c3c029925d62f35aa8a4dee2876381c)#012 1: /usr/bin/ceph-mon() [0x45c27f]#012 2: (PGMonitor::send_pg_creates()+0x15b4) [0x4c6974]#012 3: (PGMonitor::update_from_paxos()+0x4ef) [0x4c959f]#012 4: (PaxosService::_active()+0x39) [0x4861e9]#012 5: (Context::complete(int)+0xa) [0x472aca]#012 6: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xca) [0x482e2a]#012 7: (Paxos::handle_lease(MMonPaxos*)+0x36b) [0x47d35b]#012 8: (Paxos::dispatch(PaxosServiceMessage*)+0x21b) [0x481d3b]#012 9: (Monitor::_ms_dispatch(Message*)+0x84e) [0x47062e]#012 10: (Monitor::ms_dispatch(Message*)+0x35) [0x479485]#012 11: (SimpleMessenger::dispatch_entry()+0x84b) [0x56d05b]#012 12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45fd5c]#012 13: (()+0x7efc) [0x7f9c08007efc]#012 14: (clone()+0x6d) [0x7f9c06a4189d] Nov 8 14:40:35 monitor-sec mon.sec[1611]: ceph version 0.37-314-g40843eb (commit:40843eb36c3c029925d62f35aa8a4dee2876381c)#012 1: /usr/bin/ceph-mon() [0x45c27f]#012 2: (PGMonitor::send_pg_creates()+0x15b4) [0x4c6974]#012 3: (PGMonitor::update_from_paxos()+0x4ef) [0x4c959f]#012 4: (PaxosService::_active()+0x39) [0x4861e9]#012 5: (Context::complete(int)+0xa) [0x472aca]#012 6: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xca) [0x482e2a]#012 7: (Paxos::handle_lease(MMonPaxos*)+0x36b) [0x47d35b]#012 8: (Paxos::dispatch(PaxosServiceMessage*)+0x21b) [0x481d3b]#012 9: (Monitor::_ms_dispatch(Message*)+0x84e) [0x47062e]#012 10: (Monitor::ms_dispatch(Message*)+0x35) [0x479485]#012 11: (SimpleMessenger::dispatch_entry()+0x84b) [0x56d05b]#012 12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45fd5c]#012 13: (()+0x7efc) [0x7f9c08007efc]#012 14: (clone()+0x6d) [0x7f9c06a4189d]
This seems to be due to 885d71481bf06915569fadb938a0245097f2a9e0
On that specific monitor I checked the OSD maps and this showed:
root@monitor-sec:/var/lib/ceph/mon.sec/osdmap_full# osdmaptool --print 3 osdmaptool: osdmap file '3' epoch 3 fsid 4bd06b88-1d07-53de-ea22-73f1fb4fe0c4 created 2011-11-08 14:09:40.450317 modifed 2011-11-08 14:39:01.498778 flags pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 2496 pgp_num 2496 lpg_num 2 lpgp_num 2 last_change 1 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 2496 pgp_num 2496 lpg_num 2 lpgp_num 2 last_change 1 owner 0 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 2496 pgp_num 2496 lpg_num 2 lpgp_num 2 last_change 1 owner 0 max_osd 39 osd.17 up in weight 1 up_from 2 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:49cc]:6803/5177 [2a00:f10:11b:cef0:225:90ff:fe33:49cc]:6804/5177 [2a00:f10:11b:cef0:225:90ff:fe33:49cc]:6805/5177 osd.20 up in weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6800/8492 [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6801/8492 [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6802/8492 osd.22 up in weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6806/8693 [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6807/8693 [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6808/8693 osd.29 up in weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6803/5478 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6804/5478 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6805/5478 osd.30 up in weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6806/5600 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6807/5600 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6808/5600 osd.31 up in weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6809/5711 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6810/5711 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6811/5711 osd.36 up in weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6800/5663 [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6801/5663 [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6802/5663 osd.37 up in weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6803/5753 [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6804/5753 [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6805/5753 root@monitor-sec:/var/lib/ceph/mon.sec/osdmap_full#
I extracted the crushmap (attached) out the osdmap and that shows:
root@monitor-sec:~# cat /root/crushmap.txt |grep item|grep osd|wc -l 40 root@monitor-sec:~#
I don't see why this assert should come up? 39 (max_osd) is less then 40 (max devices).
Could it be a problem that the "devices" were renamed to "items" in the crushmap? I haven't dumped max_devices out of the crushmap to test it though.
History
#1 Updated by Wido den Hollander over 12 years ago
I just made a small adjustment to crushtool so it would print max_devices:
root@monitor-sec:~# ./crushtool -d crushmap -o crushmap.txt max_devices 40 root@monitor-sec:~#
That seems OK?
#2 Updated by Sage Weil over 12 years ago
- Target version set to v0.39
max_osd in the osdmap needs to be >= the max_devices in the crush map. how did you set up the cluster? did mkcephfs generate teh crush map or did you feed one in manually?
#3 Updated by Wido den Hollander over 12 years ago
Aha! Read that wrong, tnx.
I used mkcephfs to generate the crushmap, I did not write my own.
#4 Updated by Sage Weil over 12 years ago
Can you try this and see if there is a mismatch?
osdmaptool --create-from-conf -c your.ceph.conf osdmap osdmaptool -p osdmap | grep max osdmaptool --export-crush crushmap osdmap crushtool -d crushmap | grep device | tail -1
#5 Updated by Sage Weil over 12 years ago
- Status changed from New to Need More Info
#6 Updated by Wido den Hollander over 12 years ago
Ok, I've ran those commands and it gives me:
root@monitor:~# osdmaptool -p osdmap | grep max max_osd 39 root@monitor:~#
So, that is one short of the 40 I have.
root@monitor:~# crushtool -d crushmap | grep device | tail -1 device 39 osd.39 root@monitor:~#
That is correct, also, if I check the crushmap:
root@monitor:~# crushtool -d crushmap | grep device | grep osd | wc -l 40 root@monitor:~#
So, max_osd is set one short in the osdmap.
#7 Updated by Wido den Hollander over 12 years ago
The monitor that was generating the osdmap was running 5bd029ef01fcb59bea9170af563c3499cce1e8c4 and that failed.
I just ran it with the latest master and that gave me a map with max_osd = 40, but I don't see a change in the last 24 hours, or did I miss that?
I'll update asap with a test with the latest master.
#8 Updated by Sage Weil over 12 years ago
- Assignee set to Sage Weil
Great. Can you attach (or email) the ceph.conf you're using?
Thanks!
#9 Updated by Sage Weil over 12 years ago
oh nevermind, didn't see that second comment. the fix is 0bcdd4f3b2a2dba405639122b84f7aad978f347b, which comes after 5bd029ef01fcb59bea9170af563c3499cce1e8c4.
#10 Updated by Sage Weil over 12 years ago
- Status changed from Need More Info to Resolved