Bug #7611
closedAll mon nodes crash when running "ceph tell osd.X" and using the "version" command
0%
Description
I'm on 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
I did on one of the mon nodes:
$ ceph tell osd.151
which brings up a "ceph>" prompt. I entered "help" to get a list of commands. It showed (besides others) "version". So I entered version and hit enter.
Result: All 3 mon nodes stopped working.
From the log of the leading mon node:
2014-03-05 09:50:36.960086 7f9120574700 1 mon.csqaeubap-u01mon01@0(leader).paxos(paxos active c 14057..14605) is_readable now=2014-03-05 09:50:36.960088 lease_expire=2014-03-05 09:50:40.225030 has v0 lc 14605 2014-03-05 09:50:36.961349 7f9120574700 0 mon.csqaeubap-u01mon01@0(leader) e9 handle_command mon_command({"prefix": "version"} v 0) v1 2014-03-05 09:50:36.964103 7f9120574700 -1 mon/Monitor.cc: In function 'bool Monitor::_allowed_command(MonSession*, std::string&, std::string&, std::map<std::basic_string<char>, boost::variant<std::basic_string<char>, bool, long int, double, std::vector<std::basic_string<char> > > >&)' thread 7f9120574700 time 2014-03-05 09:50:36.961413 mon/Monitor.cc: 1898: FAILED assert(this_cmd != __null) ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) 1: /usr/bin/ceph-mon() [0x613701] 2: (Monitor::handle_command(MMonCommand*)+0x713) [0x6144f3] 3: (Monitor::dispatch(MonSession*, Message*, bool)+0x3e2) [0x61d6a2] 4: (Monitor::_ms_dispatch(Message*)+0x1c6) [0x61db16] 5: (Monitor::ms_dispatch(Message*)+0x32) [0x63ba82] 6: (DispatchQueue::entry()+0x4eb) [0x88c3db] 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c469d] 8: (()+0x6b50) [0x7f9125840b50] 9: (clone()+0x6d) [0x7f91242120ed] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -10000> 2014-03-05 09:46:28.929050 7f9120574700 1 -- 10.88.32.11:6789/0 <== client.? 10.88.32.11:0/1007096 1 ==== auth(proto 0 25 bytes epoch 0) v1 ==== 55+0+0 (1030628714 0 0) 0x39e2b40 con 0x7229a20 -9999> 2014-03-05 09:46:28.929092 7f9120574700 1 mon.csqaeubap-u01mon01@0(leader).paxos(paxos active c 14057..14599) is_readable now=2014-03-05 09:46:28.929095 lease_expire=2014-03-05 09:46:31.417707 has v0 lc 14599 -9998> 2014-03-05 09:46:28.929138 7f9120574700 1 -- 10.88.32.11:6789/0 --> 10.88.32.11:0/1007096 -- mon_map v1 -- ?+0 0x41a45a0 con 0x7229a20 -9997> 2014-03-05 09:46:28.929181 7f9120574700 1 -- 10.88.32.11:6789/0 --> 10.88.32.11:0/1007096 -- auth_reply(proto 2 0 Success) v1 -- ?+0 0x4f5d000 con 0x7229a20 ... -2> 2014-03-05 09:50:36.961238 7f9120574700 1 -- 10.88.32.11:6789/0 <== client.2375582 10.88.32.11:0/1012537 4 ==== mon_command({"prefix": "version"} v 0) v1 ==== 63+0+0 (2936324440 0 0) 0x41a4b40 con 0x5dba840 -1> 2014-03-05 09:50:36.961349 7f9120574700 0 mon.csqaeubap-u01mon01@0(leader) e9 handle_command mon_command({"prefix": "version"} v 0) v1 0> 2014-03-05 09:50:36.964103 7f9120574700 -1 mon/Monitor.cc: In function 'bool Monitor::_allowed_command(MonSession*, std::string&, std::string&, std::map<std::basic_string<char>, boost::variant<std::basic_string<char>, bool, long int, double, std::vector<std::basic_string<char> > > >&)' thread 7f9120574700 time 2014-03-05 09:50:36.961413 mon/Monitor.cc: 1898: FAILED assert(this_cmd != __null) ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) 1: /usr/bin/ceph-mon() [0x613701] 2: (Monitor::handle_command(MMonCommand*)+0x713) [0x6144f3] 3: (Monitor::dispatch(MonSession*, Message*, bool)+0x3e2) [0x61d6a2] 4: (Monitor::_ms_dispatch(Message*)+0x1c6) [0x61db16] 5: (Monitor::ms_dispatch(Message*)+0x32) [0x63ba82] 6: (DispatchQueue::entry()+0x4eb) [0x88c3db] 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c469d] 8: (()+0x6b50) [0x7f9125840b50] 9: (clone()+0x6d) [0x7f91242120ed] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-mon.csqaeubap-u01mon01.log --- end dump of recent events --- 2014-03-05 09:50:37.106625 7f9120574700 -1 *** Caught signal (Aborted) ** in thread 7f9120574700
Upon asking on IRC in #ceph, another user (calit) was able to repoduce on 0.72.2 too. A third user (fghaas) tried on dumpling, the mons did not die, but answered with "Error: 22 EINVAL, Status: unrecognized command". (Although Running "help" on dumpling offers "version" as valid command too.)
So it seems the error can easily be reproduced on a standard 0.72.2 release.
Do you need anything else, more logs, more tests?
Updated by Sage Weil about 10 years ago
- Status changed from New to 12
- Assignee set to Joao Eduardo Luis
- Priority changed from Normal to Urgent
i think the reason why we never saw this is that nobody uses the interactive command.
joao, this sounds trivial to reproduce and debug!
also , we can add some simple tests into qa/workunit/cephtool/test.sh by piping stuff with newlines into the interactive ceph mode.
Updated by Joao Eduardo Luis about 10 years ago
Easily reproduceable on 0.72.2; unable to reproduce on current master. Will further look into it.
Updated by Joao Eduardo Luis about 10 years ago
- Status changed from 12 to In Progress
Updated by Joao Eduardo Luis about 10 years ago
- Status changed from In Progress to Fix Under Review
- Target version set to 0.79
Updated by Joao Eduardo Luis about 10 years ago
- Backport set to emperor, dumpling
Updated by Sage Weil about 10 years ago
- Status changed from Fix Under Review to Resolved