Project

General

Profile

Bug #7611

All mon nodes crash when running "ceph tell osd.X" and using the "version" command

Added by Volker Voigt over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Category:
Monitor
Target version:
% Done:

0%

Source:
Tags:
Backport:
emperor, dumpling
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm on 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)

I did on one of the mon nodes:

$ ceph tell osd.151

which brings up a "ceph>" prompt. I entered "help" to get a list of commands. It showed (besides others) "version". So I entered version and hit enter.

Result: All 3 mon nodes stopped working.

From the log of the leading mon node:

2014-03-05 09:50:36.960086 7f9120574700  1 mon.csqaeubap-u01mon01@0(leader).paxos(paxos active c 14057..14605) is_readable now=2014-03-05 09:50:36.960088 lease_expire=2014-03-05 09:50:40.225030 has v0 lc 14605
2014-03-05 09:50:36.961349 7f9120574700  0 mon.csqaeubap-u01mon01@0(leader) e9 handle_command mon_command({"prefix": "version"} v 0) v1
2014-03-05 09:50:36.964103 7f9120574700 -1 mon/Monitor.cc: In function 'bool Monitor::_allowed_command(MonSession*, std::string&, std::string&, std::map<std::basic_string<char>, boost::variant<std::basic_string<char>, bool, long int, double, std::vector<std::basic_string<char> > > >&)' thread 7f9120574700 time 2014-03-05 09:50:36.961413
mon/Monitor.cc: 1898: FAILED assert(this_cmd != __null)

 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 1: /usr/bin/ceph-mon() [0x613701]
 2: (Monitor::handle_command(MMonCommand*)+0x713) [0x6144f3]
 3: (Monitor::dispatch(MonSession*, Message*, bool)+0x3e2) [0x61d6a2]
 4: (Monitor::_ms_dispatch(Message*)+0x1c6) [0x61db16]
 5: (Monitor::ms_dispatch(Message*)+0x32) [0x63ba82]
 6: (DispatchQueue::entry()+0x4eb) [0x88c3db]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c469d]
 8: (()+0x6b50) [0x7f9125840b50]
 9: (clone()+0x6d) [0x7f91242120ed]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
-10000> 2014-03-05 09:46:28.929050 7f9120574700  1 -- 10.88.32.11:6789/0 <== client.? 10.88.32.11:0/1007096 1 ==== auth(proto 0 25 bytes epoch 0) v1 ==== 55+0+0 (1030628714 0 0) 0x39e2b40 con 0x7229a20
 -9999> 2014-03-05 09:46:28.929092 7f9120574700  1 mon.csqaeubap-u01mon01@0(leader).paxos(paxos active c 14057..14599) is_readable now=2014-03-05 09:46:28.929095 lease_expire=2014-03-05 09:46:31.417707 has v0 lc 14599
 -9998> 2014-03-05 09:46:28.929138 7f9120574700  1 -- 10.88.32.11:6789/0 --> 10.88.32.11:0/1007096 -- mon_map v1 -- ?+0 0x41a45a0 con 0x7229a20
 -9997> 2014-03-05 09:46:28.929181 7f9120574700  1 -- 10.88.32.11:6789/0 --> 10.88.32.11:0/1007096 -- auth_reply(proto 2 0 Success) v1 -- ?+0 0x4f5d000 con 0x7229a20

...

    -2> 2014-03-05 09:50:36.961238 7f9120574700  1 -- 10.88.32.11:6789/0 <== client.2375582 10.88.32.11:0/1012537 4 ==== mon_command({"prefix": "version"} v 0) v1 ==== 63+0+0 (2936324440 0 0) 0x41a4b40 con 0x5dba840
    -1> 2014-03-05 09:50:36.961349 7f9120574700  0 mon.csqaeubap-u01mon01@0(leader) e9 handle_command mon_command({"prefix": "version"} v 0) v1
     0> 2014-03-05 09:50:36.964103 7f9120574700 -1 mon/Monitor.cc: In function 'bool Monitor::_allowed_command(MonSession*, std::string&, std::string&, std::map<std::basic_string<char>, boost::variant<std::basic_string<char>, bool, long int, double, std::vector<std::basic_string<char> > > >&)' thread 7f9120574700 time 2014-03-05 09:50:36.961413
mon/Monitor.cc: 1898: FAILED assert(this_cmd != __null)

 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 1: /usr/bin/ceph-mon() [0x613701]
 2: (Monitor::handle_command(MMonCommand*)+0x713) [0x6144f3]
 3: (Monitor::dispatch(MonSession*, Message*, bool)+0x3e2) [0x61d6a2]
 4: (Monitor::_ms_dispatch(Message*)+0x1c6) [0x61db16]
 5: (Monitor::ms_dispatch(Message*)+0x32) [0x63ba82]
 6: (DispatchQueue::entry()+0x4eb) [0x88c3db]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c469d]
 8: (()+0x6b50) [0x7f9125840b50]
 9: (clone()+0x6d) [0x7f91242120ed]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mon.csqaeubap-u01mon01.log
--- end dump of recent events ---
2014-03-05 09:50:37.106625 7f9120574700 -1 *** Caught signal (Aborted) **
 in thread 7f9120574700

Upon asking on IRC in #ceph, another user (calit) was able to repoduce on 0.72.2 too. A third user (fghaas) tried on dumpling, the mons did not die, but answered with "Error: 22 EINVAL, Status: unrecognized command". (Although Running "help" on dumpling offers "version" as valid command too.)

So it seems the error can easily be reproduced on a standard 0.72.2 release.

Do you need anything else, more logs, more tests?

History

#1 Updated by Sage Weil over 7 years ago

  • Status changed from New to 12
  • Assignee set to Joao Eduardo Luis
  • Priority changed from Normal to Urgent

i think the reason why we never saw this is that nobody uses the interactive command.

joao, this sounds trivial to reproduce and debug!

also , we can add some simple tests into qa/workunit/cephtool/test.sh by piping stuff with newlines into the interactive ceph mode.

#2 Updated by Joao Eduardo Luis over 7 years ago

Easily reproduceable on 0.72.2; unable to reproduce on current master. Will further look into it.

#3 Updated by Joao Eduardo Luis over 7 years ago

  • Status changed from 12 to In Progress

#4 Updated by Joao Eduardo Luis over 7 years ago

  • Status changed from In Progress to Fix Under Review
  • Target version set to 0.79

#5 Updated by Joao Eduardo Luis over 7 years ago

  • Backport set to emperor, dumpling

#6 Updated by Sage Weil over 7 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF