Project

General

Profile

Bug #21300

"ceph osd df" crashes ceph-mon if mgr is offline

Added by Artemy Kapitula 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Monitor
Target version:
-
Start date:
09/07/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
luminous, master
Needs Doc:
No

Description

ceph-mon crashes when calling "ceph osd df" if no ceph-mgr is running:

сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 2017-09-07 14:32:28.543543 7fe69f24b700 0 mon.dpr-2a1713-063-crd@0(leader) e1 handle_command mon_command({"prefix": "df", "format": "json"} v
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 2017-09-07 14:32:28.543583 7fe69f24b700 0 log_channel(audit) log [DBG] : from='client.? 10.118.63.11:0/33089080' entity='client.admin' cmd=[{
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 2017-09-07 14:32:28.547544 7fe69f24b700 0 mon.dpr-2a1713-063-crd@0(leader) e1 handle_command mon_command({"prefix": "osd df", "format": "json
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 2017-09-07 14:32:28.547574 7fe69f24b700 0 log_channel(audit) log [DBG] : from='client.? 10.118.63.11:0/4177904687' entity='client.admin' cmd=
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: ** Caught signal (Aborted) *
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: in thread 7fe69f24b700 thread_name:ms_dispatch
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 1: (()+0x8c6e21) [0x56080d4f7e21]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 2: (()+0xf130) [0x7fe6a6ad1130]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 3: (gsignal()+0x37) [0x7fe6a50715d7]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 4: (abort()+0x148) [0x7fe6a5072cc8]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 5: (()+0x4427b9) [0x56080d0737b9]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 6: (print_osd_utilization(OSDMap const&, PGStatService const*, std::ostream&, ceph::Formatter*, bool)+0x1a7) [0x56080d2eb287]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 7: (OSDMonitor::preprocess_command(boost::intrusive_ptr<MonOpRequest>)+0x105d) [0x56080d148ddd]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 8: (OSDMonitor::preprocess_query(boost::intrusive_ptr<MonOpRequest>)+0x3d6) [0x56080d150e86]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 9: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x811) [0x56080d1014a1]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 10: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x1cb1) [0x56080cfdfff1]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 11: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x8b9) [0x56080cfe5789]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 12: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x56080cfe6a3b]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 13: (Monitor::ms_dispatch(Message*)+0x23) [0x56080d011263]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 14: (DispatchQueue::entry()+0x792) [0x56080d4b3102]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x56080d2b64cd]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 16: (()+0x7df5) [0x7fe6a6ac9df5]
сен 07 14:32:28 dpr-2a1713-063-crd ceph-custom-daemon1839: 17: (clone()+0x6d) [0x7fe6a51321ad]


Related issues

Copied to Ceph - Backport #22079: luminous: "ceph osd df" crashes ceph-mon if mgr is offline Resolved

History

#1 Updated by huanwen ren 3 months ago

@Artemy Kapitula
This PR: https://github.com/ceph/ceph/pull/17322 can reslove it,
but not merge into the Luminous version

#2 Updated by Kefu Chai 3 months ago

  • Subject changed from ceph-mon crash to "ceph osd df" crashes ceph-mon if mgr is offline
  • Status changed from New to Need More Info
  • Target version deleted (v12.2.0)

i am not able to reproduce this issue with a vstart cluster with ddf84249fa8a8ec3655c39bac5331ab81c0307b1.

  1. start vstart cluster with 1 mgr
  2. stop this mgr
  3. "ceph osd df" hangs

or

  1. start vstart cluster without mgr
  2. "ceph osd df" hangs

#3 Updated by Joao Luis 3 months ago

kefu, were you able to reproduce the issue on 12.2.0?

#4 Updated by huanwen ren 3 months ago

Follow the steps below to reproduce,only one Mon and one Mgr.
(multiple mon and Mgr will have the same problem, using one to simplify reproduction):
1. start Mon and Mgr
2. stop Mgr
3. stop Mon
4. start Mon
5. use command "ceph --verbose osd df", you can get info

        bestcmds_sorted: 
      [{u'cmd019': {u'avail': u'cli,rest',
              u'flags': 0,
              u'help': u'show OSD utilization',
              u'module': u'osd',
              u'perm': u'r',
              u'sig': [argdesc(<class 'ceph_argparse.CephPrefix'>, name=prefix, req=True, n=1, prefix=osd),
                       argdesc(<class 'ceph_argparse.CephPrefix'>, name=prefix, req=True, n=1, prefix=df),
                       argdesc(<class 'ceph_argparse.CephChoices'>, name=output_method, req=False, n=1,      
      strings=plain|tree)]}}]
      Submitting command:  {'prefix': u'osd df'}

6. Mon is coredump

#5 Updated by Joao Luis 3 months ago

  • Assignee set to Joao Luis

I'll look into this.

#6 Updated by Kefu Chai 3 months ago

  • Status changed from Need More Info to New

Joao, i have not tested it on 12.2.0.

#7 Updated by Joao Luis 3 months ago

  • Status changed from New to Verified
  • Priority changed from Normal to High
  • Severity changed from 3 - minor to 2 - major

okay, managed to reproduce this somehow, on v12.2.0.

The only obvious difference between when it reproduced vs all those times it didn't lies in the ceph tool sending the command to the monitor when it does reproduce; if the issue is not triggered, it's because the ceph tool blocks trying to reach the mgr directly.

I suspect this will happen because the only time it triggered, the client was trying to contact the monitor directly because the mgr had been marked down at that point? Not sure, but looking into it now.

#8 Updated by Joao Luis 3 months ago

This is reproducible in last night's luminous.

#9 Updated by Joao Luis 3 months ago

I did figure out how to fix the crash on luminous, but the underlying issue is nevertheless present in master as well.

In master, even though the command does not crash the monitor, it does have the side-effect of returning EINVAL (because the monitor has no idea what this command is):

better match: 2 > 1: cmd019:osd df {plain|tree} 
bestcmds_sorted: 
[{u'cmd019': {u'avail': u'cli,rest',
              u'flags': 0,
              u'help': u'show OSD utilization',
              u'module': u'osd',
              u'perm': u'r',
              u'sig': [argdesc(<class 'ceph_argparse.CephPrefix'>, name=prefix, req=True, n=1, prefix=osd),
                       argdesc(<class 'ceph_argparse.CephPrefix'>, name=prefix, req=True, n=1, prefix=df),
                       argdesc(<class 'ceph_argparse.CephChoices'>, name=output_method, req=False, n=1, strings=plain|tree)]}}]
Submitting command:  {'prefix': u'osd df'}
2017-09-20 17:22:05.569963 7f54f913f700 10 monclient: _send_command 2 [{"prefix": "osd df"}]
2017-09-20 17:22:05.569973 7f54f913f700 10 monclient: _send_mon_message to mon.a at 127.0.0.1:40000/0
2017-09-20 17:22:05.570807 7f54f086e700 10 monclient: handle_mon_command_ack 2 [{"prefix": "osd df"}]
2017-09-20 17:22:05.570812 7f54f086e700 10 monclient: _finish_command 2 = -22 (22) Invalid argument
Error EINVAL: (22) Invalid argument
2017-09-20 17:22:05.572510 7f54f913f700 10 monclient: shutdown

#10 Updated by Joao Luis 3 months ago

  • Status changed from Verified to Need Review
  • Backport set to luminous
  • Release master added

#11 Updated by Joao Luis 2 months ago

merged part of the fix on master; backport of said fix and second patch required in luminous can be found in https://github.com/ceph/ceph/pull/18038

#12 Updated by Nathan Cutler 2 months ago

  • Status changed from Need Review to Pending Backport

#13 Updated by Joao Luis 2 months ago

Nathan, backport has been merged as part of https://github.com/ceph/ceph/pull/18038 - is there anything else to be done?

#14 Updated by Nathan Cutler 2 months ago

  • Status changed from Pending Backport to Resolved
  • Backport deleted (luminous)

#15 Updated by Nathan Cutler about 1 month ago

  • Status changed from Resolved to Pending Backport
  • Backport set to luminous

#16 Updated by Nathan Cutler about 1 month ago

  • Copied to Backport #22079: luminous: "ceph osd df" crashes ceph-mon if mgr is offline added

#17 Updated by Nathan Cutler about 1 month ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF