Project

General

Profile

Bug #56672

'ceph zabbix send' can block (mon) ceph commands and messages

Added by Rafael Lopez 5 months ago. Updated 5 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
zabbix module
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

It is possible to DOS the MGR by executing repeated `ceph zabbix send` manually when the zabbix server is unresponsive.

This can stop important MON messages and commands until the zabbix send returns. If these are configured (eg.) as a cron job and keep stacking this can block MON commands indefinitely.

When the mon message throttler is full, `ceph status` shows stale and inaccurate info (presumably it uses cached info if it can't get latest from MGR), and other important cmds hang, eg. `ceph osd df` `ceph pg dump` etc

How to reproduce:

1. To make it easier, set the MON message throttler to a very low amount this will trigger it very easily, in ceph.conf for MGR:

mgr mon messages = 2

and restart the MGRs.

2. Monitor the mon message throttler on the active MGR using `ceph daemon mgr.{id} perf dump | jq '."throttle-mgr_mon_messages"'`
This looks something like this:

root@soz-mon2:/home/debian# ceph daemon mgr.`hostname` perf dump | jq '."throttle-mgr_mon_messsages"'
{
  "val": 0,
  "max": 2,
  "get_started": 0,
  "get": 9781,
  "get_sum": 9781,
  "get_or_fail_fail": 297888,
  "get_or_fail_success": 9781,
  "take": 0,
  "take_sum": 0,
  "put": 9781,
  "put_sum": 9781,
  "wait": {
    "avgcount": 0,
    "sum": 0,
    "avgtime": 0
  }
}

3. Set up a fake non responsive zabbix server (or real zabbix if you can make it unresponsive).

apt install netcat
nc -k -l -v -p 10051

4. Configure your ceph zabbix module to that server

ceph zabbix config-set zabbix_host {your netcat server}

5. Set up the ceph zabbix host config

ceph zabbix config-set zabbix_host {your fake/unresponsive zabbix server}

6. Run `ceph zabbix send` a few times in the background, probably 5-10 is enough.

for i in `seq 1 5`; do ceph zabbix send &done

7. Check the throttler from perf dump again, it should show "val" reached max, and get_or_fail_fail increasing.
Check commands such as '*ceph osd df*' '*ceph pg dump*' '*ceph fs status*' etc, they will hang.

{
  "val": 2,
  "max": 2,
  "get_started": 0,
  "get": 13085,
  "get_sum": 13085,
  "get_or_fail_fail": 518045,
  "get_or_fail_success": 13085,
  "take": 0,
  "take_sum": 0,
  "put": 13083,
  "put_sum": 13083,
  "wait": {
    "avgcount": 0,
    "sum": 0,
    "avgtime": 0
  }
}

In this test, after the zabbix commands all time out (60s) it should release the throttler and everything comes back. In the case of `ceph zabbix send` being run indefinitely (we discovered this behaviour from a `ceph zabbix send` cron job)

History

#2 Updated by Konstantin Shalygin 5 months ago

  • Status changed from New to Fix Under Review
  • Assignee set to Rafael Lopez
  • Source set to Community (user)
  • Pull request ID set to 47225

#3 Updated by Patrick Donnelly 5 months ago

  • Target version changed from v17.2.2 to v18.0.0

Also available in: Atom PDF