Project

General

Profile

Actions

Bug #57309

open

Unhandled exception in monitor when muting health alert

Added by Paul Mezzanini over 1 year ago. Updated 12 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We've discovered that when doing a ceph health mute, a malformed ttl can be passed in which will cause unhandled exceptions on the monitor. Easy way to test is:

ceph osd set norecover
ceph health mute OSDMAP_FLAGS --ttl=':'

My coworker did a quick test to see what he could come up with as a patch and this the untested solution.

diff --git a/src/mon/HealthMonitor.cc b/src/mon/HealthMonitor.cc
index 3adbdc3de59f..1f6345e3ab61 100644
--- a/src/mon/HealthMonitor.cc
++ b/src/mon/HealthMonitor.cc
@ -301,8 +301,10 @ bool HealthMonitor::prepare_command(MonOpRequestRef op)
string ttl_str;
utime_t ttl;
if (cmd_getval(cmdmap, "ttl", ttl_str)) {
- auto secs = parse_timespan(ttl_str);
- if (secs == 0s) {
auto secs = 0s;
+ try {
+ secs = parse_timespan(ttl_str);
+ } catch (const std::invalid_argument& e) {
r = -EINVAL;
ss << "not a valid duration: " << ttl_str;
goto out;

I lost my dev cluster so I can't test it without impact.

Actions #1

Updated by Paul Mezzanini 12 months ago

This is still an issue that will crash a monitor.

I just tried to mute a scrub warning for one month and caused a monitor to bounce.

I ran:
ceph health mute PG_NOT_SCRUBBED 1M --sticky

hoping M was month since m is minute. It isn't.

Actions

Also available in: Atom PDF