Project

General

Profile

Actions

Feature #6325

open

mon: mon_status should make it clear when the mon has connection issues

Added by Alfredo Deza over 10 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
Monitor
Pull request ID:

Description

If `mon_status` is called when having network connection issues it will not state anything that can alert a user/system about the problem
in the output of `mon_status`.

At the most, the monitor will be in 'probing' state which doesn't really hint at the problem.

Only after getting the log levels up there was some information about the connectivity problem in the monitor logs:

2013-09-16 14:30:42.750312 7f484b44a700  2 -- 192.168.111.101:6789/0 >> 0.0.0.0:0/1 pipe(0x33e0280 sd=15 :0 s=1 pgs=0 cs=0 l=0 c=0x3331080).fault 111: Connection refused

And also:

2013-09-16 14:28:04.289943 7f966c96f700  2 -- 192.168.111.100:6789/0 >> 192.168.111.101:6789/0 pipe(0x1dcf780 sd=22 :0 s=1 pgs=0 cs=0 l=0 c=0x1db9420).connect error 192.168.111.101:6789/0, 113: No route to host

The rationale here is that if the logs can have output of 'No route to host' and/or 'Connection refused', the expectation would be to have the same
information in mon_status.


Subtasks 1 (1 open0 closed)

Feature #1894: mon: implement internal heartbeatingNew01/05/2012

Actions
Actions #1

Updated by Joao Eduardo Luis over 10 years ago

  • Subject changed from mon_status should make it clear when the mon has connection issues to mon: mon_status should make it clear when the mon has connection issues
  • Assignee set to Joao Eduardo Luis
  • Source changed from other to Community (dev)
Actions #2

Updated by Joao Eduardo Luis over 10 years ago

  • Tracker changed from Bug to Feature
Actions #3

Updated by Joao Eduardo Luis over 10 years ago

  • Status changed from New to 4

possible approach:

Considering the class Monitor is a dispatcher of the messenger, add a new courtesy function to the messenger 'ms_handle_error()'. Class Monitor will implement this function and add each error to a list. From time to time, the monitor may check if the elements on the list are still valid by attempting to reproduce them, although this feels a bit more under the responsibility of the messenger itself.

Showing these errors on mon_status is just a matter of going through the list.

Given a TTL per error. the monitor will periodically pop the head of the list.

Then again, while this is a simple way to get the errors to the monitor, it feels a lot like this is something that should be handled by the messenger itself: Having a list populated with the errors of the messenger, then periodically attempt to check if they are still a thing while making sure we do not attempt to reproduce an error more than a couple of times,

Just a thought,

Actions #4

Updated by Greg Farnum over 10 years ago

Hmmm. The issue with doing this in the Messenger is that all those errors are expected to occur at some point — failures happen! And most of the time the failures we're seeing are the fault of the other guy and the cluster will route around them appropriately (but of course the daemon can't know that in the moment, period). We can add some sort of error reporting interface as you suggest, but we'll need to be careful in designing it — we'd probably want to associate the error with a Connection, but we need to be sure the Connection stays valid long enough. (I forget if we've got them properly ref-counted now or not.)
Then the daemon can use its greater knowledge to decide if the error is a problem or not. I don't think a list of errors that we spit out is the right answer, though — there are good odds of it just filling up with garbage from disappearing clients that we don't care about. Instead we'd design the interface well, and then the Monitor can look at incoming failures and keep track of failed connections to other monitors to do analysis on — eg, "lost connection to mon.x" or "getting connect errors for every monitor!"

Actions #5

Updated by Joao Eduardo Luis almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category changed from Monitor to Administration/Usability
  • Status changed from 4 to New
Actions #6

Updated by Joao Eduardo Luis almost 7 years ago

  • Component(RADOS) Monitor added
Actions #7

Updated by Joao Eduardo Luis almost 7 years ago

  • Target version set to v13.0.0
Actions #8

Updated by Joao Eduardo Luis over 3 years ago

  • Assignee deleted (Joao Eduardo Luis)
Actions

Also available in: Atom PDF