mon: mon_status should make it clear when the mon has connection issues
If `mon_status` is called when having network connection issues it will not state anything that can alert a user/system about the problem
in the output of `mon_status`.
At the most, the monitor will be in 'probing' state which doesn't really hint at the problem.
Only after getting the log levels up there was some information about the connectivity problem in the monitor logs:
2013-09-16 14:30:42.750312 7f484b44a700 2 -- 192.168.111.101:6789/0 >> 0.0.0.0:0/1 pipe(0x33e0280 sd=15 :0 s=1 pgs=0 cs=0 l=0 c=0x3331080).fault 111: Connection refused
2013-09-16 14:28:04.289943 7f966c96f700 2 -- 192.168.111.100:6789/0 >> 192.168.111.101:6789/0 pipe(0x1dcf780 sd=22 :0 s=1 pgs=0 cs=0 l=0 c=0x1db9420).connect error 192.168.111.101:6789/0, 113: No route to host
The rationale here is that if the logs can have output of 'No route to host' and/or 'Connection refused', the expectation would be to have the same
information in mon_status.
#3 Updated by Joao Eduardo Luis over 6 years ago
- Status changed from New to 4
Considering the class Monitor is a dispatcher of the messenger, add a new courtesy function to the messenger 'ms_handle_error()'. Class Monitor will implement this function and add each error to a list. From time to time, the monitor may check if the elements on the list are still valid by attempting to reproduce them, although this feels a bit more under the responsibility of the messenger itself.
Showing these errors on mon_status is just a matter of going through the list.
Given a TTL per error. the monitor will periodically pop the head of the list.
Then again, while this is a simple way to get the errors to the monitor, it feels a lot like this is something that should be handled by the messenger itself: Having a list populated with the errors of the messenger, then periodically attempt to check if they are still a thing while making sure we do not attempt to reproduce an error more than a couple of times,
Just a thought,
#4 Updated by Greg Farnum over 6 years ago
Hmmm. The issue with doing this in the Messenger is that all those errors are expected to occur at some point — failures happen! And most of the time the failures we're seeing are the fault of the other guy and the cluster will route around them appropriately (but of course the daemon can't know that in the moment, period). We can add some sort of error reporting interface as you suggest, but we'll need to be careful in designing it — we'd probably want to associate the error with a Connection, but we need to be sure the Connection stays valid long enough. (I forget if we've got them properly ref-counted now or not.)
Then the daemon can use its greater knowledge to decide if the error is a problem or not. I don't think a list of errors that we spit out is the right answer, though — there are good odds of it just filling up with garbage from disappearing clients that we don't care about. Instead we'd design the interface well, and then the Monitor can look at incoming failures and keep track of failed connections to other monitors to do analysis on — eg, "lost connection to mon.x" or "getting connect errors for every monitor!"