Bug #23320
closedOSD suicide itself because of a firewall rule but reports a received signal
0%
Description
We (leseb & I) had an issue where the OSD crashes with the following message :
2018-03-08 14:30:26.042607 7f6142b7a700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory
2018-03-08 14:30:26.042623 7f6142b7a700 -1 received signal: Interrupt from PID: 0 task name: <unknown> UID: 0
2018-03-08 14:30:26.042626 7f6142b7a700 -1 osd.9 4733 * Got signal Interrupt *
2018-03-08 14:30:26.042635 7f6142b7a700 -1 osd.9 4733 shutdown
This message is pretty misleading as it reports a task name unknown with a PID 0.
To better understand the status of the siginfo structure the following patch have been applied :
diff --git a/src/global/signal_handler.cc b/src/global/signal_handler.cc
index d4099e1..365c148 100644
--- a/src/global/signal_handler.cc
+++ b/src/global/signal_handler.cc
@@ -298,6 +298,10 @@ struct SignalHandler : public Thread {
<< " from " << " PID: " << siginfo->si_pid
<< " task name: " << task_name
<< " UID: " << siginfo->si_uid
+ << " STATUS: " << siginfo->si_status
+ << " ERRNO: " << siginfo->si_errno
+ << " CODE: " << siginfo->si_code
+ << " VALUE: " << siginfo->si_value.sival_int
<< dendl;
handlers[signum]->handler(signum);
}
All the structures were reported as 0.
So in this case I see two different topics :
- the OSD crashes because of a firewall rule and consider that was an external signal (maybe triggered by pthread_kill(), raise(), abort() or alarm())
- the message isn't appropriate
In this case, if si_code is set to SI_USER and SI_PID is set to 0 we should report a different message saying we are suicide and not trying to report that someone killed us.
This bug triggers a more generic issue where regarding a matrix of SI_CODE and SIGNAL NUMBER a different message should be printed to get relevant info.
I also suggest applying that diff to give more context regarding the signal received.
Files