Project

General

Profile

Actions

Bug #23320

closed

OSD suicide itself because of a firewall rule but reports a received signal

Added by Anonymous about 6 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We (leseb & I) had an issue where the OSD crashes with the following message :

2018-03-08 14:30:26.042607 7f6142b7a700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory
2018-03-08 14:30:26.042623 7f6142b7a700 -1 received signal: Interrupt from PID: 0 task name: <unknown> UID: 0
2018-03-08 14:30:26.042626 7f6142b7a700 -1 osd.9 4733 * Got signal Interrupt *
2018-03-08 14:30:26.042635 7f6142b7a700 -1 osd.9 4733 shutdown

This message is pretty misleading as it reports a task name unknown with a PID 0.

To better understand the status of the siginfo structure the following patch have been applied :

diff --git a/src/global/signal_handler.cc b/src/global/signal_handler.cc
index d4099e1..365c148 100644
--- a/src/global/signal_handler.cc
+++ b/src/global/signal_handler.cc
@@ -298,6 +298,10 @@ struct SignalHandler : public Thread {
                   << " from " << " PID: " << siginfo->si_pid
                   << " task name: " << task_name
                   << " UID: " << siginfo->si_uid
+                  << " STATUS: " << siginfo->si_status
+                  << " ERRNO: " << siginfo->si_errno
+                  << " CODE: " << siginfo->si_code
+                  << " VALUE: " << siginfo->si_value.sival_int
                   << dendl;
              handlers[signum]->handler(signum);
            }

All the structures were reported as 0.

So in this case I see two different topics :
- the OSD crashes because of a firewall rule and consider that was an external signal (maybe triggered by pthread_kill(), raise(), abort() or alarm())
- the message isn't appropriate

In this case, if si_code is set to SI_USER and SI_PID is set to 0 we should report a different message saying we are suicide and not trying to report that someone killed us.

This bug triggers a more generic issue where regarding a matrix of SI_CODE and SIGNAL NUMBER a different message should be printed to get relevant info.

I also suggest applying that diff to give more context regarding the signal received.


Files

smssecure-2018-03-08-180723.jpg (247 KB) smssecure-2018-03-08-180723.jpg Anonymous, 03/12/2018 11:08 AM
patch.txt (714 Bytes) patch.txt siginfo patch Anonymous, 03/12/2018 01:07 PM
Actions #1

Updated by Anonymous about 6 years ago

Actions #2

Updated by Anonymous about 6 years ago

I'm attaching the patch for more readability.

Actions #3

Updated by Anonymous about 6 years ago

I used this url https://www.mkssoftware.com/docs/man5/siginfo_t.5.asp#Signal_Codes to get a better understanding of the siginfo fields & meaning.

Actions #4

Updated by Greg Farnum about 6 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
Actions #5

Updated by Anonymous about 6 years ago

Can I have some inputs on this topic ? I can make the PR but I'd love having your opinion on it.

Thx,

Actions #6

Updated by Kefu Chai about 6 years ago

  • Description updated (diff)
Actions #7

Updated by Greg Farnum about 6 years ago

  • Status changed from New to Fix Under Review

github.com/ceph/ceph/pull/21000

Actions #8

Updated by Mykola Golub about 5 years ago

  • Status changed from Fix Under Review to Resolved
  • Target version deleted (v12.2.0)
  • Pull request ID set to 21000
Actions

Also available in: Atom PDF