Bug #50647: common: the fault handling becomes inoperational when multiple faults happen the same time - RADOS - Ceph

Actions

Copy link

Bug #50647

open

common: the fault handling becomes inoperational when multiple faults happen the same time

Added by Radoslaw Zarzynski about 3 years ago. Updated about 3 years ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

Radoslaw Zarzynski

Category:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

41154

Crash signature (v1):

Crash signature (v2):

Description

The problem arises due to installing the fault handlers with the flag SA_RESETHAND. It instructs the kernel to restore the default handler for a signal upon entry to its handler. Unfortunately, in a situation when more than one fault happens the same time (which might happen when e.g. two `tp_osd_tp` threads run into the same, buggy path), the default handler may interrupt-and-exit-the-process when our original handler is still executing.

Following instrumentation could be used to demonstrate the issue:

diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc
index 626e8ccefbd..cde46776d53 100644
--- a/src/osd/PrimaryLogPG.cc
+++ b/src/osd/PrimaryLogPG.cc
@@ -6617,6 +6617,7 @@ int PrimaryLogPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops)
       ++ctx->num_write;
       result = 0;
       { // write
+        *((int*)((int)ceph_gettid() % 0x42)) = 0xdeadbeef;
         __u32 seq = oi.truncate_seq;
        tracepoint(osd, do_osd_op_pre_write, soid.oid.name.c_str(), soid.snap.val, oi.size, seq, op.extent.offset, op.extent.length, op.extent.truncate_size, op.extent.truncate_seq);
        if (op.extent.length != osd_op.indata.length()) {