Project

General

Profile

Actions

Bug #50647

open

common: the fault handling becomes inoperational when multiple faults happen the same time

Added by Radoslaw Zarzynski about 3 years ago. Updated about 3 years ago.

Status:
Fix Under Review
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The problem arises due to installing the fault handlers with the flag SA_RESETHAND. It instructs the kernel to restore the default handler for a signal upon entry to its handler. Unfortunately, in a situation when more than one fault happens the same time (which might happen when e.g. two `tp_osd_tp` threads run into the same, buggy path), the default handler may interrupt-and-exit-the-process when our original handler is still executing.

Following instrumentation could be used to demonstrate the issue:

diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc
index 626e8ccefbd..cde46776d53 100644
--- a/src/osd/PrimaryLogPG.cc
+++ b/src/osd/PrimaryLogPG.cc
@@ -6617,6 +6617,7 @@ int PrimaryLogPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops)
       ++ctx->num_write;
       result = 0;
       { // write
+        *((int*)((int)ceph_gettid() % 0x42)) = 0xdeadbeef;
         __u32 seq = oi.truncate_seq;
        tracepoint(osd, do_osd_op_pre_write, soid.oid.name.c_str(), soid.snap.val, oi.size, seq, op.extent.offset, op.extent.length, op.extent.truncate_size, op.extent.truncate_seq);
        if (op.extent.length != osd_op.indata.length()) {
Actions #1

Updated by Radoslaw Zarzynski about 3 years ago

  • Status changed from New to Fix Under Review
Actions #3

Updated by Neha Ojha about 3 years ago

  • Pull request ID set to 41154
Actions

Also available in: Atom PDF