Actions
Bug #50647
opencommon: the fault handling becomes inoperational when multiple faults happen the same time
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
The problem arises due to installing the fault handlers with the flag SA_RESETHAND
. It instructs the kernel to restore the default handler for a signal upon entry to its handler. Unfortunately, in a situation when more than one fault happens the same time (which might happen when e.g. two `tp_osd_tp` threads run into the same, buggy path), the default handler may interrupt-and-exit-the-process when our original handler is still executing.
Following instrumentation could be used to demonstrate the issue:
diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc index 626e8ccefbd..cde46776d53 100644 --- a/src/osd/PrimaryLogPG.cc +++ b/src/osd/PrimaryLogPG.cc @@ -6617,6 +6617,7 @@ int PrimaryLogPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops) ++ctx->num_write; result = 0; { // write + *((int*)((int)ceph_gettid() % 0x42)) = 0xdeadbeef; __u32 seq = oi.truncate_seq; tracepoint(osd, do_osd_op_pre_write, soid.oid.name.c_str(), soid.snap.val, oi.size, seq, op.extent.offset, op.extent.length, op.extent.truncate_size, op.extent.truncate_seq); if (op.extent.length != osd_op.indata.length()) {
Updated by Radoslaw Zarzynski about 3 years ago
- Status changed from New to Fix Under Review
Updated by Radoslaw Zarzynski about 3 years ago
Just to the record: https://gist.github.com/rzarzynski/eb21e48a4458b593912eccd50ab8da46.
Actions