Project

General

Profile

Bug #10541

objecter memory corruption

Added by Jason Scheck about 9 years ago. Updated about 9 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
librados
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

With ceph 0.90, we have been fighting infrequent memory corruption with our client programs using librados. Unfortunately, our client is complex enough that running it with a memory checker can be difficult.

However, I did get a test program to crash rather reliably when linked with electric fence -- the crash occurs when the OSD on another system is shut down, and appears to be an access to a freed object pointer.

The stack trace at the time of the crash is:

(gdb) where
#0 Objecter::_op_submit_with_budget (this=0x7fff9cbbdaa0, op=0x7fffb92a6e08, lc=...,
ctx_budget=<optimized out>) at osdc/Objecter.cc:1722
#1 0x00007ffff3f5ac9d in Objecter::op_submit (this=0x7fff9cbbdaa0, op=0x7fffb92a6e08,
ctx_budget=0x0) at osdc/Objecter.cc:1696
#2 0x00007ffff3f2f4dd in librados::IoCtxImpl::operate (this=0x7fffc0f3de58, oid=...,
o=0x7fffc47d15f0, pmtime=<optimized out>, flags=0) at librados/IoCtxImpl.cc:514
#3 0x00007ffff3f35618 in librados::IoCtxImpl::write_full (this=0x7fffc0f3de58,
oid=..., bl=...) at librados/IoCtxImpl.cc:469
#4 0x00007ffff3f0526a in librados::IoCtx::write_full (this=0x7fffc0f49fe0, oid=...,
bl=...) at librados/librados.cc:1056
#5 0x000000000041def3 in RadosIoContext::write_full (this=0x7fffc0f49fd0, oid=...,
bl=...) at ../include/jrados.h:204

Since electric fence uses guard bands, "*op" appears to be zero, but probably wouldn't be in its absence.


Related issues

Duplicates Ceph - Bug #10340: "[ FAILED ] LibRadosIo.ReadTimeout" in upgrade:dumpling-firefly-x:parallel-giant-distro-basic-vps run Resolved 12/16/2014

History

#1 Updated by Jason Scheck about 9 years ago

With more testing, I get crashes in this section even without an OSD going down. The following patch seems to mostly close the race, and is in-line with what the rest of the code is doing:

--- ceph-0.90/src/osdc/Objecter.cc.orig 2015-01-14 16:02:53.384279182 -0800
+++ ceph-0.90/src/osdc/Objecter.cc 2015-01-14 16:03:41.365094389 -0800
@ -1715,15 +1715,16 @
}
}

- ceph_tid_t tid = _op_submit(op, lc);
+ if (op->tid == 0)
+ op->tid = last_tid.inc();

if (osd_timeout > 0) {
Mutex::Locker l(timer_lock);
- op->ontimeout = new C_CancelOp(tid, this);
+ op->ontimeout = new C_CancelOp(op->tid, this);
timer.add_event_after(osd_timeout, op->ontimeout);
}

- return tid;
+ return _op_submit(op, lc);
}

ceph_tid_t Objecter::_op_submit(Op *op, RWLock::Context& lc)

#2 Updated by Saurav Sengupta about 9 years ago

Working on the same project with Jason. His patch above seems to address the crash, but now all reads come back with the right size, but no actual data (all bytes set to 0). We are using timeouts, and the code above changes the behavior such that op->ontimeout is now set when _op_submit is called. That in turn calls _send_op, which has this code from #9582:


if (op->outbl &&
op->ontimeout == NULL && // only post rx_buffer if no timeout; see #9582
op->outbl->length()) {
ldout(cct, 20) << " posting rx buffer for " << op->tid << " on " << con << dendl;
op->con = con;
op->con->post_rx_buffer(op->tid, *op->outbl);
}

Since op->ontimeout is no longer NULL, post_rx_buffer is not getting called. From the comments this should only be a performance optimization change, but the end result is that the buffers are not getting filled in with the retrieved data.

#3 Updated by Sage Weil about 9 years ago

  • Subject changed from Client crash when OSD goes down. to objecter memory corruption
  • Source changed from other to Community (dev)

#4 Updated by Samuel Just about 9 years ago

  • Status changed from New to Duplicate

I ended up with a somewhat similar patch while tracking down 10340, I think this is the same bug.

Also available in: Atom PDF