Project

General

Profile

Bug #666

OSD: ReplicatedPG::eval_repop

Added by Wido den Hollander over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
Start date:
12/21/2010
Due date:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

On my 'noisy' machine I wanted to see how fast writing a 10GB file would go, but this caused some crashes.

This machine hosts 4 OSD's, each with a 200MB journal on a tmpfs.

root@noisy:~# dd if=/dev/zero bs=1024k count=10240|pv > /mnt/ceph/10GB.bin
 9.8GB 0:06:47 [31.5MB/s] [                                    <=> 

The write stalled there and didn't go any further.

Around 3 ~ 4GB the OSD's started to crash, eventually 3 out of 4 OSD's crashed with the same backtrace:

2010-12-21 10:38:51.771608 7fe3e3eb2700 filestore(/var/lib/ceph/osd.0) async snap create 'snap_5545' transid 79 got 0 Success
2010-12-21 10:38:51.771673 7fe3e3eb2700 filestore(/var/lib/ceph/osd.0)  waiting for transid 79 to complete
2010-12-21 10:38:51.771699 7fe3de3a4700 filestore(/var/lib/ceph/osd.0) queue_op 0x44986c0 seq 5546 1306 bytes   (queue has 1 ops and 1306 bytes)
2010-12-21 10:38:51.771731 7fe3de3a4700 osd0 17 dequeue_op 0x413eb80 finish
2010-12-21 10:38:51.771747 7fe3de3a4700 osd0 17 dequeue_op osd_sub_op_reply(client4104.1:890 0.51 10000000000.00000373/head [] ondisk = 0) v1 pg pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'1 active+clean], 28 more pending
2010-12-21 10:38:51.771795 7fe3de3a4700 osd0 17 take_peer_stat peer osd1 stat(2010-12-21 10:38:51.014427 oprate=2.67588 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-12-21 10:38:51.771835 7fe3de3a4700 osd0 17 pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'1 active+clean] repop_ack rep_tid 478 op osd_op(client4104.1:890 10000000000.00000373 [write 0~4194304 [1@-1]] 0.ba51 snapc 1=[]) result 0 ack_type 4 from osd1
2010-12-21 10:38:51.771863 7fe3de3a4700 osd0 17 pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'1 active+clean] eval_repop repgather(0x444ce10 applied 17'5 rep_tid=478 wfack= wfdisk= op=osd_op(client4104.1:890 10000000000.00000373 [write 0~4194304 [1@-1]] 0.ba51 snapc 1=[])) wants=d
2010-12-21 10:38:51.771888 7fe3de3a4700 osd0 17 pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'1 active+clean]  sending commit on repgather(0x444ce10 applied 17'5 rep_tid=478 wfack= wfdisk= op=osd_op(client4104.1:890 10000000000.00000373 [write 0~4194304 [1@-1]] 0.ba51 snapc 1=[])) 0x43838c0
2010-12-21 10:38:51.771934 7fe3de3a4700 osd0 17 pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'4 active+clean]  removing repgather(0x444ce10 applied 17'5 rep_tid=478 wfack= wfdisk= op=osd_op(client4104.1:890 10000000000.00000373 [write 0~4194304 [1@-1]] 0.ba51 snapc 1=[]))
2010-12-21 10:38:51.771959 7fe3de3a4700 osd0 17 pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'4 active+clean]    q front is repgather(0x4819f00 applied 17'3 rep_tid=461 wfack= wfdisk=1 op=osd_op(client4104.1:833 10000000000.0000033b [write 0~4194304 [1@-1]] 0.2451 snapc 1=[]))
2010-12-21 10:38:51.771983 7fe3de3a4700 osd0 17 pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'4 active+clean]  removing repgather(0x444ce10 applied 17'5 rep_tid=478 wfack= wfdisk= op=osd_op(client4104.1:890 10000000000.00000373 [write 0~4194304 [1@-1]] 0.ba51 snapc 1=[]))
2010-12-21 10:38:51.772004 7fe3de3a4700 osd0 17 pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'4 active+clean]    q front is repgather(0x4819f00 applied 17'3 rep_tid=461 wfack= wfdisk=1 op=osd_op(client4104.1:833 10000000000.0000033b [write 0~4194304 [1@-1]] 0.2451 snapc 1=[]))
osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)':
osd/ReplicatedPG.cc:2147: FAILED assert(repop_queue.front() == repop)
 ceph version 0.24 (commit:180a4176035521940390f4ce24ee3eb7aa290632)
 1: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x852) [0x484cb2]
 2: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, eversion_t)+0x168) [0x4865b8]
 3: (ReplicatedPG::sub_op_modify_reply(MOSDSubOpReply*)+0x13c) [0x4868dc]
 4: (OSD::dequeue_op(PG*)+0x122) [0x4e4172]
 5: (ThreadPool::worker()+0x28f) [0x5dd40f]
 6: (ThreadPool::WorkThread::entry()+0xd) [0x508e1d]
 7: (Thread::_entry_func(void*)+0xa) [0x479baa]
 8: (()+0x69ca) [0x7fe3e9d2e9ca]
 9: (clone()+0x6d) [0x7fe3e89ba70d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

It's easily reproducible, just setup a similar cluster and start writing the 10GB file.

What I did notice is that the write speeds dropped pretty hard as soon as my journal filled up, but this is the same as #531 and #615 I assume?

I ran cdebugpack and uploaded the data to logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_eval_repop. During the test, debug osd and filestore was at 20.

After restarting the crashed OSD's the cluster fully recovered, but the FS stayed unresponsive, a 'sync' stalled:

libceph:  tid 2530 timed out on osd1, will reset osd

But the logs didn't show the OSD was stalled or something, it's logs showed it was running and doing stuff.

I then restarted osd1 and after a few minutes the 'sync' finished.

Associated revisions

Revision e060d7a1 (diff)
Added by Sage Weil over 8 years ago

filejournal: rewrite completion handling, fix ordering on full->notfull

Rewriting the completion handling to be simpler, clearer, so that it is
easier to maintain a strict completion ordering invariant.

This also fixes an ordering bug: When restarting journal, we defer
initially until we get a committed_thru from the previous commit and then
do all those completions. That same logic needs to also apply to new items
submitted during that commit interval. This was broken before, but the
simpler structure fixes it. Fixes #666.

Tested-by: Jim Schutt <>
Signed-off-by: Sage Weil <>

History

#1 Updated by Wido den Hollander over 8 years ago

Forgot to mention, this was on the RC branch, commit: 180a4176035521940390f4ce24ee3eb7aa290632 (v0.24)

#2 Updated by Sage Weil over 8 years ago

  • Priority changed from Normal to Urgent

#3 Updated by Sage Weil over 8 years ago

  • Target version set to v0.24.1

#4 Updated by Sage Weil over 8 years ago

  • Status changed from New to Resolved
  • Assignee set to Sage Weil

#5 Updated by Jim Schutt over 8 years ago

Hit another instance of this on current unstable (d386ea2ea0b).

I've got logs from debug osd = 20 + debug filestore = 20, if that helps.

This was 64 clients writing 16 GiB data total to 16 OSDs, w/ 4 OSDs/server.
Each OSD had a ~500 MiB journal partition.

Here's a gdb backtrace:

(gdb) bt
#0  0x00007fdab48509dd in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
#1  0x000000000065c4ce in handle_fatal_signal (signum=6) at config.cc:253
#2  <signal handler called>
#3  0x00007fdab36bb265 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#4  0x00007fdab36bcd10 in abort () at abort.c:88
#5  0x00007fdab3f31cb4 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#6  0x00007fdab3f2fdb6 in ?? () from /usr/lib64/libstdc++.so.6
#7  0x00007fdab3f2fde3 in std::terminate() () from /usr/lib64/libstdc++.so.6
#8  0x00007fdab3f2feca in __cxa_throw () from /usr/lib64/libstdc++.so.6
#9  0x0000000000646e18 in ceph::__ceph_assert_fail (assertion=0x6803e0 "repop_queue.front() == repop", file=0x67f332 "osd/ReplicatedPG.cc", line=2496, 
    func=0x682080 "void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)") at common/assert.cc:25
#10 0x00000000004c300e in ReplicatedPG::eval_repop (this=0x7fda94516c30, repop=0x7fdaa07d0e50) at osd/ReplicatedPG.cc:2496
#11 0x00000000004c3529 in ReplicatedPG::repop_ack (this=0x7fda94516c30, repop=0x7fdaa07d0e50, result=0, ack_type=4, fromosd=4, peer_lcod=...) at osd/ReplicatedPG.cc:2638
#12 0x00000000004c3960 in ReplicatedPG::sub_op_modify_reply (this=0x7fda94516c30, r=0x1cb6c20) at osd/ReplicatedPG.cc:3058
#13 0x0000000000505082 in OSD::dequeue_op (this=0x13c7460, pg=0x7fda94516c30) at osd/OSD.cc:5252
#14 0x0000000000648ed9 in ThreadPool::worker (this=0x13c7818) at common/WorkQueue.cc:44
#15 0x000000000054197d in ThreadPool::WorkThread::entry (this=<value optimized out>) at ./common/WorkQueue.h:113
#16 0x0000000000493877 in Thread::_entry_func (arg=0x31b7) at ./common/Thread.h:39
#17 0x00007fdab484873d in start_thread (arg=<value optimized out>) at pthread_create.c:301
#18 0x00007fdab375ef6d in clone () from /lib64/libc.so.6

#6 Updated by Sage Weil over 8 years ago

Can you post logs?

Thanks!

#7 Updated by Sage Weil over 8 years ago

  • Status changed from Resolved to In Progress

#8 Updated by Jim Schutt over 8 years ago

I have to use our managed transfer app for that; you should be getting an email about the transfer.

Let me know if you don't get it soon.

#9 Updated by Jim Schutt over 8 years ago

FWIW, I seem to be able to reproduce this at will.

I've been trying to understand what I'm seeing in the logs - it seems like the op that is (incorrectly?) on the front of the repop_queue hasn't been committed at the time removal is attempted on the other op.

I think I've several sets of logs now where that seems to be the case.

#10 Updated by Sage Weil over 8 years ago

  • Target version changed from v0.24.1 to v0.24.2

#11 Updated by Sage Weil over 8 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF