Bug #666: OSD: ReplicatedPG::eval_repop - Ceph - Ceph

Actions

Copy link

Bug #666

closed

OSD: ReplicatedPG::eval_repop

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sage Weil

Category:

OSD

Target version:

v0.24.2

% Done:

Spent time:

3:30 h

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

On my 'noisy' machine I wanted to see how fast writing a 10GB file would go, but this caused some crashes.

This machine hosts 4 OSD's, each with a 200MB journal on a tmpfs.

root@noisy:~# dd if=/dev/zero bs=1024k count=10240|pv > /mnt/ceph/10GB.bin
 9.8GB 0:06:47 [31.5MB/s] [                                    <=>

The write stalled there and didn't go any further.

Around 3 ~ 4GB the OSD's started to crash, eventually 3 out of 4 OSD's crashed with the same backtrace:

2010-12-21 10:38:51.771608 7fe3e3eb2700 filestore(/var/lib/ceph/osd.0) async snap create 'snap_5545' transid 79 got 0 Success
2010-12-21 10:38:51.771673 7fe3e3eb2700 filestore(/var/lib/ceph/osd.0)  waiting for transid 79 to complete
2010-12-21 10:38:51.771699 7fe3de3a4700 filestore(/var/lib/ceph/osd.0) queue_op 0x44986c0 seq 5546 1306 bytes   (queue has 1 ops and 1306 bytes)
2010-12-21 10:38:51.771731 7fe3de3a4700 osd0 17 dequeue_op 0x413eb80 finish
2010-12-21 10:38:51.771747 7fe3de3a4700 osd0 17 dequeue_op osd_sub_op_reply(client4104.1:890 0.51 10000000000.00000373/head [] ondisk = 0) v1 pg pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'1 active+clean], 28 more pending
2010-12-21 10:38:51.771795 7fe3de3a4700 osd0 17 take_peer_stat peer osd1 stat(2010-12-21 10:38:51.014427 oprate=2.67588 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-12-21 10:38:51.771835 7fe3de3a4700 osd0 17 pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'1 active+clean] repop_ack rep_tid 478 op osd_op(client4104.1:890 10000000000.00000373 [write 0~4194304 [1@-1]] 0.ba51 snapc 1=[]) result 0 ack_type 4 from osd1
2010-12-21 10:38:51.771863 7fe3de3a4700 osd0 17 pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'1 active+clean] eval_repop repgather(0x444ce10 applied 17'5 rep_tid=478 wfack= wfdisk= op=osd_op(client4104.1:890 10000000000.00000373 [write 0~4194304 [1@-1]] 0.ba51 snapc 1=[])) wants=d
2010-12-21 10:38:51.771888 7fe3de3a4700 osd0 17 pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'1 active+clean]  sending commit on repgather(0x444ce10 applied 17'5 rep_tid=478 wfack= wfdisk= op=osd_op(client4104.1:890 10000000000.00000373 [write 0~4194304 [1@-1]] 0.ba51 snapc 1=[])) 0x43838c0
2010-12-21 10:38:51.771934 7fe3de3a4700 osd0 17 pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'4 active+clean]  removing repgather(0x444ce10 applied 17'5 rep_tid=478 wfack= wfdisk= op=osd_op(client4104.1:890 10000000000.00000373 [write 0~4194304 [1@-1]] 0.ba51 snapc 1=[]))
2010-12-21 10:38:51.771959 7fe3de3a4700 osd0 17 pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'4 active+clean]    q front is repgather(0x4819f00 applied 17'3 rep_tid=461 wfack= wfdisk=1 op=osd_op(client4104.1:833 10000000000.0000033b [write 0~4194304 [1@-1]] 0.2451 snapc 1=[]))
2010-12-21 10:38:51.771983 7fe3de3a4700 osd0 17 pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'4 active+clean]  removing repgather(0x444ce10 applied 17'5 rep_tid=478 wfack= wfdisk= op=osd_op(client4104.1:890 10000000000.00000373 [write 0~4194304 [1@-1]] 0.ba51 snapc 1=[]))
2010-12-21 10:38:51.772004 7fe3de3a4700 osd0 17 pg[0.51( v 17'6 (17'1,17'6] n=6 ec=2 les=14 12/12/2) [0,2,1] r=0 mlcod 17'4 active+clean]    q front is repgather(0x4819f00 applied 17'3 rep_tid=461 wfack= wfdisk=1 op=osd_op(client4104.1:833 10000000000.0000033b [write 0~4194304 [1@-1]] 0.2451 snapc 1=[]))
osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)':
osd/ReplicatedPG.cc:2147: FAILED assert(repop_queue.front() == repop)
 ceph version 0.24 (commit:180a4176035521940390f4ce24ee3eb7aa290632)
 1: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x852) [0x484cb2]
 2: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, eversion_t)+0x168) [0x4865b8]
 3: (ReplicatedPG::sub_op_modify_reply(MOSDSubOpReply*)+0x13c) [0x4868dc]
 4: (OSD::dequeue_op(PG*)+0x122) [0x4e4172]
 5: (ThreadPool::worker()+0x28f) [0x5dd40f]
 6: (ThreadPool::WorkThread::entry()+0xd) [0x508e1d]
 7: (Thread::_entry_func(void*)+0xa) [0x479baa]
 8: (()+0x69ca) [0x7fe3e9d2e9ca]
 9: (clone()+0x6d) [0x7fe3e89ba70d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

It's easily reproducible, just setup a similar cluster and start writing the 10GB file.

What I did notice is that the write speeds dropped pretty hard as soon as my journal filled up, but this is the same as #531 and #615 I assume?

I ran cdebugpack and uploaded the data to logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_eval_repop. During the test, debug osd and filestore was at 20.

After restarting the crashed OSD's the cluster fully recovered, but the FS stayed unresponsive, a 'sync' stalled:

libceph:  tid 2530 timed out on osd1, will reset osd

But the logs didn't show the OSD was stalled or something, it's logs showed it was running and doing stuff.

I then restarted osd1 and after a few minutes the 'sync' finished.

Actions

Copy link

Updated by Wido den Hollander over 13 years ago

Forgot to mention, this was on the RC branch, commit: 180a4176035521940390f4ce24ee3eb7aa290632 (v0.24)

Actions

Copy link

Updated by Sage Weil over 13 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Sage Weil over 13 years ago

Target version set to v0.24.1

Actions

Copy link

Updated by Sage Weil over 13 years ago

Status changed from New to Resolved
Assignee set to Sage Weil

fixed by 259c509a8941bf7cdad8bd4ede0ccd73ca8a83d3

Actions

Copy link

Updated by Jim Schutt over 13 years ago

Hit another instance of this on current unstable (d386ea2ea0b).

I've got logs from debug osd = 20 + debug filestore = 20, if that helps.

This was 64 clients writing 16 GiB data total to 16 OSDs, w/ 4 OSDs/server.
Each OSD had a ~500 MiB journal partition.

Here's a gdb backtrace:

(gdb) bt
#0  0x00007fdab48509dd in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
#1  0x000000000065c4ce in handle_fatal_signal (signum=6) at config.cc:253
#2  <signal handler called>
#3  0x00007fdab36bb265 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#4  0x00007fdab36bcd10 in abort () at abort.c:88
#5  0x00007fdab3f31cb4 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#6  0x00007fdab3f2fdb6 in ?? () from /usr/lib64/libstdc++.so.6
#7  0x00007fdab3f2fde3 in std::terminate() () from /usr/lib64/libstdc++.so.6
#8  0x00007fdab3f2feca in __cxa_throw () from /usr/lib64/libstdc++.so.6
#9  0x0000000000646e18 in ceph::__ceph_assert_fail (assertion=0x6803e0 "repop_queue.front() == repop", file=0x67f332 "osd/ReplicatedPG.cc", line=2496, 
    func=0x682080 "void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)") at common/assert.cc:25
#10 0x00000000004c300e in ReplicatedPG::eval_repop (this=0x7fda94516c30, repop=0x7fdaa07d0e50) at osd/ReplicatedPG.cc:2496
#11 0x00000000004c3529 in ReplicatedPG::repop_ack (this=0x7fda94516c30, repop=0x7fdaa07d0e50, result=0, ack_type=4, fromosd=4, peer_lcod=...) at osd/ReplicatedPG.cc:2638
#12 0x00000000004c3960 in ReplicatedPG::sub_op_modify_reply (this=0x7fda94516c30, r=0x1cb6c20) at osd/ReplicatedPG.cc:3058
#13 0x0000000000505082 in OSD::dequeue_op (this=0x13c7460, pg=0x7fda94516c30) at osd/OSD.cc:5252
#14 0x0000000000648ed9 in ThreadPool::worker (this=0x13c7818) at common/WorkQueue.cc:44
#15 0x000000000054197d in ThreadPool::WorkThread::entry (this=<value optimized out>) at ./common/WorkQueue.h:113
#16 0x0000000000493877 in Thread::_entry_func (arg=0x31b7) at ./common/Thread.h:39
#17 0x00007fdab484873d in start_thread (arg=<value optimized out>) at pthread_create.c:301
#18 0x00007fdab375ef6d in clone () from /lib64/libc.so.6

Actions

Copy link

Updated by Sage Weil over 13 years ago

Can you post logs?

Thanks!

Actions

Copy link

Updated by Sage Weil over 13 years ago

Status changed from Resolved to In Progress

Actions

Copy link

Updated by Jim Schutt over 13 years ago

I have to use our managed transfer app for that; you should be getting an email about the transfer.

Let me know if you don't get it soon.

Actions

Copy link

Updated by Jim Schutt over 13 years ago

FWIW, I seem to be able to reproduce this at will.

I've been trying to understand what I'm seeing in the logs - it seems like the op that is (incorrectly?) on the front of the repop_queue hasn't been committed at the time removal is attempted on the other op.

I think I've several sets of logs now where that seems to be the case.

Actions

Copy link

#10