Bug #11027
closed_scrub and digests: osd/ReplicatedPG.cc: 7506: FAILED assert(!i->mod_desc.empty())
0%
Description
This issue is relates to #10536, it seens the issue still happen on v0.92 and 0.93 as we can see: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17499.html
Description:
The first OSD go down, was osd.8, feel minutes after, another OSD goes down on the same host, the osd.1. So, I tried restart the OSDs (osd.8 and osd.1) but doesn’t worked and I decided put this OSDs out of cluster and wait the recovery complete.
During the recovery, more two OSDs goes down, osd.6 in another host… and seconds after, osd.0 on the same host that first osd goes down too.
Looking to the “ceph -w” status I realised some slow/stuck ops and I decided stop the writes on cluster. After that I restarted the OSDs 0 and 6 and bouth became UP and I was able to wait the recovery finish, which happened successfully.
I realised that when the first OSD goes down, the cluster was performing a deep-scrub and I found the bellow trace on the logs of osd.8, anyone can help me understand why the osd.8, and other osds, unexpected goes down?
Bellow the osd.8 trace:
2> 2015-03-03 16:31:48.191796 7f91a388b700 5 - op tracker -- seq: 2633606, time: 2015-03-03 16:31:48.191796, event: done, op: osd_op(client.3880912.0:236
8430 notify.6 [watch ping cookie 140352686583296] 40.97c520d4 ack+write+known_if_redirected e4231)
1> 2015-03-03 16:31:48.192174 7f91af8a3700 1 - 10.32.30.11:6804/3991 <== client.3880912 10.32.30.10:0/1001424 282597 ==== ping magic: 0 v1 ==== 0+0+0 (0
0 0) 0x3333f500 con 0x1535c580
0> 2015-03-03 16:31:48.251131 7f91a0084700 1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)' thread 7
f91a0084700 time 2015-03-03 16:31:48.169895
osd/ReplicatedPG.cc: 7494: FAILED assert(!i>mod_desc.empty())
ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xcc86c2]
2: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x49c) [0x9624fc]
3: (ReplicatedPG::simple_repop_submit(ReplicatedPG::RepGather*)+0x7a) [0x9698ba]
4: (ReplicatedPG::_scrub(ScrubMap&)+0x2e62) [0x99b072]
5: (PG::scrub_compare_maps()+0x511) [0x90f0d1]
6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x204) [0x910bb4]
7: (PG::scrub(ThreadPool::TPHandle&)+0x3a3) [0x912c53]
8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x7ebdd3]
9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xcbade9]
10: (ThreadPool::WorkThread::entry()+0x10) [0xcbbfe0]
11: (()+0x6b50) [0x7f91bfe46b50]
12: (clone()+0x6d) [0x7f91be8627bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Files
Updated by Yann Dupont about 9 years ago
Sam Asked me to post a log with debug on. It took half an hour,I stripped the log because it's very large. Here you have just crash + some seconds before it.
I can post the full log, but it's very large.
Updated by Samuel Just about 9 years ago
- Status changed from New to 7
- Assignee set to Samuel Just