Project

General

Profile

Actions

Bug #11027

closed

_scrub and digests: osd/ReplicatedPG.cc: 7506: FAILED assert(!i->mod_desc.empty())

Added by Italo Santos about 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This issue is relates to #10536, it seens the issue still happen on v0.92 and 0.93 as we can see: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17499.html

Description:

The first OSD go down, was osd.8, feel minutes after, another OSD goes down on the same host, the osd.1. So, I tried restart the OSDs (osd.8 and osd.1) but doesn’t worked and I decided put this OSDs out of cluster and wait the recovery complete.

During the recovery, more two OSDs goes down, osd.6 in another host… and seconds after, osd.0 on the same host that first osd goes down too.

Looking to the “ceph -w” status I realised some slow/stuck ops and I decided stop the writes on cluster. After that I restarted the OSDs 0 and 6 and bouth became UP and I was able to wait the recovery finish, which happened successfully.

I realised that when the first OSD goes down, the cluster was performing a deep-scrub and I found the bellow trace on the logs of osd.8, anyone can help me understand why the osd.8, and other osds, unexpected goes down?

Bellow the osd.8 trace:

2> 2015-03-03 16:31:48.191796 7f91a388b700  5 - op tracker -- seq: 2633606, time: 2015-03-03 16:31:48.191796, event: done, op: osd_op(client.3880912.0:236
8430 notify.6 [watch ping cookie 140352686583296] 40.97c520d4 ack+write+known_if_redirected e4231)
1> 2015-03-03 16:31:48.192174 7f91af8a3700 1 - 10.32.30.11:6804/3991 <== client.3880912 10.32.30.10:0/1001424 282597 ==== ping magic: 0 v1 ==== 0+0+0 (0
0 0) 0x3333f500 con 0x1535c580
0> 2015-03-03 16:31:48.251131 7f91a0084700 1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)' thread 7
f91a0084700 time 2015-03-03 16:31:48.169895
osd/ReplicatedPG.cc: 7494: FAILED assert(!i
>mod_desc.empty())
ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xcc86c2]
2: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x49c) [0x9624fc]
3: (ReplicatedPG::simple_repop_submit(ReplicatedPG::RepGather*)+0x7a) [0x9698ba]
4: (ReplicatedPG::_scrub(ScrubMap&)+0x2e62) [0x99b072]
5: (PG::scrub_compare_maps()+0x511) [0x90f0d1]
6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x204) [0x910bb4]
7: (PG::scrub(ThreadPool::TPHandle&)+0x3a3) [0x912c53]
8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x7ebdd3]
9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xcbade9]
10: (ThreadPool::WorkThread::entry()+0x10) [0xcbbfe0]
11: (()+0x6b50) [0x7f91bfe46b50]
12: (clone()+0x6d) [0x7f91be8627bd]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Files

osd_crash.log (14.2 KB) osd_crash.log Italo Santos, 03/05/2015 04:16 AM
ceph-osd.2.log-for-sam-stripped.bz2 (156 KB) ceph-osd.2.log-for-sam-stripped.bz2 Yann Dupont, 03/05/2015 09:39 PM
Actions #1

Updated by Yann Dupont about 9 years ago

Sam Asked me to post a log with debug on. It took half an hour,I stripped the log because it's very large. Here you have just crash + some seconds before it.

I can post the full log, but it's very large.

Actions #2

Updated by Samuel Just about 9 years ago

  • Status changed from New to 7
  • Assignee set to Samuel Just
Actions #3

Updated by Samuel Just about 9 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF