Project

General

Profile

Actions

Bug #37777

closed

OSD dies on assert triggered by a spicific other OSD joining the cluster

Added by Peter Bortas over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Short description: In a cluster with 44 OSDs, osd.8 will allways assert and die if osd.7 is part of or joins the cluster. If not osd.8 stays up and accept writes.

Longer description: After rsyncing about 50T to a newly created cephfs on erasure coded bluestore several of the osd's started rapidly restarting. After taking down the whole cluster and starting up the OSDs one by one it became clear that osd.8 and osd.27 asserts and dies if specific other OSDs are in the cluster. For now I will only talk about findings on osd.8. osd.27 also has a low disk warning, so it's not a clean example to prod.

osd.8 logs this assert within seconds of osd.7 joining:

ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
1: (()+0x902970) [0x55eae34ce970]
2: (()+0xf5d0) [0x7f1895fe25d0]
3: (gsignal()+0x37) [0x7f1895003207]
4: (abort()+0x148) [0x7f18950048f8]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x242) [0x7f1899484802]
6: (()+0x285887) [0x7f1899484887]
7: (PGLog::rewind_divergent_log(eversion_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x703) [0x55eae309dd93]
8: (PG::rewind_divergent_log(ObjectStore::Transaction&, eversion_t)+0x61) [0x55eae2ffdd21]
9: (PG::RecoveryState::Stray::react(MInfoRec const&)+0xa1) [0x55eae30290e1]

After dumping the core and loading it in gdb this turns up when walking the stack:

─── Source ──────────────────────────────────────────────────────────────────────────
3614 */
3615 ++p;
3616 divergent.splice(divergent.begin(), log, p, log.end());
3617 break;
3618 }
3619 assert(p->version > newhead);
3620 }
3621 head = newhead;
3622
3623 if (can_rollback_to > newhead)
3624 can_rollback_to = newhead;
─── Stack ───────────────────────────────────────────────────────────
[8] from 0x000055eae309dd93 in rewind_from_head+1587 at /usr/src/debug/ceph-13.2.2/src/osd/osd_types.h:3619
arg newhead = {
version = 35677,
epoch = 7320,
__pad = 0
}
arg this = 0x55eaf892fe58
arg this = <optimized out>

p is unfortunately optimized out here.

Probably unrelated to my problems, but: Shouldn't that be "assert(p->version > newhead->version)"?

Environment: Up to date CentOS7 with the latest Mimic 13.2.2 rpms from upstream

New to Ceph, and it's the first time I've looked at the source. Hints on how to solve this appreciated.

Actions #1

Updated by Greg Farnum over 5 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
Actions #2

Updated by Peter Bortas over 5 years ago

For the record: I can no longer reproduce this crash. I fixed the crashes on osd.27 yesterday by

1. taking down the osd that killed it (osd.28)
2. draining all the data from osd.27 with "ceph osd out 27"
3. let it complete
4. started osd.28 again
5. let recovery and rebalance complete
6. reintroduce osd.27 with "ceph osd in 27"
7. let recovery and rebalance complete

After doing this I was going to debug osd.8 without having two sources of error in the cluster was the thought, but when I tried triggering the error it no longer happens. The only thing I have left from this error is one core file and some lingering "active+clean+inconsistent" pgs that is currently being repaired.

Actions #3

Updated by Neha Ojha over 5 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF