Bug #37777: OSD dies on assert triggered by a spicific other OSD joining the cluster - RADOS - Ceph

Actions

Copy link

Bug #37777

closed

OSD dies on assert triggered by a spicific other OSD joining the cluster

Added by Peter Bortas over 5 years ago. Updated over 5 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v13.2.2

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Short description: In a cluster with 44 OSDs, osd.8 will allways assert and die if osd.7 is part of or joins the cluster. If not osd.8 stays up and accept writes.

Longer description: After rsyncing about 50T to a newly created cephfs on erasure coded bluestore several of the osd's started rapidly restarting. After taking down the whole cluster and starting up the OSDs one by one it became clear that osd.8 and osd.27 asserts and dies if specific other OSDs are in the cluster. For now I will only talk about findings on osd.8. osd.27 also has a low disk warning, so it's not a clean example to prod.

osd.8 logs this assert within seconds of osd.7 joining:

ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
 1: (()+0x902970) [0x55eae34ce970]
 2: (()+0xf5d0) [0x7f1895fe25d0]
 3: (gsignal()+0x37) [0x7f1895003207]
 4: (abort()+0x148) [0x7f18950048f8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x242) [0x7f1899484802]
 6: (()+0x285887) [0x7f1899484887]
 7: (PGLog::rewind_divergent_log(eversion_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x703) [0x55eae309dd93]
 8: (PG::rewind_divergent_log(ObjectStore::Transaction&, eversion_t)+0x61) [0x55eae2ffdd21]
 9: (PG::RecoveryState::Stray::react(MInfoRec const&)+0xa1) [0x55eae30290e1]

After dumping the core and loading it in gdb this turns up when walking the stack:

─── Source ──────────────────────────────────────────────────────────────────────────
 3614      */
 3615     ++p;
 3616     divergent.splice(divergent.begin(), log, p, log.end());
 3617     break;
 3618       }
 3619       assert(p->version > newhead);
 3620     }
 3621     head = newhead;
 3622 
 3623     if (can_rollback_to > newhead)
 3624       can_rollback_to = newhead;
 ─── Stack ───────────────────────────────────────────────────────────
 [8] from 0x000055eae309dd93 in rewind_from_head+1587 at /usr/src/debug/ceph-13.2.2/src/osd/osd_types.h:3619
 arg newhead = {
   version = 35677, 
   epoch = 7320, 
   __pad = 0
 }
 arg this = 0x55eaf892fe58
 arg this = &lt;optimized out&gt;

p is unfortunately optimized out here.

Probably unrelated to my problems, but: Shouldn't that be "assert(p->version > newhead->version)"?

Environment: Up to date CentOS7 with the latest Mimic 13.2.2 rpms from upstream

New to Ceph, and it's the first time I've looked at the source. Hints on how to solve this appreciated.

Actions

Copy link

Updated by Greg Farnum over 5 years ago

Project changed from Ceph to RADOS
Category deleted (~~OSD~~)

Actions

Copy link

Updated by Peter Bortas over 5 years ago

For the record: I can no longer reproduce this crash. I fixed the crashes on osd.27 yesterday by

1. taking down the osd that killed it (osd.28)
2. draining all the data from osd.27 with "ceph osd out 27"
3. let it complete
4. started osd.28 again
5. let recovery and rebalance complete
6. reintroduce osd.27 with "ceph osd in 27"
7. let recovery and rebalance complete

After doing this I was going to debug osd.8 without having two sources of error in the cluster was the thought, but when I tried triggering the error it no longer happens. The only thing I have left from this error is one core file and some lingering "active+clean+inconsistent" pgs that is currently being repaired.

Actions

Copy link

Updated by Neha Ojha over 5 years ago

Status changed from New to Closed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #37777

OSD dies on assert triggered by a spicific other OSD joining the cluster

Updated by Greg Farnum over 5 years ago

Updated by Peter Bortas over 5 years ago

Updated by Neha Ojha over 5 years ago