Project

General

Profile

Actions

Bug #39978

closed

Adding OSD to Luminous Cluster will crash the active mon

Added by Henry Spanka almost 5 years ago. Updated over 4 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I recently upgraded my cluster to Luminous v12.2.11. While adding a new OSD the active monitor crashes (attempt to free invalid pointer). The other mons are still running but the OSD is stuck in new state. Attempting to restart the OSD process will crash the monitor again.

Crash Log: https://pastebin.com/pMpth7dV
Binary: http://mirror.centos.org/centos/7/storage/x86_64/ceph-luminous/ceph-12.2.11-0.el7.x86_64.rpm
OSD Tree: https://pastebin.com/RZQX2zAz

I think it crashes at this point: https://github.com/ceph/ceph/blob/26dc3775efc7bb286a1d6d66faee0ba30ea23eee/src/crush/CrushWrapper.cc#L463
The OSD is added on a new node (not in the crush map yet). Could that be a problem?


Related issues 1 (0 open1 closed)

Related to RADOS - Bug #40029: ceph-mon: Caught signal (Aborted) in (CrushWrapper::update_choose_args(CephContext*)+0x2fa) [0x7f516505614a]Resolved

Actions
Actions #1

Updated by Henry Spanka almost 5 years ago

Indeed the issue is related to adding a new host to the crush map.
I fixed it by manually adding the host to the crush map first and then activating the new OSD. Consider this solved but It would be good to still fix this bug as it may cause unexpected downtime if a monitor fails due to this.

Commands to fix the issue:

ceph osd crush add-bucket newhost host
ceph osd crush move-bucket newhost root=default

Actions #2

Updated by Greg Farnum almost 5 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (Monitor)
Actions #3

Updated by Neha Ojha almost 5 years ago

  • Priority changed from Normal to Urgent
Actions #4

Updated by Greg Farnum almost 5 years ago

  • Related to Bug #40029: ceph-mon: Caught signal (Aborted) in (CrushWrapper::update_choose_args(CephContext*)+0x2fa) [0x7f516505614a] added
Actions #5

Updated by Greg Farnum over 4 years ago

  • Status changed from New to Duplicate

Closing in favor of the other since we've lost all the pastebins. :(

Actions

Also available in: Atom PDF