Project

General

Profile

Bug #45393

Containerized osd config must be updated when adding/removing mons

Added by Tim Serong 7 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
cephadm
Target version:
% Done:

0%

Source:
Tags:
Backport:
octopus
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Try this:

- bootstrap a cluster (1 mon, 1 mgr)
- add a bunch of osds (ceph orch apply osd --all-available-devices)
- add some more mons and mgrs (ceph orch apply mon 3 ; ceph orch apply mgr 3)

At this point, ceph.conf as seen by the osds (/var/lib/ceph/$FSID/osd.$ID/config) still only lists the first mon. If you restart any osds, and that first mon is down for some reason, the osd's can't join the cluster, because they don't know the other mons exist. They'll just sit there logging "monclient(hunting): authenticate timed out after 300" every five minutes until that first mon comes back.

I can think of two potential ways to address this:

1) Have cephadm mon apply go out and update every single osd config file with the new list of mons. This would of course not work completely if some osd hosts were down at the time. Also it might take a while...
2) Have the osds update their own config file automatically based on current monmaps.

This probably also needs to go into troubleshooting docs (check mon_host in each containerized osd's individual config file)


Related issues

Related to Orchestrator - Feature #45378: cephadm: manage /etc/ceph/ceph.conf Resolved

History

#1 Updated by Tim Serong 7 months ago

  • Related to Feature #45378: cephadm: manage /etc/ceph/ceph.conf added

#2 Updated by Sebastian Wagner 7 months ago

  • Priority changed from Normal to High
  • Regression changed from No to Yes

This was fixed in https://github.com/ceph/ceph/pull/33855 . Looks like we have to figure out, what went wrong here.

#3 Updated by Tim Serong 7 months ago

  • Assignee set to Tim Serong

Thanks for the pointer, I'll try to figure out what's going on, seeing as I'm the one who hit this :-)

#4 Updated by Tim Serong 7 months ago

A quick grep of my logs shows it reconfiguring the mons and mgrs, but not the osds.

#5 Updated by Tim Serong 7 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 34922

It's always the little things...

#6 Updated by Tim Serong 7 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to octopus

#7 Updated by Sebastian Wagner 6 months ago

  • Status changed from Pending Backport to Resolved
  • Target version set to v15.2.4

Also available in: Atom PDF