Project

General

Profile

Actions

Feature #51004

closed

cephadm agent 2.0

Added by Sebastian Wagner almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

The idea is to re-build the cephadm agent and make it push data to the mgr

Things to push to mgr/cephadm

  • ls - running containers
  • list-networks
  • gather-facts

Architecture

  • systemd unit
  • non-containerized
  • cephadm agent to push to mgr/cephadm
  • cherrypy
  • client authentication
  • HTTPS certificates
  • make it mandatory? yes!
  • just members for now. No client hosts.
  • have to verify that we have the same hash of the binary. Let's avoid any compatibility requirements. Only if this actually helps reducing the complexity

lots of new failure modes

We properly need to handle all of those in a good way:

  • what to do when agent never pushed?
  • load spikes!
  • mgr moves around. needs latest MGR endpoint
  • problem: flopping daemons
  • how do we detect offline hosts? -> timeout
  • problem: firewall blocking http connections to the MGR
  • problem: thrashing. If the MGR is overloaded for 5 mins and the agent can no longer push information to the mgr.

race conditions

  1. agent caches running daemons
  2. mgr/cephadm deployes a new mgr
  3. agent pushes outdated information to the mgr
  4. mgr/cephadm deployes a seconds new mgr

solution: lamport clock: https://en.wikipedia.org/wiki/Lamport_timestamp

open questions

  • should we access the config-key store directly from the MGR endpoint?

if not: race conditions?
if yes: slow REST API endpoint?

Action items

  • add mgr/cephadm endpoint
  • add cephadm command to push results to endpoint
  • add cephadm command that push results every so often
  • replace exporter with cephadm agent daemon mode
  • teach mgr/cephadm to deploy agent (incl generating a keyring for each host/agent)
  • add lamport clock
  • simplify mgr/cephadm serve() loop

Files

agent 2.0 (43.9 KB) agent 2.0 Sebastian Wagner, 05/27/2021 02:23 PM

Subtasks 7 (0 open7 closed)

Feature #51005: add mgr/cephadm endpointResolved

Actions
Feature #51006: add cephadm command to push results to endpointResolved

Actions
Feature #51007: add cephadm command that push results every so oftenResolved

Actions
Feature #51008: replace exporter with cephadm agent daemon modeResolved

Actions
Feature #51009: teach mgr/cephadm to deploy agent (incl generating a keyring for each host/agent)Resolved

Actions
Feature #51010: add lamport clockResolved

Actions
Feature #51011: simplify mgr/cephadm serve() loopResolved

Actions

Related issues 1 (0 open1 closed)

Has duplicate Orchestrator - Bug #49079: cephadm: slow to clear CEPHADM_FAILED_DAEMONDuplicate

Actions
Actions #1

Updated by Paul Cuzner almost 3 years ago

If the intent is to move to a agent architecture, why not offload the work entirely to the agent i.e. don't just have a passive endpoint, enable it to provide osd creation, daemon deployments etc etc Wouldn't this help scale too?

This is what I would expect from an agent. Otherwise, isn't agnt v2 just another exporter that provides state?

Actions #2

Updated by Ernesto Puerta almost 3 years ago

Interesting move. Just beware of the Second-System Effect (Dashboard v2 speaking :P)!

Actions #3

Updated by Sebastian Wagner almost 3 years ago

  • Assignee set to Adam King
Actions #4

Updated by Sebastian Wagner almost 3 years ago

  • Status changed from New to In Progress
Actions #5

Updated by Sebastian Wagner over 2 years ago

  • Has duplicate Bug #49079: cephadm: slow to clear CEPHADM_FAILED_DAEMON added
Actions #6

Updated by Sage Weil over 2 years ago

  • Status changed from In Progress to Resolved
Actions #7

Updated by Ramana Raja over 2 years ago

  • Pull request ID set to 42384
Actions

Also available in: Atom PDF