Feature #51004
closedcephadm agent 2.0
100%
Description
The idea is to re-build the cephadm agent and make it push data to the mgr
Things to push to mgr/cephadm
- ls - running containers
- list-networks
- gather-facts
Architecture
- systemd unit
- non-containerized
- cephadm agent to push to mgr/cephadm
- cherrypy
- client authentication
- HTTPS certificates
- make it mandatory? yes!
- just members for now. No client hosts.
- have to verify that we have the same hash of the binary. Let's avoid any compatibility requirements. Only if this actually helps reducing the complexity
lots of new failure modes
We properly need to handle all of those in a good way:
- what to do when agent never pushed?
- load spikes!
- mgr moves around. needs latest MGR endpoint
- problem: flopping daemons
- how do we detect offline hosts? -> timeout
- problem: firewall blocking http connections to the MGR
- problem: thrashing. If the MGR is overloaded for 5 mins and the agent can no longer push information to the mgr.
race conditions
- agent caches running daemons
- mgr/cephadm deployes a new mgr
- agent pushes outdated information to the mgr
- mgr/cephadm deployes a seconds new mgr
solution: lamport clock: https://en.wikipedia.org/wiki/Lamport_timestamp
open questions
- should we access the config-key store directly from the MGR endpoint?
if not: race conditions?
if yes: slow REST API endpoint?
Action items
- add mgr/cephadm endpoint
- add cephadm command to push results to endpoint
- add cephadm command that push results every so often
- replace exporter with cephadm agent daemon mode
- teach mgr/cephadm to deploy agent (incl generating a keyring for each host/agent)
- add lamport clock
- simplify mgr/cephadm serve() loop
Files
Updated by Paul Cuzner almost 3 years ago
If the intent is to move to a agent architecture, why not offload the work entirely to the agent i.e. don't just have a passive endpoint, enable it to provide osd creation, daemon deployments etc etc Wouldn't this help scale too?
This is what I would expect from an agent. Otherwise, isn't agnt v2 just another exporter that provides state?
Updated by Ernesto Puerta almost 3 years ago
Interesting move. Just beware of the Second-System Effect (Dashboard v2 speaking :P)!
Updated by Sebastian Wagner almost 3 years ago
- Status changed from New to In Progress
Updated by Sebastian Wagner over 2 years ago
- Has duplicate Bug #49079: cephadm: slow to clear CEPHADM_FAILED_DAEMON added
Updated by Sage Weil over 2 years ago
- Status changed from In Progress to Resolved