Project

General

Profile

Feature #47368

Provide a daemon mode for cephadm to handle host/daemon state requests

Added by Paul Cuzner 5 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
cephadm (binary)
Target version:
% Done:

100%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Scaling the current methods for list_daemons and gather_facts to 100+ hosts represents problems in mgr multi-threading, and the actual time it takes to gather the information from hosts. This feature allows cephadm to install itself as a systemd unit to respond to the data queries instantly. The idea is to provide a simple web service similar to a prometheus exporter on each host, which proactively gathers the data and responds to requests from the mgr/cephadm from a local cache - reducing latency and providing more current insights into daemon state.


Related issues

Related to Orchestrator - Tasks #47369: Ceph scales to 100's of hosts, 1000's of OSDs....can orchestrator? New

History

#1 Updated by Sebastian Wagner 5 months ago

This has the potential to screw up cephadm pretty badly for all eternity, if we design the architecture wrong. We have to get the architecture and the goals right.

And we have a lot of choices to make here:

  • mgr/cephadm pulls things VS cephadm daemon pushes directly into config-key
  • containerized cephadm daemon VS native systemd service
  • optional VS mandatory
  • cephx authentication vs other auth mechanisms
  • just for gathering data or also for deploying daemons?
  • user visible vs implementation detail.

In any case we're again re-implementing kuberentes and/or Salt here.

#2 Updated by Juan Miguel Olmo Martínez 5 months ago

I think that the target here is to provide a way to collect information of the hosts and daemons running in each host in an efficient and fast way.
This information is essential for the orchestrator in order to reconcile the cluster and to take decisions. And this is not possible if collecting the information takes lot of time. ( what is what happen in a big cluster with the current implementation)

Does not seem a bad idea to convert cephadm in a systemd daemon serving requests about host information almost instantly. (a background task and a good cache can do the slow work)

This is also related with the "scale" problem in cephadm: We have opened another track issue to discuss about this theme. See https://tracker.ceph.com/issues/47369

#3 Updated by Sebastian Wagner 5 months ago

  • Related to Tasks #47369: Ceph scales to 100's of hosts, 1000's of OSDs....can orchestrator? added

#4 Updated by Sebastian Wagner 5 months ago

Juan Miguel Olmo Martínez wrote:

I think that the target here is to provide a way to collect information of the hosts and daemons running in each host in an efficient and fast way.

Right.

Note that we have additional bottlenecks in cephadm:

  • Some daemons are deployed on all hosts, like ceph-crash or node-exporter. We should think about making this new daemon capable of deploying daemons as well. Otherwise we'll
    simply introduce the next bottleneck, if we only think about gathering information.
  • What's the point in establishing SSH connections from the MGR, if we already have a daemon on all hosts?

This is also related with the "scale" problem in cephadm: We have opened another track issue to discuss about this theme. See https://tracker.ceph.com/issues/47369

Redmine supports to direcly link issues as "Related issues". see above!

#5 Updated by Juan Miguel Olmo Martínez 5 months ago

Sebastian Wagner wrote:

  • What's the point in establishing SSH connections from the MGR, if we already have a daemon on all hosts?

mmmm.... a "cephadm tool" with API REST interface to provide functionality to cephadm orchestrator.... maybe is not a bad idea :-)

#6 Updated by Sebastian Wagner 5 months ago

Juan Miguel Olmo Martínez wrote:

Sebastian Wagner wrote:

  • What's the point in establishing SSH connections from the MGR, if we already have a daemon on all hosts?

mmmm.... a "cephadm tool" with API REST interface to provide functionality to cephadm orchestrator.... maybe is not a bad idea :-)

  • cephadm daemon provides a REST API
  • mgr/cephadm fetches data from all hosts
  • mgr/cephddm writes all data to config-key

Why not simplify this to:

  • cephadm daemon writes all data to config-key

#7 Updated by Paul Cuzner 5 months ago

  • % Done changed from 0 to 40

This work has a simple goal - provide the data that mgr/cephadm needs faster to make the data more current and the Ui more responsive.

I'm tempted too, to make cephadmd an 'agent' on each host - but at that point it's no longer a cephadm (ssh) orchestrator but a rest API one. Whilst that could be a strategy long term, it brings with it security, functionality concerns and would be quite impactful. Fundamentally it would be another orchestrator plugin.

What I'm proposing here is a simple means of cephadm installing itself as a systemd unit to respond to queries for host facts - daemon states, host information and anything else we need. It's scope is to act as a kind of prometheus exporter for cephadm metadata - that's all, and because of this limited scope, issues like security, functionality etc are not a problem.

When it comes to switching to the data provided by the "exporter", we could implement as ; try the http endpoint first, and fallback to the existing cephadm invocations as a plan b

Please, let's not let "scope creep" change the intent here ... this is not about a new orchestrator plugin

#8 Updated by Paul Cuzner 5 months ago

I'll push the code tomorrow as a draft for people to poke at.

At this point deployment from cephadm looks like
cephadm deploy --fsid 3b33d172-cd75-11ea-b2c9-b083fee4188b --name cephadm.rhs-srv-01 --tcp-ports 5003

and removal
cephadm --verbose rm-daemon --fsid 3b33d172-cd75-11ea-b2c9-b083fee4188b --name cephadm.rhs-srv-01

#9 Updated by Paul Cuzner 5 months ago

  • % Done changed from 40 to 60

#10 Updated by Sebastian Wagner 5 months ago

Actually this change is really impactful, as this introduces a fundamental change in the architecture of cephadm.

We really want to be able to run things like `ceph-volume inventory` from the cephadm daemon in the future. I honestly see the daemon as extraordinary important design change.

This is a one-way street here. The thing is, having a versatile daemon on hosts is going to make so many things so much easier:

  • Deploying all OSDs at once? Peace of cake.
  • Deploying the ceph-crash daemon on all hosts in parallel? Peace of cake.
  • Deploying the node-exporter on all hosts in parallel? Peace of cake.
  • Running ceph-volume inventory every minute or so? Peace of cake.
  • Detect newly attached disks and run ceph-volume inventory instantly? Peace of cake.
  • Having nearly real-time updates of `cephadm ls` far beyond what mgr/cephadm is able to query in a reasonable time frame? Peace of cake.

We're obviously not going to implement those things in the first PR, but we have to have a story of how to extend the capabilities of cephadm daemon to solve those problems.

The thing is, just providing a REST API for `cephadm ls` is on the edge of not being worth it, as instead of:

  • for each host, establish a SSH connection and run cephadm ls in 0.5 seconds, we'd now do
  • for each host, establish a HTTP connection and run a GET on the rest API.

And we're not even going to solve connection timeouts! We need to be able to do more things form a cephadm daemon, than just to provide a REST API for `ls` in the future to make it a thing that is worth the added complexity.

#11 Updated by Paul Cuzner 5 months ago

  • % Done changed from 60 to 70

cephadm ls in 0.5 secs...on what system :) the current ls code takes 10secs per host in my physical lab!

Like I said, the idea of an agent is compelling - but is it cephadm? The foundation for cephadm was ssh, so to move forward with an rpc or http based agent sounds more like a different plugin to me.

As far as including ceph-volume inventory...done.

A query to the web service responds in 15ms - passing back ceph-volume, list_daemons and host-facts.
Here's an example of the output: https://gist.github.com/pcuzner/33f1049a13ed590700ae25ab371b2173
This captures ceph_volume, list_daemons (current code), and gather-facts. Each scrape, reports it's scrape time (how old is the data) and duration taken and the state of the scraper threads is seen in the health section)

15ms per host, not worth it? In your current implementation the ls and inventory for 100 hosts would take 30 minutes (100*(10+8))- so your data currency in orch and dashboard is 30mins old - this is not really viable. With this service 100 hosts serially scanned, takes a few seconds - and the data currency is at the last scrape time, which could be 10-20seconds old. This benefits orch and the dashboard surely?

I think the idea of an agent to offload the work could be a next step for the orchestrator, and worth exploring. However, the focus here is current cephadm/ssh with minimal impact to resolve the data currency issues we have in the orchestrator and dashboard layers.

Adopting an agent based architecture is a big thing, not least from a security standpoint, since now you need to protect the endpoint since your allowing it to do potentially disruptive things to the cluster - which until now has been the remit of the mgr.

Personally, I would rather focus on completing cephadm, so it's more production ready, and then investigate a new plugin to offload actions to the agents running on each host.

#13 Updated by Sebastian Wagner 5 months ago

Paul, can you make a WIP Pr with your current code?

#14 Updated by Joshua Schmid 5 months ago

Could you push the code to a branch, please? What sebastian said :)

Also, what about finding a suitable timeslot to talk about the general architecture of that thing?

#15 Updated by Paul Cuzner 5 months ago

  • Pull request ID set to 37130

Take a look at https://github.com/ceph/ceph/pull/37130

It needs a rebase, but the basics are there in terms of the approach and scope.

Just to re-iterate, my goal here is simple - cache the metadata, instead of forcing mgr to do the work. That's it.

The code is simple, and should be easy to knit into the current cephadm deployment process.

#16 Updated by Paul Cuzner 4 months ago

  • % Done changed from 70 to 90

#17 Updated by Paul Cuzner about 2 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 90 to 100

merged

#18 Updated by Nathan Cutler about 1 month ago

  • Status changed from Closed to Resolved

Also available in: Atom PDF