https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2020-09-09T08:24:40ZCeph Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1747062020-09-09T08:24:40ZSebastian Wagner
<ul></ul><p>This has the potential to screw up cephadm pretty badly for all eternity, if we design the architecture wrong. We have to get the architecture and the goals right.</p>
<p>And we have a <strong>lot</strong> of choices to make here:</p>
<ul>
<li>mgr/cephadm pulls things VS cephadm daemon pushes directly into config-key</li>
<li>containerized cephadm daemon VS native systemd service</li>
<li>optional VS mandatory</li>
<li>cephx authentication vs other auth mechanisms </li>
<li>just for gathering data or also for deploying daemons? </li>
<li>user visible vs implementation detail.</li>
</ul>
<p>In any case we're again re-implementing kuberentes and/or Salt here.</p> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1747092020-09-09T08:54:35ZJuan Miguel Olmo Martínez
<ul></ul><p>I think that the target here is to provide a way to collect information of the hosts and daemons running in each host in an efficient and fast way.<br />This information is essential for the orchestrator in order to reconcile the cluster and to take decisions. And this is not possible if collecting the information takes lot of time. ( what is what happen in a big cluster with the current implementation)</p>
<p>Does not seem a bad idea to convert cephadm in a systemd daemon serving requests about host information almost instantly. (a background task and a good cache can do the slow work)</p>
<p>This is also related with the "scale" problem in cephadm: We have opened another track issue to discuss about this theme. See <a class="external" href="https://tracker.ceph.com/issues/47369">https://tracker.ceph.com/issues/47369</a></p> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1747242020-09-09T11:16:12ZSebastian Wagner
<ul><li><strong>Related to</strong> <i><a class="issue tracker-5 status-3 priority-4 priority-default closed" href="/issues/47369">Tasks #47369</a>: Ceph scales to 100's of hosts, 1000's of OSDs....can orchestrator?</i> added</li></ul> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1747272020-09-09T11:20:40ZSebastian Wagner
<ul></ul><p>Juan Miguel Olmo Martínez wrote:</p>
<blockquote>
<p>I think that the target here is to provide a way to collect information of the hosts and daemons running in each host in an efficient and fast way.</p>
</blockquote>
<p>Right.</p>
<p>Note that we have additional bottlenecks in cephadm:</p>
<ul>
<li>Some daemons are deployed on all hosts, like ceph-crash or node-exporter. We should think about making this new daemon capable of deploying daemons as well. Otherwise we'll <br />simply introduce the next bottleneck, if we <strong>only</strong> think about gathering information.</li>
</ul>
<ul>
<li>What's the point in establishing SSH connections from the MGR, if we already have a daemon on all hosts?</li>
</ul>
<blockquote>
<p>This is also related with the "scale" problem in cephadm: We have opened another track issue to discuss about this theme. See <a class="external" href="https://tracker.ceph.com/issues/47369">https://tracker.ceph.com/issues/47369</a></p>
</blockquote>
<p>Redmine supports to direcly link issues as "Related issues". see above!</p> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1747382020-09-09T14:26:24ZJuan Miguel Olmo Martínez
<ul></ul><p>Sebastian Wagner wrote:</p>
<blockquote>
<ul>
<li>What's the point in establishing SSH connections from the MGR, if we already have a daemon on all hosts?</li>
</ul>
</blockquote>
<p>mmmm.... a "cephadm tool" with API REST interface to provide functionality to cephadm orchestrator.... maybe is not a bad idea :-)</p> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1747392020-09-09T14:55:36ZSebastian Wagner
<ul></ul><p>Juan Miguel Olmo Martínez wrote:</p>
<blockquote>
<p>Sebastian Wagner wrote:</p>
<blockquote>
<ul>
<li>What's the point in establishing SSH connections from the MGR, if we already have a daemon on all hosts?</li>
</ul>
</blockquote>
<p>mmmm.... a "cephadm tool" with API REST interface to provide functionality to cephadm orchestrator.... maybe is not a bad idea :-)</p>
</blockquote>
<ul>
<li>cephadm daemon provides a REST API</li>
<li>mgr/cephadm fetches data from <strong>all</strong> hosts</li>
<li>mgr/cephddm writes all data to config-key</li>
</ul>
<p>Why not simplify this to:</p>
<ul>
<li>cephadm daemon writes all data to config-key</li>
</ul> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1747522020-09-09T22:49:03ZPaul Cuzner
<ul><li><strong>% Done</strong> changed from <i>0</i> to <i>40</i></li></ul><p>This work has a simple goal - provide the data that mgr/cephadm needs faster to make the data more current and the Ui more responsive.</p>
<p>I'm tempted too, to make cephadmd an 'agent' on each host - but at that point it's no longer a cephadm (ssh) orchestrator but a rest API one. Whilst that could be a strategy long term, it brings with it security, functionality concerns and would be quite impactful. Fundamentally it would be another orchestrator plugin.</p>
<p>What I'm proposing here is a simple means of cephadm installing itself as a systemd unit to respond to queries for host facts - daemon states, host information and anything else we need. It's scope is to act as a kind of prometheus exporter for cephadm metadata - that's all, and because of this limited scope, issues like security, functionality etc are not a problem.</p>
<p>When it comes to switching to the data provided by the "exporter", we could implement as ; try the http endpoint first, and fallback to the existing cephadm invocations as a plan b</p>
<p>Please, let's not let "scope creep" change the intent here ... this is <strong>not</strong> about a new orchestrator plugin</p> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1747542020-09-10T05:33:30ZPaul Cuzner
<ul></ul><p>I'll push the code tomorrow as a draft for people to poke at.</p>
<p>At this point deployment from cephadm looks like<br />cephadm deploy --fsid 3b33d172-cd75-11ea-b2c9-b083fee4188b --name cephadm.rhs-srv-01 --tcp-ports 5003</p>
<p>and removal <br />cephadm --verbose rm-daemon --fsid 3b33d172-cd75-11ea-b2c9-b083fee4188b --name cephadm.rhs-srv-01</p> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1747552020-09-10T05:33:41ZPaul Cuzner
<ul><li><strong>% Done</strong> changed from <i>40</i> to <i>60</i></li></ul> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1747662020-09-10T08:20:00ZSebastian Wagner
<ul></ul><p>Actually this change is really impactful, as this introduces a fundamental change in the architecture of cephadm.</p>
<p>We <strong>really</strong> want to be able to run things like `ceph-volume inventory` from the cephadm daemon in the future. I honestly see the daemon as extraordinary important design change.</p>
<p>This is a one-way street here. The thing is, having a versatile daemon on hosts is going to make so many things so much easier:</p>
<ul>
<li>Deploying all OSDs at once? Peace of cake.</li>
<li>Deploying the ceph-crash daemon on all hosts in parallel? Peace of cake.</li>
<li>Deploying the node-exporter on all hosts in parallel? Peace of cake.</li>
<li>Running ceph-volume inventory every minute or so? Peace of cake.</li>
<li>Detect newly attached disks and run ceph-volume inventory instantly? Peace of cake.</li>
<li>Having nearly real-time updates of `cephadm ls` far beyond what mgr/cephadm is able to query in a reasonable time frame? Peace of cake.</li>
</ul>
<p>We're obviously not going to implement those things in the first PR, but we have to have a story of how to extend the capabilities of cephadm daemon to solve those problems.</p>
<p>The thing is, just providing a REST API for `cephadm ls` is on the edge of not being worth it, as instead of:</p>
<ul>
<li>for each host, establish a SSH connection and run cephadm ls in 0.5 seconds, we'd now do</li>
<li>for each host, establish a HTTP connection and run a GET on the rest API.</li>
</ul>
<p>And we're not even going to solve connection timeouts! We need to be able to do more things form a cephadm daemon, than just to provide a REST API for `ls` in the future to make it a thing that is worth the added complexity.</p> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1748112020-09-10T23:10:17ZPaul Cuzner
<ul><li><strong>% Done</strong> changed from <i>60</i> to <i>70</i></li></ul><p>cephadm ls in 0.5 secs...on what system :) the current ls code takes 10secs per host in my physical lab!</p>
<p>Like I said, the idea of an agent is compelling - but is it cephadm? The foundation for cephadm was ssh, so to move forward with an rpc or http based agent sounds more like a different plugin to me.</p>
<p>As far as including ceph-volume inventory...done.</p>
<p>A query to the web service responds in 15ms - passing back ceph-volume, list_daemons and host-facts.<br />Here's an example of the output: <a class="external" href="https://gist.github.com/pcuzner/33f1049a13ed590700ae25ab371b2173">https://gist.github.com/pcuzner/33f1049a13ed590700ae25ab371b2173</a><br />This captures ceph_volume, list_daemons (current code), and gather-facts. Each scrape, reports it's scrape time (how old is the data) and duration taken and the state of the scraper threads is seen in the health section)</p>
<p>15ms per host, not worth it? In your current implementation the ls and inventory for 100 hosts would take 30 minutes (100*(10+8))- so your data currency in orch and dashboard is 30mins old - this is not really viable. With this service 100 hosts <strong>serially</strong> scanned, takes a few seconds - and the data currency is at the last scrape time, which could be 10-20seconds old. This benefits orch and the dashboard surely?</p>
<p>I think the idea of an agent to offload the work could be a next step for the orchestrator, and worth exploring. However, the focus here is current cephadm/ssh with minimal impact to resolve the data currency issues we have in the orchestrator and dashboard layers.</p>
<p>Adopting an agent based architecture is a big thing, not least from a security standpoint, since now you need to protect the endpoint since your allowing it to do potentially disruptive things to the cluster - which until now has been the remit of the mgr.</p>
<p>Personally, I would rather focus on completing cephadm, so it's more production ready, and then investigate a new plugin to offload actions to the agents running on each host.</p> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1748212020-09-11T05:24:10ZPaul Cuzner
<ul></ul><p>quick demo of the work so far</p>
<p><a class="external" href="https://gist.githubusercontent.com/pcuzner/574113c5976e05a78a6334901eba6319/raw/ae2f8c8f3785dffe58079e06a98da122eaa23ee4/Peek%25202020-09-11%252016-40.gif">https://gist.githubusercontent.com/pcuzner/574113c5976e05a78a6334901eba6319/raw/ae2f8c8f3785dffe58079e06a98da122eaa23ee4/Peek%25202020-09-11%252016-40.gif</a></p> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1748642020-09-11T15:01:34ZSebastian Wagner
<ul></ul><p>Paul, can you make a WIP Pr with your current code?</p> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1748662020-09-11T15:04:49ZJoshua Schmidjschmid@suse.de
<ul></ul><p><del>Could you push the code to a branch, please?</del> What sebastian said :)</p>
<p>Also, what about finding a suitable timeslot to talk about the general architecture of that thing?</p> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1749572020-09-14T07:41:49ZPaul Cuzner
<ul><li><strong>Pull request ID</strong> set to <i>37130</i></li></ul><p>Take a look at <a class="external" href="https://github.com/ceph/ceph/pull/37130">https://github.com/ceph/ceph/pull/37130</a></p>
<p>It needs a rebase, but the basics are there in terms of the approach and scope.</p>
<p>Just to re-iterate, my goal here is simple - cache the metadata, instead of forcing mgr to do the work. That's it.</p>
<p>The code is simple, and should be easy to knit into the current cephadm deployment process.</p> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1750242020-09-15T03:19:48ZPaul Cuzner
<ul><li><strong>% Done</strong> changed from <i>70</i> to <i>90</i></li></ul> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1812172020-12-13T21:19:12ZPaul Cuzner
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Closed</i></li><li><strong>% Done</strong> changed from <i>90</i> to <i>100</i></li></ul><p>merged</p> Orchestrator - Feature #47368: Provide a daemon mode for cephadm to handle host/daemon state requestshttps://tracker.ceph.com/issues/47368?journal_id=1812362020-12-14T12:14:09ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>Closed</i> to <i>Resolved</i></li></ul>