https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2022-11-01T20:22:41ZCeph Orchestrator - Bug #57897: ceph mgr restart causes restart of all iscsi daemons in a loophttps://tracker.ceph.com/issues/57897?journal_id=2279162022-11-01T20:22:41ZDavid Heap
<ul></ul><p>We currently have this situation again after an mgr failover for node maintenance/reboots and managed enabled debug logging</p>
<p>It looks like the orchestrator keeps reconfiguring the service due to thinking the mgr ip has changed when it hasn't - based on the code in serve.py here <a class="external" href="https://github.com/ceph/ceph/blob/v17.2.5/src/pybind/mgr/cephadm/serve.py#L910">https://github.com/ceph/ceph/blob/v17.2.5/src/pybind/mgr/cephadm/serve.py#L910</a> it looks like the ip address it's getting as `last_deps` isn't updating, so it keeps trying to reconfigure the ip each time:</p>
<pre><code>
Nov 01 18:35:51 node1 ceph-mon[3901]: iscsi.iscsi.node0.ldfwch deps ['172.16.1.4'] -> ['172.16.1.6']
Nov 01 18:35:51 node1 ceph-mon[3901]: Reconfiguring iscsi.iscsi.node0.ldfwch (dependencies changed)...
Nov 01 18:36:03 node1 ceph-mon[3901]: iscsi.iscsi.node2.lofqpp deps ['172.16.1.5'] -> ['172.16.1.6']
Nov 01 18:36:03 node1 ceph-mon[3901]: Reconfiguring iscsi.iscsi.node2.lofqpp (dependencies changed)...
Nov 01 18:36:06 node1 ceph-mon[3901]: iscsi.iscsi.node0.ldfwch deps ['172.16.1.4'] -> ['172.16.1.6']
Nov 01 18:36:06 node1 ceph-mon[3901]: Reconfiguring iscsi.iscsi.node0.ldfwch (dependencies changed)...
Nov 01 18:37:10 node1 ceph-mon[3901]: iscsi.iscsi.node2.lofqpp deps ['172.16.1.5'] -> ['172.16.1.6']
Nov 01 18:37:10 node1 ceph-mon[3901]: Reconfiguring iscsi.iscsi.node2.lofqpp (dependencies changed)...
Nov 01 18:37:15 node1 ceph-mon[3901]: iscsi.iscsi.node0.ldfwch deps ['172.16.1.4'] -> ['172.16.1.6']
Nov 01 18:37:15 node1 ceph-mon[3901]: Reconfiguring iscsi.iscsi.node0.ldfwch (dependencies changed)...
Nov 01 18:38:19 node1 ceph-mon[3901]: iscsi.iscsi.node2.lofqpp deps ['172.16.1.5'] -> ['172.16.1.6']
Nov 01 18:38:19 node1 ceph-mon[3901]: Reconfiguring iscsi.iscsi.node2.lofqpp (dependencies changed)...
Nov 01 18:38:25 node1 ceph-mon[3901]: iscsi.iscsi.node0.ldfwch deps ['172.16.1.4'] -> ['172.16.1.6']
Nov 01 18:38:25 node1 ceph-mon[3901]: Reconfiguring iscsi.iscsi.node0.ldfwch (dependencies changed)...
Nov 01 18:39:31 node1 ceph-mon[3901]: iscsi.iscsi.node2.lofqpp deps ['172.16.1.5'] -> ['172.16.1.6']
Nov 01 18:39:31 node1 ceph-mon[3901]: Reconfiguring iscsi.iscsi.node2.lofqpp (dependencies changed)...
Nov 01 18:39:35 node1 ceph-mon[3901]: iscsi.iscsi.node0.ldfwch deps ['172.16.1.4'] -> ['172.16.1.6']
Nov 01 18:39:35 node1 ceph-mon[3901]: Reconfiguring iscsi.iscsi.node0.ldfwch (dependencies changed)...
</code></pre>
<p>When we pause orchestration and check inside the container that's created, the trusted ip list looks to be appended with the correct mgr ip as expected in quincy (which i believe is the reason the reconfiguration takes place since <a class="external" href="https://github.com/ceph/ceph/commit/cda82c98a32f51cb392fc51ba854bcae409567f8">https://github.com/ceph/ceph/commit/cda82c98a32f51cb392fc51ba854bcae409567f8</a> )</p>
<pre><code>
trusted_ip_list = 172.16.1.4,172.16.1.5,172.16.1.6,172.16.1.7,172.16.1.8,172.16.1.6
</code></pre> Orchestrator - Bug #57897: ceph mgr restart causes restart of all iscsi daemons in a loophttps://tracker.ceph.com/issues/57897?journal_id=2279262022-11-02T12:51:23ZDavid Heap
<ul></ul><p>Attempts to stop or redeploy the daemon don't work as they seem to invoke a dependency check, which then restarts the service again</p>
<pre><code>
Nov 02 12:18:04 dh-ceph01-test ceph-mon[2054]: Schedule stop daemon iscsi.iscsi.dh-ceph01-test.jrdndc
Nov 02 12:18:05 dh-ceph01-test ceph-mon[2054]: Reconfiguring iscsi.iscsi.dh-ceph01-test.jrdndc (dependencies changed)...
Nov 02 12:18:05 dh-ceph01-test ceph-mon[2054]: Reconfiguring daemon iscsi.iscsi.dh-ceph01-test.jrdndc on dh-ceph01-test
Nov 02 12:18:23 dh-ceph01-test ceph-mon[2054]: Schedule redeploy daemon iscsi.iscsi.dh-ceph01-test.jrdndc
Nov 02 12:18:23 dh-ceph01-test ceph-mon[2054]: Reconfiguring iscsi.iscsi.dh-ceph01-test.jrdndc (dependencies changed)...
Nov 02 12:18:23 dh-ceph01-test ceph-mon[2054]: Reconfiguring daemon iscsi.iscsi.dh-ceph01-test.jrdndc on dh-ceph01-test
</pre></code></pre> Orchestrator - Bug #57897: ceph mgr restart causes restart of all iscsi daemons in a loophttps://tracker.ceph.com/issues/57897?journal_id=2279892022-11-04T12:29:46ZRedouane Kachach Elhichou
<ul></ul><p>In this case I think it would be helpful to see what's the actual content of the deps. To get this information, from a cephadm shell enable the debug level and grep for 'deps'. Something like:</p>
<pre>
ceph config set mgr mgr/cephadm/log_to_cluster_level debug; ceph -W cephadm --watch-debug | grep -e iscsi -e deps
</pre> Orchestrator - Bug #57897: ceph mgr restart causes restart of all iscsi daemons in a loophttps://tracker.ceph.com/issues/57897?journal_id=2279902022-11-04T12:45:47ZAdam King
<ul></ul><p>this is a painful one. @David at least until we have a fix for this, I will mention that setting the iscsi spec to unmanaged (<a class="external" href="https://docs.ceph.com/en/quincy/cephadm/services/#disabling-automatic-management-of-daemons">https://docs.ceph.com/en/quincy/cephadm/services/#disabling-automatic-management-of-daemons</a>) will stop cephadm from doing these dependency checks and will at least stop this from happening. The downside is it won't update the trusted ip list when it actually should (that's what it's trying to do here since the active mgr ip needs to be in that list), but theoretically just a "ceph orch redeploy <iscsi-service-name>" after the mgr failover happens should resolve it.</p> Orchestrator - Bug #57897: ceph mgr restart causes restart of all iscsi daemons in a loophttps://tracker.ceph.com/issues/57897?journal_id=2280782022-11-08T12:03:18ZDavid Heap
<ul></ul><p>Adam King wrote:</p>
<blockquote>
<p>this is a painful one. @David at least until we have a fix for this, I will mention that setting the iscsi spec to unmanaged (<a class="external" href="https://docs.ceph.com/en/quincy/cephadm/services/#disabling-automatic-management-of-daemons">https://docs.ceph.com/en/quincy/cephadm/services/#disabling-automatic-management-of-daemons</a>) will stop cephadm from doing these dependency checks and will at least stop this from happening. The downside is it won't update the trusted ip list when it actually should (that's what it's trying to do here since the active mgr ip needs to be in that list), but theoretically just a "ceph orch redeploy <iscsi-service-name>" after the mgr failover happens should resolve it.</p>
</blockquote>
<p>Thanks, we will look into setting it to unmanaged for now. We've currently got orchestration paused at a point where all the containers are up, which i guess is doing a similar thing in a more global way.</p>
<p>Our mgr IPs are in the initial trusted IP lists we configured when deploying the iscsi services, so the updates aren't really required in our clusters anyway.</p>
<p>Redouane Kachach Elhichou wrote:</p>
<blockquote>
<p>In this case I think it would be helpful to see what's the actual content of the deps. To get this information, from a cephadm shell enable the debug level and grep for 'deps'.</p>
</blockquote>
<p>It's the same IP address issue as above when attempting to stop or redeploy the service</p>
<pre><code>
2022-11-08T11:59:11.048572+0000 mgr.dh-ceph00-test [INF] Schedule stop daemon iscsi.iscsi.dh-ceph00-test.asunqg
2022-11-08T11:59:11.279916+0000 mgr.dh-ceph00-test [DBG] iscsi.iscsi.dh-ceph00-test.asunqg deps ['10.10.15.183'] -> ['10.10.15.182']
2022-11-08T11:59:11.280093+0000 mgr.dh-ceph00-test [INF] Reconfiguring iscsi.iscsi.dh-ceph00-test.asunqg (dependencies changed)...
2022-11-08T11:59:21.609811+0000 mgr.dh-ceph00-test [INF] Schedule redeploy daemon iscsi.iscsi.dh-ceph00-test.asunqg
2022-11-08T11:59:21.822075+0000 mgr.dh-ceph00-test [DBG] iscsi.iscsi.dh-ceph00-test.asunqg deps ['10.10.15.183'] -> ['10.10.15.182']
2022-11-08T11:59:21.822242+0000 mgr.dh-ceph00-test [INF] Reconfiguring iscsi.iscsi.dh-ceph00-test.asunqg (dependencies changed)...
</code></pre> Orchestrator - Bug #57897: ceph mgr restart causes restart of all iscsi daemons in a loophttps://tracker.ceph.com/issues/57897?journal_id=2281892022-11-14T11:08:50ZRedouane Kachach Elhichou
<ul></ul><p>I tried to reproduce the issue with the same setup but wasn't successful so far. I started from a cluster running v17.2.4 with iscsi devices and then upgraded to v17.2.5. Upgrade was smooth and I didn't see any issues.</p> Orchestrator - Bug #57897: ceph mgr restart causes restart of all iscsi daemons in a loophttps://tracker.ceph.com/issues/57897?journal_id=2283242022-11-20T13:24:29ZMykola Golubmgolub@suse.com
<ul></ul><p>David mentioned [1] as a potential cause when the issue was introduced. But actually I think it is [2]. And apart that there seem to be a bug that causes an issue in David's environment, I am not sure I like the approach very much. I.e. now on every mgr failover we are reconfiguring and restarting iscsi gateways because of the trusted_ip_list change. I think it could be improved if the list includes not only the active mgr ip, but all mgr ips.</p>
<p>[1] <a class="external" href="https://github.com/ceph/ceph/commit/cda82c98a32f51cb392fc51ba854bcae409567f8">https://github.com/ceph/ceph/commit/cda82c98a32f51cb392fc51ba854bcae409567f8</a> <br />[2] <a class="external" href="https://github.com/ceph/ceph/pull/48076/commits/7ad668158c76c498c08ca584a27d288de0ac1e3b">https://github.com/ceph/ceph/pull/48076/commits/7ad668158c76c498c08ca584a27d288de0ac1e3b</a></p> Orchestrator - Bug #57897: ceph mgr restart causes restart of all iscsi daemons in a loophttps://tracker.ceph.com/issues/57897?journal_id=2283362022-11-21T10:42:35ZDavid Heap
<ul></ul><p>Hi Mykola and Redouane</p>
<p>Thanks for looking into this - we initially thought the same regarding the commit in the [2] link (see Dan's opening entry above), but it doesn't look like that commit is in the code tagged in 17.2.5:</p>
<p><a class="external" href="https://github.com/ceph/ceph/blob/main/src/pybind/mgr/cephadm/module.py#L2480">https://github.com/ceph/ceph/blob/main/src/pybind/mgr/cephadm/module.py#L2480</a><br /><a class="external" href="https://github.com/ceph/ceph/blob/v17.2.5/src/pybind/mgr/cephadm/module.py#L2362">https://github.com/ceph/ceph/blob/v17.2.5/src/pybind/mgr/cephadm/module.py#L2362</a></p>
<p>nor is the new code in the containers running 17.2.5:</p>
<pre><code>
heapd@dh-ceph00-test:~$ sudo ceph version
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
heapd@dh-ceph00-test:~$ sudo podman ps | grep mgr
c85b4d1c7e76 quay.io/ceph/ceph@sha256:0560b16bec6e84345f29fb6693cd2430884e6efff16a95d5bdd0bb06d7661c45 -n mgr.dh-ceph00-... 9 days ago Up 9 days ago ...
heapd@dh-ceph00-test:~$ sudo podman exec c85b4d1c7e76 grep iscsi /usr/share/ceph/mgr/cephadm/module.py
from .services.iscsi import IscsiService
elif daemon_type == 'iscsi':
'iscsi': PlacementSpec(count=1),
def apply_iscsi(self, spec: ServiceSpec) -> str:
</code></pre>
<p>Agree that it probably shouldn't be reconfiguring and restarting on every mgr failover if the pool of mgr IPs is staying the same</p> Orchestrator - Bug #57897: ceph mgr restart causes restart of all iscsi daemons in a loophttps://tracker.ceph.com/issues/57897?journal_id=2318092023-02-19T17:52:05ZMykola Golubmgolub@suse.com
<ul></ul><p>Hi David,</p>
<p>Although I was not able to reproduce your issue, I think the commit [1] I mentioned in my previous comment may fix your issue, as it changes how deps is calculated. This change will be in 17.2.6.</p>
<p>Also I created a ticket [2] and PR [3] to change the mgr cephadm module not to add the mgr ip in to the trusted_ip_list if it's already there. In your case (when you already have the mgr ips in the list) it should avoid restarting the iscsi gateway on mgr failover.</p>
<p>[1] <a class="external" href="https://github.com/ceph/ceph/pull/48076/commits/7ad668158c76c498c08ca584a27d288de0ac1e3b">https://github.com/ceph/ceph/pull/48076/commits/7ad668158c76c498c08ca584a27d288de0ac1e3b</a><br />[2] <a class="external" href="https://tracker.ceph.com/issues/58792">https://tracker.ceph.com/issues/58792</a><br />[3] <a class="external" href="https://github.com/ceph/ceph/pull/50167">https://github.com/ceph/ceph/pull/50167</a></p> Orchestrator - Bug #57897: ceph mgr restart causes restart of all iscsi daemons in a loophttps://tracker.ceph.com/issues/57897?journal_id=2349272023-04-12T15:28:20ZDavid Heap
<ul></ul><p>Hi Mykola</p>
<p>Unfortunately the change in deps calculation in 17.2.6 didn't resolve the issue as the deps still change, but hopefully your deduplication PR will help once released - thanks for submitting that!</p>
<pre>
2023-04-12T16:15:29.180866+0100 mgr.dh-ceph01-test [DBG] Applying service iscsi.iscsi spec
2023-04-12T16:15:29.181770+0100 mgr.dh-ceph01-test [DBG] Combine hosts with existing daemons [<DaemonDescription>(iscsi.iscsi.dh-ceph00-test.asunqg)] + new hosts []
2023-04-12T16:15:29.326384+0100 mgr.dh-ceph01-test [DBG] iscsi.iscsi.dh-ceph00-test.asunqg deps ['10.10.15.181,10.10.15.182,10.10.15.183,10.10.15.184,10.10.15.185,10.10.15.182'] -> ['10.10.15.181,10.10.15.182,10.10.15.183,10.10.15.184,10.10.15.185,10.10.15.183']
2023-04-12T16:15:29.326585+0100 mgr.dh-ceph01-test [INF] Reconfiguring iscsi.iscsi.dh-ceph00-test.asunqg (dependencies changed)...
2023-04-12T16:15:29.336105+0100 mgr.dh-ceph01-test [INF] Reconfiguring daemon iscsi.iscsi.dh-ceph00-test.asunqg on dh-ceph00-test
</pre>