Bug #65627: Centos 9 stream ceph container iscsi test failure - mgr - Ceph

Actions

Copy link

Bug #65627

open

Centos 9 stream ceph container iscsi test failure

Added by Laura Flores 11 days ago. Updated 3 days ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Missing k8sevents module¶

While waiting for the mgr to start, we get this traceback message:
teuthology.log

2024-04-21T15:40:34.547 INFO:teuthology.orchestra.run.smithi017.stdout:mgr not available, waiting (1/15)...
2024-04-21T15:40:34.853 INFO:journalctl@ceph.mgr.x.smithi017.stdout:Apr 21 22:40:34 smithi017 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-mgr-x[23720]: 2024-04-21T22:40:34.429+0000 7f6f44397040 -1 mgr[py] Module not found: 'k8sevents'
2024-04-21T15:40:34.853 INFO:journalctl@ceph.mgr.x.smithi017.stdout:Apr 21 22:40:34 smithi017 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-mgr-x[23720]: 2024-04-21T22:40:34.429+0000 7f6f44397040 -1 mgr[py] Traceback (most recent call last):
2024-04-21T15:40:34.853 INFO:journalctl@ceph.mgr.x.smithi017.stdout:Apr 21 22:40:34 smithi017 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-mgr-x[23720]:   File "/usr/share/ceph/mgr/k8sevents/__init__.py", line 1, in <module>
2024-04-21T15:40:34.853 INFO:journalctl@ceph.mgr.x.smithi017.stdout:Apr 21 22:40:34 smithi017 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-mgr-x[23720]:     from .module import Module
2024-04-21T15:40:34.854 INFO:journalctl@ceph.mgr.x.smithi017.stdout:Apr 21 22:40:34 smithi017 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-mgr-x[23720]:   File "/usr/share/ceph/mgr/k8sevents/module.py", line 70, in <module>
2024-04-21T15:40:34.854 INFO:journalctl@ceph.mgr.x.smithi017.stdout:Apr 21 22:40:34 smithi017 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-mgr-x[23720]:     from kubernetes.client.models.v1_event import V1Event
2024-04-21T15:40:34.854 INFO:journalctl@ceph.mgr.x.smithi017.stdout:Apr 21 22:40:34 smithi017 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-mgr-x[23720]: ModuleNotFoundError: No module named 'kubernetes.client.models.v1_event'

ceph-mgr.x.log:

2024-04-21T22:40:34.094+0000 7f6f44397040  4 mgr[py] Module influx reported that it cannot run: influxdb python module not found
2024-04-21T22:40:34.094+0000 7f6f44397040  1 mgr[py] Loading python module 'k8sevents'
2024-04-21T22:40:34.111+0000 7f6f44397040 10 mgr[py] Computed sys.path '/usr/share/ceph/mgr:/usr/local/lib64/python3.9/site-packages:/usr/local/lib/python3.9/site-packages:/lib64/python3.9/site-packages:/lib/python3.9/site-packages:lib64/python39.zip:/lib64/python3.9:/lib64/python3.9/lib-dynload'
2024-04-21T22:40:34.429+0000 7f6f44397040 -1 mgr[py] Module not found: 'k8sevents'
2024-04-21T22:40:34.429+0000 7f6f44397040 -1 mgr[py] Traceback (most recent call last):
  File "/usr/share/ceph/mgr/k8sevents/__init__.py", line 1, in <module>
    from .module import Module
  File "/usr/share/ceph/mgr/k8sevents/module.py", line 70, in <module>
    from kubernetes.client.models.v1_event import V1Event
ModuleNotFoundError: No module named 'kubernetes.client.models.v1_event'

There is also a Traceback in the mon a bit after this time, which indicates a connection issue with mgr.x:
ceph-mon.a.log

2024-04-21T22:40:56.696+0000 7febf911f640 10 mon.a@0(leader).log v24  logging 2024-04-21T22:40:56.057598+0000 mgr.x (mgr.14118) 13 : cephadm [INF] [21/Apr/2024:22:40:56] ENGINE Serving on https://172.21.15.17:7150
2024-04-21T22:40:56.696+0000 7febf911f640 10 mon.a@0(leader).log v24  logging 2024-04-21T22:40:56.057736+0000 mgr.x (mgr.14118) 14 : cephadm [INF] [21/Apr/2024:22:40:56] ENGINE Bus STARTED
2024-04-21T22:40:56.696+0000 7febf911f640 10 mon.a@0(leader).log v24  logging 2024-04-21T22:40:56.059970+0000 mgr.x (mgr.14118) 15 : cephadm [ERR] [21/Apr/2024:22:40:56] ENGINE Error in HTTPServer.serve
Traceback (most recent call last):
  File "/lib/python3.9/site-packages/cheroot/server.py", line 1823, in serve
    self._connections.run(self.expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 203, in run
    self._run(expiration_interval)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 246, in _run
    new_conn = self._from_server_socket(self.server.socket)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 300, in _from_server_socket
    s, ssl_env = self.server.ssl_adapter.wrap(s)
  File "/lib/python3.9/site-packages/cheroot/ssl/builtin.py", line 277, in wrap
    s = self.context.wrap_socket(
  File "/lib64/python3.9/ssl.py", line 501, in wrap_socket
    return self.sslsocket_class._create(
  File "/lib64/python3.9/ssl.py", line 1074, in _create
    self.do_handshake()
  File "/lib64/python3.9/ssl.py", line 1343, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLZeroReturnError: TLS/SSL connection has been closed (EOF) (_ssl.c:1129)

The mgr log keeps going for awhile after this Traceback, so this error does not seem to kill the manager. The teuthology test ends when the rest_api_delete.t test is run and passes. This must be the last workload because teardown begins, and all daemons are stopped.

2024-04-21T15:54:46.234 INFO:tasks.cram.client.2.smithi159.stdout:/home/ubuntu/cephtest/archive/cram.client.2/rest_api_delete.t: passed
2024-04-21T15:54:46.234 INFO:tasks.cram.client.2.smithi159.stdout:# Ran 1 tests, 0 skipped, 0 failed.
2024-04-21T15:54:46.353 INFO:journalctl@ceph.iscsi.iscsi.a.smithi017.stdout:Apr 21 22:54:45 smithi017 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-iscsi-iscsi-a[47907]: debug (Gateway.load_config) successfully loaded existing target definition
2024-04-21T15:54:46.353 INFO:journalctl@ceph.iscsi.iscsi.a.smithi017.stdout:Apr 21 22:54:45 smithi017 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-iscsi-iscsi-a[47907]: debug Removing target configuration
2024-04-21T15:54:46.353 INFO:journalctl@ceph.iscsi.iscsi.a.smithi017.stdout:Apr 21 22:54:46 smithi017 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-iscsi-iscsi-a[47907]: debug ::ffff:172.21.15.159 - - [21/Apr/2024 22:54:46] "DELETE /api/_target/iqn.2003-01.com.redhat.iscsi-gw:ceph-gw HTTP/1.1" 200 -
2024-04-21T15:54:46.353 INFO:journalctl@ceph.iscsi.iscsi.a.smithi017.stdout:Apr 21 22:54:46 smithi017 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-iscsi-iscsi-a[47907]: ::ffff:172.21.15.159 - - [21/Apr/2024 22:54:46] "DELETE /api/_target/iqn.2003-01.com.redhat.iscsi-gw:ceph-gw HTTP/1.1" 200 -
2024-04-21T15:54:46.660 INFO:journalctl@ceph.iscsi.iscsi.b.smithi159.stdout:Apr 21 22:54:46 smithi159 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-iscsi-iscsi-b[47144]: debug _disk update on smithi159, successful
2024-04-21T15:54:46.660 INFO:journalctl@ceph.iscsi.iscsi.b.smithi159.stdout:Apr 21 22:54:46 smithi159 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-iscsi-iscsi-b[47144]: debug ::ffff:127.0.0.1 - - [21/Apr/2024 22:54:46] "DELETE /api/disk/datapool/block1 HTTP/1.1" 200 -
2024-04-21T15:54:46.660 INFO:journalctl@ceph.iscsi.iscsi.b.smithi159.stdout:Apr 21 22:54:46 smithi159 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-iscsi-iscsi-b[47144]: ::ffff:127.0.0.1 - - [21/Apr/2024 22:54:46] "DELETE /api/disk/datapool/block1 HTTP/1.1" 200 -
2024-04-21T15:54:46.853 INFO:journalctl@ceph.iscsi.iscsi.a.smithi017.stdout:Apr 21 22:54:46 smithi017 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-iscsi-iscsi-a[47907]: debug Change detected - internal 24 / xattr 26 refreshing
2024-04-21T15:54:47.159 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:54:46 smithi159 ceph-mon[26763]: pgmap v466: 4 pgs: 4 active+clean; 580 KiB data, 235 MiB used, 715 GiB / 715 GiB avail; 34 KiB/s rd, 2.3 KiB/s wr, 37 op/s
2024-04-21T15:54:47.181 INFO:journalctl@ceph.mon.b.smithi118.stdout:Apr 21 22:54:46 smithi118 ceph-mon[26396]: pgmap v466: 4 pgs: 4 active+clean; 580 KiB data, 235 MiB used, 715 GiB / 715 GiB avail; 34 KiB/s rd, 2.3 KiB/s wr, 37 op/s
2024-04-21T15:54:47.243 INFO:teuthology.task.install.util:Removing shipped files: /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits /usr/bin/stdin-killer...
2024-04-21T15:54:47.268 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:54:46 smithi017 ceph-mon[23490]: pgmap v466: 4 pgs: 4 active+clean; 580 KiB data, 235 MiB used, 715 GiB / 715 GiB avail; 34 KiB/s rd, 2.3 KiB/s wr, 37 op/s
2024-04-21T15:54:47.348 INFO:tasks.cephadm:Teardown begin
2024-04-21T15:54:47.424 INFO:tasks.cephadm:Cleaning up testdir ceph.* files...
2024-04-21T15:54:47.478 INFO:tasks.cephadm:Stopping all daemons...

Why did the test say it failed due to a MON_DOWN warning?¶

We start deploying mon.b (at this point mon.a and mon.c are already deployed). At this point, mon.b is not in the monmap yet:

2024-04-21T15:42:02.105 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:01 smithi017 ceph-mon[23490]: Deploying daemon mon.b on smithi118
2024-04-21T15:42:02.155 INFO:teuthology.orchestra.run.smithi159.stdout:
2024-04-21T15:42:02.155 INFO:teuthology.orchestra.run.smithi159.stdout:{"epoch":2,"fsid":"fc21275e-002f-11ef-9c9f-ac1f6bc659f6","modified":"2024-04-21T22:41:56.715872Z","created":"2024-04-21T22:40:23.832411Z","min_mon_release":19,"min_mon_release_name":"squid","election_strategy":1,"disallowed_leaders: ":"","stretch_mode":false,"tiebreaker_mon":"","removed_ranks: ":"","features":{"persistent":["kraken","luminous","mimic","osdmap-prune","nautilus","octopus","pacific","elector-pinging","quincy","reef","squid"],"optional":[]},"mons":[{"rank":0,"name":"a","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.17:3300","nonce":0},{"type":"v1","addr":"172.21.15.17:6789","nonce":0}]},"addr":"172.21.15.17:6789/0","public_addr":"172.21.15.17:6789/0","priority":0,"weight":0,"crush_location":"{}"},{"rank":1,"name":"c","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.159:3300","nonce":0},{"type":"v1","addr":"172.21.15.159:6789","nonce":0}]},"addr":"172.21.15.159:6789/0","public_addr":"172.21.15.159:6789/0","priority":0,"weight":0,"crush_location":"{}"}],"quorum":[0,1]}
2024-04-21T15:42:02.155 INFO:teuthology.orchestra.run.smithi159.stderr:dumped monmap epoch 2

mon.a and mon.c call for election:

2024-04-21T15:42:02.156 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: mon.a calling monitor election
2024-04-21T15:42:02.156 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: from='mgr.14150 172.21.15.17:0/3986839991' entity='mgr.x' cmd=[{"prefix": "mon metadata", "id": "c"}]: dispatch
2024-04-21T15:42:02.156 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: pgmap v16: 0 pgs: ; 0 B data, 0 B used, 0 B / 0 B avail
2024-04-21T15:42:02.156 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: from='mgr.14150 172.21.15.17:0/3986839991' entity='mgr.x' cmd=[{"prefix": "mon metadata", "id": "c"}]: dispatch
2024-04-21T15:42:02.156 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: mon.c calling monitor election

mon.a becomes the leader and both mons are in quorum (mon.b is still deploying):

2024-04-21T15:42:02.156 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: mon.a is new leader, mons a,c in quorum (ranks 0,1)
2024-04-21T15:42:02.156 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: monmap epoch 2
2024-04-21T15:42:02.156 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: fsid fc21275e-002f-11ef-9c9f-ac1f6bc659f6
2024-04-21T15:42:02.156 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: last_changed 2024-04-21T22:41:56.715872+0000
2024-04-21T15:42:02.156 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: created 2024-04-21T22:40:23.832411+0000
2024-04-21T15:42:02.156 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: min_mon_release 19 (squid)
2024-04-21T15:42:02.156 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: election_strategy: 1
2024-04-21T15:42:02.156 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: 0: [v2:172.21.15.17:3300/0,v1:172.21.15.17:6789/0] mon.a
2024-04-21T15:42:02.156 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: 1: [v2:172.21.15.159:3300/0,v1:172.21.15.159:6789/0] mon.c
2024-04-21T15:42:02.157 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: fsmap
2024-04-21T15:42:02.157 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: osdmap e4: 0 total, 0 up, 0 in
2024-04-21T15:42:02.157 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: mgrmap e13: x(active, since 48s)
2024-04-21T15:42:02.157 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:01 smithi159 ceph-mon[26763]: overall HEALTH_OK

Then we wait for 3 mons in the monmap. Now, mon.b is present in the monmap:

2024-04-21T15:42:03.717 INFO:tasks.cephadm:Waiting for 3 mons in monmap...
2024-04-21T15:42:03.950 INFO:teuthology.orchestra.run.smithi159.stderr:Inferring config /var/lib/ceph/fc21275e-002f-11ef-9c9f-ac1f6bc659f6/mon.c/config
2024-04-21T15:42:04.072 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:03 smithi159 ceph-mon[26763]: pgmap v19: 0 pgs: ; 0 B data, 0 B used, 0 B / 0 B avail
2024-04-21T15:42:04.103 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:03 smithi017 ceph-mon[23490]: pgmap v19: 0 pgs: ; 0 B data, 0 B used, 0 B / 0 B avail
2024-04-21T15:42:05.682 INFO:journalctl@ceph.mon.b.smithi118.stdout:Apr 21 22:42:05 smithi118 systemd[1]: Started Ceph mon.b for fc21275e-002f-11ef-9c9f-ac1f6bc659f6.
2024-04-21T15:42:10.540 INFO:teuthology.orchestra.run.smithi159.stdout:
2024-04-21T15:42:10.540 INFO:teuthology.orchestra.run.smithi159.stdout:{"epoch":3,"fsid":"fc21275e-002f-11ef-9c9f-ac1f6bc659f6","modified":"2024-04-21T22:42:05.526437Z","created":"2024-04-21T22:40:23.832411Z","min_mon_release":19,"min_mon_release_name":"squid","election_strategy":1,"disallowed_leaders: ":"","stretch_mode":false,"tiebreaker_mon":"","removed_ranks: ":"","features":{"persistent":["kraken","luminous","mimic","osdmap-prune","nautilus","octopus","pacific","elector-pinging","quincy","reef","squid"],"optional":[]},"mons":[{"rank":0,"name":"a","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.17:3300","nonce":0},{"type":"v1","addr":"172.21.15.17:6789","nonce":0}]},"addr":"172.21.15.17:6789/0","public_addr":"172.21.15.17:6789/0","priority":0,"weight":0,"crush_location":"{}"},{"rank":1,"name":"c","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.159:3300","nonce":0},{"type":"v1","addr":"172.21.15.159:6789","nonce":0}]},"addr":"172.21.15.159:6789/0","public_addr":"172.21.15.159:6789/0","priority":0,"weight":0,"crush_location":"{}"},{"rank":2,"name":"b","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.118:3300","nonce":0},{"type":"v1","addr":"172.21.15.118:6789","nonce":0}]},"addr":"172.21.15.118:6789/0","public_addr":"172.21.15.118:6789/0","priority":0,"weight":0,"crush_location":"{}"}],"quorum":[0,1]}
2024-04-21T15:42:10.540 INFO:teuthology.orchestra.run.smithi159.stderr:dumped monmap epoch 3

mon.a and mon.c call for election:

2024-04-21T15:42:10.845 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: mon.a calling monitor election
2024-04-21T15:42:10.846 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: mon.c calling monitor election

mon.a ends up as the leader, but only mon.a and mon.c are in quorum:

2024-04-21T15:42:10.846 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: mon.a is new leader, mons a,c in quorum (ranks 0,1)
2024-04-21T15:42:10.846 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: monmap epoch 3
2024-04-21T15:42:10.846 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: fsid fc21275e-002f-11ef-9c9f-ac1f6bc659f6
2024-04-21T15:42:10.846 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: last_changed 2024-04-21T22:42:05.526437+0000
2024-04-21T15:42:10.846 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: created 2024-04-21T22:40:23.832411+0000
2024-04-21T15:42:10.846 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: min_mon_release 19 (squid)
2024-04-21T15:42:10.846 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: election_strategy: 1
2024-04-21T15:42:10.846 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: 0: [v2:172.21.15.17:3300/0,v1:172.21.15.17:6789/0] mon.a
2024-04-21T15:42:10.846 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: 1: [v2:172.21.15.159:3300/0,v1:172.21.15.159:6789/0] mon.c
2024-04-21T15:42:10.846 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: 2: [v2:172.21.15.118:3300/0,v1:172.21.15.118:6789/0] mon.b
2024-04-21T15:42:10.846 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: fsmap
2024-04-21T15:42:10.847 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: osdmap e4: 0 total, 0 up, 0 in
2024-04-21T15:42:10.847 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: mgrmap e13: x(active, since 57s)
2024-04-21T15:42:10.847 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)
2024-04-21T15:42:10.847 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: Health detail: HEALTH_WARN 1/3 mons down, quorum a,c
2024-04-21T15:42:10.847 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]: [WRN] MON_DOWN: 1/3 mons down, quorum a,c
2024-04-21T15:42:10.847 INFO:journalctl@ceph.mon.c.smithi159.stdout:Apr 21 22:42:10 smithi159 ceph-mon[26763]:     mon.b (rank 2) addr [v2:172.21.15.118:3300/0,v1:172.21.15.118:6789/0] is down (out of quorum)

Next, mon.b calls for election. This time, the ceph.conf is updated, and all monitors are in quorum. The health warning from before is clear.

2024-04-21T15:42:13.654 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: mon.b calling monitor election
2024-04-21T15:42:13.654 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: Updating smithi017:/etc/ceph/ceph.conf
2024-04-21T15:42:13.654 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: Updating smithi118:/etc/ceph/ceph.conf
2024-04-21T15:42:13.654 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: Updating smithi159:/etc/ceph/ceph.conf
2024-04-21T15:42:13.654 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: Updating smithi118:/var/lib/ceph/fc21275e-002f-11ef-9c9f-ac1f6bc659f6/config/ceph.conf
2024-04-21T15:42:13.654 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: Updating smithi017:/var/lib/ceph/fc21275e-002f-11ef-9c9f-ac1f6bc659f6/config/ceph.conf
2024-04-21T15:42:13.654 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: Updating smithi159:/var/lib/ceph/fc21275e-002f-11ef-9c9f-ac1f6bc659f6/config/ceph.conf
2024-04-21T15:42:13.654 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: Reconfiguring mon.a (unknown last config time)...
2024-04-21T15:42:13.654 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: Reconfiguring daemon mon.a on smithi017
2024-04-21T15:42:13.654 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: mon.b calling monitor election
2024-04-21T15:42:13.654 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: mon.c calling monitor election
2024-04-21T15:42:13.654 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: mon.a calling monitor election
2024-04-21T15:42:13.654 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: mon.a is new leader, mons a,c,b in quorum (ranks 0,1,2)
2024-04-21T15:42:13.654 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: monmap epoch 3
2024-04-21T15:42:13.655 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: fsid fc21275e-002f-11ef-9c9f-ac1f6bc659f6
2024-04-21T15:42:13.655 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: last_changed 2024-04-21T22:42:05.526437+0000
2024-04-21T15:42:13.655 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: created 2024-04-21T22:40:23.832411+0000
2024-04-21T15:42:13.655 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: min_mon_release 19 (squid)
2024-04-21T15:42:13.655 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: election_strategy: 1
2024-04-21T15:42:13.655 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: 0: [v2:172.21.15.17:3300/0,v1:172.21.15.17:6789/0] mon.a
2024-04-21T15:42:13.655 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: 1: [v2:172.21.15.159:3300/0,v1:172.21.15.159:6789/0] mon.c
2024-04-21T15:42:13.655 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: 2: [v2:172.21.15.118:3300/0,v1:172.21.15.118:6789/0] mon.b
2024-04-21T15:42:13.655 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: fsmap
2024-04-21T15:42:13.655 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: osdmap e4: 0 total, 0 up, 0 in
2024-04-21T15:42:13.655 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: mgrmap e13: x(active, since 59s)
2024-04-21T15:42:13.655 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: Health check cleared: MON_DOWN (was: 1/3 mons down, quorum a,c)
2024-04-21T15:42:13.655 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: Cluster is now healthy
2024-04-21T15:42:13.655 INFO:journalctl@ceph.mon.a.smithi017.stdout:Apr 21 22:42:13 smithi017 ceph-mon[23490]: overall HEALTH_OK

The next mentions of MON_DOWN come later when we're checking the cluster log for badness. This happens a bit after the final rest_api_delete.t test is run from the first section and teardown occurs. This check does detect that we had the MON_DOWN warning earlier:

2024-04-21T15:56:24.613 INFO:tasks.cephadm:Checking cluster log for badness...
2024-04-21T15:56:24.643 INFO:teuthology.orchestra.run.smithi017.stdout:2024-04-21T22:42:10.535211+0000 mon.a (mon.0) 206 : cluster [WRN] Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)
2024-04-21T15:56:24.643 WARNING:tasks.cephadm:Found errors (ERR|WRN|SEC) in cluster log
2024-04-21T15:56:24.850 INFO:teuthology.orchestra.run.smithi017.stdout:2024-04-21T22:42:10.535211+0000 mon.a (mon.0) 206 : cluster [WRN] Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)

This is why the test failure reason is: Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)" in cluster log

In the cluster log shows the same series of events as the teuthology log, except a few minutes off. (Are cluster logs delayed?)

2024-04-21T22:42:10.535211+0000 mon.a (mon.0) 206 : cluster [WRN] Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)
2024-04-21T22:42:10.538197+0000 mon.a (mon.0) 207 : cluster [WRN] Health detail: HEALTH_WARN 1/3 mons down, quorum a,c
2024-04-21T22:42:10.538216+0000 mon.a (mon.0) 208 : cluster [WRN] [WRN] MON_DOWN: 1/3 mons down, quorum a,c
2024-04-21T22:42:10.538222+0000 mon.a (mon.0) 209 : cluster [WRN]     mon.b (rank 2) addr [v2:172.21.15.118:3300/0,v1:172.21.15.118:6789/0] is down (out of quorum)
2024-04-21T22:42:07.527675+0000 mon.b (mon.2) 1 : cluster [INF] mon.b calling monitor election
2024-04-21T22:42:12.534479+0000 mon.b (mon.2) 2 : cluster [INF] mon.b calling monitor election
2024-04-21T22:42:12.534787+0000 mon.c (mon.1) 3 : cluster [INF] mon.c calling monitor election
2024-04-21T22:42:12.535024+0000 mon.a (mon.0) 237 : cluster [INF] mon.a calling monitor election
2024-04-21T22:42:12.536672+0000 mon.a (mon.0) 238 : cluster [INF] mon.a is new leader, mons a,c,b in quorum (ranks 0,1,2)
2024-04-21T22:42:12.540048+0000 mon.a (mon.0) 239 : cluster [DBG] monmap epoch 3
2024-04-21T22:42:12.540070+0000 mon.a (mon.0) 240 : cluster [DBG] fsid fc21275e-002f-11ef-9c9f-ac1f6bc659f6
2024-04-21T22:42:12.540086+0000 mon.a (mon.0) 241 : cluster [DBG] last_changed 2024-04-21T22:42:05.526437+0000
2024-04-21T22:42:12.540099+0000 mon.a (mon.0) 242 : cluster [DBG] created 2024-04-21T22:40:23.832411+0000
2024-04-21T22:42:12.540111+0000 mon.a (mon.0) 243 : cluster [DBG] min_mon_release 19 (squid)
2024-04-21T22:42:12.540124+0000 mon.a (mon.0) 244 : cluster [DBG] election_strategy: 1
2024-04-21T22:42:12.540137+0000 mon.a (mon.0) 245 : cluster [DBG] 0: [v2:172.21.15.17:3300/0,v1:172.21.15.17:6789/0] mon.a
2024-04-21T22:42:12.540150+0000 mon.a (mon.0) 246 : cluster [DBG] 1: [v2:172.21.15.159:3300/0,v1:172.21.15.159:6789/0] mon.c
2024-04-21T22:42:12.540163+0000 mon.a (mon.0) 247 : cluster [DBG] 2: [v2:172.21.15.118:3300/0,v1:172.21.15.118:6789/0] mon.b
2024-04-21T22:42:12.540858+0000 mon.a (mon.0) 248 : cluster [DBG] fsmap
2024-04-21T22:42:12.540892+0000 mon.a (mon.0) 249 : cluster [DBG] osdmap e4: 0 total, 0 up, 0 in
2024-04-21T22:42:12.541111+0000 mon.a (mon.0) 250 : cluster [DBG] mgrmap e13: x(active, since 59s)
2024-04-21T22:42:12.541400+0000 mon.a (mon.0) 251 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum a,c)
2024-04-21T22:42:12.541425+0000 mon.a (mon.0) 252 : cluster [INF] Cluster is now healthy
2024-04-21T22:42:12.544885+0000 mon.a (mon.0) 253 : cluster [INF] overall HEALTH_OK

Conclusion¶

There was a clear package failure (k8sevents) in the mgr, but this error was not fatal to the mgr. The test summary said the test failed from a MON_DOWN warning, but this was just a warning seen briefly because mon.b had not been added to the ceph.conf yet upon deploying.

It is tough to say how the teuthology test would have failed if we'd had MON_DOWN in the ignorelist, but it might have been something like what was logged in the journal syslog:
journalctl-b0.gz

Apr 21 22:40:38 smithi017 ceph-fc21275e-002f-11ef-9c9f-ac1f6bc659f6-mgr-x[23720]: 2024-04-21T22:40:38.636+0000 7f6f44397040 -1 log_channel(cluster) log [ERR] : Failed to load ceph-mgr modules: k8sevents

Related issues 1 (1 open — 0 closed)