Bug #53807: Dead jobs in rados/cephadm/smoke-roleless{...}: ingress jobs stuck - Orchestrator - Ceph

Actions

Copy link

Bug #53807

closed

Dead jobs in rados/cephadm/smoke-roleless{...}: ingress jobs stuck

Added by Laura Flores over 2 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Guillaume Abrioux

Category:

Target version:

% Done:

Source:

Tags:

Backport:

quincy,pacific,octopus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

45014

Crash signature (v1):

Crash signature (v2):

Description

Description: rados/cephadm/smoke-roleless/{0-distro/centos_8.3_container_tools_3.0 0-nvme-loop 1-start 2-services/nfs-ingress 3-final}

Failure Reason: hit max job timeout

Jobs:
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6598774
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6598785
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6599316
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6599350

Earlier in the log:

2022-01-06T16:33:28.615 INFO:teuthology.task.ansible.out:^M
TASK [common : Check firewalld status] *****************************************^M

2022-01-06T16:33:28.617 INFO:teuthology.task.ansible.out:fatal: [smithi107.front.sepia.ceph.com]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true}^M
...ignoring^M

2022-01-06T16:33:28.638 INFO:teuthology.task.ansible.out:Thursday 06 January 2022  16:33:28 +0000 (0:00:00.260)       0:02:03.410 ****** ^M

Later in the log:

2022-01-06T16:44:13.943 INFO:journalctl@ceph.mon.smithi107.smithi107.stdout:Jan 06 16:44:13 smithi107 ceph-mon[30485]: from='mgr.14216 172.21.15.107:0/4007838189' entity='mgr.smithi107.tttsho' cmd=[{"prefix": "osd pool create", "pool": "cephfs.foofs.data"}]: dispatch
2022-01-06T16:44:13.943 INFO:journalctl@ceph.mon.smithi107.smithi107.stdout:Jan 06 16:44:13 smithi107 ceph-mon[30485]: pgmap v133: 33 pgs: 32 unknown, 1 active+clean; 577 KiB data, 47 MiB used, 715 GiB / 715 GiB avail
2022-01-06T16:44:13.943 INFO:journalctl@ceph.mon.smithi107.smithi107.stdout:Jan 06 16:44:13 smithi107 ceph-mon[30485]: from='mgr.14216 172.21.15.107:0/4007838189' entity='mgr.smithi107.tttsho'
2022-01-06T16:44:13.943 INFO:journalctl@ceph.mon.smithi107.smithi107.stdout:Jan 06 16:44:13 smithi107 conmon[30462]: 2022-01-06T16:44:13.732+0000 7fa865de5700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2022-01-06T16:44:14.150 INFO:journalctl@ceph.mon.smithi150.smithi150.stdout:Jan 06 16:44:13 smithi150 ceph-mon[37940]: from='mgr.14216 172.21.15.107:0/4007838189' entity='mgr.smithi107.tttsho' cmd='[{"prefix": "osd pool create", "pool": "cephfs.foofs.meta"}]': finished
2022-01-06T16:44:14.151 INFO:journalctl@ceph.mon.smithi150.smithi150.stdout:Jan 06 16:44:13 smithi150 ceph-mon[37940]: osdmap e43: 8 total, 8 up, 8 in
2022-01-06T16:44:14.151 INFO:journalctl@ceph.mon.smithi150.smithi150.stdout:Jan 06 16:44:13 smithi150 ceph-mon[37940]: from='mgr.14216 172.21.15.107:0/4007838189' entity='mgr.smithi107.tttsho' cmd=[{"prefix": "osd pool create", "pool": "cephfs.foofs.data"}]: dispatch
2022-01-06T16:44:14.151 INFO:journalctl@ceph.mon.smithi150.smithi150.stdout:Jan 06 16:44:13 smithi150 ceph-mon[37940]: pgmap v133: 33 pgs: 32 unknown, 1 active+clean; 577 KiB data, 47 MiB used, 715 GiB / 715 GiB avail
2022-01-06T16:44:14.151 INFO:journalctl@ceph.mon.smithi150.smithi150.stdout:Jan 06 16:44:13 smithi150 ceph-mon[37940]: from='mgr.14216 172.21.15.107:0/4007838189' entity='mgr.smithi107.tttsho'
2022-01-06T16:44:14.535 INFO:teuthology.run_tasks:Running task cephadm.apply...

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Laura Flores over 2 years ago

Another similar scenario, which does not involve offline filesystems:

Description: rados/cephadm/smoke-roleless/{0-distro/rhel_8.4_container_tools_rhel8 0-nvme-loop 1-start 2-services/rgw-ingress 3-final}

Failure reason: hit max job timeout

Jobs:
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6598830
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6599155

TASK [common : Check firewalld status] *****************************************

2022-01-06T17:12:46.159 INFO:teuthology.task.ansible.out:fatal: [smithi158.front.sepia.ceph.com]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true}
...ignoring

2022-01-06T17:12:46.180 INFO:teuthology.task.ansible.out:Thursday 06 January 2022  17:12:46 +0000 (0:00:00.236)       0:03:15.016 ****** 

2022-01-06T17:12:46.208 INFO:teuthology.task.ansible.out:
TASK [common : Open nrpe port if firewalld enabled] ****************************

TASK [testnode : Stop and disable iptables] ************************************

2022-01-06T17:16:52.036 INFO:teuthology.task.ansible.out:fatal: [smithi179.front.sepia.ceph.com]: FAILED! => {"changed": false, "msg": "Could not find the requested service iptables: host"}
...ignoring

2022-01-06T17:16:52.056 INFO:teuthology.task.ansible.out:Thursday 06 January 2022  17:16:52 +0000 (0:00:00.307)       0:07:20.893 ****** 

2022-01-06T17:16:52.698 INFO:teuthology.task.ansible.out:
TASK [testnode : Enable SELinux] ***********************************************

Actions

Copy link

Updated by Laura Flores over 2 years ago

And a third similar scenario where an offline filesystem leads to failed CEPHADM daemons:

Description: rados/cephadm/smoke-roleless/{0-distro/ubuntu_20.04 0-nvme-loop 1-start 2-services/nfs-ingress 3-final}

Failure reason: hit max job timeout

Jobs:
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6599055
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6599082

2022-01-06T21:57:30.302 INFO:journalctl@ceph.mon.smithi005.smithi005.stdout:Jan 06 21:57:30 smithi005 bash[11180]: audit 2022-01-06T21:57:29.036464+0000 mon.smithi005 (mon.0) 651 : audit [INF] from='mgr.14206 172.21.15.5:0/2764321477' entity='mgr.smithi005.fpqapy' cmd=[{"prefix": "osd pool create", "pool": "cephfs.f
oofs.meta"}]: dispatch
2022-01-06T21:57:30.303 INFO:journalctl@ceph.mon.smithi005.smithi005.stdout:Jan 06 21:57:30 smithi005 bash[11180]: cluster 2022-01-06T21:57:29.125324+0000 mgr.smithi005.fpqapy (mgr.14206) 195 : cluster [DBG] pgmap v177: 1 pgs: 1 active+clean; 577 KiB data, 46 MiB used, 715 GiB / 715 GiB avail
2022-01-06T21:57:30.390 INFO:journalctl@ceph.mon.smithi031.smithi031.stdout:Jan 06 21:57:30 smithi031 bash[13845]: audit 2022-01-06T21:57:29.035515+0000 mgr.smithi005.fpqapy (mgr.14206) 194 : audit [DBG] from='client.14540 -' entity='client.admin' cmd=[{"prefix": "fs volume create", "name": "foofs", "target": ["mon-
mgr", ""]}]: dispatch
2022-01-06T21:57:30.390 INFO:journalctl@ceph.mon.smithi031.smithi031.stdout:Jan 06 21:57:30 smithi031 bash[13845]: audit 2022-01-06T21:57:29.036464+0000 mon.smithi005 (mon.0) 651 : audit [INF] from='mgr.14206 172.21.15.5:0/2764321477' entity='mgr.smithi005.fpqapy' cmd=[{"prefix": "osd pool create", "pool": "cephfs.f
oofs.meta"}]: dispatch
2022-01-06T21:57:30.390 INFO:journalctl@ceph.mon.smithi031.smithi031.stdout:Jan 06 21:57:30 smithi031 bash[13845]: cluster 2022-01-06T21:57:29.125324+0000 mgr.smithi005.fpqapy (mgr.14206) 195 : cluster [DBG] pgmap v177: 1 pgs: 1 active+clean; 577 KiB data, 46 MiB used, 715 GiB / 715 GiB avail
2022-01-06T21:57:31.136 INFO:journalctl@ceph.mon.smithi005.smithi005.stdout:Jan 06 21:57:31 smithi005 bash[11180]: debug 2022-01-06T21:57:31.045+0000 7f17651e8700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2022-01-06T21:57:31.136 INFO:journalctl@ceph.mon.smithi005.smithi005.stdout:Jan 06 21:57:31 smithi005 bash[11180]: audit 2022-01-06T21:57:30.036464+0000 mon.smithi005 (mon.0) 652 : audit [INF] from='mgr.14206 172.21.15.5:0/2764321477' entity='mgr.smithi005.fpqapy' cmd='[{"prefix": "osd pool create", "pool": "cephfs.foofs.meta"}]': finished

2022-01-06T21:58:40.526 INFO:teuthology.orchestra.run.smithi005.stdout:[{"placement": {"count": 1}, "service_name": "alertmanager", "service_type": "alertmanager", "status": {"created": "2022-01-06T21:51:00.275167Z", "last_refresh": "2022-01-06T21:58:38.559512Z", "ports": [9093, 9094], "running": 1, "size": 1}}, {"p
lacement": {"host_pattern": "*"}, "service_name": "crash", "service_type": "crash", "status": {"created": "2022-01-06T21:50:50.887947Z", "last_refresh": "2022-01-06T21:58:37.072722Z", "running": 2, "size": 2}}, {"placement": {"count": 1}, "service_name": "grafana", "service_type": "grafana", "status": {"created": "2
022-01-06T21:50:55.649076Z", "last_refresh": "2022-01-06T21:58:38.559719Z", "ports": [3000], "running": 1, "size": 1}}, {"events": ["2022-01-06T21:57:37.915361Z service:ingress.nfs.foo [INFO] \"service was created\""], "placement": {"count": 2}, "service_id": "nfs.foo", "service_name": "ingress.nfs.foo", "service_ty
pe": "ingress", "spec": {"backend_service": "nfs.foo", "frontend_port": 2049, "monitor_port": 9002, "virtual_ip": "10.0.31.5/16"}, "status": {"created": "2022-01-06T21:57:37.910852Z", "last_refresh": "2022-01-06T21:58:37.075244Z", "ports": [2049, 9002], "running": 3, "size": 4, "virtual_ip": "10.0.31.5/16"}}, {"even
ts": ["2022-01-06T21:57:31.078479Z service:mds.foofs [INFO] \"service was created\""], "placement": {"count": 2}, "service_id": "foofs", "service_name": "mds.foofs", "service_type": "mds", "status": {"created": "2022-01-06T21:57:31.074689Z", "last_refresh": "2022-01-06T21:58:37.074755Z", "running": 2, "size": 2}}, {
"placement": {"count": 2}, "service_name": "mgr", "service_type": "mgr", "status": {"created": "2022-01-06T21:50:48.757876Z", "last_refresh": "2022-01-06T21:58:37.073083Z", "running": 2, "size": 2}}, {"placement": {"count": 2, "hosts": ["smithi005:172.21.15.5=smithi005", "smithi031:172.21.15.31=smithi031"]}, "servic
e_name": "mon", "service_type": "mon", "status": {"created": "2022-01-06T21:51:55.847078Z", "last_refresh": "2022-01-06T21:58:37.073332Z", "running": 2, "size": 2}}, {"events": ["2022-01-06T21:57:37.910505Z service:nfs.foo [INFO] \"service was created\""], "placement": {"count": 2}, "service_id": "foo", "service_nam
e": "nfs.foo", "service_type": "nfs", "spec": {"port": 12049}, "status": {"created": "2022-01-06T21:57:37.905948Z", "last_refresh": "2022-01-06T21:58:37.075048Z", "ports": [12049], "running": 2, "size": 2}}, {"placement": {"host_pattern": "*"}, "service_name": "node-exporter", "service_type": "node-exporter", "statu
s": {"created": "2022-01-06T21:50:58.088177Z", "last_refresh": "2022-01-06T21:58:37.073559Z", "ports": [9100], "running": 2, "size": 2}}, {"events": ["2022-01-06T21:53:27.529310Z service:osd.all-available-devices [INFO] \"service was created\""], "placement": {"host_pattern": "*"}, "service_id": "all-available-devic
es", "service_name": "osd.all-available-devices", "service_type": "osd", "spec": {"data_devices": {"all": true}, "filter_logic": "AND", "objectstore": "bluestore"}, "status": {"created": "2022-01-06T21:53:27.522984Z", "last_refresh": "2022-01-06T21:58:37.073802Z", "running": 8, "size": 8}}, {"placement": {"count": 1
}, "service_name": "prometheus", "service_type": "prometheus", "status": {"created": "2022-01-06T21:50:53.020786Z", "last_refresh": "2022-01-06T21:58:38.559921Z", "ports": [9095], "running": 1, "size": 1}}]
2022-01-06T21:58:40.982 INFO:journalctl@ceph.mon.smithi031.smithi031.stdout:Jan 06 21:58:40 smithi031 bash[13845]: cluster 2022-01-06T21:58:39.567636+0000 mon.smithi005 (mon.0) 747 : cluster [WRN] Health check failed: 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON)
2022-01-06T21:58:41.053 INFO:journalctl@ceph.mon.smithi005.smithi005.stdout:Jan 06 21:58:40 smithi005 bash[11180]: cluster 2022-01-06T21:58:39.567636+0000 mon.smithi005 (mon.0) 747 : cluster [WRN] Health check failed: 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON)
2022-01-06T21:58:41.698 INFO:tasks.cephadm:nfs.foo has 2/2

Actions

Copy link

Updated by Jeff Layton over 2 years ago

Assignee set to Venky Shankar

Looking at /a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6599082/remote/smithi137/log/6ffb065c-6f3e-11ec-8c32-001a4aab830c

2022-01-06T22:36:04.416+0000 7ff834993900  0 set uid:gid to 167:167 (ceph:ceph)
2022-01-06T22:36:04.416+0000 7ff834993900  0 ceph version 17.0.0-9958-g09cb93b0 (09cb93b02b9e3e136a791fb4aa165fc1d446de8c) quincy (dev), process ceph-mds, pid 7
2022-01-06T22:36:04.416+0000 7ff834993900  1 main not setting numa affinity
2022-01-06T22:36:04.416+0000 7ff834993900  0 pidfile_write: ignore empty --pid-file
2022-01-06T22:36:04.419+0000 7ff82abbc700  1 mds.foofs.smithi137.gkwdeh Updating MDS map to version 2 from mon.0
2022-01-06T22:36:04.557+0000 7ff82abbc700  1 mds.foofs.smithi137.gkwdeh Updating MDS map to version 3 from mon.0
2022-01-06T22:36:04.557+0000 7ff82abbc700  1 mds.foofs.smithi137.gkwdeh Monitors have assigned me to become a standby.
2022-01-06T22:36:04.561+0000 7ff82abbc700  1 mds.foofs.smithi137.gkwdeh Updating MDS map to version 4 from mon.0
2022-01-06T22:36:04.562+0000 7ff82abbc700  1 mds.0.4 handle_mds_map i am now mds.0.4
2022-01-06T22:36:04.562+0000 7ff82abbc700  1 mds.0.4 handle_mds_map state change up:boot --> up:creating
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x1
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x100
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x600
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x601
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x602
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x603
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x604
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x605
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x606
2022-01-06T22:36:04.563+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x607
2022-01-06T22:36:04.563+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x608
2022-01-06T22:36:04.563+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x609
2022-01-06T22:36:04.572+0000 7ff824bb0700  1 mds.0.4 creating_done
2022-01-06T22:36:05.565+0000 7ff82abbc700  1 mds.foofs.smithi137.gkwdeh Updating MDS map to version 5 from mon.0
2022-01-06T22:36:05.565+0000 7ff82abbc700  1 mds.0.4 handle_mds_map i am now mds.0.4
2022-01-06T22:36:05.565+0000 7ff82abbc700  1 mds.0.4 handle_mds_map state change up:creating --> up:active
2022-01-06T22:36:05.565+0000 7ff82abbc700  1 mds.0.4 recovery_done -- successful recovery!
2022-01-06T22:36:05.565+0000 7ff82abbc700  1 mds.0.4 active_start
2022-01-06T22:36:10.561+0000 7ff8283b7700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2022-01-07T03:51:01.262+0000 7ff82c3bf700 -1 received  signal: Hangup from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0

There's a more recent log too, but it's scrambled. The is_rank_lagging message may be significant here, but it looks like it started up, and was running until it was shut down (probably via systemd).

The other job (6599055) looks similar, there is just no log message about a SIGHUP.

Actions

Copy link

Updated by Venky Shankar over 2 years ago

Jeff Layton wrote:

Looking at /a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6599082/remote/smithi137/log/6ffb065c-6f3e-11ec-8c32-001a4aab830c

[...]

There's a more recent log too, but it's scrambled. The is_rank_lagging message may be significant here, but it looks like it started up, and was running until it was shut down (probably via systemd).

is_rank_lagging should be harmless - that's a part of the metrics machinery in the MDS that does not effect the MDSs boot procedure.

The other job (6599055) looks similar, there is just no log message about a SIGHUP.

Actions

Copy link

Updated by Venky Shankar over 2 years ago

Is this related to CephFS? Comment https://tracker.ceph.com/issues/53807#note-1 indicates this is being hit with rados jobs too.

Actions

Copy link

Updated by Laura Flores over 2 years ago

Project changed from CephFS to Ceph

Actions

Copy link

Updated by Laura Flores over 2 years ago

Moved this Tracker out of CephFS, as offline filesystems on this particular test appear even in successful runs.

Example successful run: /a/yuriw-2022-01-04_18:45:05-rados-wip-yuriw-master-1.1.22-distro-default-smithi/6595039

2022-01-04T21:05:24.444 INFO:journalctl@ceph.mon.smithi149.smithi149.stdout:Jan 04 21:05:23 smithi149 ceph-mon[29224]: from='mgr.14214 172.21.15.149:0/4025601137' entity='mgr.smithi149.axflmp' cmd=[{"prefix": "osd pool create", "pool": "cephfs.foofs.data"}]: dispatch
2022-01-04T21:05:24.445 INFO:journalctl@ceph.mon.smithi149.smithi149.stdout:Jan 04 21:05:23 smithi149 conmon[29200]: 2022-01-04T21:05:23.994+0000 7f2cefb7c700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2022-01-04T21:05:25.034 DEBUG:teuthology.orchestra.run.smithi149:> sudo /home/ubuntu/cephtest/cephadm --image quay.ceph.io/ceph-ci/ceph:7b5bbfea3dc99d59b2173c093177ae92f881f823 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid 15c89698-6da1-11ec-8c32-001a4aab830c -- bash -c 'ceph nfs cluster create foo --ingress --virtual-ip 10.0.31.149/16 --port 2999'

Hidden Ansible output is also normal. The root of the cause must be something else:

TASK [common : Check firewalld status] *****************************************

2022-01-04T20:51:41.958 INFO:teuthology.task.ansible.out:fatal: [smithi149.front.sepia.ceph.com]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true}
...ignoring

2022-01-04T20:51:41.979 INFO:teuthology.task.ansible.out:Tuesday 04 January 2022  20:51:41 +0000 (0:00:00.081)       0:03:15.747 ******* 

2022-01-04T20:51:42.016 INFO:teuthology.task.ansible.out:
TASK [common : Open nrpe port if firewalld enabled] ****************************

Actions

Copy link

Updated by Laura Flores over 2 years ago

Subject changed from Hidden ansible output and offline filesystem failures lead to dead jobs to Dead jobs in rados/cephadm/smoke-roleless{...}

Actions

Copy link

Updated by Laura Flores over 2 years ago

Project changed from Ceph to Orchestrator

Actions

Copy link

#10

Updated by Aishwarya Mathuria over 2 years ago

/a/yuriw-2022-01-13_18:06:52-rados-wip-yuri3-testing-2022-01-13-0809-distro-default-smithi/6614725
/a/yuriw-2022-01-13_18:06:52-rados-wip-yuri3-testing-2022-01-13-0809-distro-default-smithi/6614681
/a/yuriw-2022-01-13_18:06:52-rados-wip-yuri3-testing-2022-01-13-0809-distro-default-smithi/6614665

Actions

Copy link

#11

Updated by Sebastian Wagner about 2 years ago

Has duplicate Bug #53904: cephadm: ingress jobs stuck added

Actions

Copy link

#12

Updated by Sebastian Wagner about 2 years ago

Subject changed from Dead jobs in rados/cephadm/smoke-roleless{...} to Dead jobs in rados/cephadm/smoke-roleless{...}: ingress jobs stuck

Actions

Copy link

#13

Updated by Sebastian Wagner about 2 years ago

Priority changed from Normal to Immediate

Actions

Copy link

#14

Updated by Venky Shankar about 2 years ago

Assignee changed from Venky Shankar to Sebastian Wagner

Reassigning to cephadm lead.

Actions

Copy link

#15

Updated by Melissa Li about 2 years ago

Assignee changed from Sebastian Wagner to Melissa Li

On a teuthology node with the stuck job:

   {
        "style": "cephadm:v1",
        "name": "haproxy.nfs.foo.smithi086.rilsmn",
        "fsid": "677afccc-7d61-11ec-8c35-001a4aab830c",
        "systemd_unit": "ceph-677afccc-7d61-11ec-8c35-001a4aab830c@haproxy.nfs.foo.smithi086.rilsmn",
        "enabled": true,
        "state": "stopped",
        "service_name": "ingress.nfs.foo",
        "ports": [
            2999,
            9999
        ],
        "ip": null,
        "deployed_by": [
            "quay.ceph.io/ceph-ci/ceph@sha256:4f125c7c6b9f2347c45fc02cd9dac333ee5730d930fbbe70f27ae87ecb849842" 
        ],
        "rank": null,
        "rank_generation": null,
        "extra_container_args": null,
        "memory_request": null,
        "memory_limit": null,
        "container_id": null,
        "container_image_name": "docker.io/library/haproxy:2.3",
        "container_image_id": null,
        "container_image_digests": null,
        "version": null,
        "started": null,
        "created": "2022-01-24T22:10:42.670979Z",
        "deployed": "2022-01-24T22:10:41.647001Z",
        "configured": "2022-01-24T22:10:42.670979Z" 
    },

the haproxy logs:

[root@smithi086 cephtest]# ./cephadm logs --name haproxy.nfs.foo.smithi086.rilsmn | tee haproxy.log
Inferring fsid 677afccc-7d61-11ec-8c35-001a4aab830c
-- Logs begin at Mon 2022-01-24 21:56:24 UTC, end at Wed 2022-01-26 16:47:54 UTC. --
Jan 24 22:10:41 smithi086 systemd[1]: Starting Ceph haproxy.nfs.foo.smithi086.rilsmn for 677afccc-7d61-11ec-8c35-001a4aab830c...
Jan 24 22:10:42 smithi086 conmon[57221]: [NOTICE] 023/221042 (7) : haproxy version is 2.3.17-d1c9119
Jan 24 22:10:42 smithi086 conmon[57221]: [NOTICE] 023/221042 (7) : path to executable is /usr/local/sbin/haproxy
Jan 24 22:10:42 smithi086 conmon[57221]: [ALERT] 023/221042 (7) : Starting frontend stats: cannot bind socket (Cannot assign requested address) [10.0.31.35:9999]
Jan 24 22:10:42 smithi086 conmon[57221]: [ALERT] 023/221042 (7) : Starting frontend frontend: cannot bind socket (Cannot assign requested address) [10.0.31.35:2999]
Jan 24 22:10:42 smithi086 conmon[57221]: [ALERT] 023/221042 (7) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.
Jan 24 22:10:42 smithi086 bash[57031]: 0dbc37f5c5cf924909892a20c3aa791436cae779913cbac45662cd51ffa60327
Jan 24 22:10:42 smithi086 systemd[1]: Started Ceph haproxy.nfs.foo.smithi086.rilsmn for 677afccc-7d61-11ec-8c35-001a4aab830c.
Jan 24 22:10:43 smithi086 systemd[1]: ceph-677afccc-7d61-11ec-8c35-001a4aab830c@haproxy.nfs.foo.smithi086.rilsmn.service: Main process exited, code=exited, status=1/FAILURE
Jan 24 22:10:43 smithi086 systemd[1]: ceph-677afccc-7d61-11ec-8c35-001a4aab830c@haproxy.nfs.foo.smithi086.rilsmn.service: Failed with result 'exit-code'.
Jan 24 22:10:53 smithi086 systemd[1]: ceph-677afccc-7d61-11ec-8c35-001a4aab830c@haproxy.nfs.foo.smithi086.rilsmn.service: Service RestartSec=10s expired, scheduling restart.
Jan 24 22:10:53 smithi086 systemd[1]: ceph-677afccc-7d61-11ec-8c35-001a4aab830c@haproxy.nfs.foo.smithi086.rilsmn.service: Scheduled restart job, restart counter is at 1.
Jan 24 22:10:53 smithi086 systemd[1]: Stopped Ceph haproxy.nfs.foo.smithi086.rilsmn for 677afccc-7d61-11ec-8c35-001a4aab830c.
Jan 24 22:10:53 smithi086 systemd[1]: Starting Ceph haproxy.nfs.foo.smithi086.rilsmn for 677afccc-7d61-11ec-8c35-001a4aab830c...
Jan 24 22:10:54 smithi086 conmon[58096]: [NOTICE] 023/221054 (7) : New worker #1 (9) forked
Jan 24 22:10:54 smithi086 bash[57906]: 97bdb5da26b18f8ea05c497035a2c72d051cc831a3b0656b5446e439f1971ea7
Jan 24 22:10:54 smithi086 systemd[1]: Started Ceph haproxy.nfs.foo.smithi086.rilsmn for 677afccc-7d61-11ec-8c35-001a4aab830c.
Jan 24 22:11:53 smithi086 systemd[1]: Stopping Ceph haproxy.nfs.foo.smithi086.rilsmn for 677afccc-7d61-11ec-8c35-001a4aab830c...
Jan 24 22:11:54 smithi086 bash[65491]: Error: no container with name or ID ceph-677afccc-7d61-11ec-8c35-001a4aab830c-haproxy.nfs.foo.smithi086.rilsmn found: no such container
Jan 24 22:11:54 smithi086 conmon[58096]: [WARNING] 023/221154 (7) : Exiting Master process...
Jan 24 22:11:54 smithi086 conmon[58096]: [NOTICE] 023/221154 (7) : haproxy version is 2.3.17-d1c9119
Jan 24 22:11:54 smithi086 conmon[58096]: [NOTICE] 023/221154 (7) : path to executable is /usr/local/sbin/haproxy
Jan 24 22:11:54 smithi086 conmon[58096]: [ALERT] 023/221154 (7) : Current worker #1 (9) exited with code 143 (Terminated)
Jan 24 22:11:54 smithi086 conmon[58096]: [WARNING] 023/221154 (7) : All workers exited. Exiting... (0)
Jan 24 22:11:54 smithi086 bash[65491]: 97bdb5da26b18f8ea05c497035a2c72d051cc831a3b0656b5446e439f1971ea7
Jan 24 22:11:54 smithi086 bash[65491]: Error: no container with name or ID ceph-677afccc-7d61-11ec-8c35-001a4aab830c-haproxy.nfs.foo.smithi086.rilsmn found: no such container
Jan 24 22:11:54 smithi086 systemd[1]: Stopped Ceph haproxy.nfs.foo.smithi086.rilsmn for 677afccc-7d61-11ec-8c35-001a4aab830c.

Actions

Copy link

#16