Project

General

Profile

Actions

Bug #53807

closed

Dead jobs in rados/cephadm/smoke-roleless{...}: ingress jobs stuck

Added by Laura Flores over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Immediate
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy,pacific,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description: rados/cephadm/smoke-roleless/{0-distro/centos_8.3_container_tools_3.0 0-nvme-loop 1-start 2-services/nfs-ingress 3-final}

Failure Reason: hit max job timeout

Jobs:
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6598774
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6598785
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6599316
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6599350

Earlier in the log:

2022-01-06T16:33:28.615 INFO:teuthology.task.ansible.out:^M
TASK [common : Check firewalld status] *****************************************^M

2022-01-06T16:33:28.617 INFO:teuthology.task.ansible.out:fatal: [smithi107.front.sepia.ceph.com]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true}^M
...ignoring^M

2022-01-06T16:33:28.638 INFO:teuthology.task.ansible.out:Thursday 06 January 2022  16:33:28 +0000 (0:00:00.260)       0:02:03.410 ****** ^M

Later in the log:

2022-01-06T16:44:13.943 INFO:journalctl@ceph.mon.smithi107.smithi107.stdout:Jan 06 16:44:13 smithi107 ceph-mon[30485]: from='mgr.14216 172.21.15.107:0/4007838189' entity='mgr.smithi107.tttsho' cmd=[{"prefix": "osd pool create", "pool": "cephfs.foofs.data"}]: dispatch
2022-01-06T16:44:13.943 INFO:journalctl@ceph.mon.smithi107.smithi107.stdout:Jan 06 16:44:13 smithi107 ceph-mon[30485]: pgmap v133: 33 pgs: 32 unknown, 1 active+clean; 577 KiB data, 47 MiB used, 715 GiB / 715 GiB avail
2022-01-06T16:44:13.943 INFO:journalctl@ceph.mon.smithi107.smithi107.stdout:Jan 06 16:44:13 smithi107 ceph-mon[30485]: from='mgr.14216 172.21.15.107:0/4007838189' entity='mgr.smithi107.tttsho'
2022-01-06T16:44:13.943 INFO:journalctl@ceph.mon.smithi107.smithi107.stdout:Jan 06 16:44:13 smithi107 conmon[30462]: 2022-01-06T16:44:13.732+0000 7fa865de5700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2022-01-06T16:44:14.150 INFO:journalctl@ceph.mon.smithi150.smithi150.stdout:Jan 06 16:44:13 smithi150 ceph-mon[37940]: from='mgr.14216 172.21.15.107:0/4007838189' entity='mgr.smithi107.tttsho' cmd='[{"prefix": "osd pool create", "pool": "cephfs.foofs.meta"}]': finished
2022-01-06T16:44:14.151 INFO:journalctl@ceph.mon.smithi150.smithi150.stdout:Jan 06 16:44:13 smithi150 ceph-mon[37940]: osdmap e43: 8 total, 8 up, 8 in
2022-01-06T16:44:14.151 INFO:journalctl@ceph.mon.smithi150.smithi150.stdout:Jan 06 16:44:13 smithi150 ceph-mon[37940]: from='mgr.14216 172.21.15.107:0/4007838189' entity='mgr.smithi107.tttsho' cmd=[{"prefix": "osd pool create", "pool": "cephfs.foofs.data"}]: dispatch
2022-01-06T16:44:14.151 INFO:journalctl@ceph.mon.smithi150.smithi150.stdout:Jan 06 16:44:13 smithi150 ceph-mon[37940]: pgmap v133: 33 pgs: 32 unknown, 1 active+clean; 577 KiB data, 47 MiB used, 715 GiB / 715 GiB avail
2022-01-06T16:44:14.151 INFO:journalctl@ceph.mon.smithi150.smithi150.stdout:Jan 06 16:44:13 smithi150 ceph-mon[37940]: from='mgr.14216 172.21.15.107:0/4007838189' entity='mgr.smithi107.tttsho'
2022-01-06T16:44:14.535 INFO:teuthology.run_tasks:Running task cephadm.apply...


Related issues 1 (0 open1 closed)

Has duplicate Orchestrator - Bug #53904: cephadm: ingress jobs stuckDuplicateMelissa Li

Actions
Actions #1

Updated by Laura Flores over 2 years ago

Another similar scenario, which does not involve offline filesystems:

Description: rados/cephadm/smoke-roleless/{0-distro/rhel_8.4_container_tools_rhel8 0-nvme-loop 1-start 2-services/rgw-ingress 3-final}

Failure reason: hit max job timeout

Jobs:
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6598830
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6599155

TASK [common : Check firewalld status] *****************************************

2022-01-06T17:12:46.159 INFO:teuthology.task.ansible.out:fatal: [smithi158.front.sepia.ceph.com]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true}
...ignoring

2022-01-06T17:12:46.180 INFO:teuthology.task.ansible.out:Thursday 06 January 2022  17:12:46 +0000 (0:00:00.236)       0:03:15.016 ****** 

2022-01-06T17:12:46.208 INFO:teuthology.task.ansible.out:
TASK [common : Open nrpe port if firewalld enabled] ****************************
TASK [testnode : Stop and disable iptables] ************************************

2022-01-06T17:16:52.036 INFO:teuthology.task.ansible.out:fatal: [smithi179.front.sepia.ceph.com]: FAILED! => {"changed": false, "msg": "Could not find the requested service iptables: host"}
...ignoring

2022-01-06T17:16:52.056 INFO:teuthology.task.ansible.out:Thursday 06 January 2022  17:16:52 +0000 (0:00:00.307)       0:07:20.893 ****** 

2022-01-06T17:16:52.698 INFO:teuthology.task.ansible.out:
TASK [testnode : Enable SELinux] ***********************************************
Actions #2

Updated by Laura Flores over 2 years ago

And a third similar scenario where an offline filesystem leads to failed CEPHADM daemons:

Description: rados/cephadm/smoke-roleless/{0-distro/ubuntu_20.04 0-nvme-loop 1-start 2-services/nfs-ingress 3-final}

Failure reason: hit max job timeout

Jobs:
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6599055
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6599082

2022-01-06T21:57:30.302 INFO:journalctl@ceph.mon.smithi005.smithi005.stdout:Jan 06 21:57:30 smithi005 bash[11180]: audit 2022-01-06T21:57:29.036464+0000 mon.smithi005 (mon.0) 651 : audit [INF] from='mgr.14206 172.21.15.5:0/2764321477' entity='mgr.smithi005.fpqapy' cmd=[{"prefix": "osd pool create", "pool": "cephfs.f
oofs.meta"}]: dispatch
2022-01-06T21:57:30.303 INFO:journalctl@ceph.mon.smithi005.smithi005.stdout:Jan 06 21:57:30 smithi005 bash[11180]: cluster 2022-01-06T21:57:29.125324+0000 mgr.smithi005.fpqapy (mgr.14206) 195 : cluster [DBG] pgmap v177: 1 pgs: 1 active+clean; 577 KiB data, 46 MiB used, 715 GiB / 715 GiB avail
2022-01-06T21:57:30.390 INFO:journalctl@ceph.mon.smithi031.smithi031.stdout:Jan 06 21:57:30 smithi031 bash[13845]: audit 2022-01-06T21:57:29.035515+0000 mgr.smithi005.fpqapy (mgr.14206) 194 : audit [DBG] from='client.14540 -' entity='client.admin' cmd=[{"prefix": "fs volume create", "name": "foofs", "target": ["mon-
mgr", ""]}]: dispatch
2022-01-06T21:57:30.390 INFO:journalctl@ceph.mon.smithi031.smithi031.stdout:Jan 06 21:57:30 smithi031 bash[13845]: audit 2022-01-06T21:57:29.036464+0000 mon.smithi005 (mon.0) 651 : audit [INF] from='mgr.14206 172.21.15.5:0/2764321477' entity='mgr.smithi005.fpqapy' cmd=[{"prefix": "osd pool create", "pool": "cephfs.f
oofs.meta"}]: dispatch
2022-01-06T21:57:30.390 INFO:journalctl@ceph.mon.smithi031.smithi031.stdout:Jan 06 21:57:30 smithi031 bash[13845]: cluster 2022-01-06T21:57:29.125324+0000 mgr.smithi005.fpqapy (mgr.14206) 195 : cluster [DBG] pgmap v177: 1 pgs: 1 active+clean; 577 KiB data, 46 MiB used, 715 GiB / 715 GiB avail
2022-01-06T21:57:31.136 INFO:journalctl@ceph.mon.smithi005.smithi005.stdout:Jan 06 21:57:31 smithi005 bash[11180]: debug 2022-01-06T21:57:31.045+0000 7f17651e8700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2022-01-06T21:57:31.136 INFO:journalctl@ceph.mon.smithi005.smithi005.stdout:Jan 06 21:57:31 smithi005 bash[11180]: audit 2022-01-06T21:57:30.036464+0000 mon.smithi005 (mon.0) 652 : audit [INF] from='mgr.14206 172.21.15.5:0/2764321477' entity='mgr.smithi005.fpqapy' cmd='[{"prefix": "osd pool create", "pool": "cephfs.foofs.meta"}]': finished
2022-01-06T21:58:40.526 INFO:teuthology.orchestra.run.smithi005.stdout:[{"placement": {"count": 1}, "service_name": "alertmanager", "service_type": "alertmanager", "status": {"created": "2022-01-06T21:51:00.275167Z", "last_refresh": "2022-01-06T21:58:38.559512Z", "ports": [9093, 9094], "running": 1, "size": 1}}, {"p
lacement": {"host_pattern": "*"}, "service_name": "crash", "service_type": "crash", "status": {"created": "2022-01-06T21:50:50.887947Z", "last_refresh": "2022-01-06T21:58:37.072722Z", "running": 2, "size": 2}}, {"placement": {"count": 1}, "service_name": "grafana", "service_type": "grafana", "status": {"created": "2
022-01-06T21:50:55.649076Z", "last_refresh": "2022-01-06T21:58:38.559719Z", "ports": [3000], "running": 1, "size": 1}}, {"events": ["2022-01-06T21:57:37.915361Z service:ingress.nfs.foo [INFO] \"service was created\""], "placement": {"count": 2}, "service_id": "nfs.foo", "service_name": "ingress.nfs.foo", "service_ty
pe": "ingress", "spec": {"backend_service": "nfs.foo", "frontend_port": 2049, "monitor_port": 9002, "virtual_ip": "10.0.31.5/16"}, "status": {"created": "2022-01-06T21:57:37.910852Z", "last_refresh": "2022-01-06T21:58:37.075244Z", "ports": [2049, 9002], "running": 3, "size": 4, "virtual_ip": "10.0.31.5/16"}}, {"even
ts": ["2022-01-06T21:57:31.078479Z service:mds.foofs [INFO] \"service was created\""], "placement": {"count": 2}, "service_id": "foofs", "service_name": "mds.foofs", "service_type": "mds", "status": {"created": "2022-01-06T21:57:31.074689Z", "last_refresh": "2022-01-06T21:58:37.074755Z", "running": 2, "size": 2}}, {
"placement": {"count": 2}, "service_name": "mgr", "service_type": "mgr", "status": {"created": "2022-01-06T21:50:48.757876Z", "last_refresh": "2022-01-06T21:58:37.073083Z", "running": 2, "size": 2}}, {"placement": {"count": 2, "hosts": ["smithi005:172.21.15.5=smithi005", "smithi031:172.21.15.31=smithi031"]}, "servic
e_name": "mon", "service_type": "mon", "status": {"created": "2022-01-06T21:51:55.847078Z", "last_refresh": "2022-01-06T21:58:37.073332Z", "running": 2, "size": 2}}, {"events": ["2022-01-06T21:57:37.910505Z service:nfs.foo [INFO] \"service was created\""], "placement": {"count": 2}, "service_id": "foo", "service_nam
e": "nfs.foo", "service_type": "nfs", "spec": {"port": 12049}, "status": {"created": "2022-01-06T21:57:37.905948Z", "last_refresh": "2022-01-06T21:58:37.075048Z", "ports": [12049], "running": 2, "size": 2}}, {"placement": {"host_pattern": "*"}, "service_name": "node-exporter", "service_type": "node-exporter", "statu
s": {"created": "2022-01-06T21:50:58.088177Z", "last_refresh": "2022-01-06T21:58:37.073559Z", "ports": [9100], "running": 2, "size": 2}}, {"events": ["2022-01-06T21:53:27.529310Z service:osd.all-available-devices [INFO] \"service was created\""], "placement": {"host_pattern": "*"}, "service_id": "all-available-devic
es", "service_name": "osd.all-available-devices", "service_type": "osd", "spec": {"data_devices": {"all": true}, "filter_logic": "AND", "objectstore": "bluestore"}, "status": {"created": "2022-01-06T21:53:27.522984Z", "last_refresh": "2022-01-06T21:58:37.073802Z", "running": 8, "size": 8}}, {"placement": {"count": 1
}, "service_name": "prometheus", "service_type": "prometheus", "status": {"created": "2022-01-06T21:50:53.020786Z", "last_refresh": "2022-01-06T21:58:38.559921Z", "ports": [9095], "running": 1, "size": 1}}]
2022-01-06T21:58:40.982 INFO:journalctl@ceph.mon.smithi031.smithi031.stdout:Jan 06 21:58:40 smithi031 bash[13845]: cluster 2022-01-06T21:58:39.567636+0000 mon.smithi005 (mon.0) 747 : cluster [WRN] Health check failed: 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON)
2022-01-06T21:58:41.053 INFO:journalctl@ceph.mon.smithi005.smithi005.stdout:Jan 06 21:58:40 smithi005 bash[11180]: cluster 2022-01-06T21:58:39.567636+0000 mon.smithi005 (mon.0) 747 : cluster [WRN] Health check failed: 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON)
2022-01-06T21:58:41.698 INFO:tasks.cephadm:nfs.foo has 2/2
Actions #3

Updated by Jeff Layton over 2 years ago

  • Assignee set to Venky Shankar

Looking at /a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6599082/remote/smithi137/log/6ffb065c-6f3e-11ec-8c32-001a4aab830c

2022-01-06T22:36:04.416+0000 7ff834993900  0 set uid:gid to 167:167 (ceph:ceph)
2022-01-06T22:36:04.416+0000 7ff834993900  0 ceph version 17.0.0-9958-g09cb93b0 (09cb93b02b9e3e136a791fb4aa165fc1d446de8c) quincy (dev), process ceph-mds, pid 7
2022-01-06T22:36:04.416+0000 7ff834993900  1 main not setting numa affinity
2022-01-06T22:36:04.416+0000 7ff834993900  0 pidfile_write: ignore empty --pid-file
2022-01-06T22:36:04.419+0000 7ff82abbc700  1 mds.foofs.smithi137.gkwdeh Updating MDS map to version 2 from mon.0
2022-01-06T22:36:04.557+0000 7ff82abbc700  1 mds.foofs.smithi137.gkwdeh Updating MDS map to version 3 from mon.0
2022-01-06T22:36:04.557+0000 7ff82abbc700  1 mds.foofs.smithi137.gkwdeh Monitors have assigned me to become a standby.
2022-01-06T22:36:04.561+0000 7ff82abbc700  1 mds.foofs.smithi137.gkwdeh Updating MDS map to version 4 from mon.0
2022-01-06T22:36:04.562+0000 7ff82abbc700  1 mds.0.4 handle_mds_map i am now mds.0.4
2022-01-06T22:36:04.562+0000 7ff82abbc700  1 mds.0.4 handle_mds_map state change up:boot --> up:creating
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x1
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x100
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x600
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x601
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x602
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x603
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x604
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x605
2022-01-06T22:36:04.562+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x606
2022-01-06T22:36:04.563+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x607
2022-01-06T22:36:04.563+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x608
2022-01-06T22:36:04.563+0000 7ff82abbc700  0 mds.0.cache creating system inode with ino:0x609
2022-01-06T22:36:04.572+0000 7ff824bb0700  1 mds.0.4 creating_done
2022-01-06T22:36:05.565+0000 7ff82abbc700  1 mds.foofs.smithi137.gkwdeh Updating MDS map to version 5 from mon.0
2022-01-06T22:36:05.565+0000 7ff82abbc700  1 mds.0.4 handle_mds_map i am now mds.0.4
2022-01-06T22:36:05.565+0000 7ff82abbc700  1 mds.0.4 handle_mds_map state change up:creating --> up:active
2022-01-06T22:36:05.565+0000 7ff82abbc700  1 mds.0.4 recovery_done -- successful recovery!
2022-01-06T22:36:05.565+0000 7ff82abbc700  1 mds.0.4 active_start
2022-01-06T22:36:10.561+0000 7ff8283b7700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2022-01-07T03:51:01.262+0000 7ff82c3bf700 -1 received  signal: Hangup from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0

There's a more recent log too, but it's scrambled. The is_rank_lagging message may be significant here, but it looks like it started up, and was running until it was shut down (probably via systemd).

The other job (6599055) looks similar, there is just no log message about a SIGHUP.

Actions #4

Updated by Venky Shankar over 2 years ago

Jeff Layton wrote:

Looking at /a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6599082/remote/smithi137/log/6ffb065c-6f3e-11ec-8c32-001a4aab830c

[...]

There's a more recent log too, but it's scrambled. The is_rank_lagging message may be significant here, but it looks like it started up, and was running until it was shut down (probably via systemd).

is_rank_lagging should be harmless - that's a part of the metrics machinery in the MDS that does not effect the MDSs boot procedure.

The other job (6599055) looks similar, there is just no log message about a SIGHUP.

Actions #5

Updated by Venky Shankar over 2 years ago

Is this related to CephFS? Comment https://tracker.ceph.com/issues/53807#note-1 indicates this is being hit with rados jobs too.

Actions #6

Updated by Laura Flores over 2 years ago

  • Project changed from CephFS to Ceph
Actions #7

Updated by Laura Flores over 2 years ago

Moved this Tracker out of CephFS, as offline filesystems on this particular test appear even in successful runs.

Example successful run: /a/yuriw-2022-01-04_18:45:05-rados-wip-yuriw-master-1.1.22-distro-default-smithi/6595039

2022-01-04T21:05:24.444 INFO:journalctl@ceph.mon.smithi149.smithi149.stdout:Jan 04 21:05:23 smithi149 ceph-mon[29224]: from='mgr.14214 172.21.15.149:0/4025601137' entity='mgr.smithi149.axflmp' cmd=[{"prefix": "osd pool create", "pool": "cephfs.foofs.data"}]: dispatch
2022-01-04T21:05:24.445 INFO:journalctl@ceph.mon.smithi149.smithi149.stdout:Jan 04 21:05:23 smithi149 conmon[29200]: 2022-01-04T21:05:23.994+0000 7f2cefb7c700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2022-01-04T21:05:25.034 DEBUG:teuthology.orchestra.run.smithi149:> sudo /home/ubuntu/cephtest/cephadm --image quay.ceph.io/ceph-ci/ceph:7b5bbfea3dc99d59b2173c093177ae92f881f823 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid 15c89698-6da1-11ec-8c32-001a4aab830c -- bash -c 'ceph nfs cluster create foo --ingress --virtual-ip 10.0.31.149/16 --port 2999'

Hidden Ansible output is also normal. The root of the cause must be something else:

TASK [common : Check firewalld status] *****************************************

2022-01-04T20:51:41.958 INFO:teuthology.task.ansible.out:fatal: [smithi149.front.sepia.ceph.com]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true}
...ignoring

2022-01-04T20:51:41.979 INFO:teuthology.task.ansible.out:Tuesday 04 January 2022  20:51:41 +0000 (0:00:00.081)       0:03:15.747 ******* 

2022-01-04T20:51:42.016 INFO:teuthology.task.ansible.out:
TASK [common : Open nrpe port if firewalld enabled] ****************************

Actions #8

Updated by Laura Flores over 2 years ago

  • Subject changed from Hidden ansible output and offline filesystem failures lead to dead jobs to Dead jobs in rados/cephadm/smoke-roleless{...}
Actions #9

Updated by Laura Flores over 2 years ago

  • Project changed from Ceph to Orchestrator
Actions #10

Updated by Aishwarya Mathuria over 2 years ago

/a/yuriw-2022-01-13_18:06:52-rados-wip-yuri3-testing-2022-01-13-0809-distro-default-smithi/6614725
/a/yuriw-2022-01-13_18:06:52-rados-wip-yuri3-testing-2022-01-13-0809-distro-default-smithi/6614681
/a/yuriw-2022-01-13_18:06:52-rados-wip-yuri3-testing-2022-01-13-0809-distro-default-smithi/6614665

Actions #11

Updated by Sebastian Wagner about 2 years ago

  • Has duplicate Bug #53904: cephadm: ingress jobs stuck added
Actions #12

Updated by Sebastian Wagner about 2 years ago

  • Subject changed from Dead jobs in rados/cephadm/smoke-roleless{...} to Dead jobs in rados/cephadm/smoke-roleless{...}: ingress jobs stuck
Actions #13

Updated by Sebastian Wagner about 2 years ago

  • Priority changed from Normal to Immediate
Actions #14

Updated by Venky Shankar about 2 years ago

  • Assignee changed from Venky Shankar to Sebastian Wagner

Reassigning to cephadm lead.

Actions #15

Updated by Melissa Li about 2 years ago

  • Assignee changed from Sebastian Wagner to Melissa Li

On a teuthology node with the stuck job:

   {
        "style": "cephadm:v1",
        "name": "haproxy.nfs.foo.smithi086.rilsmn",
        "fsid": "677afccc-7d61-11ec-8c35-001a4aab830c",
        "systemd_unit": "ceph-677afccc-7d61-11ec-8c35-001a4aab830c@haproxy.nfs.foo.smithi086.rilsmn",
        "enabled": true,
        "state": "stopped",
        "service_name": "ingress.nfs.foo",
        "ports": [
            2999,
            9999
        ],
        "ip": null,
        "deployed_by": [
            "quay.ceph.io/ceph-ci/ceph@sha256:4f125c7c6b9f2347c45fc02cd9dac333ee5730d930fbbe70f27ae87ecb849842" 
        ],
        "rank": null,
        "rank_generation": null,
        "extra_container_args": null,
        "memory_request": null,
        "memory_limit": null,
        "container_id": null,
        "container_image_name": "docker.io/library/haproxy:2.3",
        "container_image_id": null,
        "container_image_digests": null,
        "version": null,
        "started": null,
        "created": "2022-01-24T22:10:42.670979Z",
        "deployed": "2022-01-24T22:10:41.647001Z",
        "configured": "2022-01-24T22:10:42.670979Z" 
    },

the haproxy logs:

[root@smithi086 cephtest]# ./cephadm logs --name haproxy.nfs.foo.smithi086.rilsmn | tee haproxy.log
Inferring fsid 677afccc-7d61-11ec-8c35-001a4aab830c
-- Logs begin at Mon 2022-01-24 21:56:24 UTC, end at Wed 2022-01-26 16:47:54 UTC. --
Jan 24 22:10:41 smithi086 systemd[1]: Starting Ceph haproxy.nfs.foo.smithi086.rilsmn for 677afccc-7d61-11ec-8c35-001a4aab830c...
Jan 24 22:10:42 smithi086 conmon[57221]: [NOTICE] 023/221042 (7) : haproxy version is 2.3.17-d1c9119
Jan 24 22:10:42 smithi086 conmon[57221]: [NOTICE] 023/221042 (7) : path to executable is /usr/local/sbin/haproxy
Jan 24 22:10:42 smithi086 conmon[57221]: [ALERT] 023/221042 (7) : Starting frontend stats: cannot bind socket (Cannot assign requested address) [10.0.31.35:9999]
Jan 24 22:10:42 smithi086 conmon[57221]: [ALERT] 023/221042 (7) : Starting frontend frontend: cannot bind socket (Cannot assign requested address) [10.0.31.35:2999]
Jan 24 22:10:42 smithi086 conmon[57221]: [ALERT] 023/221042 (7) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.
Jan 24 22:10:42 smithi086 bash[57031]: 0dbc37f5c5cf924909892a20c3aa791436cae779913cbac45662cd51ffa60327
Jan 24 22:10:42 smithi086 systemd[1]: Started Ceph haproxy.nfs.foo.smithi086.rilsmn for 677afccc-7d61-11ec-8c35-001a4aab830c.
Jan 24 22:10:43 smithi086 systemd[1]: ceph-677afccc-7d61-11ec-8c35-001a4aab830c@haproxy.nfs.foo.smithi086.rilsmn.service: Main process exited, code=exited, status=1/FAILURE
Jan 24 22:10:43 smithi086 systemd[1]: ceph-677afccc-7d61-11ec-8c35-001a4aab830c@haproxy.nfs.foo.smithi086.rilsmn.service: Failed with result 'exit-code'.
Jan 24 22:10:53 smithi086 systemd[1]: ceph-677afccc-7d61-11ec-8c35-001a4aab830c@haproxy.nfs.foo.smithi086.rilsmn.service: Service RestartSec=10s expired, scheduling restart.
Jan 24 22:10:53 smithi086 systemd[1]: ceph-677afccc-7d61-11ec-8c35-001a4aab830c@haproxy.nfs.foo.smithi086.rilsmn.service: Scheduled restart job, restart counter is at 1.
Jan 24 22:10:53 smithi086 systemd[1]: Stopped Ceph haproxy.nfs.foo.smithi086.rilsmn for 677afccc-7d61-11ec-8c35-001a4aab830c.
Jan 24 22:10:53 smithi086 systemd[1]: Starting Ceph haproxy.nfs.foo.smithi086.rilsmn for 677afccc-7d61-11ec-8c35-001a4aab830c...
Jan 24 22:10:54 smithi086 conmon[58096]: [NOTICE] 023/221054 (7) : New worker #1 (9) forked
Jan 24 22:10:54 smithi086 bash[57906]: 97bdb5da26b18f8ea05c497035a2c72d051cc831a3b0656b5446e439f1971ea7
Jan 24 22:10:54 smithi086 systemd[1]: Started Ceph haproxy.nfs.foo.smithi086.rilsmn for 677afccc-7d61-11ec-8c35-001a4aab830c.
Jan 24 22:11:53 smithi086 systemd[1]: Stopping Ceph haproxy.nfs.foo.smithi086.rilsmn for 677afccc-7d61-11ec-8c35-001a4aab830c...
Jan 24 22:11:54 smithi086 bash[65491]: Error: no container with name or ID ceph-677afccc-7d61-11ec-8c35-001a4aab830c-haproxy.nfs.foo.smithi086.rilsmn found: no such container
Jan 24 22:11:54 smithi086 conmon[58096]: [WARNING] 023/221154 (7) : Exiting Master process...
Jan 24 22:11:54 smithi086 conmon[58096]: [NOTICE] 023/221154 (7) : haproxy version is 2.3.17-d1c9119
Jan 24 22:11:54 smithi086 conmon[58096]: [NOTICE] 023/221154 (7) : path to executable is /usr/local/sbin/haproxy
Jan 24 22:11:54 smithi086 conmon[58096]: [ALERT] 023/221154 (7) : Current worker #1 (9) exited with code 143 (Terminated)
Jan 24 22:11:54 smithi086 conmon[58096]: [WARNING] 023/221154 (7) : All workers exited. Exiting... (0)
Jan 24 22:11:54 smithi086 bash[65491]: 97bdb5da26b18f8ea05c497035a2c72d051cc831a3b0656b5446e439f1971ea7
Jan 24 22:11:54 smithi086 bash[65491]: Error: no container with name or ID ceph-677afccc-7d61-11ec-8c35-001a4aab830c-haproxy.nfs.foo.smithi086.rilsmn found: no such container
Jan 24 22:11:54 smithi086 systemd[1]: Stopped Ceph haproxy.nfs.foo.smithi086.rilsmn for 677afccc-7d61-11ec-8c35-001a4aab830c.
Actions #16

Updated by Guillaume Abrioux about 2 years ago

  • Status changed from New to In Progress
  • Assignee changed from Melissa Li to Guillaume Abrioux
Actions #17

Updated by Guillaume Abrioux about 2 years ago

  • Pull request ID set to 45014
Actions #18

Updated by Guillaume Abrioux about 2 years ago

  • Status changed from In Progress to Fix Under Review
Actions #19

Updated by Guillaume Abrioux about 2 years ago

  • Backport set to quincy,pacific,octopus
Actions #20

Updated by Laura Flores about 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #21

Updated by Adam King about 2 years ago

fix is in pacific now via https://github.com/ceph/ceph/pull/44628. Quincy backport is in testing https://github.com/ceph/ceph/pull/45038

Actions #23

Updated by Laura Flores about 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF