Project

General

Profile

Actions

Bug #62698

closed

qa: fsstress.sh fails with error code 124

Added by Rishabh Dave 8 months ago. Updated 7 months ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
qa-suite
Labels (FS):
qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

https://pulpito.ceph.com/rishabh-2023-08-25_06:38:25-fs-wip-rishabh-2023aug3-b5-testing-default-smithi/7379296

The workunit halted around the time we see these OSD related messages. There are no core dumps.

2023-08-25T12:23:07.870 INFO:journalctl@ceph.osd.5.smithi194.stdout:Aug 25 12:23:07 smithi194 ceph-e9f3b94a-4327-11ee-9b3d-001a4aab830c-osd-5[125517]: debug 2023-08-25T12:23:07.552+0000 7fd64ffbf700 -1 osd.5 47 heartbeat_check: no reply from 172.21.15.39:6806 osd.0 ever on either front or back, first ping sent 2023-08-25T12:14:12.001505+0000 (oldest deadline 2023-08-25T12:14:32.001505+0000)
2023-08-25T12:23:07.871 INFO:journalctl@ceph.osd.5.smithi194.stdout:Aug 25 12:23:07 smithi194 ceph-e9f3b94a-4327-11ee-9b3d-001a4aab830c-osd-5[125517]: debug 2023-08-25T12:23:07.552+0000 7fd64ffbf700 -1 osd.5 47 heartbeat_check: no reply from 172.21.15.39:6814 osd.1 ever on either front or back, first ping sent 2023-08-25T12:14:12.001505+0000 (oldest deadline 2023-08-25T12:14:32.001505+0000)
2023-08-25T12:23:07.871 INFO:journalctl@ceph.osd.5.smithi194.stdout:Aug 25 12:23:07 smithi194 ceph-e9f3b94a-4327-11ee-9b3d-001a4aab830c-osd-5[125517]: debug 2023-08-25T12:23:07.552+0000 7fd64ffbf700 -1 osd.5 47 heartbeat_check: no reply from 172.21.15.39:6822 osd.2 ever on either front or back, first ping sent 2023-08-25T12:14:12.001505+0000 (oldest deadline 2023-08-25T12:14:32.001505+0000)
Actions #1

Updated by Rishabh Dave 8 months ago

  • Component(FS) qa-suite added
  • Labels (FS) qa, qa-failure added
Actions #2

Updated by Radoslaw Zarzynski 8 months ago

These messages mean there was no even a single successful exchange of network heartbeat messages between osd.5 and (osd.0, osd.1 nor osd.2). Usually this means serious malfunction of an underlying network.

osd.5 (...) no reply from 172.21.15.39:6806 osd.0 ever

This looks like a lab issue. I'm seeing networking problems in other services as well:

2023-08-25T12:44:11.249 ERROR:teuthology.run_tasks:Manager failed: internal.sudo
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_449a1bc2027504e7b3c3d7b30fa4178906581da7/teuthology/task/internal/__init__.py", line 452, in sudo
    yield
...
OSError: [Errno 113] No route to host
2023-08-25T12:45:51.113 INFO:teuthology.orchestra.remote:Trying to reconnect to host 'ubuntu@smithi039.front.sepia.ceph.com'
2023-08-25T12:45:51.114 DEBUG:teuthology.orchestra.connection:{'hostname': 'smithi039.front.sepia.ceph.com', 'username': 'ubuntu', 'timeout': 60}
2023-08-25T12:45:51.184 DEBUG:teuthology.orchestra.remote:[Errno None] Unable to connect to port 22 on 172.21.15.39
2023-08-25T12:45:51.185 WARNING:teuthology.contextutil:'reconnect to {self.shortname}' reached maximum tries (5) after waiting for 30 seconds
Actions #3

Updated by Rishabh Dave 8 months ago

Copying following log entries on behalf of Radoslaw -

2023-08-25T12:45:54.900 INFO:journalctl@ceph.osd.4.smithi194.stdout:Aug 25 12:45:54 smithi194 ceph-e9f3b94a-4327-11ee-9b3d-001a4aab830c-osd-4[120521]: debug 2023-08-25T12:45:54.642+0000 7fba78b62700 -1 osd.4 47 heartbeat_check: no reply from 172.21.15.39:6806 osd.0 ever on either front or back, first ping sent 2023-08-25T12:35:35.838249+0000 (oldest deadline 2023-08-25T12:35:55.838249+0000)
2023-08-25T12:45:54.900 INFO:journalctl@ceph.osd.4.smithi194.stdout:Aug 25 12:45:54 smithi194 ceph-e9f3b94a-4327-11ee-9b3d-001a4aab830c-osd-4[120521]: debug 2023-08-25T12:45:54.642+0000 7fba78b62700 -1 osd.4 47 heartbeat_check: no reply from 172.21.15.39:6814 osd.1 ever on either front or back, first ping sent 2023-08-25T12:35:35.838249+0000 (oldest deadline 2023-08-25T12:35:55.838249+0000)
2023-08-25T12:45:54.900 INFO:journalctl@ceph.osd.4.smithi194.stdout:Aug 25 12:45:54 smithi194 ceph-e9f3b94a-4327-11ee-9b3d-001a4aab830c-osd-4[120521]: debug 2023-08-25T12:45:54.642+0000 7fba78b62700 -1 osd.4 47 heartbeat_check: no reply from 172.21.15.39:6822 osd.2 ever on either front or back, first ping sent 2023-08-25T12:35:35.838249+0000 (oldest deadline 2023-08-25T12:35:55.838249+0000)
2023-08-25T12:45:50.479 INFO:journalctl@ceph.osd.5.smithi194.stdout:Aug 25 12:45:50 smithi194 ceph-e9f3b94a-4327-11ee-9b3d-001a4aab830c-osd-5[125517]: debug 2023-08-25T12:45:50.348+0000 7fd64ffbf700 -1 osd.5 47 heartbeat_check: no reply from 172.21.15.39:6806 osd.0 ever on either front or back, first ping sent 2023-08-25T12:35:23.925703+0000 (oldest deadline 2023-08-25T12:35:43.925703+0000)
2023-08-25T12:45:50.479 INFO:journalctl@ceph.osd.5.smithi194.stdout:Aug 25 12:45:50 smithi194 ceph-e9f3b94a-4327-11ee-9b3d-001a4aab830c-osd-5[125517]: debug 2023-08-25T12:45:50.348+0000 7fd64ffbf700 -1 osd.5 47 heartbeat_check: no reply from 172.21.15.39:6814 osd.1 ever on either front or back, first ping sent 2023-08-25T12:35:23.925703+0000 (oldest deadline 2023-08-25T12:35:43.925703+0000)
2023-08-25T12:45:50.479 INFO:journalctl@ceph.osd.5.smithi194.stdout:Aug 25 12:45:50 smithi194 ceph-e9f3b94a-4327-11ee-9b3d-001a4aab830c-osd-5[125517]: debug 2023-08-25T12:45:50.348+0000 7fd64ffbf700 -1 osd.5 47 heartbeat_check: no reply from 172.21.15.39:6822 osd.2 ever on either front or back, first ping sent 2023-08-25T12:35:23.925703+0000 (oldest deadline 2023-08-25T12:35:43.925703+0000)
2023-08-25T12:45:50.480 INFO:journalctl@ceph.osd.5.smithi194.stdout:Aug 25 12:45:50 smithi194 ceph-e9f3b94a-4327-11ee-9b3d-001a4aab830c-osd-5[125517]: debug 2023-08-25T12:45:50.372+0000 7fd64267a700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-08-25T11:45:50.373551+0000)
Actions #4

Updated by Venky Shankar 7 months ago

Rishabh, have you seen this in any of your very recent runs?

Actions #5

Updated by Venky Shankar 7 months ago

  • Status changed from New to Can't reproduce

Rishabh, please reopen if this issue is seen again.

Actions

Also available in: Atom PDF