Project

General

Profile

Actions

Bug #55258

closed

lots of "heartbeat_check: no reply from X.X.X.X" in OSD logs

Added by Venky Shankar about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Seeing this in upgrade suite for CephFS and seems to be happening frequently: https://pulpito.ceph.com/vshankar-2022-04-09_12:55:41-fs-wip-vshankar-testing-55110-20220408-203242-testing-default-smithi/6784177/

I think this is causing the test to fail since the workunit (fsstress in this case) does not make any progress, thereby timing out hitting the job timout (3h in this case). From the log:

2022-04-09T21:46:00.585 INFO:tasks.workunit.client.1.smithi043.stdout:4/348: creat dc/d17/f77 x:0 0 0
2022-04-09T21:46:00.588 INFO:tasks.workunit.client.1.smithi043.stdout:7/360: dwrite d4/d8/d4c/f4a [0,4194304] 0
2022-04-09T21:46:00.594 INFO:tasks.workunit.client.1.smithi043.stdout:7/361: mkdir d4/d8/da/d23/d6b/d6c/d6e 0
2022-04-09T21:46:00.594 INFO:tasks.workunit.client.1.smithi043.stdout:7/362: chown d4/d36 3151 1
2022-04-09T21:46:21.136 INFO:journalctl@ceph.osd.4.smithi043.stdout:Apr 09 21:46:20 smithi043 ceph-481700d4-b84d-11ec-8c37-001a4aab830c-osd.4[35355]: debug 2022-04-09T21:46:20.741+0000 7fb1f5bf8700 -1 osd.4 43 heartbeat_check: no reply from 172.21.15.5:6806 osd.0 since back 2022-04-09T21:45:55.338212+0000 front 2022-04-09T21:46:09.319056+0000 (oldest deadline 2022-04-09T21:46:20.617069+0000)
2022-04-09T21:46:21.137 INFO:journalctl@ceph.osd.4.smithi043.stdout:Apr 09 21:46:20 smithi043 ceph-481700d4-b84d-11ec-8c37-001a4aab830c-osd.4[35355]: debug 2022-04-09T21:46:20.741+0000 7fb1f5bf8700 -1 osd.4 43 heartbeat_check: no reply from 172.21.15.5:6814 osd.1 since back 2022-04-09T21:45:55.327589+0000 front 2022-04-09T21:46:08.218398+0000 (oldest deadline 2022-04-09T21:46:20.617069+0000)
2022-04-09T21:46:21.137 INFO:journalctl@ceph.osd.4.smithi043.stdout:Apr 09 21:46:20 smithi043 ceph-481700d4-b84d-11ec-8c37-001a4aab830c-osd.4[35355]: debug 2022-04-09T21:46:20.741+0000 7fb1f5bf8700 -1 osd.4 43 heartbeat_check: no reply from 172.21.15.5:6822 osd.2 since back 2022-04-09T21:45:55.337931+0000 front 2022-04-09T21:46:00.617663+0000 (oldest deadline 2022-04-09T21:46:20.617069+0000)
2022-04-09T21:46:22.136 INFO:journalctl@ceph.osd.4.smithi043.stdout:Apr 09 21:46:21 smithi043 ceph-481700d4-b84d-11ec-8c37-001a4aab830c-osd.4[35355]: debug 2022-04-09T21:46:21.790+0000 7fb1f5bf8700 -1 osd.4 43 heartbeat_check: no reply from 172.21.15.5:6806 osd.0 since back 2022-04-09T21:45:55.338212+0000 front 2022-04-09T21:46:09.319056+0000 (oldest deadline 2022-04-09T21:46:20.617069+0000)
...
...
...

Happens (mostly) with fs:upgrade, but not always. Also, this does not involve thrashing the OSDs, so not sure why such messages are showing up.


Related issues 1 (1 open0 closed)

Related to Linux kernel client - Bug #64471: kernel: upgrades from quincy/v18.2.[01]/reef to main|squid fail with kernel oopsNewXiubo Li

Actions
Actions

Also available in: Atom PDF