Actions
Bug #55258
closedlots of "heartbeat_check: no reply from X.X.X.X" in OSD logs
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):
Description
Seeing this in upgrade suite for CephFS and seems to be happening frequently: https://pulpito.ceph.com/vshankar-2022-04-09_12:55:41-fs-wip-vshankar-testing-55110-20220408-203242-testing-default-smithi/6784177/
I think this is causing the test to fail since the workunit (fsstress in this case) does not make any progress, thereby timing out hitting the job timout (3h in this case). From the log:
2022-04-09T21:46:00.585 INFO:tasks.workunit.client.1.smithi043.stdout:4/348: creat dc/d17/f77 x:0 0 0 2022-04-09T21:46:00.588 INFO:tasks.workunit.client.1.smithi043.stdout:7/360: dwrite d4/d8/d4c/f4a [0,4194304] 0 2022-04-09T21:46:00.594 INFO:tasks.workunit.client.1.smithi043.stdout:7/361: mkdir d4/d8/da/d23/d6b/d6c/d6e 0 2022-04-09T21:46:00.594 INFO:tasks.workunit.client.1.smithi043.stdout:7/362: chown d4/d36 3151 1 2022-04-09T21:46:21.136 INFO:journalctl@ceph.osd.4.smithi043.stdout:Apr 09 21:46:20 smithi043 ceph-481700d4-b84d-11ec-8c37-001a4aab830c-osd.4[35355]: debug 2022-04-09T21:46:20.741+0000 7fb1f5bf8700 -1 osd.4 43 heartbeat_check: no reply from 172.21.15.5:6806 osd.0 since back 2022-04-09T21:45:55.338212+0000 front 2022-04-09T21:46:09.319056+0000 (oldest deadline 2022-04-09T21:46:20.617069+0000) 2022-04-09T21:46:21.137 INFO:journalctl@ceph.osd.4.smithi043.stdout:Apr 09 21:46:20 smithi043 ceph-481700d4-b84d-11ec-8c37-001a4aab830c-osd.4[35355]: debug 2022-04-09T21:46:20.741+0000 7fb1f5bf8700 -1 osd.4 43 heartbeat_check: no reply from 172.21.15.5:6814 osd.1 since back 2022-04-09T21:45:55.327589+0000 front 2022-04-09T21:46:08.218398+0000 (oldest deadline 2022-04-09T21:46:20.617069+0000) 2022-04-09T21:46:21.137 INFO:journalctl@ceph.osd.4.smithi043.stdout:Apr 09 21:46:20 smithi043 ceph-481700d4-b84d-11ec-8c37-001a4aab830c-osd.4[35355]: debug 2022-04-09T21:46:20.741+0000 7fb1f5bf8700 -1 osd.4 43 heartbeat_check: no reply from 172.21.15.5:6822 osd.2 since back 2022-04-09T21:45:55.337931+0000 front 2022-04-09T21:46:00.617663+0000 (oldest deadline 2022-04-09T21:46:20.617069+0000) 2022-04-09T21:46:22.136 INFO:journalctl@ceph.osd.4.smithi043.stdout:Apr 09 21:46:21 smithi043 ceph-481700d4-b84d-11ec-8c37-001a4aab830c-osd.4[35355]: debug 2022-04-09T21:46:21.790+0000 7fb1f5bf8700 -1 osd.4 43 heartbeat_check: no reply from 172.21.15.5:6806 osd.0 since back 2022-04-09T21:45:55.338212+0000 front 2022-04-09T21:46:09.319056+0000 (oldest deadline 2022-04-09T21:46:20.617069+0000) ... ... ...
Happens (mostly) with fs:upgrade, but not always. Also, this does not involve thrashing the OSDs, so not sure why such messages are showing up.
Actions