Project

General

Profile

Bug #55258

lots of "heartbeat_check: no reply from X.X.X.X" in OSD logs

Added by Venky Shankar 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Seeing this in upgrade suite for CephFS and seems to be happening frequently: https://pulpito.ceph.com/vshankar-2022-04-09_12:55:41-fs-wip-vshankar-testing-55110-20220408-203242-testing-default-smithi/6784177/

I think this is causing the test to fail since the workunit (fsstress in this case) does not make any progress, thereby timing out hitting the job timout (3h in this case). From the log:

2022-04-09T21:46:00.585 INFO:tasks.workunit.client.1.smithi043.stdout:4/348: creat dc/d17/f77 x:0 0 0
2022-04-09T21:46:00.588 INFO:tasks.workunit.client.1.smithi043.stdout:7/360: dwrite d4/d8/d4c/f4a [0,4194304] 0
2022-04-09T21:46:00.594 INFO:tasks.workunit.client.1.smithi043.stdout:7/361: mkdir d4/d8/da/d23/d6b/d6c/d6e 0
2022-04-09T21:46:00.594 INFO:tasks.workunit.client.1.smithi043.stdout:7/362: chown d4/d36 3151 1
2022-04-09T21:46:21.136 INFO:journalctl@ceph.osd.4.smithi043.stdout:Apr 09 21:46:20 smithi043 ceph-481700d4-b84d-11ec-8c37-001a4aab830c-osd.4[35355]: debug 2022-04-09T21:46:20.741+0000 7fb1f5bf8700 -1 osd.4 43 heartbeat_check: no reply from 172.21.15.5:6806 osd.0 since back 2022-04-09T21:45:55.338212+0000 front 2022-04-09T21:46:09.319056+0000 (oldest deadline 2022-04-09T21:46:20.617069+0000)
2022-04-09T21:46:21.137 INFO:journalctl@ceph.osd.4.smithi043.stdout:Apr 09 21:46:20 smithi043 ceph-481700d4-b84d-11ec-8c37-001a4aab830c-osd.4[35355]: debug 2022-04-09T21:46:20.741+0000 7fb1f5bf8700 -1 osd.4 43 heartbeat_check: no reply from 172.21.15.5:6814 osd.1 since back 2022-04-09T21:45:55.327589+0000 front 2022-04-09T21:46:08.218398+0000 (oldest deadline 2022-04-09T21:46:20.617069+0000)
2022-04-09T21:46:21.137 INFO:journalctl@ceph.osd.4.smithi043.stdout:Apr 09 21:46:20 smithi043 ceph-481700d4-b84d-11ec-8c37-001a4aab830c-osd.4[35355]: debug 2022-04-09T21:46:20.741+0000 7fb1f5bf8700 -1 osd.4 43 heartbeat_check: no reply from 172.21.15.5:6822 osd.2 since back 2022-04-09T21:45:55.337931+0000 front 2022-04-09T21:46:00.617663+0000 (oldest deadline 2022-04-09T21:46:20.617069+0000)
2022-04-09T21:46:22.136 INFO:journalctl@ceph.osd.4.smithi043.stdout:Apr 09 21:46:21 smithi043 ceph-481700d4-b84d-11ec-8c37-001a4aab830c-osd.4[35355]: debug 2022-04-09T21:46:21.790+0000 7fb1f5bf8700 -1 osd.4 43 heartbeat_check: no reply from 172.21.15.5:6806 osd.0 since back 2022-04-09T21:45:55.338212+0000 front 2022-04-09T21:46:09.319056+0000 (oldest deadline 2022-04-09T21:46:20.617069+0000)
...
...
...

Happens (mostly) with fs:upgrade, but not always. Also, this does not involve thrashing the OSDs, so not sure why such messages are showing up.

History

#1 Updated by Neha Ojha 7 months ago

  • Status changed from New to Need More Info

I am not seeing any daemon logs for vshankar-2022-04-09_12:55:41-fs-wip-vshankar-testing-55110-20220408-203242-testing-default-smithi/6784177, in fact nothing for smithi043.

2022-04-09T21:23:32.099 INFO:teuthology.task.internal:roles: ubuntu@smithi005.front.sepia.ceph.com - ['host.a', 'client.0', 'osd.0', 'osd.1', 'osd.2']
2022-04-09T21:23:32.099 INFO:teuthology.task.internal:roles: ubuntu@smithi043.front.sepia.ceph.com - ['host.b', 'client.1', 'osd.3', 'osd.4', 'osd.5']
nojha@teuthology:/ceph/teuthology-archive/vshankar-2022-04-09_12:55:41-fs-wip-vshankar-testing-55110-20220408-203242-testing-default-smithi/6784177/remote$ ls -lrt
total 0
drwxrwxr-x 1 teuthworker teuthworker 1 Apr 10 01:02 smithi005
nojha@teuthology:/ceph/teuthology-archive/vshankar-2022-04-09_12:55:41-fs-wip-vshankar-testing-55110-20220408-203242-testing-default-smithi/6784177/remote/smithi005$ ls -lrt
total 0
drwxrwxr-x 1 teuthworker teuthworker 0 Apr 10 01:02 crash

Do you have a run in which the daemon logs are available? When did you start seeing this issue?

#2 Updated by Venky Shankar 7 months ago

Neha Ojha wrote:

I am not seeing any daemon logs for vshankar-2022-04-09_12:55:41-fs-wip-vshankar-testing-55110-20220408-203242-testing-default-smithi/6784177, in fact nothing for smithi043.

[...]

[...]

[...]

Do you have a run in which the daemon logs are available? When did you start seeing this issue?

Hmmm. none of the job that fail due to this issue have any logs. Latest:

https://pulpito.ceph.com/vshankar-2022-04-25_04:05:41-fs-wip-vshankar-testing-20220420-113951-testing-default-smithi/6804360/
https://pulpito.ceph.com/vshankar-2022-04-25_04:05:41-fs-wip-vshankar-testing-20220420-113951-testing-default-smithi/6804400/

In my last test run, I hit this frequently with fs:upgrade. I wonder what's happening. Let me try with a disto kernel and see if I can narrow down the scope of the problem. (since its easily reproducible lately).

Cheers,
Venky

#3 Updated by Venky Shankar 7 months ago

  • Project changed from RADOS to Linux kernel client
  • Category deleted (Correctness/Safety)
  • Assignee changed from Neha Ojha to Jeff Layton

Venky Shankar wrote:

Neha Ojha wrote:

I am not seeing any daemon logs for vshankar-2022-04-09_12:55:41-fs-wip-vshankar-testing-55110-20220408-203242-testing-default-smithi/6784177, in fact nothing for smithi043.

[...]

[...]

[...]

Do you have a run in which the daemon logs are available? When did you start seeing this issue?

Hmmm. none of the job that fail due to this issue have any logs. Latest:

https://pulpito.ceph.com/vshankar-2022-04-25_04:05:41-fs-wip-vshankar-testing-20220420-113951-testing-default-smithi/6804360/
https://pulpito.ceph.com/vshankar-2022-04-25_04:05:41-fs-wip-vshankar-testing-20220420-113951-testing-default-smithi/6804400/

In my last test run, I hit this frequently with fs:upgrade. I wonder what's happening. Let me try with a disto kernel and see if I can narrow down the scope of the problem. (since its easily reproducible lately).

No failures seen with distro kernel - https://pulpito.ceph.com/vshankar-2022-04-26_15:19:12-fs-wip-vshankar-testing-20220420-113951-distro-default-smithi/

Sorry for the noise, Neha. This seems to be happening with testing kernels and not an issue with OSD/RADOS.

#5 Updated by Jeff Layton 7 months ago

We're not carrying much in the testing kernel currently other than cephfs patches. We have a small libceph patch, but it's trivial and I wouldn't expect any issues from it. It may be that this issue is unrelated to ceph, and is a problem in the larger kernel (networking issue of some sort maybe).

Once the current test run finishes, we can cut a branch that has all of the current patches based on top of a v5.17 kernel and see if the problem persists.

#6 Updated by Jeff Layton 7 months ago

I think this is probably unrelated to anything in the ceph patch pile. I see this in one the failed tests:

[  759.163883] ------------[ cut here ]------------
[  759.168666] NETDEV WATCHDOG: enp3s0f1 (ixgbe): transmit queue 7 timed out
[  759.175595] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:530 dev_watchdog+0x20f/0x250
[  759.184005] Modules linked in: ceph libceph dns_resolver fscache netfs veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink bridge stp llc xfs libcrc32c sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul ghash_clmulni_intel mei_me lpc_ich i2c_i801 joydev i2c_smbus mei mfd_core ioatdma wmi acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad ixgbe igb mdio ptp crc32c_intel i2c_algo_bit pps_core nvme dca nvme_core fuse
[  759.238840] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G S                5.18.0-rc2-ceph-g444283043271 #1
[  759.248268] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
[  759.255780] RIP: 0010:dev_watchdog+0x20f/0x250
[  759.260321] Code: 00 e9 4d ff ff ff 48 89 df c6 05 0f 9c f6 00 01 e8 c6 fb f9 ff 44 89 e9 48 89 de 48 c7 c7 d8 de 41 82 48 89 c2 e8 ab 1f 19 00 <0f> 0b e9 2f ff ff ff e8 15 0c 1e 00 85 c0 74 b5 80 3d d8 94 f6 00
[  759.279231] RSP: 0018:ffffc9000003ce80 EFLAGS: 00010282
[  759.284545] RAX: 0000000000000000 RBX: ffff88817b0c0000 RCX: 0000000000000103
[  759.291771] RDX: 0000000000000103 RSI: ffffffff823932ad RDI: 00000000ffffffff
[  759.298999] RBP: ffff88817b0c0508 R08: 0000000000000000 R09: c0000000ffffdfff
[  759.306226] R10: 0000000000000001 R11: ffffc9000003cca0 R12: ffff88817b0c0420
[  759.313460] R13: 0000000000000007 R14: ffff88817b0c0508 R15: ffff88885fc5d480
[  759.320692] FS:  0000000000000000(0000) GS:ffff88885fc40000(0000) knlGS:0000000000000000
[  759.328902] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  759.334751] CR2: 0000558f8488d1c0 CR3: 0000000002612003 CR4: 00000000003706e0
[  759.341980] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  759.349205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  759.356433] Call Trace:
[  759.358974]  <IRQ>
[  759.361083]  ? mq_change_real_num_tx+0xd0/0xd0
[  759.365623]  call_timer_fn+0xa1/0x2c0
[  759.369385]  ? mq_change_real_num_tx+0xd0/0xd0
[  759.373927]  run_timer_softirq+0x4ff/0x520
[  759.378125]  ? lock_is_held_type+0xe3/0x140
[  759.382406]  __do_softirq+0xdc/0x483
[  759.386079]  irq_exit_rcu+0xdc/0x120
[  759.389751]  sysvec_apic_timer_interrupt+0x9e/0xc0
[  759.394632]  </IRQ>
[  759.396816]  <TASK>
[  759.399001]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[  759.404234] RIP: 0010:cpuidle_enter_state+0x10a/0x4b0
[  759.409383] Code: dc e3 8f ff 45 84 ff 74 16 9c 58 0f 1f 40 00 f6 c4 02 0f 85 43 03 00 00 31 ff e8 41 58 96 ff e8 dc fb 9d ff fb 0f 1f 44 00 00 <45> 85 f6 0f 88 12 01 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d
[  759.428290] RSP: 0018:ffffc900000fbe88 EFLAGS: 00000202
[  759.433612] RAX: 000000000050736d RBX: 0000000000000004 RCX: 0000000000000000
[  759.440843] RDX: 0000000000000000 RSI: ffffffff8240dec7 RDI: ffffffff823a6717
[  759.448069] RBP: ffffe8ffffc43a00 R08: 0000000000000001 R09: 0000000000000001
[  759.455343] R10: 000003c9cd8cd7ea R11: ffff88885fc6c1c4 R12: 000000b0c1b170ce
[  759.462568] R13: ffffffff827ed480 R14: 0000000000000004 R15: 0000000000000000
[  759.469807]  ? cpuidle_enter_state+0x104/0x4b0
[  759.474350]  cpuidle_enter+0x29/0x40
[  759.478025]  do_idle+0x245/0x260
[  759.481350]  cpu_startup_entry+0x19/0x20
[  759.485361]  start_secondary+0x109/0x140
[  759.489377]  secondary_startup_64_no_verify+0xd5/0xdb
[  759.494529]  </TASK>
[  759.496806] irq event stamp: 5273298
[  759.500466] hardirqs last  enabled at (5273308): [<ffffffff8116078e>] __up_console_sem+0x5e/0x70
[  759.509365] hardirqs last disabled at (5273317): [<ffffffff81160773>] __up_console_sem+0x43/0x70
[  759.525933] softirqs last  enabled at (5272412): [<ffffffff81e00321>] __do_softirq+0x321/0x483
[  759.534657] softirqs last disabled at (5272431): [<ffffffff810e49bc>] irq_exit_rcu+0xdc/0x120
[  759.543297] ---[ end trace 0000000000000000 ]--

...which looks like a problem down in the network driver layer at first glannce.

Note that the above kernel is based on v5.18-rc2. The testing kernel has since been rebased onto -rc4 so it's possible that this is already fixed.

#7 Updated by Jeff Layton 7 months ago

One of the tests failed with a different softlockup with -rc4 based kernel:

[ 1293.900278] libceph: mon1 (1)172.21.15.150:6789 session established
[ 1293.907749] libceph: client24369 fsid 4a163edc-c60c-11ec-8c39-001a4aab830c
[ 1426.027049] watchdog: BUG: soft lockup - CPU#4 stuck for 26s! [fsstress:118723]
[ 1426.030049] watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [fsstress:118729]
[ 1426.032049] watchdog: BUG: soft lockup - CPU#7 stuck for 26s! [fsstress:118724]
[ 1426.032054] Modules linked in: ceph libceph dns_resolver fscache netfs veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink bridge stp llc binfmt_misc overlay xfs libcrc32c intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support irqbypass i2c_i801 joydev lpc_ich crct10dif_pclmul ipmi_ssif crc32_pclmul ghash_clmulni_intel i2c_smbus mei_me wmi mei mfd_core acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad ioatdma sunrpc crc32c_intel ixgbe igb mdio nvme ptp i2c_algo_bit pps_core nvme_core dca fuse
[ 1426.032100] irq event stamp: 812770
[ 1426.032101] hardirqs last  enabled at (812769): [<ffffffff81c00d42>] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 1426.032105] hardirqs last disabled at (812770): [<ffffffff81ae1e4a>] sysvec_apic_timer_interrupt+0xa/0xc0
[ 1426.032109] softirqs last  enabled at (764054): [<ffffffff81e00321>] __do_softirq+0x321/0x483
[ 1426.032111] softirqs last disabled at (764025): [<ffffffff810e49ec>] irq_exit_rcu+0xdc/0x120
[ 1426.032114] CPU: 7 PID: 118724 Comm: fsstress Tainted: G S                5.18.0-rc4-ceph-g122455b02775 #1
[ 1426.032116] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0b 04/10/2017
[ 1426.032117] RIP: 0010:lock_acquire+0x107/0x2f0
[ 1426.032121] Code: c4 20 e8 1c f4 98 00 b8 ff ff ff ff 65 0f c1 05 07 85 ec 7e 83 f8 01 0f 85 d3 01 00 00 48 83 3c 24 00 74 06 fb 0f 1f 44 00 00 <48> 8b 44 24 30 65 48 33 04 25 28 00 00 00 0f 85 d0 01 00 00 48 83
[ 1426.032123] RSP: 0018:ffffc90002a4bd48 EFLAGS: 00000206
[ 1426.032125] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
[ 1426.032126] RDX: 6afe868d15c10576 RSI: ffffffff823332fd RDI: ffffffff823a6baf
[ 1426.032127] RBP: ffff88810c5e24c8 R08: 0000000000000c79 R09: 0000000000000000
[ 1426.032128] R10: 0000000000000001 R11: 6afe868d15c10576 R12: 0000000000000000
[ 1426.032129] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1426.032130] FS:  00007f3f51e07500(0000) GS:ffff88885fdc0000(0000) knlGS:0000000000000000
[ 1426.032131] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1426.032132] CR2: 00005565c3f0a000 CR3: 000000028234c003 CR4: 00000000001706e0
[ 1426.032134] Call Trace:
[ 1426.032135]  <TASK>
[ 1426.032138]  ? flush_dirty_session_caps+0x63/0x150 [ceph]
[ 1426.032156]  ? lock_release+0x13d/0x2b0
[ 1426.032161]  _raw_spin_lock+0x2f/0x40
[ 1426.032164]  ? flush_dirty_session_caps+0x82/0x150 [ceph]
[ 1426.032177]  flush_dirty_session_caps+0x82/0x150 [ceph]
[ 1426.032191]  ? __ceph_put_cap_refs+0x440/0x440 [ceph]
[ 1426.032204]  ceph_mdsc_iterate_sessions+0x65/0xa0 [ceph]
[ 1426.032219]  ? __x64_sys_tee+0xc0/0xc0
[ 1426.032222]  ceph_mdsc_sync+0x66/0x690 [ceph]
[ 1426.032235]  ? lock_release+0x13d/0x2b0
[ 1426.032241]  ? up_read+0x17/0x220
[ 1426.032246]  ? __x64_sys_tee+0xc0/0xc0
[ 1426.032249]  ceph_sync_fs+0x2c/0x100 [ceph]
[ 1426.032259]  iterate_supers+0x93/0xe0
[ 1426.032264]  ksys_sync+0x60/0xa0
[ 1426.032267]  __do_sys_sync+0xa/0x10
[ 1426.032270]  do_syscall_64+0x34/0x80
[ 1426.032273]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1426.032276] RIP: 0033:0x7f3f516397cb
[ 1426.032279] Code: 73 01 c3 48 8b 0d bd 56 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 a2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8d 56 38 00 f7 d8 64 89 01 48
[ 1426.032280] RSP: 002b:00007fffd7170978 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
[ 1426.032282] RAX: ffffffffffffffda RBX: 0000000000405c5a RCX: 00007f3f516397cb
[ 1426.032283] RDX: 000000002ef5ffa5 RSI: 000000002ef5ffa5 RDI: 00000000000000fd
[ 1426.032285] RBP: 00007fffd7170990 R08: 00007f3f519bf210 R09: 00007f3f519bf260
[ 1426.032286] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000401760
[ 1426.032287] R13: 00007fffd7170cd0 R14: 0000000000000000 R15: 0000000000000000
[ 1426.032297]  </TASK>

...and there is a bunch of other interleaved stack traces after this that are hard to read.

The above stack is stuck trying to acquire the mdsc->cap_dirty_lock. We have had some changes in the sync/fsync codepaths recently, but I don't see how they'd cause a softlockup like this. This is probably fallout from something else.

We may need to bisect to track this down. What we may want to do first is back down v5.17 and see if it passes this. If it does, then we can try backing out all of the ceph patches from an -rc4 kernel that went in since v5.17.

#8 Updated by Jeff Layton 7 months ago

I tried to run a bisect today, but I think I ran afoul of some of the randomness in teuthology. Apparently even when you specify a -k option to teuthology-suite, you sometimes still get distro kernels installed, which makes it hard to do a consistent test for a bisect. It came up inconclusive. I think I did get enough bad test results with commits before the ceph patch merge to conclude that this has little to do with ceph, and is something else broken in the kernel.

I also tried running the test on gibba, on a hunch that the problem was hw-specific, but teuthology never acquired machines for it. Not sure what's wrong there.

At that point I just decided to go for a Hail Mary, and rebase ceph-testing onto the latest master branch to see how it does. That seems to have passed on a first test run. I'm queueing another now. If it behaves, then what may be best is to rebase testing on -rc5 sometime next week.

#9 Updated by Jeff Layton 7 months ago

The command I was using to test, per Venky's suggestion:

teuthology-suite  --machine-type smithi --email <<<REDACTED>>> -p 50 --suite fs:thrash -k wip-55258 --force-priority --subset $((RANDOM % 32))/32 -c wip-vshankar-testing1-20220427-145314 -l 8

#10 Updated by Jeff Layton 7 months ago

I did another run last night with 16 jobs on the kernel based on current mainline. Half of them failed, but I don't see any evidence that it's due to kernel issues. Venky if you can take a look and confirm, that would be great.

https://pulpito.ceph.com/jlayton-2022-04-29_23:36:52-fs:thrash-wip-vshankar-testing1-20220427-145314-wip-55258-default-smithi/

If you do confirm it, then we probably just need to rebase the kernel onto -rc5 early next week.

#11 Updated by Jeff Layton 7 months ago

  • Status changed from Need More Info to Resolved

Another run, this time with testing kernel based on -rc5. Everything basically went OK. One of the tests failed due to a deep scrub problem (unrelated to client), and the second failed due to bug #55421. I think this is now resolved. Please reopen if you're still seeing issues with the latest testing kernel.

https://pulpito.ceph.com/jlayton-2022-05-02_14:27:24-fs:thrash-master-testing-default-smithi/

Also available in: Atom PDF