Bug #50281: qa: untar_snap_rm timeout - CephFS - Ceph

Actions

Copy link

Bug #50281

closed

qa: untar_snap_rm timeout

Added by Patrick Donnelly about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jeff Layton

Category:

Target version:

Ceph - v17.0.0

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

kceph

Labels (FS):

qa, qa-failure, snapshots

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2021-04-09T08:46:37.116 INFO:tasks.ceph.mds.c.smithi035.stderr:2021-04-09T08:46:37.111+0000 7eff48f48700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2021-04-09T07:46:37.114769+0000)
2021-04-09T08:46:37.253 INFO:tasks.ceph.mds.a.smithi035.stderr:2021-04-09T08:46:37.251+0000 7f9488bec700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2021-04-09T07:46:37.251787+0000)
2021-04-09T08:46:37.257 INFO:tasks.ceph.osd.4.smithi172.stderr:2021-04-09T08:46:37.251+0000 7f14d39e0700 -1 osd.4 103 get_health_metrics reporting 6 slow ops, oldest is osd_op(mds.0.4:171295 3.3 3.f2b68aab (undecoded) ondisk+retry+write+known_if_redirected+full_force e79)
2021-04-09T08:46:37.753 INFO:tasks.ceph.osd.6.smithi172.stderr:2021-04-09T08:46:37.747+0000 7f5ed2972700 -1 osd.6 103 get_health_metrics reporting 114 slow ops, oldest is osd_op(mds.0.4:171360 4.5 4.17c5e6c5 (undecoded) ondisk+write+known_if_redirected+full_force e79)
2021-04-09T08:46:38.116 INFO:tasks.ceph.mds.c.smithi035.stderr:2021-04-09T08:46:38.111+0000 7eff48f48700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2021-04-09T07:46:38.114959+0000)
2021-04-09T08:46:38.254 INFO:tasks.ceph.mds.a.smithi035.stderr:2021-04-09T08:46:38.251+0000 7f9488bec700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2021-04-09T07:46:38.252015+0000)
2021-04-09T08:46:38.281 INFO:tasks.ceph.osd.4.smithi172.stderr:2021-04-09T08:46:38.275+0000 7f14d39e0700 -1 osd.4 103 get_health_metrics reporting 6 slow ops, oldest is osd_op(mds.0.4:171295 3.3 3.f2b68aab (undecoded) ondisk+retry+write+known_if_redirected+full_force e79)
2021-04-09T08:46:38.687 DEBUG:teuthology.orchestra.run:got remote process result: 124
2021-04-09T08:46:38.740 INFO:tasks.ceph.osd.6.smithi172.stderr:2021-04-09T08:46:38.735+0000 7f5ed2972700 -1 osd.6 103 get_health_metrics reporting 114 slow ops, oldest is osd_op(mds.0.4:171360 4.5 4.17c5e6c5 (undecoded) ondisk+write+known_if_redirected+full_force e79)
2021-04-09T08:46:39.127 INFO:tasks.ceph.mds.c.smithi035.stderr:2021-04-09T08:46:39.123+0000 7eff48f48700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2021-04-09T07:46:39.125356+0000)
2021-04-09T08:46:39.213 INFO:tasks.workunit.client.0.smithi035.stdout:'.snap/k/linux
2021-04-09T08:46:39.214 INFO:tasks.workunit:Stopping ['fs/snaps'] on client.0...
2021-04-09T08:46:39.215 DEBUG:teuthology.orchestra.run.smithi035:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2021-04-09T08:46:39.254 INFO:tasks.ceph.mds.a.smithi035.stderr:2021-04-09T08:46:39.247+0000 7f9488bec700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2021-04-09T07:46:39.252176+0000)
2021-04-09T08:46:39.327 INFO:tasks.ceph.osd.4.smithi172.stderr:2021-04-09T08:46:39.323+0000 7f14d39e0700 -1 osd.4 103 get_health_metrics reporting 6 slow ops, oldest is osd_op(mds.0.4:171295 3.3 3.f2b68aab (undecoded) ondisk+retry+write+known_if_redirected+full_force e79)
2021-04-09T08:46:39.465 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/run_tasks.py", line 91, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/run_tasks.py", line 70, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_48c4592f88f808feaabf260b70177e37c037a379/qa/tasks/workunit.py", line 147, in task
    cleanup=cleanup)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_48c4592f88f808feaabf260b70177e37c037a379/qa/tasks/workunit.py", line 297, in _spawn_on_all_clients
    timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_48c4592f88f808feaabf260b70177e37c037a379/qa/tasks/workunit.py", line 425, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/orchestra/remote.py", line 215, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test fs/snaps/untar_snap_rm.sh) on smithi035 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=48c4592f88f808feaabf260b70177e37c037a379 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/fs/snaps/untar_snap_rm.sh'

From: /ceph/teuthology-archive/pdonnell-2021-04-08_22:42:24-fs-wip-pdonnell-testing-20210408.192301-distro-basic-smithi/6030440/teuthology.log

Cursory glance indicates it's just caused by a storm of slow ops. Not sure why this is new though.

Testing kernel client might be related.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

/ceph/teuthology-archive/pdonnell-2021-04-15_01:35:57-fs-wip-pdonnell-testing-20210414.230315-distro-basic-smithi/6047654/teuthology.log

testing kernel again

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Status changed from New to Triaged
Assignee set to Jeff Layton

Actions

Copy link

Updated by Jeff Layton about 3 years ago

Looks like the client kernel hit a GPF. The address appears to be a poisoned list value, so this may be a use-after-free:

smithi035 login: [ 2896.447494] general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] SMP PTI
[ 2896.457547] CPU: 0 PID: 27465 Comm: kworker/0:3 Tainted: G        W         5.12.0-rc4-ceph-ga390f0b08638 #1
[ 2896.467435] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015
[ 2896.474970] Workqueue: events delayed_work [ceph]
[ 2896.479757] RIP: 0010:rb_erase+0x1a/0x310
[ 2896.483828] Code: fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8b 4f 10 4c 8b 47 08 48 85 c9 0f 84 f1 01 00 00 4d 85 c0 0f 84 57 02 00 00 <49> 8b 40 10 4c 89 c2 48 85 c0 75 0b e9 af 01 00 0
0 48 89 c2 48 89
[ 2896.502661] RSP: 0000:ffffa94e8868fda8 EFLAGS: 00010286
[ 2896.507941] RAX: ffff97a18585ae58 RBX: ffff97a18585ae58 RCX: dead000000000122
[ 2896.515130] RDX: 0000000000000001 RSI: ffff97a1c5cd9ae8 RDI: ffff97a18585ae40
[ 2896.522323] RBP: ffff97a1c5cd9000 R08: dead000000000100 R09: 0000000000000002
[ 2896.529507] R10: 0000000000000000 R11: ffff97a3c202b230 R12: ffffa94e8868fdb8
[ 2896.536693] R13: 000000010008c149 R14: ffff97a1c5cd9af0 R15: ffff97a1c48b4000
[ 2896.543877] FS:  0000000000000000(0000) GS:ffff97a8dfc00000(0000) knlGS:0000000000000000
[ 2896.552017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2896.557810] CR2: 000055dfa60ea260 CR3: 000000014414e002 CR4: 00000000001706f0
[ 2896.564997] Call Trace:
[ 2896.567516]  ceph_trim_snapid_map+0x86/0x1a0 [ceph]
[ 2896.572486]  delayed_work+0x1db/0x220 [ceph]
[ 2896.576843]  process_one_work+0x238/0x550
[ 2896.580924]  ? process_one_work+0x550/0x550
[ 2896.585178]  worker_thread+0x50/0x3b0
[ 2896.588914]  ? process_one_work+0x550/0x550
[ 2896.593171]  kthread+0x117/0x150
[ 2896.596469]  ? kthread_insert_work_sanity_check+0x60/0x60
[ 2896.601943]  ret_from_fork+0x1f/0x30
[ 2896.605592] Modules linked in: ceph libceph fscache netfs veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc xfs libcrc32c rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm configfs ib_core overlay intel_rapl_msr intel_rapl_common ipmi_ssif sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd joydev mei_me mei ioatdma wmi acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad kvm_intel kvm irqbypass sch_fq_codel scsi_transport_iscsi lp parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 hid_generic usbhid hid igb ixgbe i2c_algo_bit crc32_pclmul i2c_i801 dca i2c_smbus nvme ahci ptp libahci lpc_ich nvme_core pps_core mdio
[ 2896.682869] ---[ end trace 4e28fbd490b7bb11 ]---
[ 2898.342946] RIP: 0010:rb_erase+0x1a/0x310
[ 2898.347043] Code: fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8b 4f 10 4c 8b 47 08 48 85 c9 0f 84 f1 01 00 00 4d 85 c0 0f 84 57 02 00 00 <49> 8b 40 10 4c 89 c2 48 85 c0 75 0b e9 af 01 00 00 48 89 c2 48 89
[ 2898.365941] RSP: 0000:ffffa94e8868fda8 EFLAGS: 00010286
[ 2898.371249] RAX: ffff97a18585ae58 RBX: ffff97a18585ae58 RCX: dead000000000122
[ 2898.378470] RDX: 0000000000000001 RSI: ffff97a1c5cd9ae8 RDI: ffff97a18585ae40
[ 2898.385691] RBP: ffff97a1c5cd9000 R08: dead000000000100 R09: 0000000000000002
[ 2898.392910] R10: 0000000000000000 R11: ffff97a3c202b230 R12: ffffa94e8868fdb8
[ 2898.400132] R13: 000000010008c149 R14: ffff97a1c5cd9af0 R15: ffff97a1c48b4000
[ 2898.407346] FS:  0000000000000000(0000) GS:ffff97a8dfc00000(0000) knlGS:0000000000000000
[ 2898.415543] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2898.421368] CR2: 000055dfa60ea260 CR3: 000000014414e002 CR4: 00000000001706f0
[ 2898.428598] note: kworker/0:3[27465] exited with preempt_count 1

Patrick, are there other instances of this failure? It might be good to check the client kernel logs on those and see if they have similar oopses in them.

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

/ceph/teuthology-archive/pdonnell-2021-04-15_01:35:57-fs-wip-pdonnell-testing-20210414.230315-distro-basic-smithi/6047654

Actions

Copy link

Updated by Jeff Layton about 3 years ago

Same problem there:

smithi029 login: [ 2106.147389] general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] SMP PTI
[ 2106.157445] CPU: 2 PID: 26286 Comm: kworker/2:0 Tainted: G        W         5.12.0-rc4-ceph-g0bcfc80ef004 #1
[ 2106.167332] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015
[ 2106.174868] Workqueue: events delayed_work [ceph]
[ 2106.179644] RIP: 0010:rb_erase+0x1a/0x310
[ 2106.183716] Code: fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8b 4f 10 4c 8b 47 08 48 85 c9 0f 84 f1 01 00 00 4d 85 c0 0f 84 57 02 00 00 <49> 8b 40 10 4c 89 c2 48 85 c0 75 0b e9 af 01 00 0
0 48 89 c2 48 89
[ 2106.202549] RSP: 0000:ffffbc4505983da8 EFLAGS: 00010286
[ 2106.207830] RAX: ffffa0b2cd4782b8 RBX: ffffa0b2cd4782b8 RCX: dead000000000122
[ 2106.215019] RDX: 0000000000000001 RSI: ffffa0b2c1b2eae8 RDI: ffffa0b2cd4782a0
[ 2106.222214] RBP: ffffa0b2c1b2e000 R08: dead000000000100 R09: 0000000000000002
[ 2106.229406] R10: 0000000000000000 R11: ffffa0b307440970 R12: ffffbc4505983db8
[ 2106.236600] R13: 000000010005bd8a R14: ffffa0b2c1b2eaf0 R15: ffffa0b2deda8000
[ 2106.243794] FS:  0000000000000000(0000) GS:ffffa0ba1fc80000(0000) knlGS:0000000000000000
[ 2106.251942] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2106.257742] CR2: 000056522205d030 CR3: 000000014551e005 CR4: 00000000001706e0
[ 2106.264933] Call Trace:
[ 2106.267454]  ceph_trim_snapid_map+0x86/0x1a0 [ceph]
[ 2106.272428]  delayed_work+0x1db/0x220 [ceph]
[ 2106.276794]  process_one_work+0x238/0x550
[ 2106.280885]  ? process_one_work+0x550/0x550
[ 2106.285147]  worker_thread+0x50/0x3b0
[ 2106.288882]  ? process_one_work+0x550/0x550
[ 2106.293137]  kthread+0x117/0x150
[ 2106.296436]  ? kthread_insert_work_sanity_check+0x60/0x60
[ 2106.301912]  ret_from_fork+0x1f/0x30
[ 2106.305566] Modules linked in: ceph libceph fscache netfs veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc xfs libcrc32c rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm configfs ib_core overlay intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd joydev cryptd mei_me ioatdma mei wmi acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad kvm_intel kvm irqbypass sch_fq_codel scsi_transport_iscsi lp parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 hid_generic usbhid hid igb i2c_algo_bit ixgbe crc32_pclmul i2c_i801 nvme dca ahci i2c_smbus ptp libahci lpc_ich nvme_core pps_core mdio
[ 2106.382825] ---[ end trace 1774ba2170907b2f ]---

One issue I see: ceph_put_snapid_map looks for RB_EMPTY_NODE to indicate its status during cleanup, but ceph_trim_snapid_map doesn't call RB_CLEAR_NODE after an rb_erase(). I think this means we can end up putting the thing back onto the LRU after it has been moved to the (on-stack) to_free list if we do the last put at an inopportune time. I'm still going over the code to verify that theory though.

Actions

Copy link

Updated by Jeff Layton about 3 years ago

Ok, I think I see the issue. We had a patch that had done a conversion from atomic_t to refcount_t for some objects, including the snapid_maps. That patch is broken though, as it tried to convert an atomic_inc_return into a refcount_inc/refcount_read and that's not atomic enough for this codepath:

https://lore.kernel.org/ceph-devel/20210324102625.112640-1-yejune.deng@gmail.com/

I've reverted the patch for now, but I'll leave this open for a bit until we can verify that this is really fixed.

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

test run: https://pulpito.ceph.com/pdonnell-2021-04-19_20:54:09-fs:snaps-master-distro-basic-smithi/

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Status changed from Triaged to Resolved

Considering this resolved -- thanks Jeff.

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Related to Bug #50826: kceph: stock RHEL kernel hangs on snaptests with mon|osd thrashers added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #50281

qa: untar_snap_rm timeout

Updated by Patrick Donnelly about 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Jeff Layton about 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Jeff Layton about 3 years ago

Updated by Jeff Layton about 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Patrick Donnelly almost 3 years ago