Project

General

Profile

Actions

Bug #56393

open

failed to complete snap trimming before timeout

Added by Laura Flores almost 2 years ago. Updated 1 day ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description: rados/thrash-erasure-code-big/{ceph cluster/{12-osds openstack} mon_election/connectivity msgr-failures/osd-dispatch-delay objectstore/filestore-xfs rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{ubuntu_latest} thrashers/morepggrow thrashosds-health workloads/ec-rados-plugin=jerasure-k=4-m=2}

/a/yuriw-2022-06-21_16:28:27-rados-wip-yuri4-testing-2022-06-21-0704-pacific-distro-default-smithi/6889462

Last pg map before Assertion:

{
  "pgs_by_state": [
    {
      "state_name": "active+clean",
      "count": 537
    },
    {
      "state_name": "recovering+undersized+remapped+peered",
      "count": 5
    },
    {
      "state_name": "active+clean+snaptrim",
      "count": 2
    },
    {
      "state_name": "undersized+remapped+backfilling+peered",
      "count": 1
    }
  ],
  "num_pgs": 545,
  "num_pools": 3,
  "num_objects": 200,
  "data_bytes": 419090944,
  "bytes_used": 11029893120,
  "bytes_avail": 1140348088320,
  "bytes_total": 1151377981440,
  "inactive_pgs_ratio": 0.011009174399077892,
  "misplaced_objects": 40,
  "misplaced_total": 1200,
  "misplaced_ratio": 0.03333333333333333
}

2022-06-21T17:59:05.224 ERROR:teuthology:Uncaught exception (Hub)
Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 906, in gevent._gevent_cgreenlet.Greenlet.run
  File "/home/teuthworker/src/github.com_ceph_ceph-c_6ba5460a27104908466809fcb14c544d5dfc6522/qa/tasks/rados.py", line 268, in thread
    manager.wait_snap_trimming_complete(pool);
  File "/home/teuthworker/src/github.com_ceph_ceph-c_6ba5460a27104908466809fcb14c544d5dfc6522/qa/tasks/ceph_manager.py", line 2444, in wait_snap_trimming_complete
    'failed to complete snap trimming before timeout'
AssertionError: failed to complete snap trimming before timeout

Related issues 2 (1 open1 closed)

Related to RADOS - Bug #19783: upgrade tests failing with "AssertionError: failed to complete snap trimming before timeout"New04/26/2017

Actions
Related to teuthology - Bug #59127: Job that normally complete much sooner last almost 12 hoursCan't reproduce

Actions
Actions #1

Updated by Laura Flores almost 2 years ago

  • Related to Bug #19783: upgrade tests failing with "AssertionError: failed to complete snap trimming before timeout" added
Actions #2

Updated by Radoslaw Zarzynski almost 2 years ago

  • Assignee set to Ronen Friedman

Could it be srub related?

Actions #3

Updated by Laura Flores about 1 year ago

  • Translation missing: en.field_tag_list set to test-failure

/a/yuriw-2023-03-14_21:33:13-upgrade:octopus-x-quincy-release-distro-default-smithi/7208213

2023-03-15T15:50:11.256 ERROR:teuthology:Uncaught exception (Hub)
Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 906, in gevent._gevent_cgreenlet.Greenlet.run
  File "/home/teuthworker/src/github.com_ceph_ceph-c_714f8ff94ab1a8a5b10ea54247535614e53b7234/qa/tasks/rados.py", line 277, in thread
    manager.wait_snap_trimming_complete(pool);
  File "/home/teuthworker/src/github.com_ceph_ceph-c_714f8ff94ab1a8a5b10ea54247535614e53b7234/qa/tasks/ceph_manager.py", line 2474, in wait_snap_trimming_complete
    assert (now - start) < FATAL_TIMEOUT, \
AssertionError: failed to complete snap trimming before timeout

Actions #4

Updated by Laura Flores about 1 year ago

  • Backport changed from pacific to pacific,quincy
Actions #5

Updated by Laura Flores about 1 year ago

/a/yuriw-2023-03-15_21:14:59-upgrade:pacific-x-quincy-release-distro-default-smithi/7209137

Actions #6

Updated by Laura Flores about 1 year ago

https://pulpito.ceph.com/yuriw-2023-03-15_23:30:58-rados-wip-yuri4-testing-2023-03-15-1418-distro-default-smithi/7209724/

Taking a closer look at these failures, the recent instances look related to https://tracker.ceph.com/issues/59127. The duration lasts 10 hours before the test fails from timeout. upgrade/parallel tests normally take about an hour and a half to complete.

See https://pulpito.ceph.com/yuriw-2023-01-20_17:26:16-rados-main-distro-default-smithi/7131367/ for reference.

Actions #7

Updated by Laura Flores about 1 year ago

  • Related to Bug #59127: Job that normally complete much sooner last almost 12 hours added
Actions #8

Updated by Laura Flores about 1 year ago

/a/yuriw-2023-03-14_20:10:47-rados-wip-yuri-testing-2023-03-14-0714-reef-distro-default-smithi/7207193

Actions #9

Updated by Laura Flores about 1 year ago

  • Subject changed from thrash-erasure-code-big: failed to complete snap trimming before timeout to failed to complete snap trimming before timeout
Actions #10

Updated by Laura Flores about 1 year ago

/a/yuriw-2023-03-14_20:10:47-rados-wip-yuri-testing-2023-03-14-0714-reef-distro-default-smithi/7207042

Actions #11

Updated by Laura Flores about 1 year ago

/a/yuriw-2023-04-04_15:24:40-rados-wip-yuri4-testing-2023-03-31-1237-distro-default-smithi/7231364

From rados/thrash-erasure-code-overwrites

Actions #12

Updated by Radoslaw Zarzynski about 1 year ago

Bump up.

Actions #13

Updated by Radoslaw Zarzynski 12 months ago

Bump up.

Actions #14

Updated by Sridhar Seshasayee about 1 month ago

/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587430
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587823

Actions #15

Updated by Aishwarya Mathuria about 1 month ago

/a/yuriw-2024-03-13_19:25:03-rados-wip-yuri6-testing-2024-03-12-0858-distro-default-smithi/7597884
/a/yuriw-2024-03-13_19:25:03-rados-wip-yuri6-testing-2024-03-12-0858-distro-default-smithi/7598303

Actions #16

Updated by Aishwarya Mathuria about 1 month ago

/a/yuriw-2024-03-15_19:59:43-rados-wip-yuri6-testing-2024-03-15-0709-distro-default-smithi/7603381/

ctive+clean+snaptrim trimq=[e9~2,ed~2,f2~2,f5~9,ff~1]] get_object_context: obc(5:cad852a3:::smithi17965979-25:head rwstate(none n=0 w=0)) oi: 5:cad852a3:::smithi17965979-25:head(409'6476 client.15042.0:4155 dirty|data_digest s 3899392 uv 6046 dd 2ee4cdec alloc_hint [0 0 0]) exists: 1 ssc(5:cad852a3:::smithi17965979-25:snapdir snapset: 100=[]:{f0=[ee,ed],100=[100,fe]} ref: 3 registered: 1 exists: 1)
2024-03-16T14:37:43.062+0000 7fdac05d2700 -1 log_channel(cluster) log [ERR] : No clone_snaps in snapset 100=[]:{f0=[ee,ed],100=[100,fe]} for object 5:cad852a3:::smithi17965979-25:ec
2024-03-16T14:37:43.062+0000 7fdac05d2700 10 osd.0 pg_epoch: 409 pg[5.3s0( v 409'6478 (398'1300,409'6478] local-lis/les=397/398 n=10 ec=72/72 lis/c=397/397 les/c/f=398/398/0 sis=397) [0,3,6]p0(0) r=0 lpr=397 crt=409'6478 lcod 409'6476 mlcod 409'6476 active+clean+snaptrim+snaptrim_error trimq=[e9~2,ed~2,f2~2,f5~9,ff~1]] SnapTrimmer state<Trimming/AwaitAsyncWork>: Snaptrim error=-2
2024-03-16T14:37:43.062+0000 7fdac05d2700 20 osd.0 pg_epoch: 409 pg[5.3s0( v 409'6478 (398'1300,409'6478] local-lis/les=397/398 n=10 ec=72/72 lis/c=397/397 les/c/f=398/398/0 sis=397) [0,3,6]p0(0) r=0 lpr=397 crt=409'6478 lcod 409'6476 mlcod 409'6476 active+clean+snaptrim+snaptrim_error trimq=[e9~2,ed~2,f2~2,f5~9,ff~1]] exit Trimming/AwaitAsyncWork
2024-03-16T14:37:43.062+0000 7fdac05d2700 20 osd.0 pg_epoch: 409 pg[5.3s0( v 409'6478 (398'1300,409'6478] local-lis/les=397/398 n=10 ec=72/72 lis/c=397/397 les/c/f=398/398/0 sis=397) [0,3,6]p0(0) r=0 lpr=397 crt=409'6478 lcod 409'6476 mlcod 409'6476 active+clean+snaptrim+snaptrim_error trimq=[e9~2,ed~2,f2~2,f5~9,ff~1]] exit Trimming
2024-03-16T14:37:43.062+0000 7fdac05d2700 20 osd.0 pg_epoch: 409 pg[5.3s0( v 409'6478 (398'1300,409'6478] local-lis/les=397/398 n=10 ec=72/72 lis/c=397/397 les/c/f=398/398/0 sis=397) [0,3,6]p0(0) r=0 lpr=397 crt=409'6478 lcod 409'6476 mlcod 409'6476 active+clean+snaptrim_error trimq=[e9~2,ed~2,f2~2,f5~9,ff~1]] PeeringState::prepare_stats_for_publish reporting purged_snaps []
2024-03-16T14:37:43.062+0000 7fdac05d2700 15 osd.0 pg_epoch: 409 pg[5.3s0( v 409'6478 (398'1300,409'6478] local-lis/les=397/398 n=10 ec=72/72 lis/c=397/397 les/c/f=398/398/0 sis=397) [0,3,6]p0(0) r=0 lpr=397 crt=409'6478 lcod 409'6476 mlcod 409'6476 active+clean+snaptrim_error trimq=[e9~2,ed~2,f2~2,f5~9,ff~1]] PeeringState::prepare_stats_for_publish publish_stats_to_osd 409:5999
2024-03-16T14:37:43.062+0000 7fdac05d2700 20 osd.0 pg_epoch: 409 pg[5.3s0( v 409'6478 (398'1300,409'6478] local-lis/les=397/398 n=10 ec=72/72 lis/c=397/397 les/c/f=398/398/0 sis=397) [0,3,6]p0(0) r=0 lpr=397 crt=409'6478 lcod 409'6476 mlcod 409'6476 active+clean+snaptrim_error trimq=[e9~2,ed~2,f2~2,f5~9,ff~1]] enter NotTrimming
2024-03-16T14:37:43.062+0000 7fdac05d2700 10 osd.0 pg_epoch: 409 pg[5.3s0( v 409'6478 (398'1300,409'6478] local-lis/les=397/398 n=10 ec=72/72 lis/c=397/397 les/c/f=398/398/0 sis=397) [0,3,6]p0(0) r=0 lpr=397 crt=409'6478 lcod 409'6476 mlcod 409'6476 active+clean+snaptrim_error trimq=[e9~2,ed~2,f2~2,f5~9,ff~1]] snap_trimmer complete

Actions #17

Updated by Radoslaw Zarzynski about 1 month ago

  • Assignee changed from Ronen Friedman to Matan Breizman

Hi Matan,
would you mind taking a look? Not a high priority.

Actions #18

Updated by Matan Breizman 25 days ago

Radoslaw Zarzynski wrote:

Hi Matan,
would you mind taking a look? Not a high priority.

I suspect that the new occurrences of this tracker are caused due to: https://tracker.ceph.com/issues/64917 (awaiting merge).

Note:
The log line "get_next_objects_to_trim::New Clone-Objects were added to Snap ea after trimming was started" is printed over 1k times which is also off.
This can be caused when we have key corruptions ("// On removal tolerate missing key corruption").
The tracker attached here is a Snapmapper related key corruption issue.

We should probably wait until after the fix of https://tracker.ceph.com/issues/64917 is merged and see if this issue is appearing once again.

Actions #19

Updated by Radoslaw Zarzynski 24 days ago

I will merge the PR mentioned by Matan above.

Actions #20

Updated by Laura Flores 17 days ago

/a/yuriw-2024-03-24_22:19:24-rados-wip-yuri10-testing-2024-03-24-1159-distro-default-smithi/7620687

Actions #21

Updated by Radoslaw Zarzynski 10 days ago

The PR has been merged on March 25th; the run is from 24th – the hope has not died (yet) ;-).

Actions #22

Updated by Aishwarya Mathuria 1 day ago

/a/yuriw-2024-04-09_14:35:50-rados-wip-yuri5-testing-2024-03-21-0833-distro-default-smithi/7648606
/a/yuriw-2024-04-09_14:35:50-rados-wip-yuri5-testing-2024-03-21-0833-distro-default-smithi/7648759

Actions

Also available in: Atom PDF