Project

General

Profile

Actions

Bug #59127

closed

Job that normally complete much sooner last almost 12 hours

Added by Laura Flores about 1 year ago. Updated 9 months ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Some job durations last almost 12 hours that normally complete much sooner. This occurs across branches, so it's unlikely to be a Ceph regression.

Mostly cephadm and thrash-old-clients jobs are affected in the rados suite.

Example: https://pulpito.ceph.com/yuriw-2023-03-17_23:38:21-rados-reef-distro-default-smithi/7212164/


Related issues 6 (5 open1 closed)

Related to teuthology - Bug #59118: teuthology.orchestra.run:timed out waiting for gevent copy_file_toClosed

Actions
Related to Infrastructure - Bug #59123: Timeout opening channelNew

Actions
Related to RADOS - Bug #56393: failed to complete snap trimming before timeoutNewMatan Breizman

Actions
Related to Infrastructure - Bug #59282: OSError: [Errno 107] Transport endpoint is not connectedNew

Actions
Related to RADOS - Bug #59285: mon/mon-last-epoch-clean.sh: TEST_mon_last_clean_epoch failure due to stuck pgsNew

Actions
Related to RADOS - Bug #59286: mon/test_mon_osdmap_prune.sh: test times out after 5+ hoursNew

Actions
Actions #1

Updated by Laura Flores about 1 year ago

  • Related to Bug #59118: teuthology.orchestra.run:timed out waiting for gevent copy_file_to added
Actions #2

Updated by Ronen Friedman about 1 year ago

A suggestion by Mark Kogan: could it be that networking configuration was changed such that cluster data is routed thru the
"external" network instead of the multi Gigs one?
That would match the scale of the disruption.

Actions #3

Updated by Laura Flores about 1 year ago

  • Related to Bug #59123: Timeout opening channel added
Actions #4

Updated by Kamoltat (Junior) Sirivadhna about 1 year ago

Analysis of slowness in `task/progress`.

Good run (0:26:53):

/a/yuriw-2023-03-21_00:35:27-rados-main-distro-default-smithi/7214888/

Description: rados/mgr/{clusters/{2-node-mgr} debug/mgr mgr_ttl_cache/disable
mon_election/classic random-objectstore$/{bluestore-comp-zlib} supported-random-distro$/{centos_8} tasks/progress}

Bad run (8:33:56):

/a/yuriw-2023-03-18_00:57:11-rados-wip-yuri3-testing-2023-03-17-1235-quincy-distro-default-smithi/7213210/

Description: rados/mgr/{clusters/{2-node-mgr} debug/mgr mgr_ttl_cache/disable
mon_election/classic random-objectstore$/{bluestore-comp-lz4} supported-random-distro$/{centos_8} tasks/progress}

Analysis: https://docs.google.com/document/d/1KgMGNAK0kSWxyxC5axd2qTsLdZJJVgc-swSKgVdTvAU/edit#heading=h.umis5id5f357

Summary:

From comparing the logs between the two runs it is safe to say that everything in Ceph is just slower from starting up the OSDs to shutting down at the end of the run for the bad run. task/progress contains 5 tests, in the bad run it takes around 1 hr for each test to finish, while in the good run it takes only 5 mins.

According to the logs, basically in the bad run Ceph is like 7 times slower in starting all the OSDs and 12 times slower when it comes to performing all the operations in each test. Loggings by itself takes 6 times longer (6 seconds to log a config while the good run takes < 1 second)

I suspect this is network issue.

Actions #5

Updated by Laura Flores about 1 year ago

  • Related to Bug #56393: failed to complete snap trimming before timeout added
Actions #6

Updated by Laura Flores about 1 year ago

  • Related to Bug #59282: OSError: [Errno 107] Transport endpoint is not connected added
Actions #7

Updated by Laura Flores about 1 year ago

  • Related to Bug #59285: mon/mon-last-epoch-clean.sh: TEST_mon_last_clean_epoch failure due to stuck pgs added
Actions #8

Updated by Laura Flores about 1 year ago

  • Related to Bug #59286: mon/test_mon_osdmap_prune.sh: test times out after 5+ hours added
Actions #9

Updated by Zack Cerza 9 months ago

  • Status changed from New to Can't reproduce

If this pops up again we can re-open and take advantage of Junior's investigation

Actions

Also available in: Atom PDF