Bug #59127

closed

Job that normally complete much sooner last almost 12 hours

Added by Laura Flores about 1 year ago. Updated 9 months ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

Some job durations last almost 12 hours that normally complete much sooner. This occurs across branches, so it's unlikely to be a Ceph regression.

Mostly cephadm and thrash-old-clients jobs are affected in the rados suite.

Example: https://pulpito.ceph.com/yuriw-2023-03-17_23:38:21-rados-reef-distro-default-smithi/7212164/

Related issues 6 (5 open — 1 closed)

Actions

Copy link

Updated by Laura Flores about 1 year ago

Related to Bug #59118: teuthology.orchestra.run:timed out waiting for gevent copy_file_to added

Actions

Copy link

Updated by Ronen Friedman about 1 year ago

A suggestion by Mark Kogan: could it be that networking configuration was changed such that cluster data is routed thru the
"external" network instead of the multi Gigs one?
That would match the scale of the disruption.

Actions

Copy link

Updated by Laura Flores about 1 year ago

Related to Bug #59123: Timeout opening channel added

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna about 1 year ago

Analysis of slowness in `task/progress`.¶

Good run (0:26:53):¶

/a/yuriw-2023-03-21_00:35:27-rados-main-distro-default-smithi/7214888/

Description: rados/mgr/{clusters/{2-node-mgr} debug/mgr mgr_ttl_cache/disable
mon_election/classic random-objectstore$/{bluestore-comp-zlib} supported-random-distro$/{centos_8} tasks/progress}

Bad run (8:33:56):¶

/a/yuriw-2023-03-18_00:57:11-rados-wip-yuri3-testing-2023-03-17-1235-quincy-distro-default-smithi/7213210/

Description: rados/mgr/{clusters/{2-node-mgr} debug/mgr mgr_ttl_cache/disable
mon_election/classic random-objectstore$/{bluestore-comp-lz4} supported-random-distro$/{centos_8} tasks/progress}

Analysis: https://docs.google.com/document/d/1KgMGNAK0kSWxyxC5axd2qTsLdZJJVgc-swSKgVdTvAU/edit#heading=h.umis5id5f357 ¶

Summary:¶

From comparing the logs between the two runs it is safe to say that everything in Ceph is just slower from starting up the OSDs to shutting down at the end of the run for the bad run. task/progress contains 5 tests, in the bad run it takes around 1 hr for each test to finish, while in the good run it takes only 5 mins.

According to the logs, basically in the bad run Ceph is like 7 times slower in starting all the OSDs and 12 times slower when it comes to performing all the operations in each test. Loggings by itself takes 6 times longer (6 seconds to log a config while the good run takes < 1 second)

I suspect this is network issue.

Actions

Copy link