Bug #59127
closedJob that normally complete much sooner last almost 12 hours
0%
Description
Some job durations last almost 12 hours that normally complete much sooner. This occurs across branches, so it's unlikely to be a Ceph regression.
Mostly cephadm and thrash-old-clients jobs are affected in the rados suite.
Example: https://pulpito.ceph.com/yuriw-2023-03-17_23:38:21-rados-reef-distro-default-smithi/7212164/
Updated by Laura Flores about 1 year ago
- Related to Bug #59118: teuthology.orchestra.run:timed out waiting for gevent copy_file_to added
Updated by Ronen Friedman about 1 year ago
A suggestion by Mark Kogan: could it be that networking configuration was changed such that cluster data is routed thru the
"external" network instead of the multi Gigs one?
That would match the scale of the disruption.
Updated by Laura Flores about 1 year ago
- Related to Bug #59123: Timeout opening channel added
Updated by Kamoltat (Junior) Sirivadhna about 1 year ago
Analysis of slowness in `task/progress`.¶
Good run (0:26:53):¶
/a/yuriw-2023-03-21_00:35:27-rados-main-distro-default-smithi/7214888/
Description: rados/mgr/{clusters/{2-node-mgr} debug/mgr mgr_ttl_cache/disable
mon_election/classic random-objectstore$/{bluestore-comp-zlib} supported-random-distro$/{centos_8} tasks/progress}
Bad run (8:33:56):¶
/a/yuriw-2023-03-18_00:57:11-rados-wip-yuri3-testing-2023-03-17-1235-quincy-distro-default-smithi/7213210/
Description: rados/mgr/{clusters/{2-node-mgr} debug/mgr mgr_ttl_cache/disable
mon_election/classic random-objectstore$/{bluestore-comp-lz4} supported-random-distro$/{centos_8} tasks/progress}
Analysis: https://docs.google.com/document/d/1KgMGNAK0kSWxyxC5axd2qTsLdZJJVgc-swSKgVdTvAU/edit#heading=h.umis5id5f357¶
Summary:¶
From comparing the logs between the two runs it is safe to say that everything in Ceph is just slower from starting up the OSDs to shutting down at the end of the run for the bad run. task/progress contains 5 tests, in the bad run it takes around 1 hr for each test to finish, while in the good run it takes only 5 mins.
According to the logs, basically in the bad run Ceph is like 7 times slower in starting all the OSDs and 12 times slower when it comes to performing all the operations in each test. Loggings by itself takes 6 times longer (6 seconds to log a config while the good run takes < 1 second)
I suspect this is network issue.
Updated by Laura Flores about 1 year ago
- Related to Bug #56393: failed to complete snap trimming before timeout added
Updated by Laura Flores about 1 year ago
- Related to Bug #59282: OSError: [Errno 107] Transport endpoint is not connected added
Updated by Laura Flores about 1 year ago
- Related to Bug #59285: mon/mon-last-epoch-clean.sh: TEST_mon_last_clean_epoch failure due to stuck pgs added
Updated by Laura Flores about 1 year ago
- Related to Bug #59286: mon/test_mon_osdmap_prune.sh: test times out after 5+ hours added
Updated by Zack Cerza 9 months ago
- Status changed from New to Can't reproduce
If this pops up again we can re-open and take advantage of Junior's investigation