Project

General

Profile

Actions

Bug #49259

closed

test_rados_api tests timeout with cephadm (plus extremely large OSD logs)

Added by Sebastian Wagner about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

swagner-2021-02-11_10:31:33-rados:cephadm-wip-swagner-testing-2021-02-09-1126-distro-basic-smithi/5874513

rados:cephadm/with-work/{distro/ubuntu_20.04 fixed-2 mode/packaged mon_election/classic msgr/async start tasks/rados_api_tests}

The tests actually complete:

grep -v -e 'slow request osd_op' -e 'heartbeat_check' -e get_health_metrics -e 'have slow ops' -e ' pgmap v' -e 'journalctl@ceph.mon' -e 'tcmu-runner'  teuthology.log| less
2021-02-11T13:11:48.564 INFO:tasks.workunit.client.0.smithi091.stdout:               api_aio_pp: [       OK ] LibRadosAioEC.ExecuteClassPP (7016 ms)
2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout:               api_aio_pp: [ RUN      ] LibRadosAioEC.OmapPP
2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout:               api_aio_pp: [       OK ] LibRadosAioEC.OmapPP (7016 ms)
2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout:               api_aio_pp: [ RUN      ] LibRadosAioEC.MultiWritePP
2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout:               api_aio_pp: [       OK ] LibRadosAioEC.MultiWritePP (7027 ms)
2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout:               api_aio_pp: [----------] 20 tests from LibRadosAioEC (139516 ms total)
2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout:               api_aio_pp:
2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout:               api_aio_pp: [----------] Global test environment tear-down
2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout:               api_aio_pp: [==========] 53 tests from 4 test suites ran. (294146 ms total)
2021-02-11T13:11:48.566 INFO:tasks.workunit.client.0.smithi091.stdout:               api_aio_pp: [  PASSED  ] 53 tests.
2021-02-11T16:04:13.814 INFO:journalctl@ceph.prometheus.a.smithi110.stdout:Feb 11 16:04:13 smithi110 bash[29017]: level=info ts=2021-02-11T16:04:13.510Z caller=compact.go:495 component=tsdb msg="write block" mint=1613048653399 maxt=1613052000000 ulid=01EY8XBWFN9THM46AQ56K0YJ9V duration=80.839364ms
2021-02-11T16:04:13.814 INFO:journalctl@ceph.prometheus.a.smithi110.stdout:Feb 11 16:04:13 smithi110 bash[29017]: level=info ts=2021-02-11T16:04:13.523Z caller=head.go:662 component=tsdb msg="Head GC completed" duration=2.564364ms
2021-02-11T16:06:54.424 INFO:tasks.workunit.client.0.smithi091.stderr:++ cleanup
2021-02-11T16:06:54.424 INFO:tasks.workunit.client.0.smithi091.stderr:++ pkill -P 35759
2021-02-11T16:06:54.462 INFO:tasks.workunit.client.0.smithi091.stderr:++ true
2021-02-11T16:06:54.494 DEBUG:teuthology.orchestra.run:got remote process result: 124

but lots of slow ops:

2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.c.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[17322]: cluster 2021-02-11T16:06:53.847115+0000 osd.0 (osd.0) 72781 : cluster [WRN] slow request osd_op(client.24646.0:2078 16.14 16:2b70cbe7:test-rados-api-smithi091-35842-17::big.copy:head [copy-from ver 0 in=74b] snapc 0=[] RETRY=7 ondisk+retry+write+known_if_redirected e350) initiated 2021-02-11T15:07:05.776639+0000 currently delayed
2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:52.173794+0000 osd.2 (osd.2) 18309 : cluster [WRN] slow request osd_op(client.37922.0:257 174.19 174:9c1d23e5:test-rados-api-smithi091-35848-12::bam:head [read 0~1] snapc 0=[] RETRY=2 ondisk+retry+read+known_if_redirected e350) initiated 2021-02-11T15:07:35.487913+0000 currently delayed
2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:52.173821+0000 osd.2 (osd.2) 18310 : cluster [WRN] slow request osd_op(client.37922.0:257 174.19 174:9c1d23e5:test-rados-api-smithi091-35848-12::bam:head [read 0~1] snapc 0=[] RETRY=3 ondisk+retry+read+known_if_redirected e350) initiated 2021-02-11T15:22:35.490704+0000 currently delayed
2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:52.173861+0000 osd.2 (osd.2) 18311 : cluster [WRN] slow request osd_op(client.37922.0:257 174.19 174:9c1d23e5:test-rados-api-smithi091-35848-12::bam:head [read 0~1] snapc 0=[] RETRY=1 ondisk+retry+read+known_if_redirected e350) initiated 2021-02-11T14:52:35.486234+0000 currently delayed
2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:52.173928+0000 osd.2 (osd.2) 18312 : cluster [WRN] slow request osd_op(client.37922.0:257 174.19 174:9c1d23e5:test-rados-api-smithi091-35848-12::bam:head [read 0~1] snapc 0=[] RETRY=4 ondisk+retry+read+known_if_redirected e350) initiated 2021-02-11T15:37:35.490076+0000 currently delayed
2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:52.173954+0000 osd.2 (osd.2) 18313 : cluster [WRN] slow request osd_op(client.37922.0:257 174.19 174:9c1d23e5:test-rados-api-smithi091-35848-12::bam:head [read 0~1] snapc 0=[] RETRY=5 ondisk+retry+read+known_if_redirected e350) initiated 2021-02-11T15:52:35.490534+0000 currently delayed
2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:52.173977+0000 osd.2 (osd.2) 18314 : cluster [WRN] slow request osd_op(client.37922.0:257 174.19 174:9c1d23e5:test-rados-api-smithi091-35848-12::bam:head [read 0~1] snapc 0=[] ondisk+read+known_if_redirected e350) initiated 2021-02-11T14:36:22.874727+0000 currently delayed
2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:53.846990+0000 osd.0 (osd.0) 72771 : cluster [WRN] slow request osd_op(client.24646.0:2078 16.14 16:2b70cbe7:test-rados-api-smithi091-35842-17::big.copy:head [copy-from ver 0 in=74b] snapc 0=[] RETRY=2 ondisk+retry+write+known_if_redirected e341) initiated 2021-02-11T13:52:05.770047+0000 currently delayed
2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:53.847012+0000 osd.0 (osd.0) 72772 : cluster [WRN] slow request osd_op(client.24646.0:2078 16.14 16:2b70cbe7:test-rados-api-smithi091-35842-17::big.copy:head [copy-from ver 0 in=74b] snapc 0=[] RETRY=1 ondisk+retry+write+known_if_redirected e341) initiated 2021-02-11T13:37:05.766073+0000 currently delayed
2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:53.847025+0000 osd.0 (osd.0) 72773 : cluster [WRN] slow request osd_op(client.24646.0:2078 16.14 16:2b70cbe7:test-rados-api-smithi091-35842-17::big.copy:head [copy-from ver 0 in=74b] snapc 0=[] RETRY=4 ondisk+retry+write+known_if_redirected e341) initiated 2021-02-11T14:22:05.770189+0000 currently delayed
2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:53.847036+0000 osd.0 (osd.0) 72774 : cluster [WRN] slow request osd_op(client.24646.0:2078 16.14 16:2b70cbe7:test-rados-api-smithi091-35842-17::big.copy:head [copy-from ver 0 in=74b] snapc 0=[] RE

lots of uncommon health warnings:

2021-02-11T15:30:15.764 INFO:journalctl@ceph.mon.b.smithi110.stdout:Feb 11 15:30:01 smithi110 bash[13423]: cluster 2021-02-11T15:30:00.001835+0000 mon.a (mon.0) 3849 : cluster [WRN] [WRN] CACHE_POOL_NO_HIT_SET: 1 cache pools are missing hit_sets
2021-02-11T15:30:15.764 INFO:journalctl@ceph.mon.b.smithi110.stdout:Feb 11 15:30:01 smithi110 bash[13423]: cluster 2021-02-11T15:30:00.001854+0000 mon.a (mon.0) 3850 : cluster [WRN]     pool 'test-rados-api-smithi091-35848-11' with cache_mode writeback needs hit_set_type to be set but it is not
2021-02-11T15:30:15.764 INFO:journalctl@ceph.mon.b.smithi110.stdout:Feb 11 15:30:01 smithi110 bash[13423]: cluster 2021-02-11T15:30:00.001874+0000 mon.a (mon.0) 3851 : cluster [WRN] [WRN] POOL_APP_NOT_ENABLED: 1 pool(s) do not have an application enabled
2021-02-11T15:30:15.764 INFO:journalctl@ceph.mon.b.smithi110.stdout:Feb 11 15:30:01 smithi110 bash[13423]: cluster 2021-02-11T15:30:00.001888+0000 mon.a (mon.0) 3852 : cluster [WRN]     application not enabled on pool 'ceph_test_rados_api_asio'
2021-02-11T15:30:15.764 INFO:journalctl@ceph.mon.b.smithi110.stdout:Feb 11 15:30:01 smithi110 bash[13423]: cluster 2021-02-11T15:30:00.001901+0000 mon.a (mon.0) 3853 : cluster [WRN]     use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications

Related issues 2 (0 open2 closed)

Related to RADOS - Bug #39039: mon connection reset, command not resentDuplicateSage Weil

Actions
Related to Messengers - Bug #49237: segv in AsyncConnection::_stop()Resolved

Actions
Actions #1

Updated by Sebastian Wagner about 3 years ago

  • Description updated (diff)
  • ceph-qa-suite rados added
Actions #2

Updated by Neha Ojha about 3 years ago

  • Subject changed from slow ops on cephadm/test_rados_api tests to test_rados_api tests timeout with cephadm
  • Priority changed from Normal to Urgent

/a/swagner-2021-02-11_10:31:33-rados:cephadm-wip-swagner-testing-2021-02-09-1126-distro-basic-smithi/5874516/

Seems like this is where it is stuck


Running main() from gmock_main.cc
[==========] Running 16 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 2 tests from LibRadosWatchNotifyECPP
[ RUN      ] LibRadosWatchNotifyECPP.WatchNotify
notify
[       OK ] LibRadosWatchNotifyECPP.WatchNotify (2197 ms)
[ RUN      ] LibRadosWatchNotifyECPP.WatchNotifyTimeout
[       OK ] LibRadosWatchNotifyECPP.WatchNotifyTimeout (11 ms)
[----------] 2 tests from LibRadosWatchNotifyECPP (2208 ms total)

[----------] 14 tests from LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP
[ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/0
notify
[       OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/0 (7 ms)
[ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/1
notify
[       OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/1 (3751 ms)
[ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotifyTimeout/0
[       OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotifyTimeout/0 (636 ms)
[ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotifyTimeout/1
[       OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotifyTimeout/1 (15 ms)
[ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify2/0
handle_notify cookie 93945391327952 notify_id 347892350976 notifier_gid 15045
[       OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify2/0 (9 ms)
[ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify2/1
handle_notify cookie 93945391346960 notify_id 347892350976 notifier_gid 15045
[       OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify2/1 (30 ms)
[ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/0
handle_notify cookie 93945391347984 notify_id 347892350977 notifier_gid 15045
[       OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/0 (34 ms)
[ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/1
Actions #3

Updated by Neha Ojha about 3 years ago

/a/swagner-2021-02-11_10:31:33-rados:cephadm-wip-swagner-testing-2021-02-09-1126-distro-basic-smithi/5874516 shows the same problem as https://tracker.ceph.com/issues/49259#note-2

Actions #4

Updated by Brad Hubbard about 3 years ago

From swagner-2021-02-11_10:31:33-rados:cephadm-wip-swagner-testing-2021-02-09-1126-distro-basic-smithi/5874513 there's a delayed op that has been delayed for more than an hour. This does not look like a problem with the test itself, it goes deeper than that.

This slow op has been delayed for more than an hour.

2021-02-11T16:07:16.719+0000 7fb9cabc3700 0 log_channel(cluster) log [WRN] : slow request osd_op(client.24646.0:2078 16.14 16:2b70cbe7:test-rados-api-smithi091-35842-17::big.copy:head [copy-from ver 0 in=74b] snapc 0=[] RETRY=7 ondisk+retry+write+known_if_redirected e350) initiated 2021-02-11T15:07:05.776639+0000 currently delayed

Trying to trawl through the massive logs to get more information.

Actions #5

Updated by Brad Hubbard about 3 years ago

The stuck op is a copy of 16:8744f7fc:test-rados-api-smithi091-35842-17::big:head to 16:2b70cbe7:test-rados-api-smithi091-35842-17::big.copy:head (invoked copy by one of the api tests). In order to do that we need to read 16:8744f7fc:test-rados-api-smithi091-35842-17::big:head but we seem to encounter some sort of error while doing so.

handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted
2021-02-11T13:07:40.778+0000 7fb9d144d700  5 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).write_message sending message m=0x55cbe43d3400 seq=1 osd_op(osd.0.8:1 16.1 16:8744f7fc:test-rados-api-smithi091-35842-17::big:head [copy-get max 8388608 in=29b] snapc 0=[] ondisk+read+known_if_redirected e93) v8
2021-02-11T13:07:40.778+0000 7fb9d144d700 20 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).write_message sending m=0x55cbe43d3400 seq=1 src=osd.0 off=0
2021-02-11T13:07:40.778+0000 7fb9d144d700 10 -- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 msgr2=0x55cbe4d19400 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send sent bytes 364 remaining bytes 0
2021-02-11T13:07:40.778+0000 7fb9d144d700 10 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).write_message sending 0x55cbe43d3400 done.
2021-02-11T13:07:40.778+0000 7fb9d144d700 20 -- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 msgr2=0x55cbe4d19400 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).process
2021-02-11T13:07:40.778+0000 7fb9d144d700 20 -- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 msgr2=0x55cbe4d19400 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read continue len=32
2021-02-11T13:07:40.778+0000 7fb9d144d700  1 -- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 msgr2=0x55cbe4d19400 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 103
2021-02-11T13:07:40.778+0000 7fb9d144d700  1 -- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 msgr2=0x55cbe4d19400 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2021-02-11T13:07:40.778+0000 7fb9d144d700 20 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main r=-1
2021-02-11T13:07:40.778+0000 7fb9d144d700  1 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
2021-02-11T13:07:40.778+0000 7fb9d144d700 10 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0)._fault
2021-02-11T13:07:40.778+0000 7fb9d144d700  2 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0)._fault on lossy channel, failing
2021-02-11T13:07:40.778+0000 7fb9d144d700  1 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).stop
2021-02-11T13:07:40.778+0000 7fb9d144d700  5 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).reset_recv_state
2021-02-11T13:07:40.778+0000 7fb9d144d700  5 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).reset_security
2021-02-11T13:07:40.778+0000 7fb9d144d700 10 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 unknown :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).discard_out_queue started
2021-02-11T13:07:40.778+0000 7fb9d144d700 20 EpollDriver.del_event del event fd=103 cur_mask=1 delmask=3 to 8
2021-02-11T13:07:40.778+0000 7fb9b7b26700  1 -- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 msgr2=0x55cbe4d19400 unknown :-1 s=STATE_CLOSED l=1).mark_down
2021-02-11T13:07:40.778+0000 7fb9b7b26700  1 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 unknown :-1 s=CLOSED pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).stop

These are the sizes of the osd log files.

osd.0 5202195739
osd.1 160654481078
osd.2 4587026710
osd.3 289661288927
osd.4 4027220224
osd.5 2770255457
osd.6 2016671190
osd.7 1333761182

Yes, osds 1 and 3 are 150G and 270G respectively. I suspect the answer to this lies in those logfiles and whatever is causing the bulk of those logs. At the moment I'm having difficulty just getting into them.

I think there's probably some recurring network issue here (whether it's infrastructure or code based).

[edit] I suspect those uncompressed sizes for the compressed files are wrong since I'm using 'gzip -l' because they are too big to uncompress anywhere (keep running out of space). Something odd going on here. [/edit]

Actions #6

Updated by Brad Hubbard about 3 years ago

$ pwd
/a/swagner-2021-02-11_10:31:33-rados:cephadm-wip-swagner-testing-2021-02-09-1126-distro-basic-smithi/5874516
$ find . -name ceph-osd.*.log -exec ls -lh {} \;
-rw-r--r-- 1 teuthworker teuthworker 4.2G Feb 11 21:10 ./remote/ubuntu@smithi040.front.sepia.ceph.com/log/7dfbf29c-6c6b-11eb-8ff4-001a4aab830c/ceph-osd.0.log
-rw-r--r-- 1 teuthworker teuthworker 4.8G Feb 11 21:11 ./remote/ubuntu@smithi040.front.sepia.ceph.com/log/7dfbf29c-6c6b-11eb-8ff4-001a4aab830c/ceph-osd.1.log
-rw-r--r-- 1 teuthworker teuthworker 181G Feb 11 21:09 ./remote/ubuntu@smithi040.front.sepia.ceph.com/log/7dfbf29c-6c6b-11eb-8ff4-001a4aab830c/ceph-osd.2.log
-rw-r--r-- 1 teuthworker teuthworker 501G Feb 11 20:20 ./remote/ubuntu@smithi040.front.sepia.ceph.com/log/7dfbf29c-6c6b-11eb-8ff4-001a4aab830c/ceph-osd.3.log

I'd say these jobs are probably a large contributor to the out of space issues seen recently.

Actions #7

Updated by Brad Hubbard about 3 years ago

https://tracker.ceph.com/issues/39039 ?

Maybe it would be worth a try to see if disabling cephx improves the situation? Early days yet though, probably needs a little more confirmation.

Actions #8

Updated by Sebastian Wagner about 3 years ago

  • Related to Bug #39039: mon connection reset, command not resent added
Actions #9

Updated by Sebastian Wagner about 3 years ago

Brad Hubbard wrote:

https://tracker.ceph.com/issues/39039 ?

Maybe it would be worth a try to see if disabling cephx improves the situation? Early days yet though, probably needs a little more confirmation.

sorry, but there is no way to disable cephx in cephadm. :-(

Actions #10

Updated by Brad Hubbard about 3 years ago

This ran for nearly 24 hours.

$ find . -name ceph-osd.\*.log\*
./remote/ubuntu@smithi091.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.0.log
./remote/ubuntu@smithi091.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.1.log
./remote/ubuntu@smithi091.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.2.log
./remote/ubuntu@smithi091.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.3.log
./remote/ubuntu@smithi110.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.4.log.gz
./remote/ubuntu@smithi110.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.5.log.gz
./remote/ubuntu@smithi110.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.6.log.gz
./remote/ubuntu@smithi110.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.7.log.gz

$ find . -name ceph-osd.\*.log\* -exec zgrep -c "handle_read_frame_preamble_main read frame preamble failed
r=-1 ((1) Operation not permitted" {} \;
1593
6355436
1203
6355434
91
4530896
4530883
95
Actions #12

Updated by Brad Hubbard about 3 years ago

Managed to reproduce this with some manageable large osd logs.

On the first osd, just before the slow ops begin we see this.

2021-02-15T00:47:46.620+0000 7f7bed847700  5 --2- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 0x5560e64a4000 crc :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=1 rev1=1 rx=0 tx=0).send_client_ident sending identification: addrs=172.21.15.132:0/7 target=v2:172.21.15.132:6818/7 gid=1 global_seq=4 features_supported=3f01cfbb7ffdffff features_required=800000000001000 flags=1 cookie=0
2021-02-15T00:47:46.620+0000 7f7bed847700 10 -- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 msgr2=0x5560e64a4000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send sent bytes 159 remaining bytes 0
2021-02-15T00:47:46.620+0000 7f7bed847700 20 --2- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 0x5560e64a4000 crc :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=1 rev1=1 rx=0 tx=0).read_frame
2021-02-15T00:47:46.620+0000 7f7bed847700 20 -- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 msgr2=0x5560e64a4000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read start len=32
2021-02-15T00:47:46.620+0000 7f7bed847700 20 -- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 msgr2=0x5560e64a4000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).process
2021-02-15T00:47:46.620+0000 7f7bed847700 20 -- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 msgr2=0x5560e64a4000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read continue len=32
2021-02-15T00:47:46.620+0000 7f7bed847700  1 -- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 msgr2=0x5560e64a4000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 101
2021-02-15T00:47:46.620+0000 7f7bed847700  1 -- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 msgr2=0x5560e64a4000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2021-02-15T00:47:46.620+0000 7f7bed847700 20 --2- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 0x5560e64a4000 crc :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main r=-1
2021-02-15T00:47:46.620+0000 7f7bed847700  1 --2- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 0x5560e64a4000 crc :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)

On the osd that message was sent to we see.

2021-02-15T00:47:46.620+0000 7fc589235700  5 --2- [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] >>  conn(0x55c70f406c00 0x55c71c2f3400 crc :-1 s=SESSION_ACCEPTING pgs=0 cs=0 l=1 rev1=1 rx=0 tx=0).handle_client_ident received client   ⤷ identification: addrs=172.21.15.132:0/7 target=v2:172.21.15.132:6818/7 gid=1 global_seq=4 features_supported=3f01cfbb7ffdffff features_required=800000000001000 flags=1 cookie=0
2021-02-15T00:47:46.620+0000 7fc589235700 10 -- [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] >> 172.21.15.132:0/7 conn(0x55c70f406c00 msgr2=0x55c71c2f3400 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._infer_target_addr 172.21.15.     ⤷ 132:0/7 -> 172.21.15.132:0/7
2021-02-15T00:47:46.620+0000 7fc589235700 20 --2- [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] >> 172.21.15.132:0/7 conn(0x55c70f406c00 0x55c71c2f3400 crc :-1 s=SESSION_ACCEPTING pgs=4 cs=0 l=1 rev1=1 rx=0 tx=0).                     ⤷ handle_existing_connection existing=0x55c71d278c00
2021-02-15T00:47:46.620+0000 7fc589235700  1 --2- [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] >> 172.21.15.132:0/7 conn(0x55c70f406c00 0x55c71c2f3400 crc :-1 s=SESSION_ACCEPTING pgs=4 cs=0 l=1 rev1=1 rx=0 tx=0).                     ⤷ handle_existing_connection this is a stale connection, peer_global_seq=4 existing->peer_global_seq=272, stopping this connection.

Not sure why this peer_global_seq mismatch is happening.

Actions #13

Updated by Sebastian Wagner about 3 years ago

  • Subject changed from test_rados_api tests timeout with cephadm to test_rados_api tests timeout with cephadm (plus extremely large OSD logs)
Actions #14

Updated by Sebastian Wagner about 3 years ago

sage: this is related to thrashing and only happens within cephadm. non-cephadm is not affected

Actions #15

Updated by Brad Hubbard about 3 years ago

$ grep -a -c "is a lossy channel. Stopping existing in favor of this connection" ceph-osd.2.log 
1383836

$ grep -a -c "_fault on lossy channel, failing" ceph-osd.2.log 
692000

Wondering if this could be indicative of some container-specific networking issue?

Actions #16

Updated by Neha Ojha about 3 years ago

Sebastian Wagner wrote:

sage: this is related to thrashing and only happens within cephadm. non-cephadm is not affected

These tests do not involve any trashing, https://tracker.ceph.com/issues/49237 did.

Actions #17

Updated by Sage Weil about 3 years ago

  • Related to Bug #49237: segv in AsyncConnection::_stop() added
Actions #18

Updated by Neha Ojha about 3 years ago

From IRC:

[12:15:10]  <neha> SebastianW, badone, sage: I am no longer seeing https://tracker.ceph.com/issues/49259 on current master, makes me think this was also related to https://github.com/ceph/ceph/pull/39482

https://pulpito.ceph.com/nojha-2021-02-16_18:27:18-rados:cephadm-master-distro-basic-gibba/

Actions #19

Updated by Brad Hubbard about 3 years ago

  • Status changed from New to Need More Info
  • Assignee set to Brad Hubbard
  • Priority changed from Urgent to Normal
Actions #20

Updated by Sebastian Wagner about 3 years ago

turned out to be caused by https://github.com/ceph/ceph/pull/39530

Actions #21

Updated by Sebastian Wagner about 3 years ago

  • Status changed from Need More Info to Resolved
Actions #22

Updated by Sebastian Wagner about 3 years ago

2021-02-18T01:28:45.951 INFO:teuthology.orchestra.run.smithi081.stderr:Inferring config /var/lib/ceph/8cedab8a-7187-11eb-900d-001a4aab830c/mon.a/config
2021-02-18T01:28:45.952 INFO:teuthology.orchestra.run.smithi081.stderr:Inferring config /var/lib/ceph/8cedab8a-7187-11eb-900d-001a4aab830c/mon.a/config
2021-02-18T01:28:45.993 INFO:journalctl@ceph.osd.7.smithi194.stdout:Feb 18 01:28:45 smithi194 conmon[45912]: debug 2021-02-18T01:28:45.768+0000 7f9786f3d700 -1 received  signal: Hangup from /dev/init -- /usr/bin/ceph-osd -n osd.7 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug   (PID: 1) UID: 0
2021-02-18T01:28:45.994 INFO:journalctl@ceph.mon.b.smithi194.stdout:Feb 18 01:28:45 smithi194 conmon[30455]: cluster 2021-02-18T01:28:44.899499+0000 mon.a (mon.0) 712 :

https://pulpito.ceph.com/mgfritch-2021-02-18_00:56:34-rados:cephadm-wip-mgfritch-testing-2021-02-17-1029-distro-basic-smithi/5891705/

Actions

Also available in: Atom PDF