Bug #49259
closedtest_rados_api tests timeout with cephadm (plus extremely large OSD logs)
0%
Description
swagner-2021-02-11_10:31:33-rados:cephadm-wip-swagner-testing-2021-02-09-1126-distro-basic-smithi/5874513
rados:cephadm/with-work/{distro/ubuntu_20.04 fixed-2 mode/packaged mon_election/classic msgr/async start tasks/rados_api_tests}
The tests actually complete:
grep -v -e 'slow request osd_op' -e 'heartbeat_check' -e get_health_metrics -e 'have slow ops' -e ' pgmap v' -e 'journalctl@ceph.mon' -e 'tcmu-runner' teuthology.log| less 2021-02-11T13:11:48.564 INFO:tasks.workunit.client.0.smithi091.stdout: api_aio_pp: [ OK ] LibRadosAioEC.ExecuteClassPP (7016 ms) 2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout: api_aio_pp: [ RUN ] LibRadosAioEC.OmapPP 2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout: api_aio_pp: [ OK ] LibRadosAioEC.OmapPP (7016 ms) 2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout: api_aio_pp: [ RUN ] LibRadosAioEC.MultiWritePP 2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout: api_aio_pp: [ OK ] LibRadosAioEC.MultiWritePP (7027 ms) 2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout: api_aio_pp: [----------] 20 tests from LibRadosAioEC (139516 ms total) 2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout: api_aio_pp: 2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout: api_aio_pp: [----------] Global test environment tear-down 2021-02-11T13:11:48.565 INFO:tasks.workunit.client.0.smithi091.stdout: api_aio_pp: [==========] 53 tests from 4 test suites ran. (294146 ms total) 2021-02-11T13:11:48.566 INFO:tasks.workunit.client.0.smithi091.stdout: api_aio_pp: [ PASSED ] 53 tests. 2021-02-11T16:04:13.814 INFO:journalctl@ceph.prometheus.a.smithi110.stdout:Feb 11 16:04:13 smithi110 bash[29017]: level=info ts=2021-02-11T16:04:13.510Z caller=compact.go:495 component=tsdb msg="write block" mint=1613048653399 maxt=1613052000000 ulid=01EY8XBWFN9THM46AQ56K0YJ9V duration=80.839364ms 2021-02-11T16:04:13.814 INFO:journalctl@ceph.prometheus.a.smithi110.stdout:Feb 11 16:04:13 smithi110 bash[29017]: level=info ts=2021-02-11T16:04:13.523Z caller=head.go:662 component=tsdb msg="Head GC completed" duration=2.564364ms 2021-02-11T16:06:54.424 INFO:tasks.workunit.client.0.smithi091.stderr:++ cleanup 2021-02-11T16:06:54.424 INFO:tasks.workunit.client.0.smithi091.stderr:++ pkill -P 35759 2021-02-11T16:06:54.462 INFO:tasks.workunit.client.0.smithi091.stderr:++ true 2021-02-11T16:06:54.494 DEBUG:teuthology.orchestra.run:got remote process result: 124
but lots of slow ops:
2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.c.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[17322]: cluster 2021-02-11T16:06:53.847115+0000 osd.0 (osd.0) 72781 : cluster [WRN] slow request osd_op(client.24646.0:2078 16.14 16:2b70cbe7:test-rados-api-smithi091-35842-17::big.copy:head [copy-from ver 0 in=74b] snapc 0=[] RETRY=7 ondisk+retry+write+known_if_redirected e350) initiated 2021-02-11T15:07:05.776639+0000 currently delayed 2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:52.173794+0000 osd.2 (osd.2) 18309 : cluster [WRN] slow request osd_op(client.37922.0:257 174.19 174:9c1d23e5:test-rados-api-smithi091-35848-12::bam:head [read 0~1] snapc 0=[] RETRY=2 ondisk+retry+read+known_if_redirected e350) initiated 2021-02-11T15:07:35.487913+0000 currently delayed 2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:52.173821+0000 osd.2 (osd.2) 18310 : cluster [WRN] slow request osd_op(client.37922.0:257 174.19 174:9c1d23e5:test-rados-api-smithi091-35848-12::bam:head [read 0~1] snapc 0=[] RETRY=3 ondisk+retry+read+known_if_redirected e350) initiated 2021-02-11T15:22:35.490704+0000 currently delayed 2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:52.173861+0000 osd.2 (osd.2) 18311 : cluster [WRN] slow request osd_op(client.37922.0:257 174.19 174:9c1d23e5:test-rados-api-smithi091-35848-12::bam:head [read 0~1] snapc 0=[] RETRY=1 ondisk+retry+read+known_if_redirected e350) initiated 2021-02-11T14:52:35.486234+0000 currently delayed 2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:52.173928+0000 osd.2 (osd.2) 18312 : cluster [WRN] slow request osd_op(client.37922.0:257 174.19 174:9c1d23e5:test-rados-api-smithi091-35848-12::bam:head [read 0~1] snapc 0=[] RETRY=4 ondisk+retry+read+known_if_redirected e350) initiated 2021-02-11T15:37:35.490076+0000 currently delayed 2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:52.173954+0000 osd.2 (osd.2) 18313 : cluster [WRN] slow request osd_op(client.37922.0:257 174.19 174:9c1d23e5:test-rados-api-smithi091-35848-12::bam:head [read 0~1] snapc 0=[] RETRY=5 ondisk+retry+read+known_if_redirected e350) initiated 2021-02-11T15:52:35.490534+0000 currently delayed 2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:52.173977+0000 osd.2 (osd.2) 18314 : cluster [WRN] slow request osd_op(client.37922.0:257 174.19 174:9c1d23e5:test-rados-api-smithi091-35848-12::bam:head [read 0~1] snapc 0=[] ondisk+read+known_if_redirected e350) initiated 2021-02-11T14:36:22.874727+0000 currently delayed 2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:53.846990+0000 osd.0 (osd.0) 72771 : cluster [WRN] slow request osd_op(client.24646.0:2078 16.14 16:2b70cbe7:test-rados-api-smithi091-35842-17::big.copy:head [copy-from ver 0 in=74b] snapc 0=[] RETRY=2 ondisk+retry+write+known_if_redirected e341) initiated 2021-02-11T13:52:05.770047+0000 currently delayed 2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:53.847012+0000 osd.0 (osd.0) 72772 : cluster [WRN] slow request osd_op(client.24646.0:2078 16.14 16:2b70cbe7:test-rados-api-smithi091-35842-17::big.copy:head [copy-from ver 0 in=74b] snapc 0=[] RETRY=1 ondisk+retry+write+known_if_redirected e341) initiated 2021-02-11T13:37:05.766073+0000 currently delayed 2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:53.847025+0000 osd.0 (osd.0) 72773 : cluster [WRN] slow request osd_op(client.24646.0:2078 16.14 16:2b70cbe7:test-rados-api-smithi091-35842-17::big.copy:head [copy-from ver 0 in=74b] snapc 0=[] RETRY=4 ondisk+retry+write+known_if_redirected e341) initiated 2021-02-11T14:22:05.770189+0000 currently delayed 2021-02-11T16:06:54.144 INFO:journalctl@ceph.mon.a.smithi091.stdout:Feb 11 16:06:53 smithi091 bash[13455]: cluster 2021-02-11T16:06:53.847036+0000 osd.0 (osd.0) 72774 : cluster [WRN] slow request osd_op(client.24646.0:2078 16.14 16:2b70cbe7:test-rados-api-smithi091-35842-17::big.copy:head [copy-from ver 0 in=74b] snapc 0=[] RE
lots of uncommon health warnings:
2021-02-11T15:30:15.764 INFO:journalctl@ceph.mon.b.smithi110.stdout:Feb 11 15:30:01 smithi110 bash[13423]: cluster 2021-02-11T15:30:00.001835+0000 mon.a (mon.0) 3849 : cluster [WRN] [WRN] CACHE_POOL_NO_HIT_SET: 1 cache pools are missing hit_sets 2021-02-11T15:30:15.764 INFO:journalctl@ceph.mon.b.smithi110.stdout:Feb 11 15:30:01 smithi110 bash[13423]: cluster 2021-02-11T15:30:00.001854+0000 mon.a (mon.0) 3850 : cluster [WRN] pool 'test-rados-api-smithi091-35848-11' with cache_mode writeback needs hit_set_type to be set but it is not 2021-02-11T15:30:15.764 INFO:journalctl@ceph.mon.b.smithi110.stdout:Feb 11 15:30:01 smithi110 bash[13423]: cluster 2021-02-11T15:30:00.001874+0000 mon.a (mon.0) 3851 : cluster [WRN] [WRN] POOL_APP_NOT_ENABLED: 1 pool(s) do not have an application enabled 2021-02-11T15:30:15.764 INFO:journalctl@ceph.mon.b.smithi110.stdout:Feb 11 15:30:01 smithi110 bash[13423]: cluster 2021-02-11T15:30:00.001888+0000 mon.a (mon.0) 3852 : cluster [WRN] application not enabled on pool 'ceph_test_rados_api_asio' 2021-02-11T15:30:15.764 INFO:journalctl@ceph.mon.b.smithi110.stdout:Feb 11 15:30:01 smithi110 bash[13423]: cluster 2021-02-11T15:30:00.001901+0000 mon.a (mon.0) 3853 : cluster [WRN] use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications
Updated by Sebastian Wagner about 3 years ago
- Description updated (diff)
- ceph-qa-suite rados added
Updated by Neha Ojha about 3 years ago
- Subject changed from slow ops on cephadm/test_rados_api tests to test_rados_api tests timeout with cephadm
- Priority changed from Normal to Urgent
/a/swagner-2021-02-11_10:31:33-rados:cephadm-wip-swagner-testing-2021-02-09-1126-distro-basic-smithi/5874516/
Seems like this is where it is stuck
Running main() from gmock_main.cc [==========] Running 16 tests from 2 test suites. [----------] Global test environment set-up. [----------] 2 tests from LibRadosWatchNotifyECPP [ RUN ] LibRadosWatchNotifyECPP.WatchNotify notify [ OK ] LibRadosWatchNotifyECPP.WatchNotify (2197 ms) [ RUN ] LibRadosWatchNotifyECPP.WatchNotifyTimeout [ OK ] LibRadosWatchNotifyECPP.WatchNotifyTimeout (11 ms) [----------] 2 tests from LibRadosWatchNotifyECPP (2208 ms total) [----------] 14 tests from LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP [ RUN ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/0 notify [ OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/0 (7 ms) [ RUN ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/1 notify [ OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/1 (3751 ms) [ RUN ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotifyTimeout/0 [ OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotifyTimeout/0 (636 ms) [ RUN ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotifyTimeout/1 [ OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotifyTimeout/1 (15 ms) [ RUN ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify2/0 handle_notify cookie 93945391327952 notify_id 347892350976 notifier_gid 15045 [ OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify2/0 (9 ms) [ RUN ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify2/1 handle_notify cookie 93945391346960 notify_id 347892350976 notifier_gid 15045 [ OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify2/1 (30 ms) [ RUN ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/0 handle_notify cookie 93945391347984 notify_id 347892350977 notifier_gid 15045 [ OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/0 (34 ms) [ RUN ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/1
Updated by Neha Ojha about 3 years ago
/a/swagner-2021-02-11_10:31:33-rados:cephadm-wip-swagner-testing-2021-02-09-1126-distro-basic-smithi/5874516 shows the same problem as https://tracker.ceph.com/issues/49259#note-2
Updated by Brad Hubbard about 3 years ago
From swagner-2021-02-11_10:31:33-rados:cephadm-wip-swagner-testing-2021-02-09-1126-distro-basic-smithi/5874513 there's a delayed op that has been delayed for more than an hour. This does not look like a problem with the test itself, it goes deeper than that.
This slow op has been delayed for more than an hour.
2021-02-11T16:07:16.719+0000 7fb9cabc3700 0 log_channel(cluster) log [WRN] : slow request osd_op(client.24646.0:2078 16.14 16:2b70cbe7:test-rados-api-smithi091-35842-17::big.copy:head [copy-from ver 0 in=74b] snapc 0=[] RETRY=7 ondisk+retry+write+known_if_redirected e350) initiated 2021-02-11T15:07:05.776639+0000 currently delayed
Trying to trawl through the massive logs to get more information.
Updated by Brad Hubbard about 3 years ago
The stuck op is a copy of 16:8744f7fc:test-rados-api-smithi091-35842-17::big:head to 16:2b70cbe7:test-rados-api-smithi091-35842-17::big.copy:head (invoked copy by one of the api tests). In order to do that we need to read 16:8744f7fc:test-rados-api-smithi091-35842-17::big:head but we seem to encounter some sort of error while doing so.
handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted
2021-02-11T13:07:40.778+0000 7fb9d144d700 5 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).write_message sending message m=0x55cbe43d3400 seq=1 osd_op(osd.0.8:1 16.1 16:8744f7fc:test-rados-api-smithi091-35842-17::big:head [copy-get max 8388608 in=29b] snapc 0=[] ondisk+read+known_if_redirected e93) v8 2021-02-11T13:07:40.778+0000 7fb9d144d700 20 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).write_message sending m=0x55cbe43d3400 seq=1 src=osd.0 off=0 2021-02-11T13:07:40.778+0000 7fb9d144d700 10 -- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 msgr2=0x55cbe4d19400 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send sent bytes 364 remaining bytes 0 2021-02-11T13:07:40.778+0000 7fb9d144d700 10 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).write_message sending 0x55cbe43d3400 done. 2021-02-11T13:07:40.778+0000 7fb9d144d700 20 -- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 msgr2=0x55cbe4d19400 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).process 2021-02-11T13:07:40.778+0000 7fb9d144d700 20 -- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 msgr2=0x55cbe4d19400 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read continue len=32 2021-02-11T13:07:40.778+0000 7fb9d144d700 1 -- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 msgr2=0x55cbe4d19400 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 103 2021-02-11T13:07:40.778+0000 7fb9d144d700 1 -- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 msgr2=0x55cbe4d19400 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2021-02-11T13:07:40.778+0000 7fb9d144d700 20 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main r=-1 2021-02-11T13:07:40.778+0000 7fb9d144d700 1 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted) 2021-02-11T13:07:40.778+0000 7fb9d144d700 10 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0)._fault 2021-02-11T13:07:40.778+0000 7fb9d144d700 2 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0)._fault on lossy channel, failing 2021-02-11T13:07:40.778+0000 7fb9d144d700 1 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).stop 2021-02-11T13:07:40.778+0000 7fb9d144d700 5 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).reset_recv_state 2021-02-11T13:07:40.778+0000 7fb9d144d700 5 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 crc :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).reset_security 2021-02-11T13:07:40.778+0000 7fb9d144d700 10 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 unknown :-1 s=READY pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).discard_out_queue started 2021-02-11T13:07:40.778+0000 7fb9d144d700 20 EpollDriver.del_event del event fd=103 cur_mask=1 delmask=3 to 8 2021-02-11T13:07:40.778+0000 7fb9b7b26700 1 -- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 msgr2=0x55cbe4d19400 unknown :-1 s=STATE_CLOSED l=1).mark_down 2021-02-11T13:07:40.778+0000 7fb9b7b26700 1 --2- 172.21.15.91:0/6 >> [v2:172.21.15.91:6818/6,v1:172.21.15.91:6819/6] conn(0x55cbe48a6000 0x55cbe4d19400 unknown :-1 s=CLOSED pgs=55 cs=0 l=1 rev1=1 rx=0 tx=0).stop
These are the sizes of the osd log files.
osd.0 5202195739
osd.1 160654481078
osd.2 4587026710
osd.3 289661288927
osd.4 4027220224
osd.5 2770255457
osd.6 2016671190
osd.7 1333761182
Yes, osds 1 and 3 are 150G and 270G respectively. I suspect the answer to this lies in those logfiles and whatever is causing the bulk of those logs. At the moment I'm having difficulty just getting into them.
I think there's probably some recurring network issue here (whether it's infrastructure or code based).
[edit] I suspect those uncompressed sizes for the compressed files are wrong since I'm using 'gzip -l' because they are too big to uncompress anywhere (keep running out of space). Something odd going on here. [/edit]
Updated by Brad Hubbard about 3 years ago
$ pwd /a/swagner-2021-02-11_10:31:33-rados:cephadm-wip-swagner-testing-2021-02-09-1126-distro-basic-smithi/5874516 $ find . -name ceph-osd.*.log -exec ls -lh {} \; -rw-r--r-- 1 teuthworker teuthworker 4.2G Feb 11 21:10 ./remote/ubuntu@smithi040.front.sepia.ceph.com/log/7dfbf29c-6c6b-11eb-8ff4-001a4aab830c/ceph-osd.0.log -rw-r--r-- 1 teuthworker teuthworker 4.8G Feb 11 21:11 ./remote/ubuntu@smithi040.front.sepia.ceph.com/log/7dfbf29c-6c6b-11eb-8ff4-001a4aab830c/ceph-osd.1.log -rw-r--r-- 1 teuthworker teuthworker 181G Feb 11 21:09 ./remote/ubuntu@smithi040.front.sepia.ceph.com/log/7dfbf29c-6c6b-11eb-8ff4-001a4aab830c/ceph-osd.2.log -rw-r--r-- 1 teuthworker teuthworker 501G Feb 11 20:20 ./remote/ubuntu@smithi040.front.sepia.ceph.com/log/7dfbf29c-6c6b-11eb-8ff4-001a4aab830c/ceph-osd.3.log
I'd say these jobs are probably a large contributor to the out of space issues seen recently.
Updated by Brad Hubbard about 3 years ago
https://tracker.ceph.com/issues/39039 ?
Maybe it would be worth a try to see if disabling cephx improves the situation? Early days yet though, probably needs a little more confirmation.
Updated by Sebastian Wagner about 3 years ago
- Related to Bug #39039: mon connection reset, command not resent added
Updated by Sebastian Wagner about 3 years ago
Brad Hubbard wrote:
https://tracker.ceph.com/issues/39039 ?
Maybe it would be worth a try to see if disabling cephx improves the situation? Early days yet though, probably needs a little more confirmation.
sorry, but there is no way to disable cephx in cephadm. :-(
Updated by Brad Hubbard about 3 years ago
This ran for nearly 24 hours.
$ find . -name ceph-osd.\*.log\* ./remote/ubuntu@smithi091.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.0.log ./remote/ubuntu@smithi091.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.1.log ./remote/ubuntu@smithi091.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.2.log ./remote/ubuntu@smithi091.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.3.log ./remote/ubuntu@smithi110.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.4.log.gz ./remote/ubuntu@smithi110.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.5.log.gz ./remote/ubuntu@smithi110.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.6.log.gz ./remote/ubuntu@smithi110.front.sepia.ceph.com/log/671ee654-6c68-11eb-8ff4-001a4aab830c/ceph-osd.7.log.gz $ find . -name ceph-osd.\*.log\* -exec zgrep -c "handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted" {} \; 1593 6355436 1203 6355434 91 4530896 4530883 95
Updated by Brad Hubbard about 3 years ago
Managed to reproduce this with some manageable large osd logs.
On the first osd, just before the slow ops begin we see this.
2021-02-15T00:47:46.620+0000 7f7bed847700 5 --2- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 0x5560e64a4000 crc :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=1 rev1=1 rx=0 tx=0).send_client_ident sending identification: addrs=172.21.15.132:0/7 target=v2:172.21.15.132:6818/7 gid=1 global_seq=4 features_supported=3f01cfbb7ffdffff features_required=800000000001000 flags=1 cookie=0 2021-02-15T00:47:46.620+0000 7f7bed847700 10 -- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 msgr2=0x5560e64a4000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send sent bytes 159 remaining bytes 0 2021-02-15T00:47:46.620+0000 7f7bed847700 20 --2- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 0x5560e64a4000 crc :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=1 rev1=1 rx=0 tx=0).read_frame 2021-02-15T00:47:46.620+0000 7f7bed847700 20 -- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 msgr2=0x5560e64a4000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read start len=32 2021-02-15T00:47:46.620+0000 7f7bed847700 20 -- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 msgr2=0x5560e64a4000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).process 2021-02-15T00:47:46.620+0000 7f7bed847700 20 -- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 msgr2=0x5560e64a4000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read continue len=32 2021-02-15T00:47:46.620+0000 7f7bed847700 1 -- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 msgr2=0x5560e64a4000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 101 2021-02-15T00:47:46.620+0000 7f7bed847700 1 -- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 msgr2=0x5560e64a4000 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2021-02-15T00:47:46.620+0000 7f7bed847700 20 --2- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 0x5560e64a4000 crc :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main r=-1 2021-02-15T00:47:46.620+0000 7f7bed847700 1 --2- 172.21.15.132:0/7 >> [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] conn(0x5560ea52bc00 0x5560e64a4000 crc :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
On the osd that message was sent to we see.
2021-02-15T00:47:46.620+0000 7fc589235700 5 --2- [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] >> conn(0x55c70f406c00 0x55c71c2f3400 crc :-1 s=SESSION_ACCEPTING pgs=0 cs=0 l=1 rev1=1 rx=0 tx=0).handle_client_ident received client ⤷ identification: addrs=172.21.15.132:0/7 target=v2:172.21.15.132:6818/7 gid=1 global_seq=4 features_supported=3f01cfbb7ffdffff features_required=800000000001000 flags=1 cookie=0 2021-02-15T00:47:46.620+0000 7fc589235700 10 -- [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] >> 172.21.15.132:0/7 conn(0x55c70f406c00 msgr2=0x55c71c2f3400 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._infer_target_addr 172.21.15. ⤷ 132:0/7 -> 172.21.15.132:0/7 2021-02-15T00:47:46.620+0000 7fc589235700 20 --2- [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] >> 172.21.15.132:0/7 conn(0x55c70f406c00 0x55c71c2f3400 crc :-1 s=SESSION_ACCEPTING pgs=4 cs=0 l=1 rev1=1 rx=0 tx=0). ⤷ handle_existing_connection existing=0x55c71d278c00 2021-02-15T00:47:46.620+0000 7fc589235700 1 --2- [v2:172.21.15.132:6818/7,v1:172.21.15.132:6819/7] >> 172.21.15.132:0/7 conn(0x55c70f406c00 0x55c71c2f3400 crc :-1 s=SESSION_ACCEPTING pgs=4 cs=0 l=1 rev1=1 rx=0 tx=0). ⤷ handle_existing_connection this is a stale connection, peer_global_seq=4 existing->peer_global_seq=272, stopping this connection.
Not sure why this peer_global_seq mismatch is happening.
Updated by Sebastian Wagner about 3 years ago
- Subject changed from test_rados_api tests timeout with cephadm to test_rados_api tests timeout with cephadm (plus extremely large OSD logs)
Updated by Sebastian Wagner about 3 years ago
sage: this is related to thrashing and only happens within cephadm. non-cephadm is not affected
Updated by Brad Hubbard about 3 years ago
$ grep -a -c "is a lossy channel. Stopping existing in favor of this connection" ceph-osd.2.log 1383836 $ grep -a -c "_fault on lossy channel, failing" ceph-osd.2.log 692000
Wondering if this could be indicative of some container-specific networking issue?
Updated by Neha Ojha about 3 years ago
Sebastian Wagner wrote:
sage: this is related to thrashing and only happens within cephadm. non-cephadm is not affected
These tests do not involve any trashing, https://tracker.ceph.com/issues/49237 did.
Updated by Sage Weil about 3 years ago
- Related to Bug #49237: segv in AsyncConnection::_stop() added
Updated by Neha Ojha about 3 years ago
From IRC:
[12:15:10] <neha> SebastianW, badone, sage: I am no longer seeing https://tracker.ceph.com/issues/49259 on current master, makes me think this was also related to https://github.com/ceph/ceph/pull/39482
https://pulpito.ceph.com/nojha-2021-02-16_18:27:18-rados:cephadm-master-distro-basic-gibba/
Updated by Brad Hubbard about 3 years ago
- Status changed from New to Need More Info
- Assignee set to Brad Hubbard
- Priority changed from Urgent to Normal
Updated by Sebastian Wagner about 3 years ago
turned out to be caused by https://github.com/ceph/ceph/pull/39530
Updated by Sebastian Wagner about 3 years ago
- Status changed from Need More Info to Resolved
Updated by Sebastian Wagner about 3 years ago
2021-02-18T01:28:45.951 INFO:teuthology.orchestra.run.smithi081.stderr:Inferring config /var/lib/ceph/8cedab8a-7187-11eb-900d-001a4aab830c/mon.a/config 2021-02-18T01:28:45.952 INFO:teuthology.orchestra.run.smithi081.stderr:Inferring config /var/lib/ceph/8cedab8a-7187-11eb-900d-001a4aab830c/mon.a/config 2021-02-18T01:28:45.993 INFO:journalctl@ceph.osd.7.smithi194.stdout:Feb 18 01:28:45 smithi194 conmon[45912]: debug 2021-02-18T01:28:45.768+0000 7f9786f3d700 -1 received signal: Hangup from /dev/init -- /usr/bin/ceph-osd -n osd.7 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug (PID: 1) UID: 0 2021-02-18T01:28:45.994 INFO:journalctl@ceph.mon.b.smithi194.stdout:Feb 18 01:28:45 smithi194 conmon[30455]: cluster 2021-02-18T01:28:44.899499+0000 mon.a (mon.0) 712 :