Bug #46180
closedqa: Scrubbing terminated -- not all pgs were active and clean.
0%
Description
Seeing several test failures in the rgw suite:
2020-06-24T06:55:00.513 INFO:tasks.ceph:Waiting for all PGs to be active+clean and split+merged, waiting on ['2.0'] to go clean and/or [] to split/merge 2020-06-24T06:55:20.514 ERROR:teuthology.contextutil:Saw exception from nested tasks Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1830, in task healthy(ctx=ctx, config=dict(cluster=config['cluster'])) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1419, in healthy manager.wait_for_clean() File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph_manager.py", line 2518, in wait_for_clean 'wait_for_clean: failed before timeout expired' AssertionError: wait_for_clean: failed before timeout expired During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 33, in nested yield vars File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1839, in task osd_scrub_pgs(ctx, config) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1209, in osd_scrub_pgs raise RuntimeError("Scrubbing terminated -- not all pgs were active and clean.") RuntimeError: Scrubbing terminated -- not all pgs were active and clean.
Updated by Sridhar Seshasayee almost 4 years ago
Saw the same error during this run:
http://pulpito.ceph.com/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/
Job ID: 5176265
Failure Reason:
Scrubbing terminated -- not all pgs were active and clean.
Updated by Patrick Donnelly almost 4 years ago
- Has duplicate Bug #46211: qa: pools stuck in creating added
Updated by Patrick Donnelly almost 4 years ago
- Priority changed from Normal to Immediate
- Target version set to v16.0.0
Updated by Neha Ojha almost 4 years ago
The common thing in all of these is that the tests are all failing while running the ceph task, no thrashing or anything going on at all at this time.
From /a/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/5176265
This is where we start seeing PGs getting stuck:
2020-06-24T21:05:50.958+0000 7fa42b7ce700 10 mgr.server operator() 8 pgs: 2 creating+peering, 6 active+clean; 0 B data, 320 KiB used, 297 GiB / 300 GiB avail
Just before that there seem to be issues in the msgr which is leading to reap_dead start
2020-06-24T21:05:50.181+0000 7fa442f31700 1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).ready entity=client.4179 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0 2020-06-24T21:05:50.199+0000 7fa4227c8700 1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] <== osd.0 v2:172.21.15.205:6806/20452 6 ==== pg_stats(1 pgs tid 0 v 0) v2 ==== 1199+0+0 (secure 0 0 0) 0x5647f6de0900 con 0x5647f6d47400 2020-06-24T21:05:50.200+0000 7fa442f31700 1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 msgr2=0x5647f6ed3900 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 42 2020-06-24T21:05:50.200+0000 7fa442f31700 1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 msgr2=0x5647f6ed3900 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2020-06-24T21:05:50.200+0000 7fa442f31700 1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted) 2020-06-24T21:05:50.200+0000 7fa442f31700 1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).stop 2020-06-24T21:05:50.200+0000 7fa442f31700 1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] reap_dead start
Updated by Sebastian Wagner almost 4 years ago
- Related to Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) added
Updated by Ilya Dryomov almost 4 years ago
- Category set to Correctness/Safety
- Assignee set to Ilya Dryomov
- Component(RADOS) Messenger added
This is a msgr2.1 issue.
Updated by Ilya Dryomov almost 4 years ago
I think it has to do with reconnect handling and how connections are reused.
This part of ProtocolV2 is pretty fragile, evidenced by steadily accumulating workarounds for invalid memory use issues during msgr2.0 development and after (the last just in March). Most likely what happens is FrameAssembler is_rev1 state gets lost and a 2.0 frame ends up being assembled while the peer is expecting a 2.1 frame. I'll confirm and put out a fix ASAP.
Updated by Neha Ojha almost 4 years ago
- Related to Bug #46179: Health check failed: Reduced data availability: PG_AVAILABILITY added
Updated by Neha Ojha almost 4 years ago
- Related to Bug #46225: Health check failed: 1 osds down (OSD_DOWN) added
Updated by Neha Ojha almost 4 years ago
Here's a reliable reproducer for the issue:
-s rados/singleton-nomsgr -c master --filter 'all/health-warnings rados' -N 20
Updated by Ilya Dryomov almost 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 35816
Updated by Ilya Dryomov almost 4 years ago
- Status changed from Fix Under Review to Resolved
Will be cherry-picked into https://github.com/ceph/ceph/pull/35720 and https://github.com/ceph/ceph/pull/35733.
Updated by Ilya Dryomov almost 4 years ago
- Related to deleted (Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) )
Updated by Ilya Dryomov almost 4 years ago
- Has duplicate Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) added
Updated by Neha Ojha almost 4 years ago
- Related to deleted (Bug #46225: Health check failed: 1 osds down (OSD_DOWN))
Updated by Neha Ojha almost 4 years ago
- Has duplicate Bug #46225: Health check failed: 1 osds down (OSD_DOWN) added
Updated by Neha Ojha almost 4 years ago
- Related to deleted (Bug #46179: Health check failed: Reduced data availability: PG_AVAILABILITY)
Updated by Neha Ojha almost 4 years ago
- Related to Bug #46179: Health check failed: Reduced data availability: PG_AVAILABILITY added