Project

General

Profile

Bug #46180

qa: Scrubbing terminated -- not all pgs were active and clean.

Added by Casey Bodley 4 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Messenger
Pull request ID:
Crash signature:

Description

Seeing several test failures in the rgw suite:

2020-06-24T06:55:00.513 INFO:tasks.ceph:Waiting for all PGs to be active+clean and split+merged, waiting on ['2.0'] to go clean and/or [] to split/merge
2020-06-24T06:55:20.514 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1830, in task
    healthy(ctx=ctx, config=dict(cluster=config['cluster']))
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1419, in healthy
    manager.wait_for_clean()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph_manager.py", line 2518, in wait_for_clean
    'wait_for_clean: failed before timeout expired'
AssertionError: wait_for_clean: failed before timeout expired

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1839, in task
    osd_scrub_pgs(ctx, config)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1209, in osd_scrub_pgs
    raise RuntimeError("Scrubbing terminated -- not all pgs were active and clean.")
RuntimeError: Scrubbing terminated -- not all pgs were active and clean.

http://pulpito.ceph.com/yuvalif-2020-06-23_14:40:15-rgw-wip-yuval-test-35331-35155-distro-basic-smithi/


Related issues

Related to RADOS - Bug #46179: Health check failed: Reduced data availability: PG_AVAILABILITY Duplicate
Duplicated by RADOS - Bug #46211: qa: pools stuck in creating Duplicate
Duplicated by RADOS - Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) Duplicate
Duplicated by RADOS - Bug #46225: Health check failed: 1 osds down (OSD_DOWN) Duplicate

History

#1 Updated by Sridhar Seshasayee 4 months ago

Saw the same error during this run:
http://pulpito.ceph.com/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/

Job ID: 5176265

Failure Reason:
Scrubbing terminated -- not all pgs were active and clean.

#2 Updated by Patrick Donnelly 4 months ago

  • Duplicated by Bug #46211: qa: pools stuck in creating added

#3 Updated by Patrick Donnelly 4 months ago

  • Priority changed from Normal to Immediate
  • Target version set to v16.0.0

#4 Updated by Neha Ojha 4 months ago

The common thing in all of these is that the tests are all failing while running the ceph task, no thrashing or anything going on at all at this time.

From /a/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/5176265

This is where we start seeing PGs getting stuck:

2020-06-24T21:05:50.958+0000 7fa42b7ce700 10 mgr.server operator() 8 pgs: 2 creating+peering, 6 active+clean; 0 B data, 320 KiB used, 297 GiB / 300 GiB avail

Just before that there seem to be issues in the msgr which is leading to reap_dead start

2020-06-24T21:05:50.181+0000 7fa442f31700  1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).ready entity=client.4179 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2020-06-24T21:05:50.199+0000 7fa4227c8700  1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] <== osd.0 v2:172.21.15.205:6806/20452 6 ==== pg_stats(1 pgs tid 0 v 0) v2 ==== 1199+0+0 (secure 0 0 0) 0x5647f6de0900 con 0x5647f6d47400
2020-06-24T21:05:50.200+0000 7fa442f31700  1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 msgr2=0x5647f6ed3900 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 42
2020-06-24T21:05:50.200+0000 7fa442f31700  1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 msgr2=0x5647f6ed3900 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2020-06-24T21:05:50.200+0000 7fa442f31700  1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
2020-06-24T21:05:50.200+0000 7fa442f31700  1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).stop
2020-06-24T21:05:50.200+0000 7fa442f31700  1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] reap_dead start

#5 Updated by Sebastian Wagner 4 months ago

  • Related to Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) added

#6 Updated by Ilya Dryomov 4 months ago

  • Category set to Correctness/Safety
  • Assignee set to Ilya Dryomov
  • Component(RADOS) Messenger added

This is a msgr2.1 issue.

#7 Updated by Ilya Dryomov 4 months ago

I think it has to do with reconnect handling and how connections are reused.

This part of ProtocolV2 is pretty fragile, evidenced by steadily accumulating workarounds for invalid memory use issues during msgr2.0 development and after (the last just in March). Most likely what happens is FrameAssembler is_rev1 state gets lost and a 2.0 frame ends up being assembled while the peer is expecting a 2.1 frame. I'll confirm and put out a fix ASAP.

#8 Updated by Neha Ojha 4 months ago

  • Related to Bug #46179: Health check failed: Reduced data availability: PG_AVAILABILITY added

#9 Updated by Neha Ojha 4 months ago

  • Related to Bug #46225: Health check failed: 1 osds down (OSD_DOWN) added

#10 Updated by Neha Ojha 4 months ago

Here's a reliable reproducer for the issue:

-s rados/singleton-nomsgr -c master --filter 'all/health-warnings rados' -N 20

https://pulpito.ceph.com/nojha-2020-06-26_17:24:37-rados:singleton-nomsgr-master-distro-basic-smithi/

#11 Updated by Ilya Dryomov 4 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 35816

#12 Updated by Ilya Dryomov 4 months ago

  • Status changed from Fix Under Review to Resolved

#13 Updated by Ilya Dryomov 4 months ago

  • Related to deleted (Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) )

#14 Updated by Ilya Dryomov 4 months ago

  • Duplicated by Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) added

#15 Updated by Neha Ojha 4 months ago

  • Related to deleted (Bug #46225: Health check failed: 1 osds down (OSD_DOWN))

#16 Updated by Neha Ojha 4 months ago

  • Duplicated by Bug #46225: Health check failed: 1 osds down (OSD_DOWN) added

#17 Updated by Neha Ojha 4 months ago

  • Related to deleted (Bug #46179: Health check failed: Reduced data availability: PG_AVAILABILITY)

#18 Updated by Neha Ojha 4 months ago

  • Related to Bug #46179: Health check failed: Reduced data availability: PG_AVAILABILITY added

Also available in: Atom PDF