Project

General

Profile

Actions

Bug #46180

closed

qa: Scrubbing terminated -- not all pgs were active and clean.

Added by Casey Bodley almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Messenger
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seeing several test failures in the rgw suite:

2020-06-24T06:55:00.513 INFO:tasks.ceph:Waiting for all PGs to be active+clean and split+merged, waiting on ['2.0'] to go clean and/or [] to split/merge
2020-06-24T06:55:20.514 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1830, in task
    healthy(ctx=ctx, config=dict(cluster=config['cluster']))
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1419, in healthy
    manager.wait_for_clean()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph_manager.py", line 2518, in wait_for_clean
    'wait_for_clean: failed before timeout expired'
AssertionError: wait_for_clean: failed before timeout expired

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1839, in task
    osd_scrub_pgs(ctx, config)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1209, in osd_scrub_pgs
    raise RuntimeError("Scrubbing terminated -- not all pgs were active and clean.")
RuntimeError: Scrubbing terminated -- not all pgs were active and clean.

http://pulpito.ceph.com/yuvalif-2020-06-23_14:40:15-rgw-wip-yuval-test-35331-35155-distro-basic-smithi/


Related issues 4 (0 open4 closed)

Related to RADOS - Bug #46179: Health check failed: Reduced data availability: PG_AVAILABILITYDuplicate

Actions
Has duplicate RADOS - Bug #46211: qa: pools stuck in creatingDuplicate

Actions
Has duplicate RADOS - Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) Duplicate

Actions
Has duplicate RADOS - Bug #46225: Health check failed: 1 osds down (OSD_DOWN)Duplicate

Actions
Actions #1

Updated by Sridhar Seshasayee almost 4 years ago

Saw the same error during this run:
http://pulpito.ceph.com/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/

Job ID: 5176265

Failure Reason:
Scrubbing terminated -- not all pgs were active and clean.

Actions #2

Updated by Patrick Donnelly almost 4 years ago

  • Has duplicate Bug #46211: qa: pools stuck in creating added
Actions #3

Updated by Patrick Donnelly almost 4 years ago

  • Priority changed from Normal to Immediate
  • Target version set to v16.0.0
Actions #4

Updated by Neha Ojha almost 4 years ago

The common thing in all of these is that the tests are all failing while running the ceph task, no thrashing or anything going on at all at this time.

From /a/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/5176265

This is where we start seeing PGs getting stuck:

2020-06-24T21:05:50.958+0000 7fa42b7ce700 10 mgr.server operator() 8 pgs: 2 creating+peering, 6 active+clean; 0 B data, 320 KiB used, 297 GiB / 300 GiB avail

Just before that there seem to be issues in the msgr which is leading to reap_dead start

2020-06-24T21:05:50.181+0000 7fa442f31700  1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).ready entity=client.4179 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2020-06-24T21:05:50.199+0000 7fa4227c8700  1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] <== osd.0 v2:172.21.15.205:6806/20452 6 ==== pg_stats(1 pgs tid 0 v 0) v2 ==== 1199+0+0 (secure 0 0 0) 0x5647f6de0900 con 0x5647f6d47400
2020-06-24T21:05:50.200+0000 7fa442f31700  1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 msgr2=0x5647f6ed3900 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 42
2020-06-24T21:05:50.200+0000 7fa442f31700  1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 msgr2=0x5647f6ed3900 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2020-06-24T21:05:50.200+0000 7fa442f31700  1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
2020-06-24T21:05:50.200+0000 7fa442f31700  1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).stop
2020-06-24T21:05:50.200+0000 7fa442f31700  1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] reap_dead start
Actions #5

Updated by Sebastian Wagner almost 4 years ago

  • Related to Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) added
Actions #6

Updated by Ilya Dryomov almost 4 years ago

  • Category set to Correctness/Safety
  • Assignee set to Ilya Dryomov
  • Component(RADOS) Messenger added

This is a msgr2.1 issue.

Actions #7

Updated by Ilya Dryomov almost 4 years ago

I think it has to do with reconnect handling and how connections are reused.

This part of ProtocolV2 is pretty fragile, evidenced by steadily accumulating workarounds for invalid memory use issues during msgr2.0 development and after (the last just in March). Most likely what happens is FrameAssembler is_rev1 state gets lost and a 2.0 frame ends up being assembled while the peer is expecting a 2.1 frame. I'll confirm and put out a fix ASAP.

Actions #8

Updated by Neha Ojha almost 4 years ago

  • Related to Bug #46179: Health check failed: Reduced data availability: PG_AVAILABILITY added
Actions #9

Updated by Neha Ojha almost 4 years ago

  • Related to Bug #46225: Health check failed: 1 osds down (OSD_DOWN) added
Actions #10

Updated by Neha Ojha almost 4 years ago

Here's a reliable reproducer for the issue:

-s rados/singleton-nomsgr -c master --filter 'all/health-warnings rados' -N 20

https://pulpito.ceph.com/nojha-2020-06-26_17:24:37-rados:singleton-nomsgr-master-distro-basic-smithi/

Actions #11

Updated by Ilya Dryomov almost 4 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 35816
Actions #12

Updated by Ilya Dryomov almost 4 years ago

  • Status changed from Fix Under Review to Resolved
Actions #13

Updated by Ilya Dryomov almost 4 years ago

  • Related to deleted (Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) )
Actions #14

Updated by Ilya Dryomov almost 4 years ago

  • Has duplicate Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) added
Actions #15

Updated by Neha Ojha almost 4 years ago

  • Related to deleted (Bug #46225: Health check failed: 1 osds down (OSD_DOWN))
Actions #16

Updated by Neha Ojha almost 4 years ago

  • Has duplicate Bug #46225: Health check failed: 1 osds down (OSD_DOWN) added
Actions #17

Updated by Neha Ojha almost 4 years ago

  • Related to deleted (Bug #46179: Health check failed: Reduced data availability: PG_AVAILABILITY)
Actions #18

Updated by Neha Ojha almost 4 years ago

  • Related to Bug #46179: Health check failed: Reduced data availability: PG_AVAILABILITY added
Actions

Also available in: Atom PDF