Project

General

Profile

Bug #51074

standalone/osd-rep-recov-eio.sh: TEST_rep_read_unfound failed with "Bad data after primary repair" error.

Added by Sridhar Seshasayee almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Observed on Master:
/a/sseshasa-2021-06-01_08:27:04-rados-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/6145022

2021-06-01T10:42:28.259 INFO:tasks.workunit.client.0.smithi050.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-rep-recov-eio.sh:458: TEST_rep_read_unfound:  wait
2021-06-01T10:42:28.259 INFO:tasks.workunit.client.0.smithi050.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-rep-recov-eio.sh:460: TEST_rep_read_unfound:  cmp td/osd-rep-recov-eio.sh/ORIGINAL td/osd-rep-recov-eio.sh/tmp
2021-06-01T10:42:28.260 INFO:tasks.workunit.client.0.smithi050.stderr:cmp: EOF on td/osd-rep-recov-eio.sh/tmp which is empty
2021-06-01T10:42:28.260 INFO:tasks.workunit.client.0.smithi050.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-rep-recov-eio.sh:462: TEST_rep_read_unfound:  echo 'Bad data after primary repair'
2021-06-01T10:42:28.260 INFO:tasks.workunit.client.0.smithi050.stdout:Bad data after primary repair
2021-06-01T10:42:28.262 INFO:tasks.workunit.client.0.smithi050.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-rep-recov-eio.sh:463: TEST_rep_read_unfound:  return 1
2021-06-01T10:42:28.262 INFO:tasks.workunit.client.0.smithi050.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-rep-recov-eio.sh:42: run:  return 1

History

#1 Updated by Kefu Chai almost 3 years ago

/a/kchai-2021-06-05_13:57:48-rados-master-distro-basic-smithi/6154221/

#2 Updated by Kefu Chai almost 3 years ago

  • Priority changed from Normal to High

#3 Updated by Kefu Chai almost 3 years ago

not able to reproduce this issue locally. bisecting:

0331281e8a74d0b744cdcede1db24e7fea4656fc https://pulpito.ceph.com/kchai-2021-06-05_16:12:26-rados-master-distro-basic-smithi/6154521/ passed
1df55c23786a11c0bcf508e71b8c242c2d295166 https://pulpito.ceph.com/kchai-2021-06-06_03:12:45-rados-master-distro-basic-smithi/6155182/ passed
1b312db5054665fe0a6f241898a05497d4a25a9b https://pulpito.ceph.com/kchai-2021-06-06_03:18:42-rados-master-distro-basic-smithi/6155187/ passed
5871240363e9912ece5e9bc9a02673ba6ba5ef8d https://pulpito.ceph.com/kchai-2021-06-06_05:24:03-rados-master-distro-basic-smithi/6155769/ passed
11252f61171d7038a7a4aedb35e11ad502bb2a6a https://pulpito.ceph.com/kchai-2021-06-06_06:50:40-rados-11252f61171d7038a7a4aedb35e11ad502bb2a6a-bisect-0-kefu-distro-basic-smithi/6155794/ failed
7025d5081354ce33088aee8449de417c9dceee85 https://pulpito.ceph.com/kchai-2021-06-06_12:07:52-rados-7025d508135-9367f137be7-bisect-1-kefu-distro-basic-smithi/6156450/ passed
e9ac37d4240d7b13021e58dea60691f038130da1 https://pulpito.ceph.com/kchai-2021-06-07_01:49:44-rados-7025d508135-e9ac37d4240-bisect-2-kefu-distro-basic-smithi/6156976/ passed
a53592e48450dc9c81134f0407a984bca49c7fae https://pulpito.ceph.com/kchai-2021-06-07_06:17:55-rados-a53592e4845-bisect-3-kefu-distro-basic-smithi/6157539/ passed
328271d587d099e78dcd020c17e7465043c1bb6b https://pulpito.ceph.com/kchai-2021-06-07_12:41:36-rados-328271d587d-bisect-4-kefu-distro-basic-smithi/6157611/ failed
db6c995ba6ea7d19642955acf8d117d3267e9632 https://pulpito.ceph.com/kchai-2021-06-08_01:44:46-rados-db6c995ba6e-bisect-5-kefu-distro-basic-smithi/6158445/ failed
f69e6f6702da55d58deb8379944d0a8b30b3384b https://pulpito.ceph.com/kchai-2021-06-08_06:33:58-rados-0edaff5d3c0-without-pr-41308-kefu-distro-basic-smithi/6159596/ passed
6ab8dbe666350a7377b35b715b28fe83cc8492c2 https://pulpito.ceph.com/kchai-2021-06-08_03:31:41-rados-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/6158710/ failed

in which, 6ab8dbe666350a7377b35b715b28fe83cc8492c2 is f69e6f6702da55d58deb8379944d0a8b30b3384b + https://github.com/ceph/ceph/pull/41308

#4 Updated by Kefu Chai almost 3 years ago

  • Status changed from New to Triaged
  • Regression changed from No to Yes

#5 Updated by Kefu Chai almost 3 years ago

  • Pull request ID set to 41754

#6 Updated by Sridhar Seshasayee almost 3 years ago

Raised PR https://github.com/ceph/ceph/pull/41782 to address the test failure.

Please see latest update to https://tracker.ceph.com/issues/51076 for the other failure.
The test with 'wpq' scheduler resulted in the same failure and so it doesn't look like
a regression caused by https://github.com/ceph/ceph/pull/41308.

Therefore, I think a revert of the changes in https://github.com/ceph/ceph/pull/41308
may not be required considering the above findings. I do agree that backport of these
changes can be put on hold until all the standalone tests are fixed with mclock
scheduler and a way to prevent running 'osd bench' on every osd restart.

#7 Updated by Kefu Chai almost 3 years ago

Sridhar, please read the https://tracker.ceph.com/issues/51074#note-3. that's my finding in the last 3 days.

#8 Updated by Sridhar Seshasayee almost 3 years ago

Kefu, yes I did read the update and your effort to find the commit(s) that caused the regression in the standalone test. I understand that the effort to find regressions is never easy and thanks for bringing it up. If the rule is to completely remove the offending commits and resubmit it afresh after fixing the regression, I am fine to follow that process.

From my point of view the cause of the regression is understood and the PR I raised a few moments ago
(https://github.com/ceph/ceph/pull/41782) is an effort to fix the regression. I have also highlighted through testing
that the issue in the other tracker https://tracker.ceph.com/issues/51076 in not a regression. If these findings are
not sufficient to prevent the revert of my original commits, I am fine with that. But I would like to know why that
is the case?

Please let me know your thoughts.

#9 Updated by Neha Ojha almost 3 years ago

  • Status changed from Triaged to Pending Backport
  • Pull request ID changed from 41754 to 41782

marking Pending Backport, needs to be included with https://github.com/ceph/ceph/pull/41731

#10 Updated by Loïc Dachary over 2 years ago

  • Backport set to pacific

I assume there needs to be at least a backport to pacific and populated the Backport field accordingly. Feel free to revert if that was a mistake.

#11 Updated by Backport Bot over 2 years ago

  • Copied to Backport #51859: pacific: standalone/osd-rep-recov-eio.sh: TEST_rep_read_unfound failed with "Bad data after primary repair" error. added

#12 Updated by Sridhar Seshasayee over 2 years ago

  • Related to Backport #51117: pacific: osd: Run osd bench test to override default max osd capacity for mclock. added

#13 Updated by Sridhar Seshasayee over 2 years ago

  • Status changed from Pending Backport to Resolved

This doesn't need to be backported to pacific. The reason is that the mclock_scheduler will not be made default for pacific. Moving this to resolved and additionally close the associated backport tracker.

#14 Updated by Sridhar Seshasayee over 2 years ago

  • Related to deleted (Backport #51117: pacific: osd: Run osd bench test to override default max osd capacity for mclock.)

#15 Updated by Sridhar Seshasayee over 2 years ago

  • Copied to deleted (Backport #51859: pacific: standalone/osd-rep-recov-eio.sh: TEST_rep_read_unfound failed with "Bad data after primary repair" error.)

#16 Updated by Sridhar Seshasayee over 2 years ago

  • Backport deleted (pacific)

Also available in: Atom PDF