Bug #51074
standalone/osd-rep-recov-eio.sh: TEST_rep_read_unfound failed with "Bad data after primary repair" error.
0%
Description
Observed on Master:
/a/sseshasa-2021-06-01_08:27:04-rados-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/6145022
2021-06-01T10:42:28.259 INFO:tasks.workunit.client.0.smithi050.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-rep-recov-eio.sh:458: TEST_rep_read_unfound: wait
2021-06-01T10:42:28.259 INFO:tasks.workunit.client.0.smithi050.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-rep-recov-eio.sh:460: TEST_rep_read_unfound: cmp td/osd-rep-recov-eio.sh/ORIGINAL td/osd-rep-recov-eio.sh/tmp
2021-06-01T10:42:28.260 INFO:tasks.workunit.client.0.smithi050.stderr:cmp: EOF on td/osd-rep-recov-eio.sh/tmp which is empty
2021-06-01T10:42:28.260 INFO:tasks.workunit.client.0.smithi050.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-rep-recov-eio.sh:462: TEST_rep_read_unfound: echo 'Bad data after primary repair'
2021-06-01T10:42:28.260 INFO:tasks.workunit.client.0.smithi050.stdout:Bad data after primary repair
2021-06-01T10:42:28.262 INFO:tasks.workunit.client.0.smithi050.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-rep-recov-eio.sh:463: TEST_rep_read_unfound: return 1
2021-06-01T10:42:28.262 INFO:tasks.workunit.client.0.smithi050.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-rep-recov-eio.sh:42: run: return 1
History
#1 Updated by Kefu Chai almost 3 years ago
/a/kchai-2021-06-05_13:57:48-rados-master-distro-basic-smithi/6154221/
#2 Updated by Kefu Chai almost 3 years ago
- Priority changed from Normal to High
#3 Updated by Kefu Chai almost 3 years ago
not able to reproduce this issue locally. bisecting:
in which, 6ab8dbe666350a7377b35b715b28fe83cc8492c2 is f69e6f6702da55d58deb8379944d0a8b30b3384b + https://github.com/ceph/ceph/pull/41308
#4 Updated by Kefu Chai almost 3 years ago
- Status changed from New to Triaged
- Regression changed from No to Yes
#5 Updated by Kefu Chai almost 3 years ago
- Pull request ID set to 41754
#6 Updated by Sridhar Seshasayee almost 3 years ago
Raised PR https://github.com/ceph/ceph/pull/41782 to address the test failure.
Please see latest update to https://tracker.ceph.com/issues/51076 for the other failure.
The test with 'wpq' scheduler resulted in the same failure and so it doesn't look like
a regression caused by https://github.com/ceph/ceph/pull/41308.
Therefore, I think a revert of the changes in https://github.com/ceph/ceph/pull/41308
may not be required considering the above findings. I do agree that backport of these
changes can be put on hold until all the standalone tests are fixed with mclock
scheduler and a way to prevent running 'osd bench' on every osd restart.
#7 Updated by Kefu Chai almost 3 years ago
Sridhar, please read the https://tracker.ceph.com/issues/51074#note-3. that's my finding in the last 3 days.
#8 Updated by Sridhar Seshasayee almost 3 years ago
Kefu, yes I did read the update and your effort to find the commit(s) that caused the regression in the standalone test. I understand that the effort to find regressions is never easy and thanks for bringing it up. If the rule is to completely remove the offending commits and resubmit it afresh after fixing the regression, I am fine to follow that process.
From my point of view the cause of the regression is understood and the PR I raised a few moments ago
(https://github.com/ceph/ceph/pull/41782) is an effort to fix the regression. I have also highlighted through testing
that the issue in the other tracker https://tracker.ceph.com/issues/51076 in not a regression. If these findings are
not sufficient to prevent the revert of my original commits, I am fine with that. But I would like to know why that
is the case?
Please let me know your thoughts.
#9 Updated by Neha Ojha almost 3 years ago
- Status changed from Triaged to Pending Backport
- Pull request ID changed from 41754 to 41782
marking Pending Backport, needs to be included with https://github.com/ceph/ceph/pull/41731
#10 Updated by Loïc Dachary over 2 years ago
- Backport set to pacific
I assume there needs to be at least a backport to pacific and populated the Backport field accordingly. Feel free to revert if that was a mistake.
#11 Updated by Backport Bot over 2 years ago
- Copied to Backport #51859: pacific: standalone/osd-rep-recov-eio.sh: TEST_rep_read_unfound failed with "Bad data after primary repair" error. added
#12 Updated by Sridhar Seshasayee over 2 years ago
- Related to Backport #51117: pacific: osd: Run osd bench test to override default max osd capacity for mclock. added
#13 Updated by Sridhar Seshasayee over 2 years ago
- Status changed from Pending Backport to Resolved
This doesn't need to be backported to pacific. The reason is that the mclock_scheduler will not be made default for pacific. Moving this to resolved and additionally close the associated backport tracker.
#14 Updated by Sridhar Seshasayee over 2 years ago
- Related to deleted (Backport #51117: pacific: osd: Run osd bench test to override default max osd capacity for mclock.)
#15 Updated by Sridhar Seshasayee over 2 years ago
- Copied to deleted (Backport #51859: pacific: standalone/osd-rep-recov-eio.sh: TEST_rep_read_unfound failed with "Bad data after primary repair" error.)
#16 Updated by Sridhar Seshasayee over 2 years ago
- Backport deleted (
pacific)