Project

General

Profile

Bug #43807

osd-backfill-recovery-log.sh fails

Added by Sage Weil 8 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

2020-01-24T10:16:33.646 INFO:tasks.workunit.client.0.smithi134.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:82: _common_test:  jq '.pg_log_t.log | length' td/osd-backfill-recovery-log/result.log
2020-01-24T10:16:33.646 INFO:tasks.workunit.client.0.smithi134.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:82: _common_test:  LOGLEN=2
2020-01-24T10:16:33.647 INFO:tasks.workunit.client.0.smithi134.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:83: _common_test:  '[' 2 '!=' 1 ']'
2020-01-24T10:16:33.647 INFO:tasks.workunit.client.0.smithi134.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:84: _common_test:  echo 'FAILED: Wrong log length got 2 (expected 1)'
2020-01-24T10:16:33.647 INFO:tasks.workunit.client.0.smithi134.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:85: _common_test:  expr 0 + 1
2020-01-24T10:16:33.648 INFO:tasks.workunit.client.0.smithi134.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:85: _common_test:  ERRORS=1
2020-01-24T10:16:33.648 INFO:tasks.workunit.client.0.smithi134.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:87: _common_test:  jq '.pg_log_t.dups | length' td/osd-backfill-recovery-log/result.log
2020-01-24T10:16:33.648 INFO:tasks.workunit.client.0.smithi134.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:87: _common_test:  DUPSLEN=8
2020-01-24T10:16:33.648 INFO:tasks.workunit.client.0.smithi134.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:88: _common_test:  '[' 8 '!=' 9 ']'
2020-01-24T10:16:33.648 INFO:tasks.workunit.client.0.smithi134.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:89: _common_test:  echo 'FAILED: Wrong dups length got 8 (expected 9)'
2020-01-24T10:16:33.648 INFO:tasks.workunit.client.0.smithi134.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:90: _common_test:  expr 1 + 1
2020-01-24T10:16:33.648 INFO:tasks.workunit.client.0.smithi134.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:90: _common_test:  ERRORS=2
2020-01-24T10:16:33.649 INFO:tasks.workunit.client.0.smithi134.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:92: _common_test:  grep 'copy_up_to\|copy_after' td/osd-backfill-recovery-log/osd.0.log td/osd-backfill-recovery-log/osd.1.log td/osd-backfill-recovery-log/osd.2.log td/osd-backfill-recovery-log/osd.3.log td/osd-backfill-recovery-log/osd.4.log td/osd-backfill-recovery-log/osd.5.log
2020-01-24T10:16:33.649 INFO:tasks.workunit.client.0.smithi134.stdout:FAILED: Wrong log length got 2 (expected 1)
2020-01-24T10:16:33.649 INFO:tasks.workunit.client.0.smithi134.stdout:FAILED: Wrong dups length got 8 (expected 9)
...
2020-01-24T10:16:33.661 INFO:tasks.workunit.client.0.smithi134.stdout:TEST FAILED
2020-01-24T10:16:33.661 INFO:tasks.workunit.client.0.smithi134.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:93: _common_test:  rm -f td/osd-backfill-recovery-log/result.log
2020-01-24T10:16:33.661 INFO:tasks.workunit.client.0.smithi134.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:94: _common_test:  '[' 2 '!=' 0 ']'
2020-01-24T10:16:33.661 INFO:tasks.workunit.client.0.smithi134.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:95: _common_test:  echo TEST FAILED
2020-01-24T10:16:33.661 INFO:tasks.workunit.client.0.smithi134.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:96: _common_test:  return 1
2020-01-24T10:16:33.661 INFO:tasks.workunit.client.0.smithi134.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:33: run:  return 1

/a/sage-2020-01-24_01:55:08-rados-wip-sage4-testing-2020-01-23-1347-distro-basic-smithi/4698015

Related issues

Copied to RADOS - Backport #44847: octopus: osd-backfill-recovery-log.sh fails Resolved

History

#1 Updated by Josh Durgin 8 months ago

The test needs to be updated due to https://github.com/ceph/ceph/pull/32683 - anything else that sets the log lengths (like the rados suite facet) needs updating too. If the tests set osd_target_pg_log_entries_per_osd they can get a similar effect.

#2 Updated by Sage Weil 8 months ago

/a/sage-2020-01-24_13:15:58-rados-wip-sage2-testing-2020-01-23-1953-distro-basic-smithi/4701051

#3 Updated by Neha Ojha 8 months ago

  • Project changed from Ceph to RADOS

#4 Updated by Neha Ojha 8 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 32851

#5 Updated by Neha Ojha 8 months ago

  • Status changed from Fix Under Review to In Progress

#6 Updated by Neha Ojha 8 months ago

  • Status changed from In Progress to Fix Under Review
  • Assignee set to Neha Ojha

#7 Updated by Sage Weil 8 months ago

//a/sage-2020-01-24_23:29:53-rados-wip-sage2-testing-2020-01-24-1408-distro-basic-smithi/4703160

#8 Updated by Sage Weil 8 months ago

  • Target version set to v15.0.0

#9 Updated by Sage Weil 8 months ago

  • Status changed from Fix Under Review to Resolved

#10 Updated by Sage Weil 6 months ago

/a/sage-2020-03-17_13:59:54-rados-wip-sage-testing-2020-03-17-0740-distro-basic-smithi/4863239

#11 Updated by Neha Ojha 6 months ago

  • Status changed from Resolved to New

Note that this is a resurrection of the same failure with different symptoms

/a/sage-2020-03-17_13:59:54-rados-wip-sage-testing-2020-03-17-0740-distro-basic-smithi/4863239

TEST_backfill_log_1

2020-03-17T16:17:27.682 INFO:tasks.workunit.client.0.smithi014.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:82: _common_test:  LOGLEN=50
2020-03-17T16:17:27.682 INFO:tasks.workunit.client.0.smithi014.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:83: _common_test:  '[' 50 '!=' 2 ']'
2020-03-17T16:17:27.683 INFO:tasks.workunit.client.0.smithi014.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:84: _common_test:  echo 'FAILED: Wrong log length got 50 (expected 2)'
2020-03-17T16:17:27.683 INFO:tasks.workunit.client.0.smithi014.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:85: _common_test:  expr 0 + 1
2020-03-17T16:17:27.683 INFO:tasks.workunit.client.0.smithi014.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:85: _common_test:  ERRORS=1
2020-03-17T16:17:27.683 INFO:tasks.workunit.client.0.smithi014.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:87: _common_test:  jq '.pg_log_t.dups | length' td/osd-backfill-recovery-log/result.log
2020-03-17T16:17:27.683 INFO:tasks.workunit.client.0.smithi014.stdout:FAILED: Wrong log length got 50 (expected 2)
2020-03-17T16:17:27.684 INFO:tasks.workunit.client.0.smithi014.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:87: _common_test:  DUPSLEN=7
2020-03-17T16:17:27.684 INFO:tasks.workunit.client.0.smithi014.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:88: _common_test:  '[' 7 '!=' 8 ']'
2020-03-17T16:17:27.685 INFO:tasks.workunit.client.0.smithi014.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:89: _common_test:  echo 'FAILED: Wrong dups length got 7 (expected 8)'                                                                                                                                                                                                      2020-03-17T16:17:27.685 INFO:tasks.workunit.client.0.smithi014.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:90: _common_test:  expr 1 + 1
2020-03-17T16:17:27.685 INFO:tasks.workunit.client.0.smithi014.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:90: _common_test:  ERRORS=2
2020-03-17T16:17:27.685 INFO:tasks.workunit.client.0.smithi014.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:92: _common_test:  grep 'copy_up_to\|copy_after' td/osd-backfill-recovery-log/osd.0.log td/osd-backfill-recovery-log/osd.1.log td/osd-backfill-recovery-log/osd.2.log td/osd-backfill-recovery-log/osd.3.log td/osd-backfill-recovery-log/osd.4.log td/osd-backfill-recovery-log/osd.5.log
2020-03-17T16:17:27.685 INFO:tasks.workunit.client.0.smithi014.stdout:FAILED: Wrong dups length got 7 (expected 8)

/a/nojha-2020-03-16_17:35:35-rados:standalone-master-distro-basic-smithi/4860657/

TEST_backfill_log_2

2020-03-16T18:53:54.721 INFO:tasks.workunit.client.0.smithi001.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:82: _common_test:  LOGLEN=50
2020-03-16T18:53:54.721 INFO:tasks.workunit.client.0.smithi001.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:83: _common_test:  '[' 50 '!=' 2 ']'
2020-03-16T18:53:54.721 INFO:tasks.workunit.client.0.smithi001.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:84: _common_test:  echo 'FAILED: Wrong log length got 50 (expected 2)'    
2020-03-16T18:53:54.721 INFO:tasks.workunit.client.0.smithi001.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:85: _common_test:  expr 0 + 1
2020-03-16T18:53:54.721 INFO:tasks.workunit.client.0.smithi001.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:85: _common_test:  ERRORS=1
2020-03-16T18:53:54.721 INFO:tasks.workunit.client.0.smithi001.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:87: _common_test:  jq '.pg_log_t.dups | length' td/osd-backfill-recovery-log/result.log
2020-03-16T18:53:54.722 INFO:tasks.workunit.client.0.smithi001.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:87: _common_test:  DUPSLEN=100
2020-03-16T18:53:54.722 INFO:tasks.workunit.client.0.smithi001.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:88: _common_test:  '[' 100 '!=' 148 ']'
2020-03-16T18:53:54.722 INFO:tasks.workunit.client.0.smithi001.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:89: _common_test:  echo 'FAILED: Wrong dups length got 100 (expected 148)'
2020-03-16T18:53:54.722 INFO:tasks.workunit.client.0.smithi001.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:90: _common_test:  expr 1 + 1
2020-03-16T18:53:54.722 INFO:tasks.workunit.client.0.smithi001.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:90: _common_test:  ERRORS=2
2020-03-16T18:53:54.722 INFO:tasks.workunit.client.0.smithi001.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:92: _common_test:  grep 'copy_up_to\|copy_after' td/osd-backfill-recovery-log/osd.0.log td/osd-backfill-recovery-log/osd.1.log td/osd-backfill-recovery-log/osd.2.log td/osd-backfill-recovery-log/osd.3.log td/osd-backfill-recovery-log/osd.4.log td/osd-backfill-recovery-log/osd.5.log
2020-03-16T18:53:54.722 INFO:tasks.workunit.client.0.smithi001.stdout:FAILED: Wrong log length got 50 (expected 2)
2020-03-16T18:53:54.723 INFO:tasks.workunit.client.0.smithi001.stdout:FAILED: Wrong dups length got 100 (expected 148)

#12 Updated by Neha Ojha 6 months ago

/a/sage-2020-03-17_13:59:54-rados-wip-sage-testing-2020-03-17-0740-distro-basic-smithi/4863239

Comparing a failed TEST_backfill_log_1 test with one that passed, I see that the "newprimary" is different in both cases. Note that the test marks osd.0,1,2 out and waits for clean.

PASSED test

"up":[4,3,5],"acting":[4,3,5]

../qa/standalone/osd/osd-backfill-recovery-log.sh:78: _common_test:  newprimary=4

FAILED test

2020-03-17T16:17:26.220 INFO:tasks.workunit.client.0.smithi014.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-recovery-log.sh:77: _common_test:  newprimary=1

Not sure why osd.1 still remained the primary here.

#13 Updated by Neha Ojha 6 months ago

  • Status changed from New to In Progress

#14 Updated by Neha Ojha 6 months ago

Neha Ojha wrote:

/a/sage-2020-03-17_13:59:54-rados-wip-sage-testing-2020-03-17-0740-distro-basic-smithi/4863239

Comparing a failed TEST_backfill_log_1 test with one that passed, I see that the "newprimary" is different in both cases. Note that the test marks osd.0,1,2 out and waits for clean.

PASSED test

"up":[4,3,5],"acting":[4,3,5]

[...]

FAILED test

[...]

Not sure why osd.1 still remained the primary here.

Before wait_for_clean

2020-03-22T02:40:05.571 INFO:tasks.workunit.client.0.smithi001.stdout:{"pg_ready":true,"pg_stats":[{"pgid":"1.0","version":"34'143","reported_seq":"153","reported_epoch":"34","state":"active+clean","last_fresh":"2020-03-22T02:40:01.191973+0000","last_change":"2020-03-22T02:39:49.653126+0000","last_active":"2020-03-22T02:40:01.191973+0000","last_peered":"2020-03-22T02:40:01.191973+0000","last_clean":"2020-03-22T02:40:01.191973+0000","last_became_active":"2020-03-22T02:39:49.652216+0000","last_became_peered":"2020-03-22T02:39:49.652216+0000","last_unstale":"2020-03-22T02:40:01.191973+0000","last_undegraded":"2020-03-22T02:40:01.191973+0000","last_fullsized":"2020-03-22T02:40:01.191973+0000","mapping_epoch":32,"log_start":"34'100","ondisk_log_start":"34'100","created":32,"last_epoch_clean":33,"parent":"0.0","parent_split_bits":0,"last_scrub":"0'0","last_scrub_stamp":"2020-03-22T02:39:49.018635+0000","last_deep_scrub":"0'0","last_deep_scrub_stamp":"2020-03-22T02:39:49.018635+0000","last_clean_scrub_stamp":"2020-03-22T02:39:49.018635+0000","log_size":43,"ondisk_log_size":43,"stats_invalid":false,"dirty_stats_invalid":false,"omap_stats_invalid":false,"hitset_stats_invalid":false,"hitset_bytes_stats_invalid":false,"pin_stats_invalid":false,"manifest_stats_invalid":false,"snaptrimq_len":0,"stat_sum":{"num_bytes":974116,"num_objects":143,"num_object_clones":0,"num_object_copies":429,"num_objects_missing_on_primary":0,"num_objects_missing":0,"num_objects_degraded":0,"num_objects_misplaced":0,"num_objects_unfound":0,"num_objects_dirty":143,"num_whiteouts":0,"num_read":0,"num_read_kb":0,"num_write":143,"num_write_kb":1001,"num_scrub_errors":0,"num_shallow_scrub_errors":0,"num_deep_scrub_errors":0,"num_objects_recovered":0,"num_bytes_recovered":0,"num_keys_recovered":0,"num_objects_omap":0,"num_objects_hit_set_archive":0,"num_bytes_hit_set_archive":0,"num_flush":0,"num_flush_kb":0,"num_evict":0,"num_evict_kb":0,"num_promote":0,"num_flush_mode_high":0,"num_flush_mode_low":0,"num_evict_mode_some":0,"num_evict_mode_full":0,"num_objects_pinned":0,"num_legacy_snapsets":0,"num_large_omap_objects":0,"num_objects_manifest":0,"num_omap_bytes":0,"num_omap_keys":0,"num_objects_repaired":0},"up":[1,0,2],"acting":[1,0,2],"avail_no_missing":[],"object_location_counts":[],"blocked_by":[],"up_primary":1,"acting_primary":1,"purged_snaps":[]}]}

After wait_for_clean

2020-03-22T02:40:10.670 INFO:tasks.workunit.client.0.smithi001.stdout:{"pg_ready":true,"pg_stats":[{"pgid":"1.0","version":"34'150","reported_seq":"160","reported_epoch":"34","state":"active+clean","last_fresh":"2020-03-22T02:40:01.734859+0000","last_change":"2020-03-22T02:39:49.653126+0000","last_active":"2020-03-22T02:40:01.734859+0000","last_peered":"2020-03-22T02:40:01.734859+0000","last_clean":"2020-03-22T02:40:01.734859+0000","last_became_active":"2020-03-22T02:39:49.652216+0000","last_became_peered":"2020-03-22T02:39:49.652216+0000","last_unstale":"2020-03-22T02:40:01.734859+0000","last_undegraded":"2020-03-22T02:40:01.734859+0000","last_fullsized":"2020-03-22T02:40:01.734859+0000","mapping_epoch":32,"log_start":"34'100","ondisk_log_start":"34'100","created":32,"last_epoch_clean":33,"parent":"0.0","parent_split_bits":0,"last_scrub":"0'0","last_scrub_stamp":"2020-03-22T02:39:49.018635+0000","last_deep_scrub":"0'0","last_deep_scrub_stamp":"2020-03-22T02:39:49.018635+0000","last_clean_scrub_stamp":"2020-03-22T02:39:49.018635+0000","log_size":50,"ondisk_log_size":50,"stats_invalid":false,"dirty_stats_invalid":false,"omap_stats_invalid":false,"hitset_stats_invalid":false,"hitset_bytes_stats_invalid":false,"pin_stats_invalid":false,"manifest_stats_invalid":false,"snaptrimq_len":0,"stat_sum":{"num_bytes":1021800,"num_objects":150,"num_object_clones":0,"num_object_copies":450,"num_objects_missing_on_primary":0,"num_objects_missing":0,"num_objects_degraded":0,"num_objects_misplaced":0,"num_objects_unfound":0,"num_objects_dirty":150,"num_whiteouts":0,"num_read":0,"num_read_kb":0,"num_write":150,"num_write_kb":1050,"num_scrub_errors":0,"num_shallow_scrub_errors":0,"num_deep_scrub_errors":0,"num_objects_recovered":0,"num_bytes_recovered":0,"num_keys_recovered":0,"num_objects_omap":0,"num_objects_hit_set_archive":0,"num_bytes_hit_set_archive":0,"num_flush":0,"num_flush_kb":0,"num_evict":0,"num_evict_kb":0,"num_promote":0,"num_flush_mode_high":0,"num_flush_mode_low":0,"num_evict_mode_some":0,"num_evict_mode_full":0,"num_objects_pinned":0,"num_legacy_snapsets":0,"num_large_omap_objects":0,"num_objects_manifest":0,"num_omap_bytes":0,"num_omap_keys":0,"num_objects_repaired":0},"up":[1,0,2],"acting":[1,0,2],"avail_no_missing":[],"object_location_counts":[],"blocked_by":[],"up_primary":1,"acting_primary":1,"purged_snaps":[]}]}

Note that the "reported_epoch" remained the same. This is because mgr_stats_period is 5 seconds, and we have not fetched the latest stats yet.

#15 Updated by Neha Ojha 6 months ago

  • Status changed from In Progress to Fix Under Review

#16 Updated by Sage Weil 6 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to octopus

#18 Updated by Nathan Cutler 6 months ago

  • Status changed from Pending Backport to Resolved
  • Backport deleted (octopus)
$ git branch --contains b208177
  master
* octopus

#19 Updated by Neha Ojha 6 months ago

  • Status changed from Resolved to Pending Backport

#20 Updated by Nathan Cutler 6 months ago

  • Backport set to octopus

#21 Updated by Nathan Cutler 6 months ago

  • Copied to Backport #44847: octopus: osd-backfill-recovery-log.sh fails added

#22 Updated by Nathan Cutler 6 months ago

Neha Ojha wrote:

Nathan, we need https://github.com/ceph/ceph/pull/34126 as well - See https://tracker.ceph.com/issues/43807#note-15

Oops! Sorry! Got it now, and octopus PR is open for review.

#23 Updated by Nathan Cutler 6 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF