Project

General

Profile

Actions

Bug #44518

closed

osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeout

Added by Sage Weil about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-03-09T03:36:55.899 INFO:tasks.workunit.client.0.smithi037.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1276: get_is_making_recovery_progress:  jq -r '.pgmap | .recovering_keys_per_sec + .recovering_bytes_per_sec + .recovering_objects_per_sec'
2020-03-09T03:36:56.374 INFO:tasks.workunit.client.0.smithi037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1276: get_is_making_recovery_progress:  local progress=null
2020-03-09T03:36:56.375 INFO:tasks.workunit.client.0.smithi037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1277: get_is_making_recovery_progress:  test null '!=' null
2020-03-09T03:36:56.375 INFO:tasks.workunit.client.0.smithi037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1601: wait_for_clean:  ((  13 >= 13  ))
2020-03-09T03:36:56.375 INFO:tasks.workunit.client.0.smithi037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1602: wait_for_clean:  ceph report
2020-03-09T03:36:56.843 INFO:tasks.workunit.client.0.smithi037.stdout:{
...
2020-03-09T03:36:56.846 INFO:tasks.workunit.client.0.smithi037.stdout:                        "message": "pg 1.0 is stuck undersized for 118s, current state active+undersized+degraded+remapped+backfilling, last acting [1,0]" 
...
2020-03-09T03:36:57.073 INFO:tasks.workunit.client.0.smithi037.stdout:}
2020-03-09T03:36:57.077 INFO:tasks.workunit.client.0.smithi037.stderr:report 54303303
2020-03-09T03:36:57.077 INFO:tasks.workunit.client.0.smithi037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1603: wait_for_clean:  return 1
2020-03-09T03:36:57.077 INFO:tasks.workunit.client.0.smithi037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:359: TEST_backfill_out2:  return 1
2020-03-09T03:36:57.078 INFO:tasks.workunit.client.0.smithi037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:37: run:  return 1

/a/sage-2020-03-09_01:44:37-rados:standalone-wip-sage2-testing-2020-03-08-1456-distro-basic-smithi/4838491
/a/sage-2020-03-09_01:44:37-rados:standalone-wip-sage2-testing-2020-03-08-1456-distro-basic-smithi/4838500

Related issues 3 (0 open3 closed)

Related to RADOS - Bug #44314: osd-backfill-stats.sh failing intermittently in TEST_backfill_sizeup_out() (degraded outside margin)ResolvedDavid Zafman

Actions
Copied to RADOS - Backport #44685: octopus: osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeoutResolvedNathan CutlerActions
Copied to RADOS - Backport #44686: nautilus: osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeoutResolvedNathan CutlerActions
Actions #1

Updated by Sage Weil about 4 years ago

  • Project changed from Ceph to RADOS
Actions #3

Updated by David Zafman about 4 years ago

  • Related to Bug #44314: osd-backfill-stats.sh failing intermittently in TEST_backfill_sizeup_out() (degraded outside margin) added
Actions #4

Updated by Sage Weil about 4 years ago

slightly different versoin of this:

2020-03-10T20:11:07.201 INFO:tasks.workunit.client.0.smithi018.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1276: get_is_making_recovery_progress:  jq -r '.pgmap | .recovering_keys_per_sec + .recovering_bytes_per_sec + .recovering_objects_per_sec'
2020-03-10T20:11:07.696 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1276: get_is_making_recovery_progress:  local progress=null
2020-03-10T20:11:07.696 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1277: get_is_making_recovery_progress:  test null '!=' null
2020-03-10T20:11:07.696 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1601: wait_for_clean:  ((  13 >= 13  ))
2020-03-10T20:11:07.696 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1602: wait_for_clean:  ceph report
...
2020-03-10T20:11:08.412 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1603: wait_for_clean:  return 1
2020-03-10T20:11:08.412 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:304: TEST_backfill_down_out:  return 1
2020-03-10T20:11:08.412 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:37: run:  return 1

/a/sage-2020-03-10_16:51:17-rados-wip-sage3-testing-2020-03-10-1037-distro-basic-smithi/4844181

Actions #5

Updated by David Zafman about 4 years ago

  • Related to deleted (Bug #44314: osd-backfill-stats.sh failing intermittently in TEST_backfill_sizeup_out() (degraded outside margin))
Actions #6

Updated by David Zafman about 4 years ago

  • Related to Bug #44314: osd-backfill-stats.sh failing intermittently in TEST_backfill_sizeup_out() (degraded outside margin) added
Actions #7

Updated by Neha Ojha about 4 years ago

Comparing a passed test with a failed one:

PASSED - note the PG mapping [2,4,3]/[1,0] backfill=[2,3,4]

2020-03-13T19:34:30.763+0000 7f6a43ffa700 20 osd.1 43 _dispatch 0x5600f561c680 command(tid 2: {"prefix": "debug kick_recovery_wq", "delay": 0}) v1
2020-03-13T19:34:30.903+0000 7f6a35fbe700 10 osd.1 44 do_recovery starting 1 pg[1.0( v 36'200 (36'100,36'200] local-lis/les=42/43 n=200 ec=32/32 lis/c=42/35 les/c/f=43/36/0 sis=42) [2,4,3]/[1,0] backfill=[2,3,4] r=0 lpr=42 pi=[35,42)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}]

FAILED - note the PG mapping [1,2,4]/[1,0] backfill=[2,4]

2020-03-09T03:36:52.373+0000 7fe5ec156700 20 osd.1 44 OSD::ms_dispatch: command(tid 3: {"prefix": "debug kick_recovery_wq", "delay": 0}) v1
2020-03-09T03:36:56.819+0000 7fe5ddf62700 10 osd.1 pg_epoch: 44 pg[1.0( v 37'200 (37'100,37'200] local-lis/les=43/44 n=200 ec=32/32 lis/c=43/35 les/c/f=44/37/0 sis=43) [1,2,4]/[1,0] backfill=[2,4] r=0 lpr=43 pi=[35,43)/2 crt=37'200 lcod 37'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] do_peering_event: epoch_sent: 43 epoch_requested: 43 RenewLease

This is what the test expects

# [1, 0] -> [2, 3, 4]
# degraded 500 -> 0
# misplaced 1000 -> 0
# state: active+undersized+degraded+remapped+backfilling

# PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE                                           STATE_STAMP                VERSION REPORTED UP      UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP
# 1.0         500                  0      500       1000       0     0 100      100 active+undersized+degraded+remapped+backfilling 2017-10-27 09:55:50.375722  23'500   27:553 [2,4,3]          2  [1,0]              1        0'0 2017-10-27 09:55:10.230919             0'0 2017-10-27 09:55:10.230919
function TEST_backfill_out2() {

So I think osd.1 being in the up set for the failed test is the problem.

Actions #8

Updated by Neha Ojha about 4 years ago

/a/sage-2020-03-09_01:44:37-rados:standalone-wip-sage2-testing-2020-03-08-1456-distro-basic-smithi/4838500

queue_recovery() gets called in on_active_actmap() when

  if (recovery_state.is_peered() &&
      !recovery_state.is_clean() &&
      !recovery_state.get_osdmap()->test_flag(CEPH_OSDMAP_NOBACKFILL) &&
      (!recovery_state.get_osdmap()->test_flag(CEPH_OSDMAP_NOREBALANCE) ||
       recovery_state.is_degraded())) {
    queue_recovery();
  }

On osd.1, the last osd_map epoch change seen was at epoch 44

2020-03-09T03:36:48.746+0000 7fe5f9009700  1 -- [v2:127.0.0.1:6812/484647,v1:127.0.0.1:6813/484647] --> [v2:127.0.0.1:6820/485002,v1:127.0.0.1:6821/485002] -- osd_map(44..44 src has 1..44) v4 -- 0x55e0201aa8c0 con 0x55e01ff8a000

From the mon logs, osd.1 was marked out at epoch 45

2020-03-09T03:36:48.944+0000 7fc997417700  5 --2- v2:127.0.0.1:7114/0 >> 127.0.0.1:0/3123918133 conn(0x55e4a6b94000 0x55e4a7450480 crc :-1 s=READY pgs=2 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x55e4a6d7ba00 seq=7 mon_command_ack([{"prefix": "osd out", "ids": ["osd.1"]}]=0 marked out osd.1.  v45) v1

Since this map change is not propagated to the osd, which is why queue_recovery() never gets called.

Actions #9

Updated by Neha Ojha about 4 years ago

  • Status changed from New to In Progress
  • Assignee set to Neha Ojha
Actions #10

Updated by Neha Ojha about 4 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 34047
Actions #11

Updated by Sage Weil about 4 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to octopus,nautilus
Actions #12

Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #44685: octopus: osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeout added
Actions #13

Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #44686: nautilus: osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeout added
Actions #14

Updated by Nathan Cutler almost 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF