Project

General

Profile

Actions

Bug #12181

closed

test: indep mapping fails because an osd is down

Added by Loïc Dachary almost 9 years ago. Updated over 8 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When running test-erasure-code.sh the mapping of pg 2.7 fails:

$ gzip -d < /tmp/bad-report.txt.gz | jq '.pgmap.pg_stats[] | select(.state != "active+clean") | [.pgid, .acting]'
[
  "2.7",
  [
    2147483647,
    0,
    4
  ]
]
<pre>
because it misses osd.6 as shown by a report from a good run of the same test:
</pre>
$ gzip -d < /tmp/good-report.txt.gz | jq '.pgmap.pg_stats[] | select(.pgid == "2.7") | .acting'
[
  6,
  0,
  4
]

and the osd map shows it as out/down
gzip -d < /tmp/bad-report.txt.gz | jq '.osdmap.osds[] | select(.osd == 6)'
{
  "osd": 6,
  "uuid": "913da64e-3527-4d06-9441-62e8d1145356",
  "up": 0,
  "in": 0,
  "weight": 0,
  "primary_affinity": 1,
  "last_clean_begin": 0,
  "last_clean_end": 0,
  "up_from": 26,
  "up_thru": 0,
  "down_at": 28,
  "lost_at": 0,
  "public_addr": "127.0.0.1:6889/29036",
  "cluster_addr": "127.0.0.1:6890/29036",
  "heartbeat_back_addr": "127.0.0.1:6891/29036",
  "heartbeat_front_addr": "127.0.0.1:6892/29036",
  "state": [
    "autoout",
    "exists" 
  ]
}

nothing in the bad.log.gz explains why the osd.6 has failed. It could just be the host running the test failing although dmesg did not show any sign of memory starvation or disk troubles.


Files

bad.log.gz (59.1 KB) bad.log.gz bad.log Loïc Dachary, 06/27/2015 04:29 PM
bad-report.txt.gz (7.29 KB) bad-report.txt.gz ceph report for the bad run Loïc Dachary, 06/27/2015 04:29 PM
good-report.txt.gz (7.28 KB) good-report.txt.gz ceph report for the good run Loïc Dachary, 06/27/2015 04:29 PM
Actions #1

Updated by Loïc Dachary almost 9 years ago

Close as cannot reproduce in two months from now.

Actions #2

Updated by Loïc Dachary almost 9 years ago

The fact that osd.6 is down explains the bad mapping: it will eventually be replaced by another osd but the test does not wait long enough for that to happen. This behavior can be reproduced by adding a

...
create_erasure_coded_pool ecpool || return 1
kill_daemons $dir TERM osd.6
wait_for_clean || return 1

and it will return the same error.
$ ceph pg map 2.7
osdmap e60 pg 2.7 (2.7) -> up [2147483647,0,4] acting [2147483647,0,4]

If waiting 5 minutes after killing osd.6, the mapping will succeed:
$ ceph pg map 2.7
osdmap e60 pg 2.7 (2.7) -> up [9,0,4] acting [9,0,4]

Actions #3

Updated by Samuel Just almost 9 years ago

Did osd.6 die of something bug worthy? Looks like a timing problem with the test?

Actions #4

Updated by Loïc Dachary almost 9 years ago

I would be surprised if it was a timing issue. When the test activates an osd it waits until it is reported to be up

make check does not collect / save / report osd crash, therefore we do not know why it crashed. The bot should save the testdir for forensic analysis. There has not been much incentive to do this so far, probably because make check is 99.99% deterministic and repeating a failure is almost always a matter of running make check locally.

Actions #5

Updated by Loïc Dachary over 8 years ago

  • Status changed from Need More Info to Can't reproduce

Did not show up in months (according to the make check failure logs).

Actions

Also available in: Atom PDF