Bug #12181: test: indep mapping fails because an osd is down - Ceph - Ceph

Actions

Copy link

Bug #12181

closed

test: indep mapping fails because an osd is down

Added by Loïc Dachary almost 9 years ago. Updated over 8 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Loïc Dachary

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When running test-erasure-code.sh the mapping of pg 2.7 fails:

$ gzip -d < /tmp/bad-report.txt.gz | jq '.pgmap.pg_stats[] | select(.state != "active+clean") | [.pgid, .acting]'
[
  "2.7",
  [
    2147483647,
    0,
    4
  ]
]
<pre>
because it misses osd.6 as shown by a report from a good run of the same test:
</pre>
$ gzip -d < /tmp/good-report.txt.gz | jq '.pgmap.pg_stats[] | select(.pgid == "2.7") | .acting'
[
  6,
  0,
  4
]

and the osd map shows it as out/down

gzip -d < /tmp/bad-report.txt.gz | jq '.osdmap.osds[] | select(.osd == 6)'
{
  "osd": 6,
  "uuid": "913da64e-3527-4d06-9441-62e8d1145356",
  "up": 0,
  "in": 0,
  "weight": 0,
  "primary_affinity": 1,
  "last_clean_begin": 0,
  "last_clean_end": 0,
  "up_from": 26,
  "up_thru": 0,
  "down_at": 28,
  "lost_at": 0,
  "public_addr": "127.0.0.1:6889/29036",
  "cluster_addr": "127.0.0.1:6890/29036",
  "heartbeat_back_addr": "127.0.0.1:6891/29036",
  "heartbeat_front_addr": "127.0.0.1:6892/29036",
  "state": [
    "autoout",
    "exists" 
  ]
}

nothing in the bad.log.gz explains why the osd.6 has failed. It could just be the host running the test failing although dmesg did not show any sign of memory starvation or disk troubles.

Files

Download all files

bad.log.gz (59.1 KB) bad.log.gz	bad.log	Loïc Dachary, 06/27/2015 04:29 PM
bad-report.txt.gz (7.29 KB) bad-report.txt.gz	ceph report for the bad run	Loïc Dachary, 06/27/2015 04:29 PM
good-report.txt.gz (7.28 KB) good-report.txt.gz	ceph report for the good run	Loïc Dachary, 06/27/2015 04:29 PM

Actions

Copy link

Updated by Loïc Dachary almost 9 years ago

Close as cannot reproduce in two months from now.

Actions

Copy link

Updated by Loïc Dachary almost 9 years ago

The fact that osd.6 is down explains the bad mapping: it will eventually be replaced by another osd but the test does not wait long enough for that to happen. This behavior can be reproduced by adding a

...
create_erasure_coded_pool ecpool || return 1
kill_daemons $dir TERM osd.6
wait_for_clean || return 1

and it will return the same error.

$ ceph pg map 2.7
osdmap e60 pg 2.7 (2.7) -> up [2147483647,0,4] acting [2147483647,0,4]

If waiting 5 minutes after killing osd.6, the mapping will succeed:

$ ceph pg map 2.7
osdmap e60 pg 2.7 (2.7) -> up [9,0,4] acting [9,0,4]

Actions

Copy link

Updated by Samuel Just almost 9 years ago

Did osd.6 die of something bug worthy? Looks like a timing problem with the test?

Actions

Copy link

Updated by Loïc Dachary almost 9 years ago

I would be surprised if it was a timing issue. When the test activates an osd it waits until it is reported to be up

make check does not collect / save / report osd crash, therefore we do not know why it crashed. The bot should save the testdir for forensic analysis. There has not been much incentive to do this so far, probably because make check is 99.99% deterministic and repeating a failure is almost always a matter of running make check locally.

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

Status changed from Need More Info to Can't reproduce

Did not show up in months (according to the make check failure logs).

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #12181

test: indep mapping fails because an osd is down

Updated by Loïc Dachary almost 9 years ago

Updated by Loïc Dachary almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by Loïc Dachary almost 9 years ago

Updated by Loïc Dachary over 8 years ago