Bug #12181
closedtest: indep mapping fails because an osd is down
0%
Description
When running test-erasure-code.sh the mapping of pg 2.7 fails:
$ gzip -d < /tmp/bad-report.txt.gz | jq '.pgmap.pg_stats[] | select(.state != "active+clean") | [.pgid, .acting]' [ "2.7", [ 2147483647, 0, 4 ] ] <pre> because it misses osd.6 as shown by a report from a good run of the same test: </pre> $ gzip -d < /tmp/good-report.txt.gz | jq '.pgmap.pg_stats[] | select(.pgid == "2.7") | .acting' [ 6, 0, 4 ]
and the osd map shows it as out/down
gzip -d < /tmp/bad-report.txt.gz | jq '.osdmap.osds[] | select(.osd == 6)' { "osd": 6, "uuid": "913da64e-3527-4d06-9441-62e8d1145356", "up": 0, "in": 0, "weight": 0, "primary_affinity": 1, "last_clean_begin": 0, "last_clean_end": 0, "up_from": 26, "up_thru": 0, "down_at": 28, "lost_at": 0, "public_addr": "127.0.0.1:6889/29036", "cluster_addr": "127.0.0.1:6890/29036", "heartbeat_back_addr": "127.0.0.1:6891/29036", "heartbeat_front_addr": "127.0.0.1:6892/29036", "state": [ "autoout", "exists" ] }
nothing in the bad.log.gz explains why the osd.6 has failed. It could just be the host running the test failing although dmesg did not show any sign of memory starvation or disk troubles.
Files
Updated by Loïc Dachary almost 9 years ago
Close as cannot reproduce in two months from now.
Updated by Loïc Dachary almost 9 years ago
The fact that osd.6 is down explains the bad mapping: it will eventually be replaced by another osd but the test does not wait long enough for that to happen. This behavior can be reproduced by adding a
... create_erasure_coded_pool ecpool || return 1 kill_daemons $dir TERM osd.6 wait_for_clean || return 1
and it will return the same error.
$ ceph pg map 2.7 osdmap e60 pg 2.7 (2.7) -> up [2147483647,0,4] acting [2147483647,0,4]
If waiting 5 minutes after killing osd.6, the mapping will succeed:
$ ceph pg map 2.7 osdmap e60 pg 2.7 (2.7) -> up [9,0,4] acting [9,0,4]
Updated by Samuel Just almost 9 years ago
Did osd.6 die of something bug worthy? Looks like a timing problem with the test?
Updated by Loïc Dachary almost 9 years ago
I would be surprised if it was a timing issue. When the test activates an osd it waits until it is reported to be up
make check does not collect / save / report osd crash, therefore we do not know why it crashed. The bot should save the testdir for forensic analysis. There has not been much incentive to do this so far, probably because make check is 99.99% deterministic and repeating a failure is almost always a matter of running make check locally.
Updated by Loïc Dachary over 8 years ago
- Status changed from Need More Info to Can't reproduce
Did not show up in months (according to the make check failure logs).