Project

General

Profile

Actions

Bug #7014

closed

rados: stuck degraded, possibly related to acting_backfill changes

Added by Samuel Just over 10 years ago. Updated over 10 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

End of ceph.log:
2013-12-15 23:51:19.781079 mon.0 10.214.131.3:6789/0 2397 : [INF] pgmap v1364: 213 pgs: 212 active+clean, 1 active+degraded; 2658 MB data, 729 MB used, 928 GB / 931 GB avail
2013-12-15 23:51:19.944286 mon.0 10.214.131.3:6789/0 2398 : [INF] osdmap e950: 6 osds: 5 up, 2 in
2013-12-15 23:51:20.061608 mon.0 10.214.131.3:6789/0 2399 : [INF] pgmap v1365: 213 pgs: 212 active+clean, 1 active+degraded; 2658 MB data, 729 MB used, 928 GB / 931 GB avail
2013-12-15 23:51:21.153941 mon.0 10.214.131.3:6789/0 2400 : [INF] osdmap e951: 6 osds: 5 up, 2 in
2013-12-15 23:51:20.061608 mon.0 10.214.131.3:6789/0 2399 : [INF] pgmap v1365: 213 pgs: 212 active+clean, 1 active+degraded; 2658 MB data, 729 MB used, 928 GB / 931 GB avail
2013-12-15 23:51:21.153941 mon.0 10.214.131.3:6789/0 2400 : [INF] osdmap e951: 6 osds: 5 up, 2 in
2013-12-15 23:51:21.305460 mon.0 10.214.131.3:6789/0 2401 : [INF] pgmap v1366: 202 pgs: 201 active+clean, 1 active+degraded; 23058 bytes data, 729 MB used, 928 GB / 931 GB avail
2013-12-15 23:51:22.384573 mon.0 10.214.131.3:6789/0 2402 : [INF] pgmap v1367: 202 pgs: 201 active+clean, 1 active+degraded; 23058 bytes data, 736 MB used, 928 GB / 931 GB avail
...
2013-12-16 00:09:59.981507 mon.0 10.214.131.3:6789/0 2434 : [INF] pgmap v1399: 202 pgs: 201 active+clean, 1 active+degraded; 23058 bytes data, 693 MB used, 928 GB / 931 GB avail
2013-12-16 00:09:59.981507 mon.0 10.214.131.3:6789/0 2434 : [INF] pgmap v1399: 202 pgs: 201 active+clean, 1 active+degraded; 23058 bytes data, 693 MB used, 928 GB / 931 GB avail
2013-12-16 00:10:23.917536 mon.0 10.214.131.3:6789/0 2435 : [INF] pgmap v1400: 202 pgs: 201 active+clean, 1 active+degraded; 23058 bytes data, 693 MB used, 928 GB / 931 GB avail
2013-12-16 00:10:23.917536 mon.0 10.214.131.3:6789/0 2435 : [INF] pgmap v1400: 202 pgs: 201 active+clean, 1 active+degraded; 23058 bytes data, 693 MB used, 928 GB / 931 GB avail

5/6 were up. The dead osd appears to be dead due to the min_size test, not a crash. I suggest grabbing the latest osdmap from the monstore to determine how the pgs were mapped to start with. I suspect there was a temp pg mapping lingering for one of the pgs.

ubuntu@teuthology:/a/teuthology-2013-12-15_23:00:15-rados-master-testing-basic-plana/4634/remote

Actions #1

Updated by Samuel Just over 10 years ago

Another option would be to reproduce with logging. If you catch it before it gets cleaned up, it should be pretty obvious what's going on.

Actions #2

Updated by David Zafman over 10 years ago

  • Status changed from New to Can't reproduce

This might have been fixed by fix for #6905 which is to increase the timeout in suites/rados/thrash/thrashers/mapgap.yaml.

Actions

Also available in: Atom PDF