Bug #2373
osd: last_epoch_clean > last_epoch_started
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
ubuntu@plana53:/tmp/cephtest$ binary/usr/local/bin/ceph pg 0.0 query 2>/dev/null { "state": "down+peering", "up": [ 1], "acting": [ 1], "info": { "pgid": "0.0", "last_update": "9'2", "last_complete": "9'2", "log_tail": "0'0", "last_backfill": "MAX", "purged_snaps": "[]", "history": { "epoch_created": 1, "last_epoch_started": 6, "last_epoch_clean": 9, "last_epoch_split": 6, "same_up_since": 14, "same_interval_since": 14, "same_primary_since": 14, "last_scrub": "0'0", "last_scrub_stamp": "2012-05-01 02:04:27.558440"}, "stats": { "version": "9'2", "reported": "14'15", "state": "down+peering", "last_fresh": "2012-05-01 22:38:57.753776", "last_change": "2012-05-01 02:05:03.311013", "last_active": "2012-05-01 02:05:01.929705", "last_clean": "2012-05-01 02:05:01.929705", "last_unstale": "2012-05-01 22:38:57.753776", "mapping_epoch": 13, "log_start": "0'0", "ondisk_log_start": "0'0", "created": 1, "last_epoch_clean": 1, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "0'0", "last_scrub_stamp": "2012-05-01 02:04:27.558440", "log_size": 226, "ondisk_log_size": 226, "stat_sum": { "num_bytes": 0, "num_objects": 0, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_degraded": 0, "num_objects_unfound": 0, "num_read": 0, "num_read_kb": 0, "num_write": 2, "num_write_kb": 1}, "stat_cat_sum": {}, "up": [ 1], "acting": [ 1]}, "empty": 0, "dne": 0, "incomplete": 0}, "recovery_state": [ { "name": "Started\/Primary\/Peering\/GetInfo", "enter_time": "2012-05-01 22:38:56.641997", "requested_info_from": []}, { "name": "Started\/Primary\/Peering", "enter_time": "2012-05-01 22:38:56.641987", "probing_osds": [ 0, 1, 2], "blocked": "peering is blocked due to down osds", "down_osds_we_would_probe": [ 0, 2], "peering_blocked_by": [ { "osd": 0, "current_lost_at": 0, "comment": "starting or marking this osd lost may let us proceed"}, { "osd": 2, "current_lost_at": 0, "comment": "starting or marking this osd lost may let us proceed"}]}, { "name": "Started", "enter_time": "2012-05-01 22:38:56.641935"}]} 2012-05-01 22:38:56.641903 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] handle_activate_map 2012-05-01 22:38:56.641914 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] update_heartbeat_peers 1 unchanged 2012-05-01 22:38:56.641924 7ffa235e6700 20 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] exit Reset 1.127091 2 0.000000 2012-05-01 22:38:56.641935 7ffa235e6700 20 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] enter Started 2012-05-01 22:38:56.641945 7ffa235e6700 20 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] enter Start 2012-05-01 22:38:56.641955 7ffa235e6700 1 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] state<Start>: transitioning to Primary 2012-05-01 22:38:56.641966 7ffa235e6700 20 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] exit Start 0.000020 0 0.000000 2012-05-01 22:38:56.641977 7ffa235e6700 20 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] enter Started/Primary 2012-05-01 22:38:56.641987 7ffa235e6700 20 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] enter Started/Primary/Peering 2012-05-01 22:38:56.641997 7ffa235e6700 20 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering] enter Started/Primary/Peering/GetInfo 2012-05-01 22:38:56.642007 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering] generate_past_intervals already have past intervals back to 9 2012-05-01 22:38:56.642018 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering] PriorSet: build_prior interval(13-13 []/[]) 2012-05-01 22:38:56.642029 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering] PriorSet: build_prior interval(10-12 [1]/[1] maybe_went_rw) 2012-05-01 22:38:56.642040 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering] PriorSet: build_prior interval(8-9 [0,1]/[0,1] maybe_went_rw) 2012-05-01 22:38:56.642051 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering] PriorSet: build_prior prior osd.0 is down 2012-05-01 22:38:56.642061 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering] PriorSet: build_prior interval(7-7 [0]/[0]) 2012-05-01 22:38:56.642071 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering] PriorSet: build_prior interval(5-6 [0,2]/[0,2] maybe_went_rw) 2012-05-01 22:38:56.642082 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering] PriorSet: build_prior prior osd.0 is down 2012-05-01 22:38:56.642092 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering] PriorSet: build_prior prior osd.2 is down 2012-05-01 22:38:56.642101 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering] PriorSet: build_prior possibly went active+rw, none up; including down osds 2012-05-01 22:38:56.642112 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering] PriorSet: build_prior final: probe 0,1,2 down 0,2 blocked_by {0=0,2=0} pg_down 2012-05-01 22:38:56.642124 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 down+peering] up_thru 10 < same_since 14, must notify monitor 2012-05-01 22:38:56.642135 7ffa235e6700 10 osd.1 14 queue_want_up_thru want 14 <= queued 14, currently 10 2012-05-01 22:38:56.642138 7ffa235e6700 15 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 down+peering] update_stats 14'14 2012-05-01 22:38:56.642148 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 down+peering] state<Started/Primary/Peering/GetInfo>: not querying info from down osd.0 2012-05-01 22:38:56.642159 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 down+peering] state<Started/Primary/Peering/GetInfo>: not querying info from down osd.2 generate_past_intervals goes back to last_epoch_clean. build_priors go back to last_epoch_started. in this case, we were overly conservative with updating last_epoch_started, and it came back to bite us... - chef: null - ceph: log-whitelist: - objects unfound and apparently lost - lost_unfound: null ubuntu@teuthology:/a/nightly_coverage_2012-05-01-a/836
History
#1 Updated by Sage Weil almost 12 years ago
the problem is that we set the CLEAN state bit and report that to the monitor before last_epoch_started has propagated to peers.
#2 Updated by Sage Weil almost 12 years ago
- Status changed from 12 to Fix Under Review
- Priority changed from High to Urgent
wip-osd-peering
#3 Updated by Sage Weil almost 12 years ago
- Status changed from Fix Under Review to Resolved