Project

General

Profile

Bug #2373

osd: last_epoch_clean > last_epoch_started

Added by Sage Weil over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

ubuntu@plana53:/tmp/cephtest$ binary/usr/local/bin/ceph pg 0.0 query 2>/dev/null 

{ "state": "down+peering",
  "up": [
        1],
  "acting": [
        1],
  "info": { "pgid": "0.0",
      "last_update": "9'2",
      "last_complete": "9'2",
      "log_tail": "0'0",
      "last_backfill": "MAX",
      "purged_snaps": "[]",
      "history": { "epoch_created": 1,
          "last_epoch_started": 6,
          "last_epoch_clean": 9,
          "last_epoch_split": 6,
          "same_up_since": 14,
          "same_interval_since": 14,
          "same_primary_since": 14,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2012-05-01 02:04:27.558440"},
      "stats": { "version": "9'2",
          "reported": "14'15",
          "state": "down+peering",
          "last_fresh": "2012-05-01 22:38:57.753776",
          "last_change": "2012-05-01 02:05:03.311013",
          "last_active": "2012-05-01 02:05:01.929705",
          "last_clean": "2012-05-01 02:05:01.929705",
          "last_unstale": "2012-05-01 22:38:57.753776",
          "mapping_epoch": 13,
          "log_start": "0'0",
          "ondisk_log_start": "0'0",
          "created": 1,
          "last_epoch_clean": 1,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2012-05-01 02:04:27.558440",
          "log_size": 226,
          "ondisk_log_size": 226,
          "stat_sum": { "num_bytes": 0,
              "num_objects": 0,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 0,
              "num_read_kb": 0,
              "num_write": 2,
              "num_write_kb": 1},
          "stat_cat_sum": {},
          "up": [
                1],
          "acting": [
                1]},
      "empty": 0,
      "dne": 0,
      "incomplete": 0},
  "recovery_state": [
        { "name": "Started\/Primary\/Peering\/GetInfo",
          "enter_time": "2012-05-01 22:38:56.641997",
          "requested_info_from": []},
        { "name": "Started\/Primary\/Peering",
          "enter_time": "2012-05-01 22:38:56.641987",
          "probing_osds": [
                0,
                1,
                2],
          "blocked": "peering is blocked due to down osds",
          "down_osds_we_would_probe": [
                0,
                2],
          "peering_blocked_by": [
                { "osd": 0,
                  "current_lost_at": 0,
                  "comment": "starting or marking this osd lost may let us proceed"},
                { "osd": 2,
                  "current_lost_at": 0,
                  "comment": "starting or marking this osd lost may let us proceed"}]},
        { "name": "Started",
          "enter_time": "2012-05-01 22:38:56.641935"}]}

2012-05-01 22:38:56.641903 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] handle_activate_map 
2012-05-01 22:38:56.641914 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] update_heartbeat_peers 1 unchanged
2012-05-01 22:38:56.641924 7ffa235e6700 20 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] exit Reset 1.127091 2 0.000000
2012-05-01 22:38:56.641935 7ffa235e6700 20 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] enter Started
2012-05-01 22:38:56.641945 7ffa235e6700 20 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] enter Start
2012-05-01 22:38:56.641955 7ffa235e6700  1 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] state<Start>: transitioning to Primary
2012-05-01 22:38:56.641966 7ffa235e6700 20 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] exit Start 0.000020 0 0.000000
2012-05-01 22:38:56.641977 7ffa235e6700 20 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] enter Started/Primary
2012-05-01 22:38:56.641987 7ffa235e6700 20 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 inactive] enter Started/Primary/Peering
2012-05-01 22:38:56.641997 7ffa235e6700 20 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering] enter Started/Primary/Peering/GetInfo
2012-05-01 22:38:56.642007 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering] generate_past_intervals already have past intervals back to 9
2012-05-01 22:38:56.642018 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering]  PriorSet: build_prior interval(13-13 []/[])
2012-05-01 22:38:56.642029 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering]  PriorSet: build_prior interval(10-12 [1]/[1] maybe_went_rw)
2012-05-01 22:38:56.642040 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering]  PriorSet: build_prior interval(8-9 [0,1]/[0,1] maybe_went_rw)
2012-05-01 22:38:56.642051 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering]  PriorSet: build_prior  prior osd.0 is down
2012-05-01 22:38:56.642061 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering]  PriorSet: build_prior interval(7-7 [0]/[0])
2012-05-01 22:38:56.642071 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering]  PriorSet: build_prior interval(5-6 [0,2]/[0,2] maybe_went_rw)
2012-05-01 22:38:56.642082 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering]  PriorSet: build_prior  prior osd.0 is down
2012-05-01 22:38:56.642092 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering]  PriorSet: build_prior  prior osd.2 is down
2012-05-01 22:38:56.642101 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering]  PriorSet: build_prior  possibly went active+rw, none up; including down osds
2012-05-01 22:38:56.642112 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 peering]  PriorSet: build_prior final: probe 0,1,2 down 0,2 blocked_by {0=0,2=0} pg_down
2012-05-01 22:38:56.642124 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 down+peering] up_thru 10 < same_since 14, must notify monitor
2012-05-01 22:38:56.642135 7ffa235e6700 10 osd.1 14 queue_want_up_thru want 14 <= queued 14, currently 10
2012-05-01 22:38:56.642138 7ffa235e6700 15 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 down+peering] update_stats 14'14
2012-05-01 22:38:56.642148 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 down+peering] state<Started/Primary/Peering/GetInfo>:  not querying info from down osd.0
2012-05-01 22:38:56.642159 7ffa235e6700 10 osd.1 14 pg[0.0( v 9'2 (0'0,9'2] n=0 ec=1 les/c 6/9 14/14/14) [1] r=0 lpr=14 pi=5-13/5 lcod 9'1 mlcod 0'0 down+peering] state<Started/Primary/Peering/GetInfo>:  not querying info from down osd.2

generate_past_intervals goes back to last_epoch_clean.  build_priors go back to last_epoch_started.  in this case, we were overly conservative with updating last_epoch_started, and it came back to bite us...

- chef: null
- ceph:
    log-whitelist:
    - objects unfound and apparently lost
- lost_unfound: null
ubuntu@teuthology:/a/nightly_coverage_2012-05-01-a/836

History

#1 Updated by Sage Weil over 8 years ago

the problem is that we set the CLEAN state bit and report that to the monitor before last_epoch_started has propagated to peers.

#2 Updated by Sage Weil over 8 years ago

  • Status changed from 12 to Fix Under Review
  • Priority changed from High to Urgent

wip-osd-peering

#3 Updated by Sage Weil over 8 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF