Project

General

Profile

Actions

Bug #9752

closed

acting in past intervals contains primary and up_primary (looks like duplicates but is a display problem only)

Added by Loïc Dachary over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

80%

Source:
other
Tags:
Backport:
giant, firefly
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In a 0.80.6 in the context of http://tracker.ceph.com/issues/9750 the following showed up (the full output can be found at http://tracker.ceph.com/issues/9750#note-10).

$ ceph pg 57.7d query
...
                { "first": 10934,
                  "last": 10936,
                  "maybe_went_rw": 0,
                  "up": [
                        3],
                  "acting": [
                        3,
                        3,
                        3]},

It does not look like it could be a display bug because the code is straightforward . There has been many actions on the cluster which started as 0.80.5.
It's quite impossible to tell at which point this problem showed up for the first time and it's equally impossible to retrace the history of the cluster. I'm filling this bug just in case it shows up again but I don't see what kind of diagnostic is possible at this point. From the bash history I found that a monmap was injected a few weeks ago but did not find traces of other inject that could be disruptive.


Files

log.txt (47.2 KB) log.txt pg query with duplicates Loïc Dachary, 10/22/2014 10:49 AM
q.txt (23.6 KB) q.txt ceph pg 57.7d query (from #9750) Loïc Dachary, 10/28/2014 11:48 PM
osdmap.10818 (14.8 KB) osdmap.10818 Loïc Dachary, 10/30/2014 01:44 PM
osdmap.10831 (14.6 KB) osdmap.10831 Loïc Dachary, 10/30/2014 01:44 PM
osdmap.10832 (14.6 KB) osdmap.10832 Loïc Dachary, 10/30/2014 01:44 PM
osdmap.10858 (14.2 KB) osdmap.10858 Loïc Dachary, 10/30/2014 01:44 PM
osdmap.10859 (14.2 KB) osdmap.10859 Loïc Dachary, 10/30/2014 01:44 PM
osdmap.10882 (14.1 KB) osdmap.10882 Loïc Dachary, 10/30/2014 01:44 PM
osd.0.log.gz (143 KB) osd.0.log.gz logs from the primary osd for pg 0.2 Loïc Dachary, 10/30/2014 02:49 PM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #9750: pg incompleteWon't Fix10/12/2014

Actions
Actions #1

Updated by Samuel Just over 9 years ago

  • Assignee set to Loïc Dachary
Actions #2

Updated by David Zafman over 9 years ago

"kit" on #ceph was in a situation of having incomplete pg. They sent the pg query output and it showed strange past_intervals information as in this bug:

{ "first": 8194,
"last": 8194,
"maybe_went_rw": 0,
"up": [
13,
15],
"acting": [
13,
15,
13,
13]},

http://pastebin.com/raw.php?i=AJ3RMjz6

Actions #3

Updated by Loïc Dachary over 9 years ago

Full logs from pastebin to survive expiration.

Actions #4

Updated by Greg Farnum over 9 years ago

See the thread "[ceph-users] Troubleshooting Incomplete PGs" for another instance of this (and there are several more on the list archives if you want to search).
In that case, it's also reporting "-1" for some active slots, which is also of interest.

Actions #5

Updated by Samuel Just over 9 years ago

Actually, that thread is the same instance as david's.

Actions #6

Updated by Samuel Just over 9 years ago

  • Priority changed from Normal to Urgent
Actions #7

Updated by Samuel Just over 9 years ago

First thing we want to get is an osdmap from the misbehaving epoch.

Loic: you can get the osdmap for a particular epoch with something like ceph osd getmap <epoch> -o <outfile>

You'd probably want epoch 10934 in this case.

Actions #8

Updated by Loïc Dachary over 9 years ago

It is unfortunately gone

root@g3:~# ceph osd getmap 10934 > /tmp/osdmap.10934
Error ENOENT: there is no map for epoch 10934

Actions #9

Updated by David Zafman over 9 years ago

I happen to notice the issue because I happen to look at this guys pastebin. I didn't interact with him at all. Not sure if he ever posted his information to ceph-users or ceph-devel. This might have been on Oct 19.

3:47:15 PM kitz: I've got a number of PGs that are "incomplete" and I'm not sure what to do about them. Can anyone help?
3:56:18 PM lincolnb: kitz: do a 'ceph pg [pgid] query'
3:56:39 PM lincolnb: it might give some suggestions near the end
3:56:45 PM lincolnb: like what OSDs its trying to probe
3:57:08 PM lincolnb: all of those OSDs should be up and in
3:57:34 PM kitz: http://pastebin.com/raw.php?i=AJ3RMjz6
3:58:35 PM lincolnb: im looking at the very bottom, "probing_osds"
3:58:42 PM kitz: all of my OSDs are up+in but some are doing a backfill. I don't think that's related though...
3:58:45 PM lincolnb: are osds 10,13,15,and 25 up ?
3:59:00 PM kitz: yeah, why so many? with this pool size=2
3:59:27 PM lincolnb: im not an expert but i think those are the potential locations for the pg to be if an osd goes down
3:59:58 PM lincolnb: like, the acting set is [13,15] but if 13 goes down it would put the pg on [10,15] (something like that)
4:00:07 PM lincolnb: im sure one of the experts can correct me :)
4:00:19 PM kitz: I was having trouble with 10 earlier...
4:00:30 PM lincolnb: ahh
4:02:34 PM kitz: But what can I do to recover the PG?
4:04:32 PM lincolnb: to be honest, i am not 100% sure. i had the same problem, but i ended up completely deleting and recreating the pg, but thats guaranteed data loss.. so you probably don't want to do that.
4:07:11 PM lincolnb: you might be able to get a more helpful response if you post to the problem to the ceph users list and attach the query
4:09:54 PM kitz: lincolnb: thanks

Actions #10

Updated by Loïc Dachary over 9 years ago

Actions #11

Updated by Loïc Dachary over 9 years ago

Could it be that the the acting vector size is not reset when ceph osd pool set size 4 is followed by ceph osd pool set size 2 ?
Sam says he looked into this but found that acting is rebuilt one osd at a time and it is unlikely to be the source of the problem.

Actions #12

Updated by Loïc Dachary over 9 years ago

root@g3:~# ceph osd getmap 11523 > /tmp/osdmap.11523
Error ENOENT: there is no map for epoch 11523
root@g3:~# ceph osd getmap 11522 > /tmp/osdmap.11522
Error ENOENT: there is no map for epoch 11522

Updated by Loïc Dachary over 9 years ago

Fortunately I saved an entire osd directory from which I was able to extract osdmaps with duplicates related to q.txt

Actions #14

Updated by Loïc Dachary over 9 years ago

$ for osdmap in /tmp/osdmap.108* ; do ./osdmaptool --test-map-pg 57.7d $osdmap ; done
./osdmaptool: osdmap file '/tmp/osdmap.10818'
 parsed '57.7d' -> 57.7d
57.7d raw ([11,3], p11) up ([11,3], p11) acting ([3,6], p3)
./osdmaptool: osdmap file '/tmp/osdmap.10831'
 parsed '57.7d' -> 57.7d
57.7d raw ([11,3], p11) up ([11,3], p11) acting ([3,6], p3)
./osdmaptool: osdmap file '/tmp/osdmap.10832'
 parsed '57.7d' -> 57.7d
57.7d raw ([11,3], p11) up ([11,3], p11) acting ([11,3], p11)
./osdmaptool: osdmap file '/tmp/osdmap.10858'
 parsed '57.7d' -> 57.7d
57.7d raw ([11,3], p11) up ([11,3], p11) acting ([11,3], p11)
./osdmaptool: osdmap file '/tmp/osdmap.10859'
 parsed '57.7d' -> 57.7d
57.7d raw ([11,3], p11) up ([3], p3) acting ([3], p3)
./osdmaptool: osdmap file '/tmp/osdmap.10882'
 parsed '57.7d' -> 57.7d
57.7d raw ([11,3], p11) up ([3], p3) acting ([3], p3)
                { "first": 10818,
                  "last": 10831,
                  "maybe_went_rw": 1,
                  "up": [
                        11,
                        3],
                  "acting": [
                        3,
                        6,
                        3,
                        11]},
                { "first": 10832,
                  "last": 10858,
                  "maybe_went_rw": 1,
                  "up": [
                        11,
                        3],
                  "acting": [
                        11,
                        3,
                        11,
                        11]},
                { "first": 10859,
                  "last": 10882,
                  "maybe_went_rw": 0,
                  "up": [
                        3],
                  "acting": [
                        3,
                        3,
                        3]},
Actions #15

Updated by Loïc Dachary over 9 years ago

On a cluster running from sources with

# ceph version 0.87-578-g105d30a (105d30a6d3546f26b0e052e0e97fff1c7a50b54c)
$ MON=1 OSD=3 ./vstart.sh -n -l mon osd

the following shows during peering, immediately after it is started
$ ceph pg 0.2 query 
{ "state": "peering",
  "snap_trimq": "[]",
  "epoch": 10,
  "up": [
        0,
        1,
        2],
  "acting": [
        0,
        1,
        2],
  "actingbackfill": [
        "0",
        "1",
        "2"],
  "info": { "pgid": "0.2",
      "last_update": "0'0",
      "last_complete": "0'0",
      "log_tail": "0'0",
      "last_user_version": 0,
      "last_backfill": "MAX",
      "purged_snaps": "[]",
      "history": { "epoch_created": 1,
          "last_epoch_started": 9,
          "last_epoch_clean": 9,
          "last_epoch_split": 0,
          "same_up_since": 10,
          "same_interval_since": 10,
          "same_primary_since": 5,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2014-10-30 22:33:15.919997",
          "last_deep_scrub": "0'0",
          "last_deep_scrub_stamp": "2014-10-30 22:33:15.919997",
          "last_clean_scrub_stamp": "0.000000"},
      "stats": { "version": "0'0",
          "reported_seq": "23",
          "reported_epoch": "10",
          "state": "peering",
          "last_fresh": "2014-10-30 22:33:27.623073",
          "last_change": "2014-10-30 22:33:27.576033",
          "last_active": "2014-10-30 22:33:23.544610",
          "last_clean": "0.000000",
          "last_became_active": "0.000000",
          "last_unstale": "2014-10-30 22:33:27.623073",
          "last_undegraded": "2014-10-30 22:33:27.623073",
          "last_fullsized": "2014-10-30 22:33:27.623073",
          "mapping_epoch": 8,
          "log_start": "0'0",
          "ondisk_log_start": "0'0",
          "created": 1,
          "last_epoch_clean": 9,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2014-10-30 22:33:15.919997",
          "last_deep_scrub": "0'0",
          "last_deep_scrub_stamp": "2014-10-30 22:33:15.919997",
          "last_clean_scrub_stamp": "0.000000",
          "log_size": 0,
          "ondisk_log_size": 0,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 0,
              "num_objects": 0,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_misplaced": 0,
              "num_objects_unfound": 0,
              "num_objects_dirty": 0,
              "num_whiteouts": 0,
              "num_read": 0,
              "num_read_kb": 0,
              "num_write": 0,
              "num_write_kb": 0,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 0,
              "num_bytes_recovered": 0,
              "num_keys_recovered": 0,
              "num_objects_omap": 0,
              "num_objects_hit_set_archive": 0,
              "num_bytes_hit_set_archive": 0},
          "stat_cat_sum": {},
          "up": [
                0,
                1,
                2],
          "acting": [
                0,
                1,
                2],
          "blocked_by": [],
          "up_primary": 0,
          "acting_primary": 0},
      "empty": 1,
      "dne": 0,
      "incomplete": 0,
      "last_epoch_started": 9,
      "hit_set_history": { "current_last_update": "0'0",
          "current_last_stamp": "0.000000",
          "current_info": { "begin": "0.000000",
              "end": "0.000000",
              "version": "0'0"},
          "history": []}},
  "peer_info": [
        { "peer": "1",
          "pgid": "0.2",
          "last_update": "0'0",
          "last_complete": "0'0",
          "log_tail": "0'0",
          "last_user_version": 0,
          "last_backfill": "MAX",
          "purged_snaps": "[]",
          "history": { "epoch_created": 1,
              "last_epoch_started": 9,
              "last_epoch_clean": 9,
              "last_epoch_split": 0,
              "same_up_since": 10,
              "same_interval_since": 10,
              "same_primary_since": 5,
              "last_scrub": "0'0",
              "last_scrub_stamp": "2014-10-30 22:33:15.919997",
              "last_deep_scrub": "0'0",
              "last_deep_scrub_stamp": "2014-10-30 22:33:15.919997",
              "last_clean_scrub_stamp": "0.000000"},
          "stats": { "version": "0'0",
              "reported_seq": "15",
              "reported_epoch": "9",
              "state": "peering",
              "last_fresh": "2014-10-30 22:33:22.580789",
              "last_change": "2014-10-30 22:33:21.195947",
              "last_active": "2014-10-30 22:33:19.939091",
              "last_clean": "0.000000",
              "last_became_active": "0.000000",
              "last_unstale": "2014-10-30 22:33:22.580789",
              "last_undegraded": "2014-10-30 22:33:22.580789",
              "last_fullsized": "2014-10-30 22:33:22.580789",
              "mapping_epoch": 8,
              "log_start": "0'0",
              "ondisk_log_start": "0'0",
              "created": 1,
              "last_epoch_clean": 6,
              "parent": "0.0",
              "parent_split_bits": 0,
              "last_scrub": "0'0",
              "last_scrub_stamp": "2014-10-30 22:33:15.919997",
              "last_deep_scrub": "0'0",
              "last_deep_scrub_stamp": "2014-10-30 22:33:15.919997",
              "last_clean_scrub_stamp": "0.000000",
              "log_size": 0,
              "ondisk_log_size": 0,
              "stats_invalid": "0",
              "stat_sum": { "num_bytes": 0,
                  "num_objects": 0,
                  "num_object_clones": 0,
                  "num_object_copies": 0,
                  "num_objects_missing_on_primary": 0,
                  "num_objects_degraded": 0,
                  "num_objects_misplaced": 0,
                  "num_objects_unfound": 0,
                  "num_objects_dirty": 0,
                  "num_whiteouts": 0,
                  "num_read": 0,
                  "num_read_kb": 0,
                  "num_write": 0,
                  "num_write_kb": 0,
                  "num_scrub_errors": 0,
                  "num_shallow_scrub_errors": 0,
                  "num_deep_scrub_errors": 0,
                  "num_objects_recovered": 0,
                  "num_bytes_recovered": 0,
                  "num_keys_recovered": 0,
                  "num_objects_omap": 0,
                  "num_objects_hit_set_archive": 0,
                  "num_bytes_hit_set_archive": 0},
              "stat_cat_sum": {},
              "up": [
                    0,
                    1,
                    2],
              "acting": [
                    0,
                    1,
                    2],
              "blocked_by": [],
              "up_primary": 0,
              "acting_primary": 0},
          "empty": 1,
          "dne": 0,
          "incomplete": 0,
          "last_epoch_started": 9,
          "hit_set_history": { "current_last_update": "0'0",
              "current_last_stamp": "0.000000",
              "current_info": { "begin": "0.000000",
                  "end": "0.000000",
                  "version": "0'0"},
              "history": []}},
        { "peer": "2",
          "pgid": "0.2",
          "last_update": "0'0",
          "last_complete": "0'0",
          "log_tail": "0'0",
          "last_user_version": 0,
          "last_backfill": "MAX",
          "purged_snaps": "[]",
          "history": { "epoch_created": 0,
              "last_epoch_started": 0,
              "last_epoch_clean": 0,
              "last_epoch_split": 0,
              "same_up_since": 0,
              "same_interval_since": 0,
              "same_primary_since": 0,
              "last_scrub": "0'0",
              "last_scrub_stamp": "0.000000",
              "last_deep_scrub": "0'0",
              "last_deep_scrub_stamp": "0.000000",
              "last_clean_scrub_stamp": "0.000000"},
          "stats": { "version": "0'0",
              "reported_seq": "0",
              "reported_epoch": "0",
              "state": "inactive",
              "last_fresh": "0.000000",
              "last_change": "0.000000",
              "last_active": "0.000000",
              "last_clean": "0.000000",
              "last_became_active": "0.000000",
              "last_unstale": "0.000000",
              "last_undegraded": "0.000000",
              "last_fullsized": "0.000000",
              "mapping_epoch": 0,
              "log_start": "0'0",
              "ondisk_log_start": "0'0",
              "created": 0,
              "last_epoch_clean": 0,
              "parent": "0.0",
              "parent_split_bits": 0,
              "last_scrub": "0'0",
              "last_scrub_stamp": "0.000000",
              "last_deep_scrub": "0'0",
              "last_deep_scrub_stamp": "0.000000",
              "last_clean_scrub_stamp": "0.000000",
              "log_size": 0,
              "ondisk_log_size": 0,
              "stats_invalid": "0",
              "stat_sum": { "num_bytes": 0,
                  "num_objects": 0,
                  "num_object_clones": 0,
                  "num_object_copies": 0,
                  "num_objects_missing_on_primary": 0,
                  "num_objects_degraded": 0,
                  "num_objects_misplaced": 0,
                  "num_objects_unfound": 0,
                  "num_objects_dirty": 0,
                  "num_whiteouts": 0,
                  "num_read": 0,
                  "num_read_kb": 0,
                  "num_write": 0,
                  "num_write_kb": 0,
                  "num_scrub_errors": 0,
                  "num_shallow_scrub_errors": 0,
                  "num_deep_scrub_errors": 0,
                  "num_objects_recovered": 0,
                  "num_bytes_recovered": 0,
                  "num_keys_recovered": 0,
                  "num_objects_omap": 0,
                  "num_objects_hit_set_archive": 0,
                  "num_bytes_hit_set_archive": 0},
              "stat_cat_sum": {},
              "up": [],
              "acting": [],
              "blocked_by": [],
              "up_primary": -1,
              "acting_primary": -1},
          "empty": 1,
          "dne": 1,
          "incomplete": 0,
          "last_epoch_started": 0,
          "hit_set_history": { "current_last_update": "0'0",
              "current_last_stamp": "0.000000",
              "current_info": { "begin": "0.000000",
                  "end": "0.000000",
                  "version": "0'0"},
              "history": []}}],
  "recovery_state": [
        { "name": "Started\/Primary\/Peering\/WaitUpThru",
          "enter_time": "2014-10-30 22:33:27.623078",
          "comment": "waiting for osdmap to reflect a new up_thru for this osd"},
        { "name": "Started\/Primary\/Peering",
          "enter_time": "2014-10-30 22:33:27.575916",
          "past_intervals": [
                { "first": 8,
                  "last": 9,
                  "maybe_went_rw": 1,
                  "up": [
                        0,
                        1],
                  "acting": [
                        0,
                        1,
                        0,
                        0]}],
          "probing_osds": [
                "0",
                "1",
                "2"],
          "down_osds_we_would_probe": [],
          "peering_blocked_by": []},
        { "name": "Started",
          "enter_time": "2014-10-30 22:33:27.575605"}],
  "agent_state": {}}

Actions #17

Updated by Loïc Dachary over 9 years ago

  • Status changed from Need More Info to Fix Under Review
  • % Done changed from 0 to 80
Actions #18

Updated by Loïc Dachary over 9 years ago

  • Backport set to giant, firefly
Actions #19

Updated by Loïc Dachary over 9 years ago

  • Subject changed from acting in past intervals contains duplicates to acting in past intervals contains primary and up_primary (looks like duplicates but is a display problem only)
Actions #20

Updated by Greg Farnum over 9 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #21

Updated by Loïc Dachary over 9 years ago

  • Status changed from Pending Backport to Fix Under Review
Actions #22

Updated by Sage Weil over 9 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF