Project

General

Profile

Actions

Bug #48417

closed

unfound EC objects in sepia's LRC after upgrade

Added by Josh Durgin over 3 years ago. Updated over 2 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

[ERR] PG_DAMAGED: Possible data damage: 25 pgs recovery_unfound
    pg 119.c is active+recovery_unfound+undersized+degraded+remapped, acting [83,104,2,2147483647,25,146], 63 unfound
    pg 119.16 is active+recovery_unfound+undersized+degraded+remapped, acting [63,2147483647,94,69,13,75], 65 unfound
    pg 119.1d is active+recovery_unfound+undersized+degraded+remapped, acting [2147483647,110,10,13,32,128], 67 unfound
    pg 119.31 is active+recovery_unfound+undersized+degraded+remapped, acting [56,22,0,4,2147483647,147], 69 unfound
    pg 119.55 is active+recovery_unfound+undersized+degraded+remapped, acting [144,2,10,2147483647,149,140], 44 unfound
    pg 119.9d is active+recovery_unfound+undersized+degraded+remapped, acting [5,3,2147483647,144,69,103], 55 unfound
    pg 119.a6 is active+recovery_unfound+undersized+degraded+remapped, acting [95,92,125,2147483647,22,23], 56 unfound
    pg 119.b9 is active+recovery_unfound+undersized+degraded+remapped, acting [61,146,56,22,73,2147483647], 73 unfound
    pg 119.c3 is active+recovery_unfound+undersized+degraded+remapped, acting [32,3,112,63,1,2147483647], 72 unfound
    pg 119.df is active+recovery_unfound+undersized+degraded+remapped, acting [0,2147483647,139,10,124,144], 52 unfound
    pg 119.f8 is active+recovery_unfound+undersized+degraded+remapped, acting [5,2147483647,145,61,0,1], 66 unfound
    pg 119.101 is active+recovery_unfound+undersized+degraded+remapped, acting [2147483647,2,21,85,61,1], 51 unfound
    pg 119.153 is active+recovery_unfound+undersized+degraded+remapped, acting [13,121,2147483647,87,90,25], 67 unfound
    pg 119.158 is active+recovery_unfound+undersized+degraded+remapped, acting [90,39,1,0,63,2147483647], 56 unfound
    pg 119.16c is active+recovery_unfound+undersized+degraded+remapped, acting [83,13,124,105,10,2147483647], 53 unfound
    pg 119.1a9 is active+recovery_unfound+undersized+degraded+remapped, acting [113,22,2147483647,56,26,85], 42 unfound
    pg 119.1da is active+recovery_unfound+undersized+degraded+remapped, acting [131,3,103,25,2147483647,105], 59 unfound
    pg 119.1dc is active+recovery_unfound+undersized+degraded+remapped, acting [26,69,28,139,2147483647,138], 51 unfound
    pg 119.1e8 is active+recovery_unfound+undersized+degraded+remapped, acting [0,40,56,2147483647,143,25], 58 unfound
    pg 119.252 is active+recovery_unfound+undersized+degraded+remapped, acting [23,4,69,141,144,2147483647], 2 unfound
    pg 119.2e2 is active+recovery_unfound+undersized+degraded+remapped, acting [2147483647,28,85,69,0,9], 61 unfound
    pg 119.303 is active+recovery_unfound+undersized+degraded+remapped, acting [13,29,1,2147483647,104,148], 70 unfound
    pg 119.32d is active+recovery_unfound+undersized+degraded+remapped, acting [148,3,145,2147483647,10,132], 55 unfound
    pg 119.33c is active+recovery_unfound+undersized+degraded+remapped, acting [138,145,2147483647,69,23,39], 64 unfound
    pg 119.385 is active+recovery_unfound+undersized+degraded+remapped, acting [121,124,61,69,2147483647,2], 68 unfound

These pgs are all part of a 4+2 cephfs data pool, so EC overwrites are performed.

There does not seem to be a common osd missing the objects, it seems to be spread across various osds and pgs in this pool.

Examining one pg, 119.385, there are varying last_updates and log_tails that do not go back far enough to rollback:

$ ceph pg 119.385 query | egrep "\"last_update|last_complete|log_tail|\"peer" 
        "last_update": "8294786'781862",
        "last_complete": "8294786'781862",
        "log_tail": "8282428'778930",
    "peer_info": [
            "peer": "1(3)",
            "last_update": "7910997'642207",
            "last_complete": "7910997'642207",
            "log_tail": "7896032'640062",
            "peer": "2(5)",
            "last_update": "8294786'781862",
            "last_complete": "0'0",
            "log_tail": "8282880'779012",
            "peer": "7(1)",
            "last_update": "7935235'645745",
            "last_complete": "7935235'645745",
            "log_tail": "7924442'643515",
            "peer": "9(3)",
            "last_update": "7932524'644986",
            "last_complete": "7932524'644986",
            "log_tail": "7913924'642474",
            "peer": "10(3)",
            "last_update": "7932540'644988",
            "last_complete": "7932540'644988",
            "log_tail": "7913924'642474",
            "peer": "13(5)",
            "last_update": "7936103'646007",
            "last_complete": "7936103'646007",
            "log_tail": "7927639'643715",
            "peer": "14(1)",
            "last_update": "8078679'702826",
            "last_complete": "8078679'702826",
            "log_tail": "8071492'700187",
            "peer": "20(1)",
            "last_update": "8294786'781862",
            "last_complete": "8294786'781862",
            "log_tail": "8282880'779012",
            "peer": "23(5)",
            "last_update": "8078679'702826",
            "last_complete": "8078679'702826",
            "log_tail": "8071492'700187",
            "peer": "25(3)",
            "last_update": "7910997'642207",
            "last_complete": "7910997'642207",
            "log_tail": "7896032'640062",
            "peer": "31(4)",
            "last_update": "8294786'781862",
            "last_complete": "8294786'781862",
            "log_tail": "8282880'779012",
            "peer": "36(4)",
            "last_update": "7942013'651879",
            "last_complete": "7942013'651879",
            "log_tail": "7937936'649586",
            "peer": "61(2)",
            "last_update": "8294786'781862",
            "last_complete": "8294786'781862",
            "log_tail": "8282880'779012",
            "peer": "65(4)",
            "last_update": "7941969'651868",
            "last_complete": "7941969'651868",
            "log_tail": "7937936'649586",
            "peer": "69(3)",
            "last_update": "8294786'781862",
            "last_complete": "0'0",
            "log_tail": "8282880'779012",
            "peer": "73(5)",
            "last_update": "7924910'643554",
            "last_complete": "7924910'643554",
            "log_tail": "7901040'641127",
            "peer": "81(4)",
            "last_update": "7941969'651868",
            "last_complete": "7941969'651868",
            "log_tail": "7937936'649586",
            "peer": "94(4)",
            "last_update": "7941920'651866",
            "last_complete": "7941920'651866",
            "log_tail": "7937936'649586",
            "peer": "122(5)",
            "last_update": "0'0",
            "last_complete": "0'0",
            "log_tail": "0'0",
            "peer": "124(1)",
            "last_update": "8294786'781862",
            "last_complete": "8294786'781862",
            "log_tail": "8282880'779012",
            "peer": "145(5)",
            "last_update": "7936058'645952",
            "last_complete": "7936058'645952",
            "log_tail": "7927639'643715",
                "peer_backfill_info": [],

Restarting the primary (osd.121) with debug_osd=30/debug_ms=1/debug_bluestore=10 enabled shows us the peering process, resulting in no change to the cluster state: ceph-post-filed at a257d82a-86de-424e-9964-86d029c87e59


Subtasks 3 (0 open3 closed)

Bug #48609: osd/PGLog: don’t fast-forward can_rollback_to during merge_log if the log isn’t extendedClosedDeepika Upadhyay

Actions
Bug #48611: osd: Delay sending info to new backfill peer resetting last_backfill until backfill actually startsResolvedDeepika Upadhyay

Actions
Bug #48613: Reproduce https://tracker.ceph.com/issues/48417ResolvedDeepika Upadhyay

Actions
Actions

Also available in: Atom PDF