Project

General

Profile

Actions

Bug #48417

closed

unfound EC objects in sepia's LRC after upgrade

Added by Josh Durgin over 3 years ago. Updated over 2 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

[ERR] PG_DAMAGED: Possible data damage: 25 pgs recovery_unfound
    pg 119.c is active+recovery_unfound+undersized+degraded+remapped, acting [83,104,2,2147483647,25,146], 63 unfound
    pg 119.16 is active+recovery_unfound+undersized+degraded+remapped, acting [63,2147483647,94,69,13,75], 65 unfound
    pg 119.1d is active+recovery_unfound+undersized+degraded+remapped, acting [2147483647,110,10,13,32,128], 67 unfound
    pg 119.31 is active+recovery_unfound+undersized+degraded+remapped, acting [56,22,0,4,2147483647,147], 69 unfound
    pg 119.55 is active+recovery_unfound+undersized+degraded+remapped, acting [144,2,10,2147483647,149,140], 44 unfound
    pg 119.9d is active+recovery_unfound+undersized+degraded+remapped, acting [5,3,2147483647,144,69,103], 55 unfound
    pg 119.a6 is active+recovery_unfound+undersized+degraded+remapped, acting [95,92,125,2147483647,22,23], 56 unfound
    pg 119.b9 is active+recovery_unfound+undersized+degraded+remapped, acting [61,146,56,22,73,2147483647], 73 unfound
    pg 119.c3 is active+recovery_unfound+undersized+degraded+remapped, acting [32,3,112,63,1,2147483647], 72 unfound
    pg 119.df is active+recovery_unfound+undersized+degraded+remapped, acting [0,2147483647,139,10,124,144], 52 unfound
    pg 119.f8 is active+recovery_unfound+undersized+degraded+remapped, acting [5,2147483647,145,61,0,1], 66 unfound
    pg 119.101 is active+recovery_unfound+undersized+degraded+remapped, acting [2147483647,2,21,85,61,1], 51 unfound
    pg 119.153 is active+recovery_unfound+undersized+degraded+remapped, acting [13,121,2147483647,87,90,25], 67 unfound
    pg 119.158 is active+recovery_unfound+undersized+degraded+remapped, acting [90,39,1,0,63,2147483647], 56 unfound
    pg 119.16c is active+recovery_unfound+undersized+degraded+remapped, acting [83,13,124,105,10,2147483647], 53 unfound
    pg 119.1a9 is active+recovery_unfound+undersized+degraded+remapped, acting [113,22,2147483647,56,26,85], 42 unfound
    pg 119.1da is active+recovery_unfound+undersized+degraded+remapped, acting [131,3,103,25,2147483647,105], 59 unfound
    pg 119.1dc is active+recovery_unfound+undersized+degraded+remapped, acting [26,69,28,139,2147483647,138], 51 unfound
    pg 119.1e8 is active+recovery_unfound+undersized+degraded+remapped, acting [0,40,56,2147483647,143,25], 58 unfound
    pg 119.252 is active+recovery_unfound+undersized+degraded+remapped, acting [23,4,69,141,144,2147483647], 2 unfound
    pg 119.2e2 is active+recovery_unfound+undersized+degraded+remapped, acting [2147483647,28,85,69,0,9], 61 unfound
    pg 119.303 is active+recovery_unfound+undersized+degraded+remapped, acting [13,29,1,2147483647,104,148], 70 unfound
    pg 119.32d is active+recovery_unfound+undersized+degraded+remapped, acting [148,3,145,2147483647,10,132], 55 unfound
    pg 119.33c is active+recovery_unfound+undersized+degraded+remapped, acting [138,145,2147483647,69,23,39], 64 unfound
    pg 119.385 is active+recovery_unfound+undersized+degraded+remapped, acting [121,124,61,69,2147483647,2], 68 unfound

These pgs are all part of a 4+2 cephfs data pool, so EC overwrites are performed.

There does not seem to be a common osd missing the objects, it seems to be spread across various osds and pgs in this pool.

Examining one pg, 119.385, there are varying last_updates and log_tails that do not go back far enough to rollback:

$ ceph pg 119.385 query | egrep "\"last_update|last_complete|log_tail|\"peer" 
        "last_update": "8294786'781862",
        "last_complete": "8294786'781862",
        "log_tail": "8282428'778930",
    "peer_info": [
            "peer": "1(3)",
            "last_update": "7910997'642207",
            "last_complete": "7910997'642207",
            "log_tail": "7896032'640062",
            "peer": "2(5)",
            "last_update": "8294786'781862",
            "last_complete": "0'0",
            "log_tail": "8282880'779012",
            "peer": "7(1)",
            "last_update": "7935235'645745",
            "last_complete": "7935235'645745",
            "log_tail": "7924442'643515",
            "peer": "9(3)",
            "last_update": "7932524'644986",
            "last_complete": "7932524'644986",
            "log_tail": "7913924'642474",
            "peer": "10(3)",
            "last_update": "7932540'644988",
            "last_complete": "7932540'644988",
            "log_tail": "7913924'642474",
            "peer": "13(5)",
            "last_update": "7936103'646007",
            "last_complete": "7936103'646007",
            "log_tail": "7927639'643715",
            "peer": "14(1)",
            "last_update": "8078679'702826",
            "last_complete": "8078679'702826",
            "log_tail": "8071492'700187",
            "peer": "20(1)",
            "last_update": "8294786'781862",
            "last_complete": "8294786'781862",
            "log_tail": "8282880'779012",
            "peer": "23(5)",
            "last_update": "8078679'702826",
            "last_complete": "8078679'702826",
            "log_tail": "8071492'700187",
            "peer": "25(3)",
            "last_update": "7910997'642207",
            "last_complete": "7910997'642207",
            "log_tail": "7896032'640062",
            "peer": "31(4)",
            "last_update": "8294786'781862",
            "last_complete": "8294786'781862",
            "log_tail": "8282880'779012",
            "peer": "36(4)",
            "last_update": "7942013'651879",
            "last_complete": "7942013'651879",
            "log_tail": "7937936'649586",
            "peer": "61(2)",
            "last_update": "8294786'781862",
            "last_complete": "8294786'781862",
            "log_tail": "8282880'779012",
            "peer": "65(4)",
            "last_update": "7941969'651868",
            "last_complete": "7941969'651868",
            "log_tail": "7937936'649586",
            "peer": "69(3)",
            "last_update": "8294786'781862",
            "last_complete": "0'0",
            "log_tail": "8282880'779012",
            "peer": "73(5)",
            "last_update": "7924910'643554",
            "last_complete": "7924910'643554",
            "log_tail": "7901040'641127",
            "peer": "81(4)",
            "last_update": "7941969'651868",
            "last_complete": "7941969'651868",
            "log_tail": "7937936'649586",
            "peer": "94(4)",
            "last_update": "7941920'651866",
            "last_complete": "7941920'651866",
            "log_tail": "7937936'649586",
            "peer": "122(5)",
            "last_update": "0'0",
            "last_complete": "0'0",
            "log_tail": "0'0",
            "peer": "124(1)",
            "last_update": "8294786'781862",
            "last_complete": "8294786'781862",
            "log_tail": "8282880'779012",
            "peer": "145(5)",
            "last_update": "7936058'645952",
            "last_complete": "7936058'645952",
            "log_tail": "7927639'643715",
                "peer_backfill_info": [],

Restarting the primary (osd.121) with debug_osd=30/debug_ms=1/debug_bluestore=10 enabled shows us the peering process, resulting in no change to the cluster state: ceph-post-filed at a257d82a-86de-424e-9964-86d029c87e59


Subtasks 3 (0 open3 closed)

Bug #48609: osd/PGLog: don’t fast-forward can_rollback_to during merge_log if the log isn’t extendedClosedDeepika Upadhyay

Actions
Bug #48611: osd: Delay sending info to new backfill peer resetting last_backfill until backfill actually startsResolvedDeepika Upadhyay

Actions
Bug #48613: Reproduce https://tracker.ceph.com/issues/48417ResolvedDeepika Upadhyay

Actions
Actions #1

Updated by Josh Durgin over 3 years ago

  • Description updated (diff)
Actions #2

Updated by Neha Ojha over 3 years ago

shard 122(5) looks suspicious, it seems to not have been queried from the pg query output but all_unfound_are_queried_or_lost reports it was queried but discover_all_missing does report it.

2020-12-01T20:04:33.397+0000 7fd2d384e700 10 osd.121 pg_epoch: 8293941 pg[119.385s0( v 8293928'781618 (8280506'778630,8293928'781618] local-lis/les=8080280/8080281 n=54305 ec=5864462/5696301 lis/c=8080280/7905412 les/c/f=8080281/7905429/8076205 sis=8080280) [121,20,61,69,31,2]/[121,124,61,69,NONE,2]p121(0) backfill=[20(1),31(4)] r=0 lpr=8080280 pi=[7905412,8080280)/6 crt=8293928'781618 lcod 8293920'781617 mlcod 0'0 active+recovery_unfound+undersized+degraded+remapped mbc={0={(1+0)=68},1={(0+1)=68},2={(1+0)=68},3={(0+0)=68},4={(0+0)=68},5={(0+0)=68}}] all_unfound_are_queried_or_lost all of might_have_unfound 1(3),2(5),7(1),9(3),10(3),13(5),14(1),20(1),23(5),25(3),31(4),36(4),61(2),65(4),69(3),73(5),81(4),94(4),122(5),124(1),145(5) have been queried or are marked lost
"might_have_unfound": [
                {
                    "osd": "122(5)",
                    "status": "not queried" 
                },
2020-12-01T20:04:33.397+0000 7fd2d384e700 20 osd.121 pg_epoch: 8293941 pg[119.385s0( v 8293928'781618 (8280506'778630,8293928'781618] local-lis/les=8080280/8080281 n=54305 ec=5864462/5696301 lis/c=8080280/7905412 les/c/f=8080281/7905429/8076205 sis=8080280) [121,20,61,69,31,2]/[121,124,61,69,NONE,2]p121(0) backfill=[20(1),31(4)] r=0 lpr=8080280 pi=[7905412,8080280)/6 crt=8293928'781618 lcod 8293920'781617 mlcod 0'0 active+recovery_unfound+undersized+degraded+remapped mbc={0={(1+0)=68},1={(0+1)=68},2={(1+0)=68},3={(0+0)=68},4={(0+0)=68},5={(0+0)=68}}] update_calc_stats no peer_missing found for 122(5)
2020-12-01T20:04:33.397+0000 7fd2d384e700 20 osd.121 pg_epoch: 8293941 pg[119.385s0( v 8293928'781618 (8280506'778630,8293928'781618] local-lis/les=8080280/8080281 n=54305 ec=5864462/5696301 lis/c=8080280/7905412 les/c/f=8080281/7905429/8076205 sis=8080280) [121,20,61,69,31,2]/[121,124,61,69,NONE,2]p121(0) backfill=[20(1),31(4)] r=0 lpr=8080280 pi=[7905412,8080280)/6 crt=8293928'781618 lcod 8293920'781617 mlcod 0'0 active+recovery_unfound+undersized+degraded+remapped mbc={0={(1+0)=68},1={(0+1)=68},2={(1+0)=68},3={(0+0)=68},4={(0+0)=68},5={(0+0)=68}}] update_calc_stats shard 122(5) objects 0 missing 54305

still digging

Actions #3

Updated by Samuel Just over 3 years ago

[NVM, used the command wrong]

Actions #4

Updated by Samuel Just over 3 years ago

I spot checked an object in 119.385

sjust@reesi002:~$ sudo ceph pg 119.385 list_unfound | head -n 30

{
    "num_missing": 68,
    "num_unfound": 68,
    "objects": [
        {
            "oid": {
                "oid": "10006fc22f8.00000000",
                "key": "",
                "snapid": -2,
                "hash": 1542341509,
                "max": 0,
                "pool": 119,
                "namespace": "" 
            },
            "need": "7936015'645867",
            "have": "0'0",
            "flags": "none",
            "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1",
            "locations": [
                "61(2)",
                "121(0)",
                "124(1)" 
            ]
        },

It's actually present on osd 69:

sjust@mira060:~$ sudo ./cephadm shell --name osd.69
Inferring fsid 28f7427e-5558-4ffd-ae1a-51ec3042759a
Inferring config /var/lib/ceph/28f7427e-5558-4ffd-ae1a-51ec3042759a/osd.69/config
Using recent ceph image quay.ceph.io/ceph-ci/ceph:a7b62340ad27b26642bb54ec916a97c7e1f07a34
cd /vacd /varroot@mira060:/# cd /var/lib/ceph/osd/ceph-69/
root@mira060:/var/lib/ceph/osd/ceph-69# ceph-objectstore-tool --data-path . --op list --pg 119.385 | less
root@mira060:/var/lib/ceph/osd/ceph-69# ceph-objectstore-tool --data-path . --op list --pg 119.385 | grep 10006fc22f8.00000000
["119.385s3",{"oid":"10006fc22f8.00000000","key":"","snapid":-2,"hash":1542341509,"max":0,"pool":119,"namespace":"","shard_id":3,"max":0}]

Actions #5

Updated by Samuel Just over 3 years ago

Unfortunately, it's present but an older version

root@mira060:/var/lib/ceph/osd/ceph-69# ceph-objectstore-tool --data-path . '["119.385s3",{"oid":"10006fc22f8.00000000","key":"","snapid":-2,"hash":1542341509,"max":0,"pool":119,"namespace":"","shard_id":3,"max":0}]' dump      
{
    "id": {
        "oid": "10006fc22f8.00000000",
        "key": "",
        "snapid": -2,
        "hash": 1542341509,
        "max": 0,
        "pool": 119,
        "namespace": "",
        "shard_id": 3,
        "max": 0
    },
    "info": {
        "oid": {
            "oid": "10006fc22f8.00000000",
            "key": "",
            "snapid": -2,
            "hash": 1542341509,
            "max": 0,
            "pool": 119,
            "namespace": "" 
        },
        "version": "7935315'645757",
...
Actions #6

Updated by Samuel Just over 3 years ago

[Wrong, used the wrong shard field]

Actions #7

Updated by Samuel Just over 3 years ago

122 doesn't seem to have a 119.385 instance.

Actions #8

Updated by Samuel Just over 3 years ago

My current working assumption is that PeeringState::activate is willing to backfill a non-contiguous replica which happens to have a necessary copy of an unreadable object. This wasn't possible before durable missing sets and trimming past last_complete, but it's possible now:

acting [0,1,2,3,NULL,NULL] write to version 20
acting [0,1,2,NULL,4,NULL] write to version 30, tail 10
acting [0,1,2,3,NULL,NULL] write to version 60, tail 40
acting [0,1,2,3,4(backfill),NULL]

This sequence is possible because in the last interval, osd.3 will simply have 10 missing set entries which will end up trimmed leaving the pg with unfound objects which become permanently so once osd.4 is marked backfill.

Actions #9

Updated by Samuel Just over 3 years ago

If this guess is correct one or more of the following 8 osds should have a copy of {"oid":"10006fc22f8.00000000","key":"","snapid":-2,"hash":1542341509,"max":0,"pool":119,"namespace":"","shard_id":?,"max":0}

        "peer": "31(4)", 
            "last_update": "8322636'789214",
            "last_complete": "8322636'789214",
            "log_tail": "8304626'786030",
            "last_backfill": "MIN",
        "peer": "36(4)",
            "last_update": "7942013'651879",
            "last_complete": "7942013'651879",
            "log_tail": "7937936'649586",
            "last_backfill": "MIN",
        "peer": "65(4)",
            "last_update": "7941969'651868",
            "last_complete": "7941969'651868",
            "log_tail": "7937936'649586",
            "last_backfill": "MIN",
        "peer": "81(4)",
            "last_update": "7941969'651868",
            "last_complete": "7941969'651868",
            "log_tail": "7937936'649586",
            "last_backfill": "MIN",
        "peer": "94(4)",
            "last_update": "7941920'651866",
            "last_complete": "7941920'651866",
            "log_tail": "7937936'649586",
            "last_backfill": "MIN",

        "peer": "13(5)", <-- apparently empty
            "last_update": "7936103'646007",
            "last_complete": "7936103'646007",
            "log_tail": "7927639'643715",
            "last_backfill": "MIN",
        "peer": "23(5)", <-- apparently empty
            "last_update": "8078679'702826",
            "last_complete": "8078679'702826",
            "log_tail": "8071492'700187",
            "last_backfill": "MIN",
        "peer": "145(5)", <-- apparently empty
            "last_update": "7936058'645952",
            "last_complete": "7936058'645952",
            "log_tail": "7927639'643715",
            "last_backfill": "MIN",
Actions #10

Updated by Samuel Just over 3 years ago

Yeah, osd.31 has a copy:

root@mira093:/var/lib/ceph/osd/ceph-31# ceph-objectstore-tool --data-path . '["119.385s4",{"oid":"10006fc22f8.00000000","key":"","snapid":-2,"hash":1542341509,"max":0,"pool":119,"namespace":"","shard_id":4,"max":0}]' dump {
"id": {
"oid": "10006fc22f8.00000000",
"key": "",
"snapid": -2,
"hash": 1542341509,
"max": 0,
"pool": 119,
"namespace": "",
"shard_id": 4,
"max": 0
},
"info": {
"oid": {
"oid": "10006fc22f8.00000000",
"key": "",
"snapid": -2,
"hash": 1542341509,
"max": 0,
"pool": 119,
"namespace": ""
},
"version": "7936015'645867",
"prior_version": "7935315'645757",

Actions #11

Updated by Samuel Just over 3 years ago

Ok, with osd.31 we have 4 copies of the correct version of this object. However, min_size is 5, so one would assume that we should have a 5th copy somewhere. Here's a somewhat more complicated sequence that could explain it using recovery-below-min-size:

epoch 9 acting [0,1,2,3,4,5] 0,1,2,3,4,5 commit up to 9'90
epoch 10 acting [0,1,2,3,4,NULL] 0,1,2,3 start at commit up to 10'110 but 4 only commits to 10'100
epoch 12 peered [0,1,2,NULL,4,6(backfill)] recovery below min_size, 4 is fast-forwarded to 10'110 with 10 missing objects, 6 has an empty pg created with last_backfill=MIN, recovery_unfound
epoch 13 acting [0,1,2,NULL,4,5] logs are contiguous, osds recover everything but the 10 objects missing on 5 and write up to 13'1000
epoch 14 acting [0,1,2,3(backfill),NULL,5] osd.3 has the 4th copies of the objects from 10'100~10'110, but gets marked backfill during activate() because its log is non-contiguous

Nvm, we'd rollback in epoch 11 above and recovery below min_size would not set last_epoch_started so we should still be unable to perform writes until we have min_size copies of prior interval objects. Still thinking about that part, but I think the activate() bug above is still the primary problem.

Actions #12

Updated by Samuel Just over 3 years ago

proc_master_log -> merge_log -> log.roll_forward_to -> advance_can_rollback_to

I think recovery below min_size (peered) is clearing can_rollback_to. That probably explains how osd.69's copy didn't end up authoritative.

Actions #13

Updated by Samuel Just over 3 years ago

epoch 100 [0, 1, 2, 3 , 4, 5 ] (100'950, 100'1000]
epoch 110 [0, 1, 2, 3 , 4, N ] 0,1,2,3 (100'960, 110'1020] 4 (100'955, 110'1010]
epoch 111 [0, 1, 2, 3 , N, 6b] (100'960, 110'1020] 6 empty, 0,1,2,3 crt = 110'1020
epoch 112 [0, 1, 2, N , 4, 5 ] (112'1050, 112'2000] 10 objects unfound
epoch 113 [0, 1, 2, 3b, 4, 5 ] 3 has been marked backfill despite having the only copy

Actions #14

Updated by Neha Ojha about 3 years ago

2021-02-05T05:08:40.959 INFO:tasks.ceph.ceph_manager.ceph:PG 3.1a is not active+clean
2021-02-05T05:08:40.959 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '3.1a', 'version': "854'670", 'reported_seq': '3258', 'reported_epoch': '1214', 'state': 'active+recovery_unfound+undersized+degraded+remapped', 'last_fresh': '2021-02-05T05:08:33.464519+0000', 'last_change': '2021-02-05T04:48:31.561421+0000', 'last_active': '2021-02-05T05:08:33.464519+0000', 'last_peered': '2021-02-05T05:08:33.464519+0000', 'last_clean': '2021-02-05T04:46:59.032407+0000', 'last_became_active': '2021-02-05T04:48:28.963999+0000', 'last_became_peered': '2021-02-05T04:48:28.963999+0000', 'last_unstale': '2021-02-05T05:08:33.464519+0000', 'last_undegraded': '2021-02-05T04:48:28.917627+0000', 'last_fullsized': '2021-02-05T04:48:28.915212+0000', 'mapping_epoch': 542, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 248, 'last_epoch_clean': 419, 'parent': '0.0', 'parent_split_bits': 5, 'last_scrub': "337'483", 'last_scrub_stamp': '2021-02-05T04:45:01.365816+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2021-02-05T04:39:32.894141+0000', 'last_clean_scrub_stamp': '2021-02-05T04:45:01.365816+0000', 'log_size': 670, 'ondisk_log_size': 670, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 209, 'stat_sum': {'num_bytes': 41926656, 'num_objects': 22, 'num_object_clones': 20, 'num_object_copies': 132, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 6, 'num_objects_misplaced': 0, 'num_objects_unfound': 2, 'num_objects_dirty': 22, 'num_whiteouts': 0, 'num_read': 70, 'num_read_kb': 76186, 'num_write': 118, 'num_write_kb': 30480, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 107, 'num_bytes_recovered': 113426432, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [11, 1, 10, 5, 0, 9], 'acting': [11, 2147483647, 2147483647, 5, 0, 9], 'avail_no_missing': ['11(0)', '5(3)', '7(1)', '8(2)', '9(5)'], 'object_location_counts': [{'shards': '0(4),5(3),9(5),11(0)', 'objects': 20}, {'shards': '5(3),9(5),11(0)', 'objects': 2}], 'blocked_by': [], 'up_primary': 11, 'acting_primary': 11, 'purged_snaps': []}
2021-02-05T05:08:40.960 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph_74275226ac79999bfd40e683dc9a1309e76033bf/qa/tasks/ceph_manager.py", line 116, in wrapper
    return func(self)
  File "/home/teuthworker/src/git.ceph.com_ceph_74275226ac79999bfd40e683dc9a1309e76033bf/qa/tasks/ceph_manager.py", line 1195, in _do_thrash
    timeout=self.config.get('timeout')
  File "/home/teuthworker/src/git.ceph.com_ceph_74275226ac79999bfd40e683dc9a1309e76033bf/qa/tasks/ceph_manager.py", line 2585, in wait_for_recovery
    'wait_for_recovery: failed before timeout expired'
AssertionError: wait_for_recovery: failed before timeout expired

2021-02-05T05:08:40.960 ERROR:tasks.thrashosds.thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph_74275226ac79999bfd40e683dc9a1309e76033bf/qa/tasks/ceph_manager.py", line 1070, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/git.ceph.com_ceph_74275226ac79999bfd40e683dc9a1309e76033bf/qa/tasks/ceph_manager.py", line 116, in wrapper
    return func(self)
  File "/home/teuthworker/src/git.ceph.com_ceph_74275226ac79999bfd40e683dc9a1309e76033bf/qa/tasks/ceph_manager.py", line 1195, in _do_thrash
    timeout=self.config.get('timeout')
  File "/home/teuthworker/src/git.ceph.com_ceph_74275226ac79999bfd40e683dc9a1309e76033bf/qa/tasks/ceph_manager.py", line 2585, in wait_for_recovery
    'wait_for_recovery: failed before timeout expired'
AssertionError: wait_for_recovery: failed before timeout expired

/a/teuthology-2021-02-03_03:31:04-rados-pacific-distro-basic-smithi/5852263 - with logs

Actions #16

Updated by Samuel Just about 3 years ago

I think this has the same root cause as https://tracker.ceph.com/issues/48613. Accepting IO below min_size I think could plausibly account for all of the above.

Actions #17

Updated by Nathan Cutler about 3 years ago

Samuel Just wrote:

I think this has the same root cause as https://tracker.ceph.com/issues/48613. Accepting IO below min_size I think could plausibly account for all of the above.

OK, so close as a duplicate of #48613 which was fixed by https://github.com/ceph/ceph/pull/40572 ?

Actions #18

Updated by Neha Ojha almost 3 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF