Project

General

Profile

Actions

Bug #61594

open

recovery_ops_reserved leak -- pg stuck in state recovering

Added by Samuel Just 11 months ago. Updated 11 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Original state:

# ceph -s
  cluster:
    id:     0fc51c26-02dd-11ee-82fa-000af7995d6c
    health: HEALTH_WARN
            noscrub,nodeep-scrub flag(s) set
            Degraded data redundancy: 229496390/1146156225 objects degraded (20.023%), 3660 pgs degraded, 3660 pgs undersized

  services:
    mon: 3 daemons, quorum f28-h21-000-r630,f28-h23-000-r630,f28-h22-000-r630 (age 28h)
    mgr: f28-h22-000-r630.whjflb(active, since 28h), standbys: f28-h23-000-r630.zctfhz, f28-h21-000-r630.opssdk
    osd: 192 osds: 192 up (since 14h), 192 in (since 15h); 4003 remapped pgs
         flags noscrub,nodeep-scrub
    rgw: 8 daemons active (8 hosts, 1 zones)

  data:
    pools:   7 pools, 4769 pgs
    objects: 191.03M objects, 9.7 TiB
    usage:   28 TiB used, 333 TiB / 361 TiB avail
    pgs:     229496390/1146156225 objects degraded (20.023%)
             49389321/1146156225 objects misplaced (4.309%)
             3652 active+undersized+degraded+remapped+backfill_wait
             765  active+clean
             343  active+remapped+backfill_wait
             7    active+recovery_wait+undersized+degraded+remapped
             1    active+recovery_wait
             1    active+recovering+undersized+degraded+remapped

pg dump:

# ceph pg dump | grep recovering
7.fd       47052                   0         2          0        0  2625982464            0           0  1486      1486     active+recovering+undersized+degraded+remapped  2023-06-05T03:27:30.378744+0000    2721'51980   2794:163727      [36,104,12,98,37,159]          36       [36,104,12,98,23,NONE]              36          0'0  2023-06-04T13:49:14.163307+0000              0'0  2023-06-04T13:49:14.163307+0000              0                    0  queued for scrub                                                           0                0

dump_op_pq_state on osd.36 (primary)

# ceph --admin-daemon /var/run/ceph/0fc51c26-02dd-11ee-82fa-000af7995d6c/ceph-osd.36.asok dump_op_pq_state
{
    "OSD:ShardedOpWQ:0": {
        "queue_sizes": {
            "immediate": 0,
            "scheduler": 0
        },
        "mClockClients": {
            "client_count": 0,
            "clients": "{ PriorityQueue:: HEAPS-EMPTY }" 
        },
        "mClockQueues": {
            "queues": "RESER:LIMIT:READY:" 
        }
    },
    "OSD:ShardedOpWQ:1": {
        "queue_sizes": {
            "immediate": 0,
            "scheduler": 0
        },
        "mClockClients": {
            "client_count": 0,
            "clients": "{ PriorityQueue:: HEAPS-EMPTY }" 
        },
        "mClockQueues": {
            "queues": "RESER:LIMIT:READY:" 
        }
    },
    "OSD:ShardedOpWQ:2": {
        "queue_sizes": {
            "immediate": 0,
            "scheduler": 0
        },
        "mClockClients": {
            "client_count": 0,
            "clients": "{ PriorityQueue:: HEAPS-EMPTY }" 
        },
        "mClockQueues": {
            "queues": "RESER:LIMIT:READY:" 
        }
    },
    "OSD:ShardedOpWQ:3": {
        "queue_sizes": {
            "immediate": 0,
            "scheduler": 0
        },
        "mClockClients": {
            "client_count": 15,
            "clients": "{ PriorityQueue::  { client:{ class_id: 3 client_id: 19145 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 19145 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989217.8446 p:732788.9572 l:0.0000 } req_count:0 top_req:none } }  { client:{ class_id: 3 client_id: 19210 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 19210 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8379 p:985550.2676 l:0.0000 } req_count:0 top_req:none } }  { client:{ class_id: 3 client_id: 19235 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 19235 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.7961 p:601269.3723 l:0.0000 } req_count:0 top_req:none } }  { client:{ class_id: 3 client_id: 19290 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 19290 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989233.3795 p:476374.6644 l:0.0000 } req_count:0 top_req:none } }  { client:{ class_id: 3 client_id: 19340 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 19340 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:988849.7979 p:633339.5302 l:0.0000 } req_count:0 top_req:none } }  { client:{ class_id: 3 client_id: 28926 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 28926 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.3236 p:590459.0414 l:0.0000 } req_count:0 top_req:none } }  { client:{ class_id: 3 client_id: 37759 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 37759 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.3253 p:923097.1278 l:0.0000 } req_count:0 top_req:none } }  { client:{ class_id: 3 client_id: 37794 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 37794 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989217.8433 p:605618.9912 l:0.0000 } req_count:0 top_req:none } }  { client:{ class_id: 3 client_id: 37804 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 37804 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8466 p:238305.3975 l:0.0000 } req_count:0 top_req:none } }  { client:{ class_id: 3 client_id: 37839 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 37839 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8365 p:340187.5934 l:0.0000 } req_count:0 top_req:none } }  { client:{ class_id: 3 client_id: 37854 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 37854 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.7976 p:703405.5547 l:0.0000 } req_count:0 top_req:none } }  { client:{ class_id: 3 client_id: 37889 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 37889 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989233.3778 p:221586.1626 l:0.0000 } req_count:0 top_req:none } }  { client:{ class_id: 3 client_id: 37909 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 37909 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:988849.7992 p:718490.5572 l:0.0000 } req_count:0 top_req:none } }  { client:{ class_id: 3 client_id: 38583 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 38583 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989217.8426 p:854006.4917 l:0.0000 } req_count:0 top_req:none } }  { client:{ class_id: 3 client_id: 38603 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 3 client_id: 38603 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8447 p:599985.7752 l:0.0000 } req_count:0 top_req:none } } { reservation_top:{ ClientRec:: client:{ class_id: 3 client_id: 37889 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989233.3778 p:221586.1626 l:0.0000 } req_count:0 top_req:none } } { ready_top:{ ClientRec:: client:{ class_id: 3 client_id: 37889 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989233.3778 p:221586.1626 l:0.0000 } req_count:0 top_req:none } } { limit_top:{ ClientRec:: client:{ class_id: 3 client_id: 37889 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989233.3778 p:221586.1626 l:0.0000 } req_count:0 top_req:none } } }" 
        },
        "mClockQueues": {
            "queues": "RESER:{ ClientRec:: client:{ class_id: 3 client_id: 37889 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989233.3778 p:221586.1626 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19235 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.7961 p:601269.3723 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19290 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989233.3795 p:476374.6644 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37854 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.7976 p:703405.5547 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37759 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.3253 p:923097.1278 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37839 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8365 p:340187.5934 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 38603 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8447 p:599985.7752 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 28926 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.3236 p:590459.0414 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 38583 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989217.8426 p:854006.4917 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19210 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8379 p:985550.2676 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37804 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8466 p:238305.3975 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19145 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989217.8446 p:732788.9572 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37909 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:988849.7992 p:718490.5572 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19340 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:988849.7979 p:633339.5302 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37794 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989217.8433 p:605618.9912 l:0.0000 } req_count:0 top_req:none }LIMIT:{ ClientRec:: client:{ class_id: 3 client_id: 37889 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989233.3778 p:221586.1626 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19235 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.7961 p:601269.3723 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19290 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989233.3795 p:476374.6644 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37854 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.7976 p:703405.5547 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37804 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8466 p:238305.3975 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 38603 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8447 p:599985.7752 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 28926 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.3236 p:590459.0414 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19340 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:988849.7979 p:633339.5302 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37759 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.3253 p:923097.1278 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 38583 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989217.8426 p:854006.4917 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19210 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8379 p:985550.2676 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37909 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:988849.7992 p:718490.5572 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19145 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989217.8446 p:732788.9572 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37794 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989217.8433 p:605618.9912 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37839 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8365 p:340187.5934 l:0.0000 } req_count:0 top_req:none }READY:{ ClientRec:: client:{ class_id: 3 client_id: 37889 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989233.3778 p:221586.1626 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19235 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.7961 p:601269.3723 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19290 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989233.3795 p:476374.6644 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37854 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.7976 p:703405.5547 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37804 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8466 p:238305.3975 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 28926 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.3236 p:590459.0414 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 38603 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8447 p:599985.7752 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19145 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989217.8446 p:732788.9572 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 38583 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989217.8426 p:854006.4917 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37909 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:988849.7992 p:718490.5572 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19210 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8379 p:985550.2676 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37759 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989224.3253 p:923097.1278 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37839 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989223.8365 p:340187.5934 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 19340 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:988849.7979 p:633339.5302 l:0.0000 } req_count:0 top_req:none }, { ClientRec:: client:{ class_id: 3 client_id: 37794 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:989217.8433 p:605618.9912 l:0.0000 } req_count:0 top_req:none }" 
        }
    },
    "OSD:ShardedOpWQ:4": {
        "queue_sizes": {
            "immediate": 0,
            "scheduler": 0
        },
        "mClockClients": {
            "client_count": 0,
            "clients": "{ PriorityQueue:: HEAPS-EMPTY }" 
        },
        "mClockQueues": {
            "queues": "RESER:LIMIT:READY:" 
        }
    }
}

Odd that we are enumerating all clients and that they are clustered on a single shard, but unrelated to this bug (see TODO).

PG query is attached, but doesn't show anything interesting, no unfound objects.

ceph osd down 36 causes the pg to repeer and end up in the same state.

Enabled debug_osd = 20 on 36, marked it down (twice, actually), captured log (attached gzipped log with 7.fds0 grepped out).

# zgrep recovery_ops_reserved ceph-osd.36.log.gz 
2023-06-05T20:40:33.376+0000 7fbd0d668640 10 osd.36 2866 release_reserved_pushes(0), recovery_ops_reserved 3 -> 3
2023-06-05T20:40:33.796+0000 7fbd0d668640 10 osd.36 2867 release_reserved_pushes(0), recovery_ops_reserved 3 -> 3
...
2023-06-05T20:57:00.185+0000 7fbd0d668640 10 osd.36 2919 release_reserved_pushes(0), recovery_ops_reserved 3 -> 3
2023-06-05T20:57:01.263+0000 7fbd0d668640 10 osd.36 2920 release_reserved_pushes(0), recovery_ops_reserved 3 -> 3
# zgrep _recover_now ceph-osd.36.log.gz 
2023-06-05T20:39:58.289+0000 7fbd1867e640 15 osd.36 2865 _recover_now active 0 + reserved 3 >= max 3
2023-06-05T20:39:59.243+0000 7fbd1867e640 15 osd.36 2865 _recover_now active 0 + reserved 3 >= max 3
2023-06-05T20:40:00.229+0000 7fbd1867e640 15 osd.36 2865 _recover_now active 0 + reserved 3 >= max 3
...
2023-06-05T20:57:14.875+0000 7fbd1867e640 15 osd.36 2920 _recover_now active 0 + reserved 3 >= max 3
2023-06-05T20:57:15.905+0000 7fbd1867e640 15 osd.36 2920 _recover_now active 0 + reserved 3 >= max 3
2023-06-05T20:57:16.920+0000 7fbd1867e640 15 osd.36 2920 _recover_now active 0 + reserved 3 >= max 3

There are no cases where recovery_ops_reserved is below 3 in the log. Presumably, the leak happened prior to the logs being enabled.


Files

7.fd_query.out (78.5 KB) 7.fd_query.out Samuel Just, 06/06/2023 12:06 AM
ceph-osd.36-7.fds0.out.gz (157 KB) ceph-osd.36-7.fds0.out.gz Samuel Just, 06/06/2023 12:09 AM
Actions #1

Updated by Samuel Just 11 months ago

Actions #2

Updated by Samuel Just 11 months ago

Actions #3

Updated by Samuel Just 11 months ago

  • Description updated (diff)
Actions #4

Updated by Radoslaw Zarzynski 11 months ago

bump this up for the next scrub

Actions

Also available in: Atom PDF