Project

General

Profile

Actions

Bug #37568

closed

CephFS remove snapshot result in slow ops

Added by Francois Legrand over 5 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,
I have a ceph mimic cluster with cephfs.
I create few snapshots (mkdir .snap/test etc...) in different directories. So far so good.
But when I delete the snapshots (rmdir .snap/test etc...) the cluster get in a warn state with :
  1. ceph -w
    cluster:
    id: 2fbbf089-a846-4c09-90bc-1dd9bd7af30f
    health: HEALTH_WARN
    3 slow ops, oldest one blocked for 11415 sec, mon.lpnceph01 has slow ops
    ...
    2018-12-06 16:54:56.356518 mon.lpnceph-mon01 [WRN] Health check update: 3 slow ops, oldest one blocked for 11410 sec, mon.lpnceph01 has slow ops (SLOW_OPS)
    2018-12-06 16:55:05.856294 mon.lpnceph-mon01 [WRN] Health check update: 3 slow ops, oldest one blocked for 11415 sec, mon.lpnceph01 has slow ops (SLOW_OPS)
    2018-12-06 16:55:10.856657 mon.lpnceph-mon01 [WRN] Health check update: 3 slow ops, oldest one blocked for 11425 sec, mon.lpnceph01 has slow ops (SLOW_OPS)
It's obviously related to the removal of snapshots because :
  1. ceph daemon mon.lpnceph01 ops {
    "ops": [ {
    "description": "remove_snaps({28=[3,4]} v0)",
    "initiated_at": "2018-12-06 13:44:41.396039",
    "age": 14549.148016,
    "duration": 14549.148028,
    "type_data": {
    "events": [ {
    "time": "2018-12-06 13:44:41.396039",
    "event": "initiated"
    }, {
    "time": "2018-12-06 13:44:41.396039",
    "event": "header_read"
    }, {
    "time": "2018-12-06 13:44:41.396042",
    "event": "throttled"
    }, {
    "time": "2018-12-06 13:44:41.396089",
    "event": "all_read"
    }, {
    "time": "2018-12-06 13:44:41.396186",
    "event": "dispatched"
    }, {
    "time": "2018-12-06 13:44:41.396190",
    "event": "mon:_ms_dispatch"
    }, {
    "time": "2018-12-06 13:44:41.396191",
    "event": "mon:dispatch_op"
    }, {
    "time": "2018-12-06 13:44:41.396192",
    "event": "psvc:dispatch"
    }, {
    "time": "2018-12-06 13:44:41.396205",
    "event": "osdmap:preprocess_query"
    }, {
    "time": "2018-12-06 13:44:41.396214",
    "event": "osdmap:preprocess_remove_snaps"
    }, {
    "time": "2018-12-06 13:44:41.396220",
    "event": "forward_request_leader"
    }, {
    "time": "2018-12-06 13:44:41.396258",
    "event": "forwarded"
    }
    ],
    "info": {
    "seq": 250448,
    "src_is_mon": false,
    "source": "mds.0 xxx.xxx.xxx.xxx:6800/2790459226",
    "forwarded_to_leader": true
    }
    }
    },
    ...

I tryed to add in ceph.conf the lines :
[osd]
osd snap trim sleep = 0.6

as suggested in http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-November/031227.html
but it doesn't solve the problem.

I had to restart the service :
systemctl restart

to get the cluster back to healty status.


Related issues 4 (0 open4 closed)

Has duplicate Ceph - Bug #37782: Snapshot removal hangsDuplicate01/03/2019

Actions
Has duplicate CephFS - Bug #24088: mon: slow remove_snaps op reported in cluster health logDuplicateZheng Yan05/10/2018

Actions
Copied to Ceph - Backport #37693: mimic: CephFS remove snapshot result in slow opsResolvedPrashant DActions
Copied to Ceph - Backport #37694: luminous: CephFS remove snapshot result in slow opsResolvedPrashant DActions
Actions #1

Updated by Patrick Donnelly over 5 years ago

  • Subject changed from Cephfs remove snapshot result in slow ops to CephFS remove snapshot result in slow ops
  • Assignee set to Zheng Yan
  • Priority changed from Normal to High
  • Target version set to v14.0.0
Actions #2

Updated by Zheng Yan over 5 years ago

  • Project changed from CephFS to Ceph
  • Category deleted (89)
  • Status changed from New to Fix Under Review
  • Backport set to mimic,luminous
  • Pull request ID set to 37568
Actions #3

Updated by Zheng Yan over 5 years ago

  • Pull request ID changed from 37568 to 25481
Actions #4

Updated by Patrick Donnelly over 5 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #37693: mimic: CephFS remove snapshot result in slow ops added
Actions #6

Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #37694: luminous: CephFS remove snapshot result in slow ops added
Actions #7

Updated by Patrick Donnelly over 5 years ago

  • Has duplicate Bug #37782: Snapshot removal hangs added
Actions #8

Updated by Nathan Cutler about 5 years ago

  • Status changed from Pending Backport to Resolved
Actions #9

Updated by Patrick Donnelly over 4 years ago

  • Has duplicate Bug #24088: mon: slow remove_snaps op reported in cluster health log added
Actions #10

Updated by Janek Bevendorff about 3 years ago

I can reproduce this on 15.2.8. I have 30 PGs in the active+clean+snaptrim state and about 1500-2500 slow ops. This happens regularly.

Actions #11

Updated by Patrick Donnelly about 3 years ago

Janek Bevendorff wrote:

I can reproduce this on 15.2.8. I have 30 PGs in the active+clean+snaptrim state and about 1500-2500 slow ops. This happens regularly.

This may be unrelated. Can you create a new tracker ticket with logs/etc.

Actions #12

Updated by Janek Bevendorff about 3 years ago

I'm actually not so sure anymore if we are really having an issue here. If I have anything, I'll open a new issue.

Actions

Also available in: Atom PDF