Support #38156: MDS Behind on trimming but using no CPU or disk IO. - Ceph - Ceph

Actions

Copy link

Support #38156

closed

MDS Behind on trimming but using no CPU or disk IO.

Added by Michael Jones over 5 years ago. Updated over 5 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Pull request ID:

Description

I have a cluster with three nodes.

Mimir: MDS, MON, MGR
Fenrir, MDS, MON, MGR, 8 OSDs
Hoenir, MDS, MON, MGR, 8 OSDs

Ceph health detail tells me:

ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 MDSs behind on trimming; Reduced data availability: 86 pgs inactive; Degraded data redundancy: 360875/6923320 objects degraded (5.212%), 80 pgs degraded, 86 pgs undersized
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
mdsmimir(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 494 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdsmimir(mds.0): 1 slow requests are blocked > 30 secs
MDS_TRIM 1 MDSs behind on trimming
mdsmimir(mds.0): Behind on trimming (1806/128) max_segments: 128, num_segments: 1806
PG_AVAILABILITY Reduced data availability: 86 pgs inactive

What's notable about this is that it was at 1806 segments to trim 12 hours ago, and when I restart the Mimir MDS process, another MDS quickly starts using a lot of CPU, gets up to 1806 segments to trim, and then goes down to no CPU usage.

Could something be stuck?

I have 86 inactive pgs, but none of my OSDs are offline, and since I created this cluster I have not had any drive failures.

My workload was to use rsync to transfer several TBs of data to the cluster via the ceph-fuse module.

Attached are the logs from each machine after a reboot and letting them run for a while.

What information can I provide that will help figure this out?

The contents of my ceph.conf file are as follows:

[global]
fsid = 07cb5105-68ea-4f1c-bace-a2be0baae5fa
cluster = ceph
ms bind ipv6 = true
public network = fda8:0941:2491:1699::/64
cluster network = fdd7:d94b:3c2e:b69f::/64 ## # For version 0.55 and beyond, you must explicitly enable # or disable authentication with "auth" entries in [global]. ##
auth client required = cephx
auth service required = cephx
auth cluster required = cephx

[mon]
mon initial members = hoenir fenrir mimir
mon host = hoenir fenrir mimir
mon addr = fda8:0941:2491:1699:75ec:3651:86c3:2e88 fda8:0941:2491:1699:0b45:a2e6:1383:2b98 fda8:0941:2491:1699:60fa:e622:8345:2162

[mon.hoenir]
host = hoenir
addr = fda8:0941:2491:1699:75ec:3651:86c3:2e88

[mon.fenrir]
host = fenrir
addr = fda8:0941:2491:1699:0b45:a2e6:1383:2b98

[mon.mimir]
host = mimir
addr = fda8:0941:2491:1699:60fa:e622:8345:2162

[osd]
osd pool default size = 1
osd pool default min size = 1
osd crush chooseleaf type = 0

[mds]

[mgr]

Files

ceph.tar.xz (800 KB) ceph.tar.xz

Michael Jones, 02/03/2019 08:18 AM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Support #38156

MDS Behind on trimming but using no CPU or disk IO.

Updated by Michael Jones over 5 years ago

Updated by Patrick Donnelly over 5 years ago

Updated by Michael Jones over 5 years ago