Project

General

Profile

Support #38156

MDS Behind on trimming but using no CPU or disk IO.

Added by Michael Jones 8 months ago. Updated 8 months ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
02/03/2019
Due date:
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

I have a cluster with three nodes.

Mimir: MDS, MON, MGR
Fenrir, MDS, MON, MGR, 8 OSDs
Hoenir, MDS, MON, MGR, 8 OSDs

Ceph health detail tells me:

ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 MDSs behind on trimming; Reduced data availability: 86 pgs inactive; Degraded data redundancy: 360875/6923320 objects degraded (5.212%), 80 pgs degraded, 86 pgs undersized
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
mdsmimir(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 494 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdsmimir(mds.0): 1 slow requests are blocked > 30 secs
MDS_TRIM 1 MDSs behind on trimming
mdsmimir(mds.0): Behind on trimming (1806/128) max_segments: 128, num_segments: 1806
PG_AVAILABILITY Reduced data availability: 86 pgs inactive

What's notable about this is that it was at 1806 segments to trim 12 hours ago, and when I restart the Mimir MDS process, another MDS quickly starts using a lot of CPU, gets up to 1806 segments to trim, and then goes down to no CPU usage.

Could something be stuck?

I have 86 inactive pgs, but none of my OSDs are offline, and since I created this cluster I have not had any drive failures.

My workload was to use rsync to transfer several TBs of data to the cluster via the ceph-fuse module.

Attached are the logs from each machine after a reboot and letting them run for a while.

What information can I provide that will help figure this out?

The contents of my ceph.conf file are as follows:

[global]
fsid = 07cb5105-68ea-4f1c-bace-a2be0baae5fa
cluster = ceph
ms bind ipv6 = true
public network = fda8:0941:2491:1699::/64
cluster network = fdd7:d94b:3c2e:b69f::/64 ## # For version 0.55 and beyond, you must explicitly enable # or disable authentication with "auth" entries in [global]. ##
auth client required = cephx
auth service required = cephx
auth cluster required = cephx

[mon]
mon initial members = hoenir fenrir mimir
mon host = hoenir fenrir mimir
mon addr = fda8:0941:2491:1699:75ec:3651:86c3:2e88 fda8:0941:2491:1699:0b45:a2e6:1383:2b98 fda8:0941:2491:1699:60fa:e622:8345:2162

[mon.hoenir]
host = hoenir
addr = fda8:0941:2491:1699:75ec:3651:86c3:2e88

[mon.fenrir]
host = fenrir
addr = fda8:0941:2491:1699:0b45:a2e6:1383:2b98

[mon.mimir]
host = mimir
addr = fda8:0941:2491:1699:60fa:e622:8345:2162

[osd]
osd pool default size = 1
osd pool default min size = 1
osd crush chooseleaf type = 0

[mds]

[mgr]

ceph.tar.xz (800 KB) Michael Jones, 02/03/2019 08:18 AM

History

#1 Updated by Michael Jones 8 months ago

Ceph version information

mimir ~ # emerge --info ceph
Portage 2.3.51 (python 3.6.5-final-0, default/linux/amd64/17.0, gcc-7.3.0, glibc-2.27-r6, 4.18.14-gentoo x86_64) =================================================================
System Settings =================================================================
System uname: Linux-4.18.14-gentoo-x86_64-AMD_E-350D_APU_with_Radeon-tm-_HD_Graphics-with-gentoo-2.6
KiB Mem: 16133864 total, 8246988 free
KiB Swap: 0 total, 0 free
Timestamp of repository gentoo: Thu, 31 Jan 2019 23:27:13 +0000
Head commit of repository gentoo: 331e9c150449199020dc862c26506e56560dcf6d

Head commit of repository jonesmz-public-overlay: 63a835bf1bc732c8b88ae8f5900f3ef68b984c71

Head commit of repository steam-overlay: 11a4ca7b45e9d9bdf26508bfaa67e2b769b91dae

sh bash 4.4_p23-r1
ld GNU ld (Gentoo 2.30 p5) 2.30.0
distcc 3.2rc1 x86_64-pc-linux-gnu [disabled]
app-shells/bash: 4.4_p23-r1::gentoo
dev-lang/perl: 5.26.2::gentoo
dev-lang/python: 2.7.15::gentoo, 3.6.5::gentoo
dev-util/cmake: 3.9.6::gentoo
dev-util/pkgconfig: 0.29.2::gentoo
sys-apps/baselayout: 2.6-r1::gentoo
sys-apps/sandbox: 2.13::gentoo
sys-devel/autoconf: 2.69-r4::gentoo
sys-devel/automake: 1.11.6-r3::gentoo, 1.16.1-r1::gentoo
sys-devel/binutils: 2.30-r4::gentoo
sys-devel/gcc: 7.3.0-r3::gentoo
sys-devel/gcc-config: 2.0::gentoo
sys-devel/libtool: 2.4.6-r3::gentoo
sys-devel/make: 4.2.1-r4::gentoo
sys-kernel/linux-headers: 4.14-r1::gentoo (virtual/os-headers)
sys-libs/glibc: 2.27-r6::gentoo
Repositories:

gentoo
location: /usr/portage
sync-type: git
sync-uri: git://anongit.gentoo.org/repo/sync/gentoo.git
priority: -1000

jonesmz-public-overlay
location: /usr/portage-overlays/jonesmz-public-overlay
sync-type: git
sync-uri: https://github.com/jonesmz/gentoo-overlay.git
masters: gentoo

steam-overlay
location: /usr/portage-overlays/steam-overlay
sync-type: git
sync-uri: https://github.com/anyc/steam-overlay.git
masters: gentoo
priority: 50

Installed sets: @pc-base-system, @portage
ACCEPT_KEYWORDS="amd64"
ACCEPT_LICENSE="* -@EULA"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-O2 -pipe -march=x86-64 -mtune=generic -O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/dconf /etc/env.d /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo"
CXXFLAGS="-O2 -pipe -O2 -pipe -march=x86-64 -mtune=generic -O2 -pipe"
DISTDIR="/usr/portage-distfiles"
EMERGE_DEFAULT_OPTS=" --jobs --keep-going --newuse --deep --backtrack=3000 --complete-graph --with-bdeps=y --usepkg"
ENV_UNSET="DBUS_SESSION_BUS_ADDRESS DISPLAY GOBIN PERL5LIB PERL5OPT PERLPREFIX PERL_CORE PERL_MB_OPT PERL_MM_OPT XAUTHORITY XDG_CACHE_HOME XDG_CONFIG_HOME XDG_DATA_HOME XDG_RUNTIME_DIR"
FCFLAGS="-O2 -pipe"
FEATURES="assume-digests binpkg-logs buildpkg clean-logs compress-build-logs compressdebug config-protect-if-modified distlocks ebuild-locks fixlafiles installsources merge-sync multilib-strict news nostrip parallel-fetch parallel-install preserve-libs protect-owned sandbox sfperms split-elog split-log strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync xattr"
FFLAGS="-O2 -pipe"
GENTOO_MIRRORS="http://distfiles.gentoo.org"
LANG="en_US.utf8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
LINGUAS="en en_US"
MAKEOPTS="-j1"
PKGDIR="/usr/portage-packages"
PORTAGE_COMPRESS="xz"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --exclude=/.git"
PORTAGE_TMPDIR="/var/tmp"
USE="acl amd64 avahi btrfs bzip2 clang crypt cxx dbus gd gudev hardened iconv ipv6 libtirpc lm_sensors multilib ncurses nls nptl openmp pam pcre pie python readline samba seccomp ssl ssp systemd threads udev udisks unicode v4l xattr xtpax zeroconf zlib" ABI_X86="64" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" APACHE2_MODULES="authn_core authz_core authz_host dir mime unixd socache_shmcb info log_config" CALLIGRA_FEATURES="karbon sheets words" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" CPU_FLAGS_X86="mmx sse sse2 mmxext" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock isync itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf skytraq superstar2 timing tsip tripmate tnt ublox ubx" GRUB_PLATFORMS="coreboot efi-64 emu qemu pc" INPUT_DEVICES="libinput" KERNEL="linux" L10N="en en-US" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" NETBEANS_MODULES="apisupport cnd groovy gsf harness ide identity j2ee java mobility nb php profiler soa visualweb webcommon websvccommon xml" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php5-6 php7-1" POSTGRES_TARGETS="postgres9_5 postgres10" PYTHON_SINGLE_TARGET="python3_6" PYTHON_TARGETS="python2_7 python3_6" QEMU_SOFTMMU_TARGETS="arm aarch64 x86_64" QEMU_USER_TARGETS="arm aarch64 x86_64" RUBY_TARGETS="ruby24" USERLAND="GNU" VIDEO_CARDS="r600 radeon radeonsi amdgpu vesa modesetting fbdev qxl" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset: CC, CPPFLAGS, CTARGET, CXX, INSTALL_MASK, LC_ALL, PORTAGE_BINHOST, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS

=================================================================
Package Settings =================================================================

sys-cluster/ceph-13.2.4::gentoo was built with the following:
USE="cephfs fuse mgr radosgw ssl systemd tcmalloc -babeltrace -dpdk -jemalloc -ldap -lttng (-static-libs) (-system-boost) -test -xfs -zfs" ABI_X86="(64)" CPU_FLAGS_X86="sse sse2 -sse3 -sse4_1 -sse4_2 -ssse3" PYTHON_TARGETS="python2_7 python3_6 -python3_4 -python3_5"

mimir ~ #

#2 Updated by Patrick Donnelly 8 months ago

  • Tracker changed from Bug to Support
  • Status changed from New to Rejected

Please try ceph-users for help.

(From a very cursory glance at the information you provided, the problem is that you have pgs inactive.)

#3 Updated by Michael Jones 8 months ago

Please recognize a bug for what it is. I'm not asking for help. I'm trying to help you.

Why would PGs be inactive when all 16 OSDs are online, and are the original OSDs for the cluster?

Your software gives no indication what might be wrong, lets try to work together to improve it.

Or do you think it's acceptable for Ceph to say it's behind on trimming for 24 hours with 1806 segments, and no CPU usage?

Maybe I'm mistaken and that's normal behavior? If so, why does it result in health warning?

Either there should be no health warning, or Ceph should explain what's actually wrong.

Also available in: Atom PDF