Project

General

Profile

Support #22566

Some osd remain 100% CPU after upgrade jewel => luminous (v12.2.2) and some work

Added by David Casier about 6 years ago. Updated about 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

I have some OSDs that remain at 100% startup without any debug info in the logs :

2018-01-04 14:47:04.089343 7f48ebc5ad00  0 set uid:gid to 167:167 (ceph:ceph)
2018-01-04 14:47:04.089385 7f48ebc5ad00  0 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process (unknown), pid 16655
2018-01-04 14:47:04.096238 7f48ebc5ad00  0 pidfile_write: ignore empty --pid-file
2018-01-04 14:47:04.171870 7f48ebc5ad00  0 load: jerasure load: lrc load: isa
2018-01-04 14:47:04.172462 7f48ebc5ad00  0 filestore(/var/lib/ceph/osd/ceph-11) backend xfs (magic 0x58465342)
2018-01-04 14:47:04.173798 7f48ebc5ad00  0 filestore(/var/lib/ceph/osd/ceph-11) backend xfs (magic 0x58465342)
2018-01-04 14:47:04.174567 7f48ebc5ad00  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2018-01-04 14:47:04.174590 7f48ebc5ad00  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2018-01-04 14:47:04.174593 7f48ebc5ad00  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: splice() is disabled via 'filestore splice' config option
2018-01-04 14:47:04.217075 7f48ebc5ad00  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2018-01-04 14:47:04.217217 7f48ebc5ad00  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature: extsize is disabled by conf
2018-01-04 14:47:04.218392 7f48ebc5ad00  0 filestore(/var/lib/ceph/osd/ceph-11) start omap initiation
2018-01-04 14:47:04.223828 7f48ebc5ad00  1 leveldb: Recovering log #101117
2018-01-04 14:47:04.225093 7f48ebc5ad00  1 leveldb: Level-0 table #101119: started
2018-01-04 14:47:04.257471 7f48ebc5ad00  1 leveldb: Level-0 table #101119: 34419 bytes OK
2018-01-04 14:47:04.348955 7f48ebc5ad00  1 leveldb: Delete type=3 #101115

2018-01-04 14:47:04.349266 7f48ebc5ad00  1 leveldb: Delete type=0 #101117

2018-01-04 14:47:04.364568 7f48ebc5ad00  0 filestore(/var/lib/ceph/osd/ceph-11) mount(1757): enabling WRITEAHEAD journal mode: checkpoint is not enabled
2018-01-04 14:47:04.389939 7f48ebc5ad00  1 journal _open /var/lib/ceph/osd/ceph-11/journal fd 27: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1

I placed a strace as an attachment (futex timeout)

Env :

Centos 7.4 up to date

Linux sm02 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

glibc-2.17-196

What do you need to diagnose better? ceph-qa-suite?

ceph-osd.11.strace.zip (81 KB) David Casier, 01/04/2018 01:44 PM

History

#1 Updated by Patrick Donnelly about 6 years ago

  • Project changed from Ceph to RADOS

#2 Updated by Josh Durgin about 6 years ago

  • Tracker changed from Bug to Support

This is likely the singe-time startup cost of accounting for a bug in omap, where the osd has to scan the whole omap db and remove bad entries. Do these osds have large current/omap directories? If not, could you get a log with:

debug osd = 20
debug leveldb = 20
debug filestore = 20
debug ms = 1

Also available in: Atom PDF