Project

General

Profile

Bug #22543

OSDs can not start after shutdown, killed by OOM killer during PGs load

Added by Volodymyr Blokhin over 3 years ago. Updated over 3 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-disk
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

After shutdown all OSDs can not start. During load_pgs stage ceph-osd process consumes all available virtual memory (RAM+swap) so OOM has to kill it.

@root@osd001:~# dpkg -l | grep -i ceph
ii ceph-base 12.2.2-1xenial amd64 common ceph daemon libraries and management tools
ii ceph-common 12.2.2-1xenial amd64 common utilities to mount and interact with a ceph storage cluster
ii ceph-fuse 12.2.2-1xenial amd64 FUSE-based client for the Ceph distributed file system
ii ceph-mds 12.2.2-1xenial amd64 metadata server for the ceph distributed file system
ii ceph-osd 12.2.2-1xenial amd64 OSD server for the ceph storage system
ii libcephfs2 12.2.2-1xenial amd64 Ceph distributed file system client library
ii python-cephfs 12.2.2-1xenial amd64 Python 2 libraries for the Ceph libcephfs library
ii python-rados 12.2.2-1xenial amd64 Python 2 libraries for the Ceph librados library
ii python-rbd 12.2.2-1xenial amd64 Python 2 libraries for the Ceph librbd library
ii python-rgw 12.2.2-1xenial amd64 Python 2 libraries for the Ceph librgw library
root@osd001:~# apt-cache policy ceph-osd
ceph-osd:
Installed: 12.2.2-1xenial
Candidate: 12.2.2-1xenial
Version table: *** 12.2.2-1xenial 1100
1100 https://download.ceph.com/debian-luminous xenial/main amd64 Packages
100 /var/lib/dpkg/status@

OSDs_lsblk.txt View - osd hdds list and bluestore partition size (1.63 KB) Volodymyr Blokhin, 12/26/2017 05:56 PM

osd2_perf_dump.txt View - ceph daemon osd.2 perf dump (21.5 KB) Volodymyr Blokhin, 12/26/2017 05:56 PM

osd2_ceph_conf.txt View - cat /etc/ceph/ceph.conf on osd node (399 Bytes) Volodymyr Blokhin, 12/26/2017 05:56 PM

osd2_config_show.txt View - ceph daemon osd.2 config show (55.1 KB) Volodymyr Blokhin, 12/26/2017 05:56 PM

cmn01_ceph_conf.txt View - cat /etc/ceph/ceph.conf on monitor node (598 Bytes) Volodymyr Blokhin, 12/26/2017 05:56 PM

OSD_RAM_usage.png View - grafana mem usage monitoring from one of osd nodes (44.3 KB) Volodymyr Blokhin, 12/26/2017 05:56 PM

ceph_status.txt View - ceph status output (645 Bytes) Volodymyr Blokhin, 12/26/2017 05:56 PM

ceph_osd2_dump_mempools.txt View - ceph daemon osd.2 dump_mempools (1.6 KB) Volodymyr Blokhin, 12/26/2017 05:56 PM

ceph_osd_tree.txt View - ceph osd tree (499 Bytes) Volodymyr Blokhin, 12/26/2017 05:56 PM

ceph_osd_dump.txt View - ceph osd dump (2.62 KB) Volodymyr Blokhin, 12/26/2017 05:56 PM

up_and_fail_cycle_osd2_log.txt View - /var/log/ceph/ceph-osd.N.log (305 KB) Volodymyr Blokhin, 12/26/2017 06:02 PM

History

#2 Updated by Sage Weil over 3 years ago

  • Status changed from New to Need More Info
  • Priority changed from Normal to High

The mempool dump shows 58GB (!) of pg logs. Can you restart the osd with 'debug bluestore = 20' so we can see if it is reading real, valid log entries?

Thanks!
sage

#3 Updated by Volodymyr Blokhin over 3 years ago

Sage,

Unfortunately we could not wait so long and re-deployed Ceph cluster on 12/30/2017.
We have managed to start ceph-osd (PG load finished) adding 100Gb to swap on each OSD node.
But we never got PGs online (waited 36 hours) and had to re-deploy the cluster.

Sage Weil wrote:

The mempool dump shows 58GB (!) of pg logs. Can you restart the osd with 'debug bluestore = 20' so we can see if it is reading real, valid log entries?

Thanks!
sage

#4 Updated by Sage Weil over 3 years ago

  • Status changed from Need More Info to Can't reproduce

Also available in: Atom PDF