Bug #22543
OSDs can not start after shutdown, killed by OOM killer during PGs load
0%
Description
Hello,
After shutdown all OSDs can not start. During load_pgs stage ceph-osd process consumes all available virtual memory (RAM+swap) so OOM has to kill it.
@root@osd001:~# dpkg -l | grep -i ceph
ii ceph-base 12.2.2-1xenial amd64 common ceph daemon libraries and management tools
ii ceph-common 12.2.2-1xenial amd64 common utilities to mount and interact with a ceph storage cluster
ii ceph-fuse 12.2.2-1xenial amd64 FUSE-based client for the Ceph distributed file system
ii ceph-mds 12.2.2-1xenial amd64 metadata server for the ceph distributed file system
ii ceph-osd 12.2.2-1xenial amd64 OSD server for the ceph storage system
ii libcephfs2 12.2.2-1xenial amd64 Ceph distributed file system client library
ii python-cephfs 12.2.2-1xenial amd64 Python 2 libraries for the Ceph libcephfs library
ii python-rados 12.2.2-1xenial amd64 Python 2 libraries for the Ceph librados library
ii python-rbd 12.2.2-1xenial amd64 Python 2 libraries for the Ceph librbd library
ii python-rgw 12.2.2-1xenial amd64 Python 2 libraries for the Ceph librgw library
root@osd001:~# apt-cache policy ceph-osd
ceph-osd:
Installed: 12.2.2-1xenial
Candidate: 12.2.2-1xenial
Version table:
*** 12.2.2-1xenial 1100
1100 https://download.ceph.com/debian-luminous xenial/main amd64 Packages
100 /var/lib/dpkg/status@
History
#1 Updated by Volodymyr Blokhin over 6 years ago
- File up_and_fail_cycle_osd2_log.txt View added
#2 Updated by Sage Weil about 6 years ago
- Status changed from New to Need More Info
- Priority changed from Normal to High
The mempool dump shows 58GB (!) of pg logs. Can you restart the osd with 'debug bluestore = 20' so we can see if it is reading real, valid log entries?
Thanks!
sage
#3 Updated by Volodymyr Blokhin about 6 years ago
Sage,
Unfortunately we could not wait so long and re-deployed Ceph cluster on 12/30/2017.
We have managed to start ceph-osd (PG load finished) adding 100Gb to swap on each OSD node.
But we never got PGs online (waited 36 hours) and had to re-deploy the cluster.
Sage Weil wrote:
The mempool dump shows 58GB (!) of pg logs. Can you restart the osd with 'debug bluestore = 20' so we can see if it is reading real, valid log entries?
Thanks!
sage
#4 Updated by Sage Weil about 6 years ago
- Status changed from Need More Info to Can't reproduce