Project

General

Profile

Actions

Bug #10554

closed

init: ceph-osd (...) main process (...) killed by ABRT signal- Erasure coding cluster with more than 238 OSDs

Added by Mohamed Pakkeer over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are creating Erasure coded cluster with 15 nodes and each node has 36 drives. we are able to create the healthy erasure coded cluster up to 238 osds. when we try to add the 239th osd the cluster is started the malfunctioning randomly. It fails some osds randomly and restarts automatically. So we are not able to get the cluster state as active+ clean. When we add the 239th OSd the CPU usage of all the nodes went to 100%(all cores of the CPU)

We have configured erasure coded profile as follows
directory=/usr/lib/ceph/erasure-code
k=10
m=3
plugin=jerasure
ruleset-failure-domain=host
technique=reed_sol_van

All the storage Nodes are running the giant release:
ceph --version
ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)

ceph.conf

fsid = c2a97a2f-fdc7-4eb5-82ef-70c52f2eceb1
public network = 10.1.x.0/21
cluster network = 10.1.x.0/21
mon_initial_members = master01
mon_host = 10.1.x.231
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

Checking dmesg on each storage node shows messages like the following:
[348483.517102] init: ceph-osd (ceph/238) main process (7171) killed by ABRT signal
[348483.517125] init: ceph-osd (ceph/238) main process ended, respawning
[348483.534798] init: ceph-osd (ceph/208) main process (7123) killed by ABRT signal
[348483.534823] init: ceph-osd (ceph/208) main process ended, respawning
[348483.556807] init: ceph-osd (ceph/163) main process (7530) killed by ABRT signal
[348483.556878] init: ceph-osd (ceph/163) main process ended, respawning
[348483.565578] init: ceph-osd (ceph/133) main process (7271) killed by ABRT signal
[348483.565603] init: ceph-osd (ceph/133) main process ended, respawning
[348483.594211] init: ceph-osd (ceph/193) main process (7290) killed by ABRT signal
[348483.594233] init: ceph-osd (ceph/193) main process ended, respawning
[348483.601238] init: ceph-osd (ceph/28) main process (7431) killed by ABRT signal
[348483.601257] init: ceph-osd (ceph/28) main process ended, respawning
[348483.614195] init: ceph-osd (ceph/13) main process (7263) killed by ABRT signal
[348483.614216] init: ceph-osd (ceph/13) main process ended, respawning
[348483.636125] init: ceph-osd (ceph/118) main process (6974) killed by ABRT signal
[348483.636173] init: ceph-osd (ceph/118) main process ended, respawning
[348484.083792] init: ceph-osd (ceph/73) main process (7360) killed by ABRT signal
[348484.083810] init: ceph-osd (ceph/73) main process ended, respawning
[348484.181915] init: ceph-osd (ceph/103) main process (6628) killed by ABRT signal
[348484.181940] init: ceph-osd (ceph/103) main process ended, respawning

admin@mon:~$ ceph status

cluster c2a97a2f-fdc7-4eb5-82ef-70c52f2eceb1
health HEALTH_WARN 8113 pgs peering; 13 pgs stale; 8192 pgs stuck inactive; 8192 pgs stuck unclean; 1 requests are blocked > 32 sec
monmap e1: 1 mons at {master01=10.1.x.231:6789/0}, election epoch 2, quorum 0 master01
osdmap e25059: 239 osds: 239 up, 239 in
pgmap v84871: 8192 pgs, 2 pools, 0 bytes data, 0 objects
1943 GB used, 866 TB / 868 TB avail
79 inactive
8088 peering
12 remapped+peering
13 stale+peering
qubevaultadmin@qubevaultdrmon:~$ ceph status
cluster c2a97a2f-fdc7-4eb5-82ef-70c52f2eceb1
health HEALTH_WARN 8123 pgs peering; 229 pgs stale; 8192 pgs stuck inactive; 8192 pgs stuck unclean; 6/239 in osds are down
monmap e1: 1 mons at {master01=10.1.x.231:6789/0}, election epoch 2, quorum 0 master01
osdmap e25060: 239 osds: 233 up, 239 in
pgmap v84876: 8192 pgs, 2 pools, 0 bytes data, 0 objects
1943 GB used, 866 TB / 868 TB avail
61 inactive
7890 peering
12 remapped+peering
8 stale
221 stale+peering
Actions

Also available in: Atom PDF