Bug #10554: init: ceph-osd (...) main process (...) killed by ABRT signal- Erasure coding cluster with more than 238 OSDs - Ceph - Ceph

Actions

Copy link

Bug #10554

closed

init: ceph-osd (...) main process (...) killed by ABRT signal- Erasure coding cluster with more than 238 OSDs

Added by Mohamed Pakkeer over 9 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We are creating Erasure coded cluster with 15 nodes and each node has 36 drives. we are able to create the healthy erasure coded cluster up to 238 osds. when we try to add the 239th osd the cluster is started the malfunctioning randomly. It fails some osds randomly and restarts automatically. So we are not able to get the cluster state as active+ clean. When we add the 239th OSd the CPU usage of all the nodes went to 100%(all cores of the CPU)

We have configured erasure coded profile as follows
directory=/usr/lib/ceph/erasure-code
k=10
m=3
plugin=jerasure
ruleset-failure-domain=host
technique=reed_sol_van

All the storage Nodes are running the giant release:
ceph --version
ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)

ceph.conf

fsid = c2a97a2f-fdc7-4eb5-82ef-70c52f2eceb1
public network = 10.1.x.0/21
cluster network = 10.1.x.0/21
mon_initial_members = master01
mon_host = 10.1.x.231
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

Checking dmesg on each storage node shows messages like the following:
[348483.517102] init: ceph-osd (ceph/238) main process (7171) killed by ABRT signal
[348483.517125] init: ceph-osd (ceph/238) main process ended, respawning
[348483.534798] init: ceph-osd (ceph/208) main process (7123) killed by ABRT signal
[348483.534823] init: ceph-osd (ceph/208) main process ended, respawning
[348483.556807] init: ceph-osd (ceph/163) main process (7530) killed by ABRT signal
[348483.556878] init: ceph-osd (ceph/163) main process ended, respawning
[348483.565578] init: ceph-osd (ceph/133) main process (7271) killed by ABRT signal
[348483.565603] init: ceph-osd (ceph/133) main process ended, respawning
[348483.594211] init: ceph-osd (ceph/193) main process (7290) killed by ABRT signal
[348483.594233] init: ceph-osd (ceph/193) main process ended, respawning
[348483.601238] init: ceph-osd (ceph/28) main process (7431) killed by ABRT signal
[348483.601257] init: ceph-osd (ceph/28) main process ended, respawning
[348483.614195] init: ceph-osd (ceph/13) main process (7263) killed by ABRT signal
[348483.614216] init: ceph-osd (ceph/13) main process ended, respawning
[348483.636125] init: ceph-osd (ceph/118) main process (6974) killed by ABRT signal
[348483.636173] init: ceph-osd (ceph/118) main process ended, respawning
[348484.083792] init: ceph-osd (ceph/73) main process (7360) killed by ABRT signal
[348484.083810] init: ceph-osd (ceph/73) main process ended, respawning
[348484.181915] init: ceph-osd (ceph/103) main process (6628) killed by ABRT signal
[348484.181940] init: ceph-osd (ceph/103) main process ended, respawning

admin@mon:~$ ceph status

cluster c2a97a2f-fdc7-4eb5-82ef-70c52f2eceb1
     health HEALTH_WARN 8113 pgs peering; 13 pgs stale; 8192 pgs stuck inactive; 8192 pgs stuck unclean; 1 requests are blocked > 32 sec
     monmap e1: 1 mons at {master01=10.1.x.231:6789/0}, election epoch 2, quorum 0 master01
     osdmap e25059: 239 osds: 239 up, 239 in
      pgmap v84871: 8192 pgs, 2 pools, 0 bytes data, 0 objects
            1943 GB used, 866 TB / 868 TB avail
                  79 inactive
                8088 peering
                  12 remapped+peering
                  13 stale+peering
qubevaultadmin@qubevaultdrmon:~$ ceph status
    cluster c2a97a2f-fdc7-4eb5-82ef-70c52f2eceb1
     health HEALTH_WARN 8123 pgs peering; 229 pgs stale; 8192 pgs stuck inactive; 8192 pgs stuck unclean; 6/239 in osds are down
     monmap e1: 1 mons at {master01=10.1.x.231:6789/0}, election epoch 2, quorum 0 master01
     osdmap e25060: 239 osds: 233 up, 239 in
      pgmap v84876: 8192 pgs, 2 pools, 0 bytes data, 0 objects
            1943 GB used, 866 TB / 868 TB avail
                  61 inactive
                7890 peering
                  12 remapped+peering
                   8 stale
                 221 stale+peering

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #10554

init: ceph-osd (...) main process (...) killed by ABRT signal- Erasure coding cluster with more than 238 OSDs

Updated by Mohamed Pakkeer over 9 years ago

Updated by Mohamed Pakkeer over 9 years ago

Updated by Mohamed Pakkeer over 9 years ago

Updated by Mohamed Pakkeer over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Mohamed Pakkeer over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Sage Weil over 9 years ago