Project

General

Profile

Actions

Bug #10554

closed

init: ceph-osd (...) main process (...) killed by ABRT signal- Erasure coding cluster with more than 238 OSDs

Added by Mohamed Pakkeer over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are creating Erasure coded cluster with 15 nodes and each node has 36 drives. we are able to create the healthy erasure coded cluster up to 238 osds. when we try to add the 239th osd the cluster is started the malfunctioning randomly. It fails some osds randomly and restarts automatically. So we are not able to get the cluster state as active+ clean. When we add the 239th OSd the CPU usage of all the nodes went to 100%(all cores of the CPU)

We have configured erasure coded profile as follows
directory=/usr/lib/ceph/erasure-code
k=10
m=3
plugin=jerasure
ruleset-failure-domain=host
technique=reed_sol_van

All the storage Nodes are running the giant release:
ceph --version
ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)

ceph.conf

fsid = c2a97a2f-fdc7-4eb5-82ef-70c52f2eceb1
public network = 10.1.x.0/21
cluster network = 10.1.x.0/21
mon_initial_members = master01
mon_host = 10.1.x.231
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

Checking dmesg on each storage node shows messages like the following:
[348483.517102] init: ceph-osd (ceph/238) main process (7171) killed by ABRT signal
[348483.517125] init: ceph-osd (ceph/238) main process ended, respawning
[348483.534798] init: ceph-osd (ceph/208) main process (7123) killed by ABRT signal
[348483.534823] init: ceph-osd (ceph/208) main process ended, respawning
[348483.556807] init: ceph-osd (ceph/163) main process (7530) killed by ABRT signal
[348483.556878] init: ceph-osd (ceph/163) main process ended, respawning
[348483.565578] init: ceph-osd (ceph/133) main process (7271) killed by ABRT signal
[348483.565603] init: ceph-osd (ceph/133) main process ended, respawning
[348483.594211] init: ceph-osd (ceph/193) main process (7290) killed by ABRT signal
[348483.594233] init: ceph-osd (ceph/193) main process ended, respawning
[348483.601238] init: ceph-osd (ceph/28) main process (7431) killed by ABRT signal
[348483.601257] init: ceph-osd (ceph/28) main process ended, respawning
[348483.614195] init: ceph-osd (ceph/13) main process (7263) killed by ABRT signal
[348483.614216] init: ceph-osd (ceph/13) main process ended, respawning
[348483.636125] init: ceph-osd (ceph/118) main process (6974) killed by ABRT signal
[348483.636173] init: ceph-osd (ceph/118) main process ended, respawning
[348484.083792] init: ceph-osd (ceph/73) main process (7360) killed by ABRT signal
[348484.083810] init: ceph-osd (ceph/73) main process ended, respawning
[348484.181915] init: ceph-osd (ceph/103) main process (6628) killed by ABRT signal
[348484.181940] init: ceph-osd (ceph/103) main process ended, respawning

admin@mon:~$ ceph status

cluster c2a97a2f-fdc7-4eb5-82ef-70c52f2eceb1
health HEALTH_WARN 8113 pgs peering; 13 pgs stale; 8192 pgs stuck inactive; 8192 pgs stuck unclean; 1 requests are blocked > 32 sec
monmap e1: 1 mons at {master01=10.1.x.231:6789/0}, election epoch 2, quorum 0 master01
osdmap e25059: 239 osds: 239 up, 239 in
pgmap v84871: 8192 pgs, 2 pools, 0 bytes data, 0 objects
1943 GB used, 866 TB / 868 TB avail
79 inactive
8088 peering
12 remapped+peering
13 stale+peering
qubevaultadmin@qubevaultdrmon:~$ ceph status
cluster c2a97a2f-fdc7-4eb5-82ef-70c52f2eceb1
health HEALTH_WARN 8123 pgs peering; 229 pgs stale; 8192 pgs stuck inactive; 8192 pgs stuck unclean; 6/239 in osds are down
monmap e1: 1 mons at {master01=10.1.x.231:6789/0}, election epoch 2, quorum 0 master01
osdmap e25060: 239 osds: 233 up, 239 in
pgmap v84876: 8192 pgs, 2 pools, 0 bytes data, 0 objects
1943 GB used, 866 TB / 868 TB avail
61 inactive
7890 peering
12 remapped+peering
8 stale
221 stale+peering
Actions #1

Updated by Mohamed Pakkeer over 9 years ago

we tried to create osd 239 on node 15 using different disk( sdq sdr sds etc...)and we got same error as reported above. But we are able to create osd 239 on node01 using disk sdr and it is created and participated in to cluster.we tested on the cluster to find the root cause of this issue. Refer below is our finding

1. we can create n number of disk in a node, if the node didn't restart earlier.
2. After adding n disk, the process showed all the osd are running properly. In this case, we tried to spin 18 osds on node 01 and it was added perfectly and services were up and running.
3. We tried to restart the node 1. After restarting,15 osds were up and running on node 01. Three random osds were failed and we tried to restart the osd and it was failed.

Is this limitation of no of osds can run in a node? Or Is there any hardware limitation?

Reference log files

admin@node001:~$ ps aux | grep ceph-osd | grep -v grep
root 8466 6.7 2.0 3296324 680620 ? Ssl Jan19 266:57 /usr/bin/ceph-osd --cluster=ceph -i 0 -f
root 10035 6.5 2.1 3220940 712304 ? Ssl Jan19 258:33 /usr/bin/ceph-osd --cluster=ceph -i 15 -f
root 12135 6.6 1.6 3345196 558880 ? Ssl Jan19 258:44 /usr/bin/ceph-osd --cluster=ceph -i 30 -f
root 14509 6.3 1.8 3109896 619204 ? Ssl Jan19 247:54 /usr/bin/ceph-osd --cluster=ceph -i 45 -f
root 17330 6.7 2.7 3224284 912472 ? Ssl Jan19 259:34 /usr/bin/ceph-osd --cluster=ceph -i 60 -f
root 20581 6.6 2.3 3292748 775904 ? Ssl Jan19 255:49 /usr/bin/ceph-osd --cluster=ceph -i 75 -f
root 24294 6.8 2.8 3345440 929988 ? Ssl Jan19 259:36 /usr/bin/ceph-osd --cluster=ceph -i 90 -f
root 32046 7.5 1.7 3193472 579556 ? Ssl Jan20 219:45 /usr/bin/ceph-osd --cluster=ceph -i 105 -f
root 36603 7.5 1.6 3293052 539712 ? Ssl Jan20 218:27 /usr/bin/ceph-osd --cluster=ceph -i 120 -f
root 42025 7.3 1.9 3213856 650036 ? Ssl Jan20 210:09 /usr/bin/ceph-osd --cluster=ceph -i 135 -f
root 49642 7.2 2.3 3175960 759916 ? Ssl Jan20 204:37 /usr/bin/ceph-osd --cluster=ceph -i 150 -f
root 57083 7.5 2.6 3288400 870124 ? Ssl Jan20 209:57 /usr/bin/ceph-osd --cluster=ceph -i 165 -f
root 65135 7.5 2.3 3267840 759120 ? Ssl Jan20 208:40 /usr/bin/ceph-osd --cluster=ceph -i 180 -f
root 584144 7.0 1.8 3185576 613084 ? Ssl Jan20 181:48 /usr/bin/ceph-osd --cluster=ceph -i 195 -f
root 599242 7.3 2.4 3215968 818408 ? Ssl Jan20 182:56 /usr/bin/ceph-osd --cluster=ceph -i 210 -f
root 622612 7.7 2.2 3265440 754752 ? Ssl Jan21 90:56 /usr/bin/ceph-osd --cluster=ceph -i 225 -f
root 667257 7.2 1.9 3164396 634876 ? Ssl Jan21 77:12 /usr/bin/ceph-osd --cluster=ceph -i 239 -f
root 773074 10.5 2.1 3000308 705660 ? Ssl 11:54 0:29 /usr/bin/ceph-osd --cluster=ceph -i 242 -f

admin@n001:~$ ps aux | grep ceph-osd | grep -v grep | wc -l
18

Dmesg Output Before Restart ===========================

[112995.460331] sdp: sdp2
[112998.679376] sdp: sdp1 sdp2
[113014.281911] XFS (sdp1): Mounting Filesystem
[113014.373332] XFS (sdp1): Ending clean mount
[113015.925112] sdp: sdp1 sdp2
[113016.358019] XFS (sdp1): Mounting Filesystem
[113016.446963] XFS (sdp1): Ending clean mount
[186540.561041] init: tty1 main process ended, respawning
[191360.894419] sdq: sdq2
[191364.413665] sdq: sdq1 sdq2
[191379.715942] XFS (sdq1): Mounting Filesystem
[191379.806936] XFS (sdq1): Ending clean mount
[191381.325716] sdq: sdq1 sdq2
[191381.784634] XFS (sdq1): Mounting Filesystem
[191381.879493] XFS (sdq1): Ending clean mount
[195490.818151] init: tty1 main process ended, respawning
[198249.132166] sdr: sdr2
[198252.776392] sdr: sdr1 sdr2
[198268.520732] XFS (sdr1): Mounting Filesystem
[198268.617230] XFS (sdr1): Ending clean mount
[198270.172251] sdr: sdr1 sdr2
[198270.608196] XFS (sdr1): Mounting Filesystem
[198270.709385] XFS (sdr1): Ending clean mount
[261497.743447] sds: sds2
[261501.679607] sds: sds1 sds2
[261517.140467] XFS (sds1): Mounting Filesystem
[261517.230067] XFS (sds1): Ending clean mount
[261518.725220] sds: sds1 sds2
[261519.183435] XFS (sds1): Mounting Filesystem
[261519.281399] XFS (sds1): Ending clean mount

After Node 01 restarted Ceph health status

admin@mon:~/$ ceph status
cluster c2a97a2f-fdc7-4eb5-82ef-70c52f2eceb1
health HEALTH_WARN 7426 pgs degraded; 7426 pgs stuck unclean; 7426 pgs undersized; 18/242 in osds are down
monmap e1: 1 mons at {mon01=10.1.x.231:6789/0}, election epoch 2, quorum 0 mon01
osdmap e2820: 243 osds: 224 up, 242 in
pgmap v24300: 8256 pgs, 3 pools, 0 bytes data, 0 objects
90441 MB used, 878 TB / 878 TB avail
830 active+clean
7426 active+undersized+degraded

Dmesg Status after Node01 restarted ===================================

[ 181.003165] init: ceph-osd (ceph/30) main process (23865) killed by ABRT signal
[ 181.003205] init: ceph-osd (ceph/30) main process ended, respawning
[ 181.369735] init: ceph-osd (ceph/90) main process (23980) killed by ABRT signal
[ 181.369764] init: ceph-osd (ceph/90) main process ended, respawning
[ 181.587322] init: ceph-osd (ceph/150) main process (24169) killed by ABRT signal
[ 181.587349] init: ceph-osd (ceph/150) main process ended, respawning
[ 186.156452] Core dump to |/usr/share/apport/apport 9444 6 0 9444 pipe failed
[ 186.263515] init: ceph-osd (ceph/30) main process (9444) killed by ABRT signal
[ 186.263533] init: ceph-osd (ceph/30) respawning too fast, stopped
[ 186.504402] init: ceph-osd (ceph/90) main process (10254) killed by ABRT signal
[ 186.504424] init: ceph-osd (ceph/90) respawning too fast, stopped
[ 186.513154] init: ceph-osd (ceph/150) main process (10737) killed by ABRT signal
[ 186.513177] init: ceph-osd (ceph/150) respawning too fast, stopped

Process status =============

root 2972 10.6 1.4 2816492 485720 ? Ssl 12:06 0:29 /usr/bin/ceph-osd --cluster=ceph -i 180 -f
root 3645 9.3 1.3 2812424 448392 ? Ssl 12:04 0:35 /usr/bin/ceph-osd --cluster=ceph -i 242 -f
root 8829 16.8 1.3 2818572 453024 ? Ssl 12:09 0:12 /usr/bin/ceph-osd --cluster=ceph -i 0 -f
root 12569 10.0 1.7 2804536 577996 ? Ssl 12:05 0:31 /usr/bin/ceph-osd --cluster=ceph -i 225 -f
root 13244 10.0 1.5 2689132 512540 ? Ssl 12:05 0:30 /usr/bin/ceph-osd --cluster=ceph -i 45 -f
root 13274 10.3 1.5 2831964 498988 ? Ssl 12:05 0:32 /usr/bin/ceph-osd --cluster=ceph -i 165 -f
root 14527 9.3 0.9 2738536 306504 ? Ssl 12:04 0:33 /usr/bin/ceph-osd --cluster=ceph -i 195 -f
root 19708 10.4 1.9 2782644 638912 ? Ssl 12:05 0:30 /usr/bin/ceph-osd --cluster=ceph -i 15 -f
root 20618 10.5 1.8 2748448 616728 ? Ssl 12:06 0:29 /usr/bin/ceph-osd --cluster=ceph -i 105 -f
root 20816 10.5 1.8 2764724 621260 ? Ssl 12:06 0:29 /usr/bin/ceph-osd --cluster=ceph -i 135 -f
root 20966 11.1 1.8 2829512 609796 ? Ssl 12:06 0:31 /usr/bin/ceph-osd --cluster=ceph -i 75 -f
root 20999 10.9 1.8 2783360 611624 ? Ssl 12:06 0:30 /usr/bin/ceph-osd --cluster=ceph -i 60 -f
root 24671 9.8 1.3 2779608 442104 ? Ssl 12:05 0:32 /usr/bin/ceph-osd --cluster=ceph -i 210 -f
root 25514 9.5 1.1 2738412 391508 ? Ssl 12:05 0:31 /usr/bin/ceph-osd --cluster=ceph -i 239 -f
root 29626 10.7 1.4 2830200 464104 ? Ssl 12:06 0:28 /usr/bin/ceph-osd --cluster=ceph -i 120 -f

admin@n001:~$ ps aux | grep ceph-osd | grep -v grep | wc -l
15

Actions #2

Updated by Mohamed Pakkeer over 9 years ago

Failed osd log on node 01

ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
1: /usr/bin/ceph-osd() [0xa9767a]
2: (()+0x10340) [0x7feaa038d340]
3: (gsignal()+0x39) [0x7fea9e82bf79]
4: (abort()+0x148) [0x7fea9e82f388]
5: (_gnu_cxx::_verbose_terminate_handler()+0x155) [0x7fea9f1376b5]
6: (()+0x5e836) [0x7fea9f135836]
7: (()+0x5e863) [0x7fea9f135863]
8: (()+0x5eaa2) [0x7fea9f135aa2]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xb82508]
10: (Thread::create(unsigned long)+0x8a) [0xb632fa]
11: (Pipe::connect()+0x2f93) [0xc49e83]
12: (Pipe::writer()+0x4d1) [0xc4bb51]
13: (Pipe::Writer::entry()+0xd) [0xc4f4dd]
14: (()+0x8182) [0x7feaa0385182]
15: (clone()+0x6d) [0x7fea9e8f030d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer

Actions #3

Updated by Mohamed Pakkeer over 9 years ago

It seems to have create the no of threads to run the osds

2015-01-22 13:50:16.631942 7f4b5df62700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f4b5df62700 time 2015-01-22 13:50:16.626052
common/Thread.cc: 129: FAILED assert(ret == 0)

ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb8231b]
2: (Thread::create(unsigned long)+0x8a) [0xb632fa]
3: (SimpleMessenger::connect_rank(entity_addr_t const&, int, PipeConnection*, Message*)+0x17c) [0xb5a23c]
4: (SimpleMessenger::get_connection(entity_inst_t const&)+0x428) [0xb5aa68]
5: (OSDService::get_con_osd_cluster(int, unsigned int)+0x1e2) [0x6628e2]
6: (OSD::compat_must_dispatch_immediately(PG*)+0x147) [0x66a937]
7: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x439) [0x699069]
8: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x6eea68]
9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa46) [0xb734a6]
10: (ThreadPool::WorkThread::entry()+0x10) [0xb74550]
11: (()+0x8182) [0x7f4b79864182]
12: (clone()+0x6d) [0x7f4b77dcf30d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
-12> 2015-01-22 13:50:16.625821 7f4b5df62700 5 osd.225 pg_epoch: 2977 pg[4.fa2s7( empty local-les=2951 n=0 ec=30 les/c 2951/2951 2950/2950/2794) [232,188,116,155,126,177,32,225,214,69,240,59,220] r=7 lpr=2951 pi=2180-2949/102 crt=0'0 inactive NOTIFY] exit Reset 0.506593 27 0.000130
-11> 2015-01-22 13:50:16.625877 7f4b5df62700 5 osd.225 pg_epoch: 2977 pg[4.fa2s7( empty local-les=2951 n=0 ec=30 les/c 2951/2951 2950/2950/2794) [232,188,116,155,126,177,32,225,214,69,240,59,220] r=7 lpr=2951 pi=2180-2949/102 crt=0'0 inactive NOTIFY] enter Started
-10> 2015-01-22 13:50:16.625895 7f4b5df62700 5 osd.225 pg_epoch: 2977 pg[4.fa2s7( empty local-les=2951 n=0 ec=30 les/c 2951/2951 2950/2950/2794) [232,188,116,155,126,177,32,225,214,69,240,59,220] r=7 lpr=2951 pi=2180-2949/102 crt=0'0 inactive NOTIFY] enter Start
-9> 2015-01-22 13:50:16.625910 7f4b5df62700 1 osd.225 pg_epoch: 2977 pg[4.fa2s7( empty local-les=2951 n=0 ec=30 les/c 2951/2951 2950/2950/2794) [232,188,116,155,126,177,32,225,214,69,240,59,220] r=7 lpr=2951 pi=2180-2949/102 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
-8> 2015-01-22 13:50:16.625927 7f4b5df62700 5 osd.225 pg_epoch: 2977 pg[4.fa2s7( empty local-les=2951 n=0 ec=30 les/c 2951/2951 2950/2950/2794) [232,188,116,155,126,177,32,225,214,69,240,59,220] r=7 lpr=2951 pi=2180-2949/102 crt=0'0 inactive NOTIFY] exit Start 0.000033 0 0.000000
-7> 2015-01-22 13:50:16.625946 7f4b5df62700 5 osd.225 pg_epoch: 2977 pg[4.fa2s7( empty local-les=2951 n=0 ec=30 les/c 2951/2951 2950/2950/2794) [232,188,116,155,126,177,32,225,214,69,240,59,220] r=7 lpr=2951 pi=2180-2949/102 crt=0'0 inactive NOTIFY] enter Started/Stray
-6> 2015-01-22 13:50:16.626025 7f4b5e763700 5 osd.225 pg_epoch: 2977 pg[4.f91s3( empty local-les=2951 n=0 ec=30 les/c 2951/2951 2950/2950/2796) [73,108,101,225,59,172,129,100,196,143,171,147,125] r=3 lpr=2951 pi=2180-2949/104 crt=0'0 inactive NOTIFY] exit Reset 0.507300 27 0.000132
-5> 2015-01-22 13:50:16.626057 7f4b5e763700 5 osd.225 pg_epoch: 2977 pg[4.f91s3( empty local-les=2951 n=0 ec=30 les/c 2951/2951 2950/2950/2796) [73,108,101,225,59,172,129,100,196,143,171,147,125] r=3 lpr=2951 pi=2180-2949/104 crt=0'0 inactive NOTIFY] enter Started
-4> 2015-01-22 13:50:16.626073 7f4b5e763700 5 osd.225 pg_epoch: 2977 pg[4.f91s3( empty local-les=2951 n=0 ec=30 les/c 2951/2951 2950/2950/2796) [73,108,101,225,59,172,129,100,196,143,171,147,125] r=3 lpr=2951 pi=2180-2949/104 crt=0'0 inactive NOTIFY] enter Start
-3> 2015-01-22 13:50:16.626086 7f4b5e763700 1 osd.225 pg_epoch: 2977 pg[4.f91s3( empty local-les=2951 n=0 ec=30 les/c 2951/2951 2950/2950/2796) [73,108,101,225,59,172,129,100,196,143,171,147,125] r=3 lpr=2951 pi=2180-2949/104 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
-2> 2015-01-22 13:50:16.626102 7f4b5e763700 5 osd.225 pg_epoch: 2977 pg[4.f91s3( empty local-les=2951 n=0 ec=30 les/c 2951/2951 2950/2950/2796) [73,108,101,225,59,172,129,100,196,143,171,147,125] r=3 lpr=2951 pi=2180-2949/104 crt=0'0 inactive NOTIFY] exit Start 0.000030 0 0.000000
-1> 2015-01-22 13:50:16.626118 7f4b5e763700 5 osd.225 pg_epoch: 2977 pg[4.f91s3( empty local-les=2951 n=0 ec=30 les/c 2951/2951 2950/2950/2796) [73,108,101,225,59,172,129,100,196,143,171,147,125] r=3 lpr=2951 pi=2180-2949/104 crt=0'0 inactive NOTIFY] enter Started/Stray
0> 2015-01-22 13:50:16.631942 7f4b5df62700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f4b5df62700 time 2015-01-22 13:50:16.626052
common/Thread.cc: 129: FAILED assert(ret == 0)

ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb8231b]
2: (Thread::create(unsigned long)+0x8a) [0xb632fa]
3: (SimpleMessenger::connect_rank(entity_addr_t const&, int, PipeConnection*, Message*)+0x17c) [0xb5a23c]
4: (SimpleMessenger::get_connection(entity_inst_t const&)+0x428) [0xb5aa68]
5: (OSDService::get_con_osd_cluster(int, unsigned int)+0x1e2) [0x6628e2]
6: (OSD::compat_must_dispatch_immediately(PG*)+0x147) [0x66a937]
7: (OSD::process_peering_events(std::list&lt;PG*, std::allocator&lt;PG*&gt; > const&, ThreadPool::TPHandle&)+0x439) [0x699069]
8: (OSD::PeeringWQ::_process(std::list&lt;PG*, std::allocator&lt;PG*&gt; > const&, ThreadPool::TPHandle&)+0x18) [0x6eea68]
9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa46) [0xb734a6]
10: (ThreadPool::WorkThread::entry()+0x10) [0xb74550]
11: (()+0x8182) [0x7f4b79864182]
12: (clone()+0x6d) [0x7f4b77dcf30d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.225.log
--
end dump of recent events ---

Actions #4

Updated by Mohamed Pakkeer over 9 years ago

After increasing the open file limit from 1024 to 64536 on node01 , random osds wer not stopped instead they were respawning. We started seeing high load increasing on all nodes.

Reference thread [[http://t79823.file-systems-ceph-user.file-systemstalk.us/expanding-pg-s-of-an-erasure-coded-pool-t79823.html]]

[ 5442.489613] init: ceph-osd (ceph/135) main process (31954) killed by ABRT signal
[ 5442.489658] init: ceph-osd (ceph/135) main process ended, respawning
[ 5448.515972] Core dump to |/usr/share/apport/apport 20491 6 0 20491 pipe failed
[ 5448.631510] init: ceph-osd (ceph/135) main process (20491) killed by ABRT signal
[ 5448.631533] init: ceph-osd (ceph/135) main process ended, respawning
[ 5454.809084] Core dump to |/usr/share/apport/apport 30103 6 0 30103 pipe failed
[ 5454.925793] init: ceph-osd (ceph/135) main process (30103) killed by ABRT signal
[ 5454.925847] init: ceph-osd (ceph/135) main process ended, respawning
[ 5460.528190] Core dump to |/usr/share/apport/apport 7839 6 0 7839 pipe failed
[ 5460.700986] init: ceph-osd (ceph/150) main process (7839) killed by ABRT signal
[ 5460.701013] init: ceph-osd (ceph/150) main process ended, respawning
[ 5461.174573] init: ceph-osd (ceph/165) main process (512) killed by ABRT signal
[ 5461.174614] init: ceph-osd (ceph/165) main process ended, respawning
[ 5461.320446] init: ceph-osd (ceph/135) main process (681) killed by ABRT signal
[ 5461.320475] init: ceph-osd (ceph/135) main process ended, respawning
[ 5461.339027] init: ceph-osd (ceph/0) main process (27677) killed by ABRT signal
[ 5461.339078] init: ceph-osd (ceph/0) main process ended, respawning
[ 5461.383469] init: ceph-osd (ceph/60) main process (22350) killed by ABRT signal
[ 5461.383490] init: ceph-osd (ceph/60) main process ended, respawning
[ 5461.463003] init: ceph-osd (ceph/45) main process (27465) killed by ABRT signal
[ 5461.463024] init: ceph-osd (ceph/45) main process ended, respawning
[ 5461.476679] init: ceph-osd (ceph/195) main process (15354) killed by ABRT signal
[ 5461.476707] init: ceph-osd (ceph/195) main process ended, respawning
[ 5461.814522] init: ceph-osd (ceph/180) main process (12833) killed by ABRT signal
[ 5461.814538] init: ceph-osd (ceph/180) main process ended, respawning
[ 5470.396822] Core dump to |/usr/share/apport/apport 8673 6 0 8673 pipe failed
[ 5470.556346] init: ceph-osd (ceph/45) main process (8673) killed by ABRT signal
[ 5470.556367] init: ceph-osd (ceph/45) main process ended, respawning
[ 5471.085784] init: ceph-osd (ceph/135) main process (8632) killed by ABRT signal
[ 5471.085802] init: ceph-osd (ceph/135) main process ended, respawning

Log of OSD 135
ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
1: /usr/bin/ceph-osd() [0xa9767a]
2: (()+0x10340) [0x7fe68ccd6340]
3: (gsignal()+0x39) [0x7fe68b174f79]
4: (abort()+0x148) [0x7fe68b178388]
5: (_gnu_cxx::_verbose_terminate_handler()+0x155) [0x7fe68ba806b5]
6: (()+0x5e836) [0x7fe68ba7e836]
7: (()+0x5e863) [0x7fe68ba7e863]
8: (()+0x5eaa2) [0x7fe68ba7eaa2]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xb82508]
10: (Thread::create(unsigned long)+0x8a) [0xb632fa]
11: (Pipe::accept()+0x31e2) [0xc425b2]
12: (Pipe::reader()+0x193d) [0xc467cd]
13: (Pipe::Reader::entry()+0xd) [0xc4f4bd]
14: (()+0x8182) [0x7fe68ccce182]
15: (clone()+0x6d) [0x7fe68b23930d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2015-01-22 18:41:50.848028 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.570s1( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5646/5646/2795) [187,135,132,21,236,103,123,130,77,128,106,34,209] r=1 lpr=5646 pi=2796-5645/705 crt=0'0 inactive NOTIFY] exit Reset 0.077371 3 0.000156
2015-01-22 18:41:50.848062 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.570s1( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5646/5646/2795) [187,135,132,21,236,103,123,130,77,128,106,34,209] r=1 lpr=5646 pi=2796-5645/705 crt=0'0 inactive NOTIFY] enter Started
2015-01-22 18:41:50.848084 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.570s1( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5646/5646/2795) [187,135,132,21,236,103,123,130,77,128,106,34,209] r=1 lpr=5646 pi=2796-5645/705 crt=0'0 inactive NOTIFY] enter Start
2015-01-22 18:41:50.848098 7fe670dc8700 1 osd.135 pg_epoch: 5647 pg[4.570s1( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5646/5646/2795) [187,135,132,21,236,103,123,130,77,128,106,34,209] r=1 lpr=5646 pi=2796-5645/705 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2015-01-22 18:41:50.848117 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.570s1( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5646/5646/2795) [187,135,132,21,236,103,123,130,77,128,106,34,209] r=1 lpr=5646 pi=2796-5645/705 crt=0'0 inactive NOTIFY] exit Start 0.000032 0 0.000000
2015-01-22 18:41:50.848134 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.570s1( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5646/5646/2795) [187,135,132,21,236,103,123,130,77,128,106,34,209] r=1 lpr=5646 pi=2796-5645/705 crt=0'0 inactive NOTIFY] enter Started/Stray
2015-01-22 18:41:50.852164 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.8aas7( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5643/5643/2795) [54,121,88,217,170,194,115,135,51,101,183,203,237] r=7 lpr=5643 pi=323-5642/911 crt=0'0 inactive NOTIFY] exit Started/Stray 2.410075 3 0.000862
2015-01-22 18:41:50.852200 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.8aas7( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5643/5643/2795) [54,121,88,217,170,194,115,135,51,101,183,203,237] r=7 lpr=5643 pi=323-5642/911 crt=0'0 inactive NOTIFY] exit Started 2.410186 0 0.000000
2015-01-22 18:41:50.852216 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.8aas7( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5643/5643/2795) [54,121,88,217,170,194,115,135,51,101,183,203,237] r=7 lpr=5643 pi=323-5642/911 crt=0'0 inactive NOTIFY] enter Reset
2015-01-22 18:41:50.853999 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.8aas7( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5646/5646/2795) [54,121,88,217,170,194,115,135,51,101,183,203,237] r=7 lpr=5646 pi=323-5645/913 crt=0'0 inactive NOTIFY] exit Reset 0.001784 3 0.000175
2015-01-22 18:41:50.854028 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.8aas7( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5646/5646/2795) [54,121,88,217,170,194,115,135,51,101,183,203,237] r=7 lpr=5646 pi=323-5645/913 crt=0'0 inactive NOTIFY] enter Started
2015-01-22 18:41:50.854042 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.8aas7( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5646/5646/2795) [54,121,88,217,170,194,115,135,51,101,183,203,237] r=7 lpr=5646 pi=323-5645/913 crt=0'0 inactive NOTIFY] enter Start
2015-01-22 18:41:50.854055 7fe670dc8700 1 osd.135 pg_epoch: 5647 pg[4.8aas7( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5646/5646/2795) [54,121,88,217,170,194,115,135,51,101,183,203,237] r=7 lpr=5646 pi=323-5645/913 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2015-01-22 18:41:50.854072 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.8aas7( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5646/5646/2795) [54,121,88,217,170,194,115,135,51,101,183,203,237] r=7 lpr=5646 pi=323-5645/913 crt=0'0 inactive NOTIFY] exit Start 0.000031 0 0.000000
2015-01-22 18:41:50.854090 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.8aas7( empty local-les=5414 n=0 ec=30 les/c 5414/5420 5646/5646/2795) [54,121,88,217,170,194,115,135,51,101,183,203,237] r=7 lpr=5646 pi=323-5645/913 crt=0'0 inactive NOTIFY] enter Started/Stray
2015-01-22 18:41:50.860118 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[3.8aas1( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5643/5643/2794) [24,135,104,220,127,16,92,192,171,11,125,3,154] r=1 lpr=5643 pi=396-5642/926 crt=0'0 inactive NOTIFY] exit Started/Stray 2.413711 3 0.000883
2015-01-22 18:41:50.860151 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[3.8aas1( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5643/5643/2794) [24,135,104,220,127,16,92,192,171,11,125,3,154] r=1 lpr=5643 pi=396-5642/926 crt=0'0 inactive NOTIFY] exit Started 2.413823 0 0.000000
2015-01-22 18:41:50.860169 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[3.8aas1( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5643/5643/2794) [24,135,104,220,127,16,92,192,171,11,125,3,154] r=1 lpr=5643 pi=396-5642/926 crt=0'0 inactive NOTIFY] enter Reset
2015-01-22 18:41:50.861765 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[3.8aas1( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2794) [24,135,104,220,127,16,92,192,171,11,125,3,154] r=1 lpr=5646 pi=396-5645/928 crt=0'0 inactive NOTIFY] exit Reset 0.001595 3 0.000181
2015-01-22 18:41:50.861793 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[3.8aas1( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2794) [24,135,104,220,127,16,92,192,171,11,125,3,154] r=1 lpr=5646 pi=396-5645/928 crt=0'0 inactive NOTIFY] enter Started
2015-01-22 18:41:50.861809 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[3.8aas1( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2794) [24,135,104,220,127,16,92,192,171,11,125,3,154] r=1 lpr=5646 pi=396-5645/928 crt=0'0 inactive NOTIFY] enter Start
2015-01-22 18:41:50.861821 7fe670dc8700 1 osd.135 pg_epoch: 5647 pg[3.8aas1( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2794) [24,135,104,220,127,16,92,192,171,11,125,3,154] r=1 lpr=5646 pi=396-5645/928 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2015-01-22 18:41:50.861838 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[3.8aas1( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2794) [24,135,104,220,127,16,92,192,171,11,125,3,154] r=1 lpr=5646 pi=396-5645/928 crt=0'0 inactive NOTIFY] exit Start 0.000028 0 0.000000
2015-01-22 18:41:50.861853 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[3.8aas1( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2794) [24,135,104,220,127,16,92,192,171,11,125,3,154] r=1 lpr=5646 pi=396-5645/928 crt=0'0 inactive NOTIFY] enter Started/Stray
2015-01-22 18:41:50.877865 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.56es6( empty local-les=5420 n=0 ec=30 les/c 5420/5420 5643/5643/2214) [231,154,12,183,16,193,135,69,119,56,212,82,128] r=6 lpr=5643 pi=771-5642/897 crt=0'0 inactive NOTIFY] exit Started/Stray 2.430529 3 0.000832
2015-01-22 18:41:50.877903 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.56es6( empty local-les=5420 n=0 ec=30 les/c 5420/5420 5643/5643/2214) [231,154,12,183,16,193,135,69,119,56,212,82,128] r=6 lpr=5643 pi=771-5642/897 crt=0'0 inactive NOTIFY] exit Started 2.430642 0 0.000000
2015-01-22 18:41:50.877933 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.56es6( empty local-les=5420 n=0 ec=30 les/c 5420/5420 5643/5643/2214) [231,154,12,183,16,193,135,69,119,56,212,82,128] r=6 lpr=5643 pi=771-5642/897 crt=0'0 inactive NOTIFY] enter Reset
2015-01-22 18:41:50.880057 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.56es6( empty local-les=5420 n=0 ec=30 les/c 5420/5420 5646/5646/2214) [231,154,12,183,16,193,135,69,119,56,212,82,128] r=6 lpr=5646 pi=771-5645/899 crt=0'0 inactive NOTIFY] exit Reset 0.002124 3 0.000193
2015-01-22 18:41:50.880085 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.56es6( empty local-les=5420 n=0 ec=30 les/c 5420/5420 5646/5646/2214) [231,154,12,183,16,193,135,69,119,56,212,82,128] r=6 lpr=5646 pi=771-5645/899 crt=0'0 inactive NOTIFY] enter Started
2015-01-22 18:41:50.880099 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.56es6( empty local-les=5420 n=0 ec=30 les/c 5420/5420 5646/5646/2214) [231,154,12,183,16,193,135,69,119,56,212,82,128] r=6 lpr=5646 pi=771-5645/899 crt=0'0 inactive NOTIFY] enter Start
2015-01-22 18:41:50.880112 7fe670dc8700 1 osd.135 pg_epoch: 5647 pg[4.56es6( empty local-les=5420 n=0 ec=30 les/c 5420/5420 5646/5646/2214) [231,154,12,183,16,193,135,69,119,56,212,82,128] r=6 lpr=5646 pi=771-5645/899 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2015-01-22 18:41:50.880129 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.56es6( empty local-les=5420 n=0 ec=30 les/c 5420/5420 5646/5646/2214) [231,154,12,183,16,193,135,69,119,56,212,82,128] r=6 lpr=5646 pi=771-5645/899 crt=0'0 inactive NOTIFY] exit Start 0.000029 0 0.000000
2015-01-22 18:41:50.880145 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.56es6( empty local-les=5420 n=0 ec=30 les/c 5420/5420 5646/5646/2214) [231,154,12,183,16,193,135,69,119,56,212,82,128] r=6 lpr=5646 pi=771-5645/899 crt=0'0 inactive NOTIFY] enter Started/Stray
2015-01-22 18:41:50.883896 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.23bs12( empty local-les=5415 n=0 ec=30 les/c 5415/5420 5643/5643/2795) [17,236,3,232,149,1,235,38,147,178,111,80,135] r=12 lpr=5643 pi=2240-5642/742 crt=0'0 inactive NOTIFY] exit Started/Stray 2.435663 3 0.000795
2015-01-22 18:41:50.883926 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.23bs12( empty local-les=5415 n=0 ec=30 les/c 5415/5420 5643/5643/2795) [17,236,3,232,149,1,235,38,147,178,111,80,135] r=12 lpr=5643 pi=2240-5642/742 crt=0'0 inactive NOTIFY] exit Started 2.435776 0 0.000000
2015-01-22 18:41:50.883943 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.23bs12( empty local-les=5415 n=0 ec=30 les/c 5415/5420 5643/5643/2795) [17,236,3,232,149,1,235,38,147,178,111,80,135] r=12 lpr=5643 pi=2240-5642/742 crt=0'0 inactive NOTIFY] enter Reset
2015-01-22 18:41:50.885044 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.23bs12( empty local-les=5415 n=0 ec=30 les/c 5415/5420 5646/5646/2795) [17,236,3,232,149,1,235,38,147,178,111,80,135] r=12 lpr=5646 pi=2240-5645/744 crt=0'0 inactive NOTIFY] exit Reset 0.001100 3 0.000184
2015-01-22 18:41:50.885070 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.23bs12( empty local-les=5415 n=0 ec=30 les/c 5415/5420 5646/5646/2795) [17,236,3,232,149,1,235,38,147,178,111,80,135] r=12 lpr=5646 pi=2240-5645/744 crt=0'0 inactive NOTIFY] enter Started
2015-01-22 18:41:50.885085 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.23bs12( empty local-les=5415 n=0 ec=30 les/c 5415/5420 5646/5646/2795) [17,236,3,232,149,1,235,38,147,178,111,80,135] r=12 lpr=5646 pi=2240-5645/744 crt=0'0 inactive NOTIFY] enter Start
2015-01-22 18:41:50.885098 7fe670dc8700 1 osd.135 pg_epoch: 5647 pg[4.23bs12( empty local-les=5415 n=0 ec=30 les/c 5415/5420 5646/5646/2795) [17,236,3,232,149,1,235,38,147,178,111,80,135] r=12 lpr=5646 pi=2240-5645/744 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2015-01-22 18:41:50.885114 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.23bs12( empty local-les=5415 n=0 ec=30 les/c 5415/5420 5646/5646/2795) [17,236,3,232,149,1,235,38,147,178,111,80,135] r=12 lpr=5646 pi=2240-5645/744 crt=0'0 inactive NOTIFY] exit Start 0.000029 0 0.000000
2015-01-22 18:41:50.885128 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.23bs12( empty local-les=5415 n=0 ec=30 les/c 5415/5420 5646/5646/2795) [17,236,3,232,149,1,235,38,147,178,111,80,135] r=12 lpr=5646 pi=2240-5645/744 crt=0'0 inactive NOTIFY] enter Started/Stray
2015-01-22 18:41:50.887455 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.564s7( empty local-les=5414 n=0 ec=30 les/c 5414/5415 5643/5643/1946) [71,182,193,83,115,170,187,135,171,72,29,19,76] r=7 lpr=5643 pi=743-5642/910 crt=0'0 inactive NOTIFY] exit Started/Stray 2.435250 3 0.000838
2015-01-22 18:41:50.887484 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.564s7( empty local-les=5414 n=0 ec=30 les/c 5414/5415 5643/5643/1946) [71,182,193,83,115,170,187,135,171,72,29,19,76] r=7 lpr=5643 pi=743-5642/910 crt=0'0 inactive NOTIFY] exit Started 2.435360 0 0.000000
2015-01-22 18:41:50.887500 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.564s7( empty local-les=5414 n=0 ec=30 les/c 5414/5415 5643/5643/1946) [71,182,193,83,115,170,187,135,171,72,29,19,76] r=7 lpr=5643 pi=743-5642/910 crt=0'0 inactive NOTIFY] enter Reset
2015-01-22 18:41:50.887550 7fe6705c7700 5 osd.135 pg_epoch: 5647 pg[3.f33s11( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2789) [67,27,78,131,220,151,5,32,89,99,88,135,154] r=11 lpr=5646 pi=1066-5645/888 crt=0'0 inactive NOTIFY] exit Reset 0.118327 3 0.000422
2015-01-22 18:41:50.887582 7fe6705c7700 5 osd.135 pg_epoch: 5647 pg[3.f33s11( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2789) [67,27,78,131,220,151,5,32,89,99,88,135,154] r=11 lpr=5646 pi=1066-5645/888 crt=0'0 inactive NOTIFY] enter Started
2015-01-22 18:41:50.887598 7fe6705c7700 5 osd.135 pg_epoch: 5647 pg[3.f33s11( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2789) [67,27,78,131,220,151,5,32,89,99,88,135,154] r=11 lpr=5646 pi=1066-5645/888 crt=0'0 inactive NOTIFY] enter Start
2015-01-22 18:41:50.887613 7fe6705c7700 1 osd.135 pg_epoch: 5647 pg[3.f33s11( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2789) [67,27,78,131,220,151,5,32,89,99,88,135,154] r=11 lpr=5646 pi=1066-5645/888 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2015-01-22 18:41:50.887631 7fe6705c7700 5 osd.135 pg_epoch: 5647 pg[3.f33s11( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2789) [67,27,78,131,220,151,5,32,89,99,88,135,154] r=11 lpr=5646 pi=1066-5645/888 crt=0'0 inactive NOTIFY] exit Start 0.000032 0 0.000000
2015-01-22 18:41:50.887648 7fe6705c7700 5 osd.135 pg_epoch: 5647 pg[3.f33s11( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2789) [67,27,78,131,220,151,5,32,89,99,88,135,154] r=11 lpr=5646 pi=1066-5645/888 crt=0'0 inactive NOTIFY] enter Started/Stray
2015-01-22 18:41:50.888701 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.564s7( empty local-les=5414 n=0 ec=30 les/c 5414/5415 5646/5646/1946) [71,182,193,83,115,170,187,135,171,72,29,19,76] r=7 lpr=5646 pi=743-5645/912 crt=0'0 inactive NOTIFY] exit Reset 0.001202 3 0.000303
2015-01-22 18:41:50.888726 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.564s7( empty local-les=5414 n=0 ec=30 les/c 5414/5415 5646/5646/1946) [71,182,193,83,115,170,187,135,171,72,29,19,76] r=7 lpr=5646 pi=743-5645/912 crt=0'0 inactive NOTIFY] enter Started
2015-01-22 18:41:50.888740 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.564s7( empty local-les=5414 n=0 ec=30 les/c 5414/5415 5646/5646/1946) [71,182,193,83,115,170,187,135,171,72,29,19,76] r=7 lpr=5646 pi=743-5645/912 crt=0'0 inactive NOTIFY] enter Start
2015-01-22 18:41:50.888753 7fe670dc8700 1 osd.135 pg_epoch: 5647 pg[4.564s7( empty local-les=5414 n=0 ec=30 les/c 5414/5415 5646/5646/1946) [71,182,193,83,115,170,187,135,171,72,29,19,76] r=7 lpr=5646 pi=743-5645/912 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2015-01-22 18:41:50.888768 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.564s7( empty local-les=5414 n=0 ec=30 les/c 5414/5415 5646/5646/1946) [71,182,193,83,115,170,187,135,171,72,29,19,76] r=7 lpr=5646 pi=743-5645/912 crt=0'0 inactive NOTIFY] exit Start 0.000028 0 0.000000
2015-01-22 18:41:50.888784 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.564s7( empty local-les=5414 n=0 ec=30 les/c 5414/5415 5646/5646/1946) [71,182,193,83,115,170,187,135,171,72,29,19,76] r=7 lpr=5646 pi=743-5645/912 crt=0'0 inactive NOTIFY] enter Started/Stray
2015-01-22 18:41:50.891274 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[3.f0as3( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5643/5643/2795) [139,197,233,135,219,202,171,56,44,193,57,61,155] r=3 lpr=5643 pi=771-5642/916 crt=0'0 inactive NOTIFY] exit Started/Stray 2.438148 3 0.000831
2015-01-22 18:41:50.891301 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[3.f0as3( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5643/5643/2795) [139,197,233,135,219,202,171,56,44,193,57,61,155] r=3 lpr=5643 pi=771-5642/916 crt=0'0 inactive NOTIFY] exit Started 2.438258 0 0.000000
2015-01-22 18:41:50.891318 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[3.f0as3( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5643/5643/2795) [139,197,233,135,219,202,171,56,44,193,57,61,155] r=3 lpr=5643 pi=771-5642/916 crt=0'0 inactive NOTIFY] enter Reset
2015-01-22 18:41:50.892560 7fe6705c7700 5 osd.135 pg_epoch: 5645 pg[3.25bs5( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5643/5643/2795) [65,130,159,196,33,135,163,209,94,71,7,98,212] r=5 lpr=5643 pi=758-5642/894 crt=0'0 inactive NOTIFY] exit Started/Stray 2.454031 3 0.000790
2015-01-22 18:41:50.893224 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[3.f0as3( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2795) [139,197,233,135,219,202,171,56,44,193,57,61,155] r=3 lpr=5646 pi=771-5645/918 crt=0'0 inactive NOTIFY] exit Reset 0.001906 3 0.000349
2015-01-22 18:41:50.893249 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[3.f0as3( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2795) [139,197,233,135,219,202,171,56,44,193,57,61,155] r=3 lpr=5646 pi=771-5645/918 crt=0'0 inactive NOTIFY] enter Started
2015-01-22 18:41:50.893265 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[3.f0as3( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2795) [139,197,233,135,219,202,171,56,44,193,57,61,155] r=3 lpr=5646 pi=771-5645/918 crt=0'0 inactive NOTIFY] enter Start
2015-01-22 18:41:50.893278 7fe670dc8700 1 osd.135 pg_epoch: 5647 pg[3.f0as3( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2795) [139,197,233,135,219,202,171,56,44,193,57,61,155] r=3 lpr=5646 pi=771-5645/918 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2015-01-22 18:41:50.893305 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[3.f0as3( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2795) [139,197,233,135,219,202,171,56,44,193,57,61,155] r=3 lpr=5646 pi=771-5645/918 crt=0'0 inactive NOTIFY] exit Start 0.000041 0 0.000000
2015-01-22 18:41:50.893322 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[3.f0as3( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2795) [139,197,233,135,219,202,171,56,44,193,57,61,155] r=3 lpr=5646 pi=771-5645/918 crt=0'0 inactive NOTIFY] enter Started/Stray
2015-01-22 18:41:50.892590 7fe6705c7700 5 osd.135 pg_epoch: 5645 pg[3.25bs5( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5643/5643/2795) [65,130,159,196,33,135,163,209,94,71,7,98,212] r=5 lpr=5643 pi=758-5642/894 crt=0'0 inactive NOTIFY] exit Started 2.454142 0 0.000000
2015-01-22 18:41:50.893880 7fe6705c7700 5 osd.135 pg_epoch: 5645 pg[3.25bs5( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5643/5643/2795) [65,130,159,196,33,135,163,209,94,71,7,98,212] r=5 lpr=5643 pi=758-5642/894 crt=0'0 inactive NOTIFY] enter Reset
2015-01-22 18:41:50.895980 7fe6705c7700 5 osd.135 pg_epoch: 5647 pg[3.25bs5( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2795) [65,130,159,196,33,135,163,209,94,71,7,98,212] r=5 lpr=5646 pi=758-5645/896 crt=0'0 inactive NOTIFY] exit Reset 0.002100 3 0.001424
2015-01-22 18:41:50.896008 7fe6705c7700 5 osd.135 pg_epoch: 5647 pg[3.25bs5( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2795) [65,130,159,196,33,135,163,209,94,71,7,98,212] r=5 lpr=5646 pi=758-5645/896 crt=0'0 inactive NOTIFY] enter Started
2015-01-22 18:41:50.896026 7fe6705c7700 5 osd.135 pg_epoch: 5647 pg[3.25bs5( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2795) [65,130,159,196,33,135,163,209,94,71,7,98,212] r=5 lpr=5646 pi=758-5645/896 crt=0'0 inactive NOTIFY] enter Start
2015-01-22 18:41:50.896040 7fe6705c7700 1 osd.135 pg_epoch: 5647 pg[3.25bs5( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2795) [65,130,159,196,33,135,163,209,94,71,7,98,212] r=5 lpr=5646 pi=758-5645/896 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2015-01-22 18:41:50.896057 7fe6705c7700 5 osd.135 pg_epoch: 5647 pg[3.25bs5( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2795) [65,130,159,196,33,135,163,209,94,71,7,98,212] r=5 lpr=5646 pi=758-5645/896 crt=0'0 inactive NOTIFY] exit Start 0.000031 0 0.000000
2015-01-22 18:41:50.896074 7fe6705c7700 5 osd.135 pg_epoch: 5647 pg[3.25bs5( empty local-les=5414 n=0 ec=29 les/c 5414/5415 5646/5646/2795) [65,130,159,196,33,135,163,209,94,71,7,98,212] r=5 lpr=5646 pi=758-5645/896 crt=0'0 inactive NOTIFY] enter Started/Stray
2015-01-22 18:41:50.896641 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.895s9( empty local-les=5414 n=0 ec=30 les/c 5414/5418 5643/5643/2790) [159,57,74,230,190,17,81,221,67,135,63,49,98] r=9 lpr=5643 pi=1098-5642/936 crt=0'0 inactive NOTIFY] exit Started/Stray 2.438902 3 0.000843
2015-01-22 18:41:50.896672 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.895s9( empty local-les=5414 n=0 ec=30 les/c 5414/5418 5643/5643/2790) [159,57,74,230,190,17,81,221,67,135,63,49,98] r=9 lpr=5643 pi=1098-5642/936 crt=0'0 inactive NOTIFY] exit Started 2.439013 0 0.000000
2015-01-22 18:41:50.896688 7fe670dc8700 5 osd.135 pg_epoch: 5645 pg[4.895s9( empty local-les=5414 n=0 ec=30 les/c 5414/5418 5643/5643/2790) [159,57,74,230,190,17,81,221,67,135,63,49,98] r=9 lpr=5643 pi=1098-5642/936 crt=0'0 inactive NOTIFY] enter Reset
2015-01-22 18:41:50.897826 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.895s9( empty local-les=5414 n=0 ec=30 les/c 5414/5418 5646/5646/2790) [159,57,74,230,190,17,81,221,67,135,63,49,98] r=9 lpr=5646 pi=1098-5645/938 crt=0'0 inactive NOTIFY] exit Reset 0.001138 3 0.000155
2015-01-22 18:41:50.897851 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.895s9( empty local-les=5414 n=0 ec=30 les/c 5414/5418 5646/5646/2790) [159,57,74,230,190,17,81,221,67,135,63,49,98] r=9 lpr=5646 pi=1098-5645/938 crt=0'0 inactive NOTIFY] enter Started
2015-01-22 18:41:50.897865 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.895s9( empty local-les=5414 n=0 ec=30 les/c 5414/5418 5646/5646/2790) [159,57,74,230,190,17,81,221,67,135,63,49,98] r=9 lpr=5646 pi=1098-5645/938 crt=0'0 inactive NOTIFY] enter Start
2015-01-22 18:41:50.897878 7fe670dc8700 1 osd.135 pg_epoch: 5647 pg[4.895s9( empty local-les=5414 n=0 ec=30 les/c 5414/5418 5646/5646/2790) [159,57,74,230,190,17,81,221,67,135,63,49,98] r=9 lpr=5646 pi=1098-5645/938 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2015-01-22 18:41:50.897894 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.895s9( empty local-les=5414 n=0 ec=30 les/c 5414/5418 5646/5646/2790) [159,57,74,230,190,17,81,221,67,135,63,49,98] r=9 lpr=5646 pi=1098-5645/938 crt=0'0 inactive NOTIFY] exit Start 0.000029 0 0.000000
2015-01-22 18:41:50.897910 7fe670dc8700 5 osd.135 pg_epoch: 5647 pg[4.895s9( empty local-les=5414 n=0 ec=30 les/c 5414/5418 5646/5646/2790) [159,57,74,230,190,17,81,221,67,135,63,49,98] r=9 lpr=5646 pi=1098-5645/938 crt=0'0 inactive NOTIFY] enter Started/Stray
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.135.log

Actions #5

Updated by Yuri Weinstein over 9 years ago

See same issue in v0.80.8.

Run: http://pulpito.ceph.com/teuthology-2015-01-21_19:13:02-upgrade:dumpling-x-firefly-distro-basic-vps/
Job: ['716482']

2015-01-22T06:32:02.596 INFO:tasks.rados.rados.0.vpm062.stdout:3938: oids not in use 500
2015-01-22T06:32:02.596 INFO:tasks.rados.rados.0.vpm062.stdout:Snapping
2015-01-22T06:32:09.392 INFO:teuthology.orchestra.run.vpm084.stderr:Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fb19cbbf700 time 2015-01-22 14:32:09.322333
2015-01-22T06:32:09.392 INFO:teuthology.orchestra.run.vpm084.stderr:common/Thread.cc: 129: FAILED assert(ret == 0)
2015-01-22T06:32:09.393 INFO:teuthology.orchestra.run.vpm084.stderr: ceph version 0.80.8-18-g8ba48d1 (8ba48d10c252b28cde5b4da1286421db12b57cc2)
2015-01-22T06:32:09.393 INFO:teuthology.orchestra.run.vpm084.stderr: 1: (()+0x35584f) [0x7fb19e44984f]
2015-01-22T06:32:09.393 INFO:teuthology.orchestra.run.vpm084.stderr: 2: (SafeTimer::init()+0x8e) [0x7fb19e42324e]
2015-01-22T06:32:09.393 INFO:teuthology.orchestra.run.vpm084.stderr: 3: (MonClient::init()+0x2b7) [0x7fb19e491b57]
2015-01-22T06:32:09.393 INFO:teuthology.orchestra.run.vpm084.stderr: 4: (librados::RadosClient::connect()+0x4a5) [0x7fb19e383cf5]
2015-01-22T06:32:09.394 INFO:teuthology.orchestra.run.vpm084.stderr: 5: (ffi_call_unix64()+0x4c) [0x7fb19ffb0834]
2015-01-22T06:32:09.394 INFO:teuthology.orchestra.run.vpm084.stderr: 6: (ffi_call()+0x3f1) [0x7fb19ffb02af]
2015-01-22T06:32:09.394 INFO:teuthology.orchestra.run.vpm084.stderr: 7: (_ctypes_callproc()+0x4d1) [0x7fb19ff9dfd1]
2015-01-22T06:32:09.394 INFO:teuthology.orchestra.run.vpm084.stderr: 8: (()+0xd742) [0x7fb19ff9e742]
2015-01-22T06:32:09.394 INFO:teuthology.orchestra.run.vpm084.stderr: 9: (PyEval_EvalFrameEx()+0x1c92) [0x4adf82]
2015-01-22T06:32:09.394 INFO:teuthology.orchestra.run.vpm084.stderr: 10: (PyEval_EvalFrameEx()+0xaf0) [0x4acde0]
2015-01-22T06:32:09.395 INFO:teuthology.orchestra.run.vpm084.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #6

Updated by Loïc Dachary over 9 years ago

It looks like both issues are related to resource starvation. Could it be that the maximum number of process / threads should be set to a higher number ?

Actions #7

Updated by Mohamed Pakkeer over 9 years ago

This issue is resolved after increasing the open file limit and maximum no of threads

config reference

/etc/security/limit.conf

root hard nofile 65535

sysctl -w kernel.pid_max=4194303

Actions #8

Updated by Samuel Just about 9 years ago

  • Status changed from New to Resolved
Actions #9

Updated by Sage Weil about 9 years ago

pid_max and the open files ulimit are the two limits to watch out for, yep!

Actions

Also available in: Atom PDF