Project

General

Profile

Bug #10988

Multiple OSD getting mark_down: common/Thread.cc: 128: FAILED assert(ret == 0)

Added by karan singh about 9 years ago. Updated about 9 years ago.

Status:
Rejected
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
osd mark_down
Backport:
0.80.7
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This weekend i saw a weird behaviour of my cluster. More than 50% OSD's are down and out. The problem occured when i increased pg_num and pgp_num values for a pool

The cluster is almost hung.

# ceph -s
2015-03-02 18:41:03.308460 7feb1affd700  1 monclient(hunting): found mon.pouta-s01
2015-03-02 18:41:03.308537 7feb21bac700  5 monclient: authenticate success, global_id 17199
    cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33
     health HEALTH_WARN 4764 pgs degraded; 1397 pgs down; 1401 pgs peering; 1423 pgs stale; 1401 pgs stuck inactive; 1423 pgs stuck stale; 9230 pgs stuck unclean; 7 requests are blocked > 32 sec; recovery 4899/30477 objects degraded (16.074%)
     monmap e3: 3 mons at {pouta-s01=10.xxx.xx.1:6789/0,pouta-s02=10.xxx.xx.2:6789/0,pouta-s03=10.xxx.xx.3:6789/0}, election epoch 22, quorum 0,1,2 pouta-s01,pouta-s02,pouta-s03
     osdmap e3979: 240 osds: 105 up, 105 in
      pgmap v24883: 17408 pgs, 13 pools, 41533 MB data, 10159 objects
            164 GB used, 381 TB / 381 TB avail
            4899/30477 objects degraded (16.074%)
                   6 stale+active+clean
                 502 active
                   1 peering
                   8 stale+down+remapped+peering
                1072 active+degraded+remapped
                8171 active+clean
                  55 down+remapped+peering
                1079 stale+active+degraded
                  94 stale+active+remapped
                 152 stale+down+peering
                2541 active+degraded
                   1 active+clean+replay
                2460 active+remapped
                1182 down+peering
                   9 stale+active
                   3 stale+peering
                  72 stale+active+degraded+remapped
recovery io 66096 kB/s, 16 objects/s
#

After increasing debug level on OSD , i found the below messages on multiple OSD's


--- begin dump of recent events ---

   -17> 2015-03-02 17:22:12.096104 7fb790400700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fb790400700 time 2015-03-02 17:22:12.092732
common/Thread.cc: 128: FAILED assert(ret == 0)

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 1: (Thread::create(unsigned long)+0x8a) [0xaf36fa]
 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xae7a1a]
 3: (Accepter::entry()+0x265) [0xb5bb65]
 4: /lib64/libpthread.so.0() [0x3c8a6079d1]
 5: (clone()+0x6d) [0x3c8a2e89dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

   -16> 2015-03-02 17:22:12.111211 7fb790c01700  1 -- 10.100.50.1:6908/17254 <== osd.159 10.100.50.4:0/21437 118 ==== osd_ping(ping e3527 stamp 2015-03-02 17:22:12.106135) v2 ==== 47+0+0 (2059412944 0 0) 0xbf16200 con 0x9d502c0
   -15> 2015-03-02 17:22:12.111246 7fb790c01700  1 -- 10.100.50.1:6908/17254 --> 10.100.50.4:0/21437 -- osd_ping(ping_reply e3527 stamp 2015-03-02 17:22:12.106135) v2 -- ?+0 0xcc10700 con 0x9d502c0
   -14> 2015-03-02 17:22:12.112992 7fb78fbff700  1 -- 10.100.50.1:6892/17254 <== osd.159 10.100.50.4:0/21437 118 ==== osd_ping(ping e3527 stamp 2015-03-02 17:22:12.106135) v2 ==== 47+0+0 (2059412944 0 0) 0xbef96c0 con 0x9de2520
   -13> 2015-03-02 17:22:12.164890 7fb74bc48700  1 -- 10.100.50.1:6870/17254 >> :/0 pipe(0xa20df00 sd=509 :6870 s=0 pgs=0 cs=0 l=0 c=0x3fa4200).accept sd=509 10.100.50.2:52081/0
   -12> 2015-03-02 17:22:12.175014 7fb74bc48700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fb74bc48700 time 2015-03-02 17:22:12.174041
common/Thread.cc: 128: FAILED assert(ret == 0)

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 1: (Thread::create(unsigned long)+0x8a) [0xaf36fa]
 2: (Pipe::accept()+0x4ac5) [0xb47a85]
 3: (Pipe::reader()+0x1bae) [0xb4a8ce]
 4: (Pipe::Reader::entry()+0xd) [0xb4cdad]
 5: /lib64/libpthread.so.0() [0x3c8a6079d1]
 6: (clone()+0x6d) [0x3c8a2e89dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

   -11> 2015-03-02 17:22:12.228677 7fb792404700  1 -- 10.100.50.1:6870/17254 <== osd.49 10.100.50.1:6873/29869 457 ==== osd_map(3528..3528 src has 1256..3528) v3 ==== 2603+0+0 (3804348554 0 0) 0xc3e5340 con 0x96644c0
   -10> 2015-03-02 17:22:12.228856 7fb792404700  3 osd.4 3527 handle_osd_map epochs [3528,3528], i have 3527, src has [1256,3528]
    -9> 2015-03-02 17:22:12.236390 7fb792404700  1 -- 10.100.50.1:6870/17254 mark_down 10.100.50.2:6944/6477 -- 0x92b5f00
    -8> 2015-03-02 17:22:12.594614 7fb744fdd700  0 -- 10.100.50.1:6870/17254 >> 10.100.50.1:6854/17443 pipe(0x944be80 sd=268 :56120 s=2 pgs=114 cs=1 l=0 c=0xab6d280).fault with nothing to send, going to standby
    -7> 2015-03-02 17:22:12.657433 7fb73df6d700  0 -- 10.100.50.1:6870/17254 >> 10.100.50.1:6817/1851 pipe(0xa20be80 sd=135 :6870 s=2 pgs=208 cs=1 l=0 c=0xf228dc0).fault with nothing to send, going to standby
    -6> 2015-03-02 17:22:12.664885 7fb793406700  1 -- 10.100.50.1:6821/17254 <== mon.1 10.100.50.2:6789/0 8 ==== osd_map(3528..3528 src has 1256..3528) v3 ==== 2603+0+0 (3804348554 0 0) 0xc3e7980 con 0x408d6a0
    -5> 2015-03-02 17:22:12.730904 7fb77c244700  0 -- 10.100.50.1:6870/17254 >> 10.100.50.1:6809/45513 pipe(0x92b7800 sd=235 :47560 s=2 pgs=137 cs=1 l=0 c=0x9352520).fault with nothing to send, going to standby
    -4> 2015-03-02 17:22:12.883314 7fb7521ad700  1 -- 10.100.50.1:6870/17254 >> :/0 pipe(0x92b0c80 sd=71 :6870 s=0 pgs=0 cs=0 l=0 c=0x3fa2940).accept sd=71 10.100.50.2:52810/0
    -3> 2015-03-02 17:22:12.903263 7fb7636c1700  0 -- 10.100.50.1:6870/17254 >> 10.100.50.1:6815/54102 pipe(0x9977300 sd=256 :6870 s=2 pgs=48 cs=1 l=0 c=0xa712ec0).fault with nothing to send, going to standby
    -2> 2015-03-02 17:22:12.940200 7fb76bc46700  0 -- 10.100.50.1:6870/17254 >> 10.100.50.1:6848/26104 pipe(0x92b6e00 sd=653 :6870 s=2 pgs=17 cs=1 l=0 c=0x9282100).fault with nothing to send, going to standby
    -1> 2015-03-02 17:22:12.944702 7fb7521ad700  1 -- 10.100.50.1:6870/17254 >> :/0 pipe(0x92b6900 sd=71 :6870 s=0 pgs=0 cs=0 l=0 c=0x3fa3860).accept sd=71 10.100.50.4:35590/0
     0> 2015-03-02 17:22:12.979390 7fb775ee8700  0 -- 10.100.50.1:6870/17254 >> 10.100.50.1:6849/6761 pipe(0x944e680 sd=290 :6870 s=2 pgs=134 cs=1 l=0 c=0x9661080).fault with nothing to send, going to standby

  • At the time of recovery observed cpu,memory and network , everything seems normal
  • I tried restarting all the OSD
     service ceph restart osd -a 
    . After a few minutes , multiple OSD's again started to go down and out

ceph version 0.80.7
Centos 6.5
3.17.2-1.el6.elrepo.x86_64

Could you please suggest how to fix this ?

History

#1 Updated by Kefu Chai about 9 years ago

common/Thread.cc: 128: FAILED assert(ret == 0)

turns out the Thread::try_create() either failed to malloc() or it failed to pthread_create()

At the time of recovery observed cpu,memory and network , everything seems normal

so not likely out of memory. but the errno is printed to stderr before osd daemon kills itself.

karan, could you launch one of the OSDs with the -f or the -d option? so it is able to print the errno to the stderr?

#2 Updated by karan singh about 9 years ago

Hello Kefu

Thanks for your time to look at this problem

Description: more than 50% of OSD's are down and not coming up to

I have been troubleshooting and found interesting things:

  • To solve this problem i upgraded to the latest stable release 0.80.8 but no luck
  • OSD processes are not forking and throwing memory allocation error , even though the server have over 50GB free memory. This thing happened on multiple OSD nodes.
  • I tried restarting OSD nodes but no luck
  • All the monitor nodes are up , even though ceph -s , ceph osd tree and service osd restart commands are taking hell lot of time. ( i understand , this could be due to recovery , but i have set norecover flag )

I am sharing all the logs that i found on to the dropbox link :
[[https://www.dropbox.com/sh/pwup6qalbw57b6x/AAAKbTkhpp9MK9qvB6m2LKbFa?dl=0]]

  • Logs related to memory leak
  • service ceph restart osd logs with debugging enabled debug ms = 20
  • service ceph restart osd ( restart for some osd fails , due to timeout )
  • ceph osd tree and ceph -s logs

I would request you to have a look and provide your suggestion to to resolve this problem.

#3 Updated by Kefu Chai about 9 years ago

hi Karan, i checked the ceph start script which should be /etc/rc.d/init.d/ceph in centos i think. it only sets the number of allowed open file if max_open_files is specified in ceph.conf. but in your case, i guess the call which failed should be pthread_create(3). there are couple ways you adjust this setting:

       EAGAIN A system-imposed limit on the number of threads was encountered.  There are a number of limits that  may  trigger  this  error:  the  RLIMIT_NPROC  soft
              resource  limit (set via setrlimit(2)), which limits the number of processes and threads for a real user ID, was reached; the kernel's system-wide limit
              on the number of processes and threads, /proc/sys/kernel/threads-max, was  reached  (see  proc(5));  or  the  maximum  number  of  PIDs,  /proc/sys/ker?
              nel/pid_max, was reached (see proc(5)).

you might want to check them also.

so i'd suggest you to launch the osd with "restart on core dump = true" in your ceph.conf to see the errno. it will run the osd daemon using ceps-run and pass -f to the underlying osd binary.

or you can change the ceph start script to add ulimit -T unlimited somewhere in the start) block, like

@@ -302,8 +302,7 @@

            [ -n "$wrap" ] && runmode="-f &" && runarg="-f" 
            [ -n "$max_open_files" ] && files="ulimit -n $max_open_files;" 
-        ulimit -n unlimited
-        ulimit -T unlimited
+
            if [ -n "$SYSTEMD_RUN" ]; then
                cmd="$SYSTEMD_RUN -r bash -c '$files $cmd --cluster $cluster -f'" 
            else

if the error is due to the limit of the number of threads your process is able to create. this patch will fix it. and it also change limit of the number of fd,

#4 Updated by Sage Weil about 9 years ago

this is usually ulimit -n or /proc/sys/kernel/pid_max ... you've checked both of those?

#5 Updated by karan singh about 9 years ago

Hi Sage / Kefu

Unfortunately none of the tricks worked till now, cluster is still dead.

Here are some more observations

  • I tried increasing ulimit and other tunables , cleanly restarted all the nodes but nothing good happened.
  • While starting OSD's few of them starts, but most of them crashes by generating core dump file

system output

[root@XXX-s03 ~]# ulimit -n
65535
[root@XXX-s03 ~]# cat /proc/sys/kernel/pid_max
65536
[root@XXX-s03 ~]#
[root@XXX-s03 ~]# cat /proc/sys/kernel/threads-max
1550216
[root@XXX-s03 ~]#
[root@XXX-s03 ~]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 775108
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 65535
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
[root@XXX-s03 ~]#
[root@XXX-s03 ~]# ceph -v
ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
[root@XXX-s03 ~]#
[root@XXX-s03 ~]# cat /etc/redhat-release
CentOS release 6.5 (Final)
[root@XXX-s03 ~]#
[root@XXX-s03 ~]# uname -r
3.17.2-1.el6.elrepo.x86_64
[root@XXX-s03 ~]#

Output from service ceph start osd

=== osd.232 ===
Starting Ceph osd.232 on XXX-s03...already running
=== osd.235 ===
Starting Ceph osd.235 on XXX-s03...already running
ERROR:ceph-disk:Failed to activate
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: Error: error executing ceph-conf: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: [Errno 12] Cannot allocate memory
ceph-disk: Error: One or more partitions failed to activate
[root@XXX-s03 ~]#
[root@XXX-s03 ~]# free -m
             total       used       free     shared    buffers     cached
Mem:        193864      27902     165962          0         69      11176
-/+ buffers/cache:      16656     177208
Swap:         5023          0       5023
[root@XXX-s03 ~]#
[root@XXX-s03 ~]#
[root@XXX-s03 ~]# vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0      0 169879216  71040 11463552    0    0     9    35   61   33  1  2 96  0  0
[root@XXX-s03 ~]#

output tail -f /var/log/messages


Mar  5 21:57:03 XXX-s03 abrt[6707]: Not saving repeating crash in '/usr/bin/ceph-osd'
Mar  5 21:57:03 XXX-s03 abrt[15533]: Not saving repeating crash in '/usr/bin/ceph-osd'
Mar  5 21:57:03 XXX-s03 abrt[15593]: Not saving repeating crash in '/usr/bin/ceph-osd'
Mar  5 21:57:03 XXX-s03 kernel: Pid 38780(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 52341(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 37124(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 52529(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 40133(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 58578(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 39533(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 38208(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 39747(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 33251(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 34544(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 36846(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 57863(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 29903(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 37556(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 61609(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 38025(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 29861(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 26424(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 34995(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 35207(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 7912(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 52841(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:03 XXX-s03 kernel: Pid 51481(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump
Mar  5 21:57:18 XXX-s03 abrt[15533]: Saved core dump of pid 14107 to core.14107 (1781747712 bytes)
Mar  5 21:57:20 XXX-s03 abrt[15593]: Saved core dump of pid 617 to core.617 (2011574272 bytes)
Mar  5 21:57:20 XXX-s03 abrt[6707]: Saved core dump of pid 63153 to core.63153 (1946578944 bytes)
Mar  5 21:57:22 XXX-s03 abrt[5258]: Saved core dump of pid 38442 (/usr/bin/ceph-osd) to /var/spool/abrt/ccpp-2015-03-05-21:57:03-38442 (1843965952 bytes)
Mar  5 21:57:22 XXX-s03 abrtd: Directory 'ccpp-2015-03-05-21:57:03-38442' creation detected
Mar  5 21:57:22 XXX-s03 abrtd: Package 'ceph' isn't signed with proper key
Mar  5 21:57:22 XXX-s03 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2015-03-05-21:57:03-38442' exited with 1
Mar  5 21:57:22 XXX-s03 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2015-03-05-21:57:03-38442'
Mar  5 22:03:52 XXX-s03 kernel: perf interrupt took too long (65424 > 62500), lowering kernel.perf_event_max_sample_rate to 2000

Output from another OSD node


=== osd.200 ===
Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fb85a59f760 time 2015-03-05 16:51:48.371919
common/Thread.cc: 129: FAILED assert(ret == 0)
 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
 1: (Thread::create(unsigned long)+0x8a) [0x4de92a]
 2: (CephContext::CephContext(unsigned int)+0xba) [0x4e380a]
 3: (common_preinit(CephInitParameters const&, code_environment_t, int)+0x45) [0x4cb225]
 4: (global_pre_init(std::vector<char const*, std::allocator<char const*> >*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int)+0xaf) [0x4b273f]
 5: (main()+0x1f6) [0x4aec76]
 6: (__libc_start_main()+0xfd) [0x3594e1ed5d]
 7: /usr/bin/ceph-conf() [0x4ae3c9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
/usr/lib64/ceph/ceph_common.sh: fork: Cannot allocate memory
/usr/lib64/ceph/ceph_common.sh: fork: Cannot allocate memory
/usr/lib64/ceph/ceph_common.sh: fork: Cannot allocate memory
/usr/lib64/ceph/ceph_common.sh: fork: Cannot allocate memory
/etc/init.d/ceph: line 362: --cluster: command not found
/usr/lib64/ceph/ceph_common.sh: fork: Cannot allocate memory
/etc/init.d/ceph: fork: Cannot allocate memory
/usr/lib64/ceph/ceph_common.sh: fork: Cannot allocate memory
timeout: fork system call failed: Cannot allocate memory
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.200 --keyring= osd crush create-or-move -- 200 1 '
[root@XXX-s02 ~]#

#6 Updated by karan singh about 9 years ago

output : ceph -s

[root@pouta-s01 ~]# ceph -s
    cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33
     health HEALTH_WARN 375 pgs degraded; 778 pgs down; 8692 pgs peering; 9001 pgs stuck inactive; 9485 pgs stuck unclean; recovery 261/31125 objects degraded (0.839%); nodown,noout flag(s) set
     monmap e3: 3 mons at {XXX-s01=xx.xx.xx.1:6789/0,XXX-s02=xx.xx.xx.1:6789/0,XXX-s03=xx.xx.xx.1:6789/0}, election epoch 1260, quorum 0,1,2 XXX-s01,XXX-s02,XXX-s03
     osdmap e24197: 240 osds: 239 up, 239 in
            flags nodown,noout
      pgmap v55409: 17408 pgs, 13 pools, 42405 MB data, 10375 objects
            4834 GB used, 863 TB / 868 TB avail
            261/31125 objects degraded (0.839%)
                 273 inactive
                   6 down+remapped+peering
                7849 active+clean
                7174 peering
                   1 active+degraded+remapped
                 772 down+peering
                   1 remapped
                 111 active+remapped
                 740 remapped+peering
                  33 replay
                   2 degraded
                  74 active+clean+replay
                 372 active+degraded
[root@pouta-s01 ~]#

Note: these values are incorrect as nodown, noout flags are set and OSD's are literally down.

#7 Updated by Kefu Chai about 9 years ago

karan, could you check /proc/sys/vm/max_map_count also? and btw, are you running ceph-osd on a bare metal or a VPS?

---
following is my analysis:

the interesting thing is

ceph-disk: [Errno 12] Cannot allocate memory

ceph-disk is a python script. it tried to fork a process to execute ceph-conf, but fork failed with ENOMEM. but i think it has nothing to do with ceph-conf, b/c the command() method didn't not get a chance to exec(3).

/usr/lib64/ceph/ceph_common.sh: fork: Cannot allocate memory

ceph_common.sh also uses ceph-conf to read settings. and

/etc/init.d/ceph: fork: Cannot allocate memory

while

data seg size           (kbytes, -d) unlimited
max memory size         (kbytes, -m) unlimited
max user processes              (-u) 65535
virtual memory          (kbytes, -v) unlimited

and apparently the system has enough free memory to dispose per the output of "free" and "vmstat". so my wild guess is the memory map is too small for ceph-disk and ceph-osd.

and

Mar  5 21:57:03 XXX-s03 kernel: Pid 51481(ceph-osd) over core_pipe_limit
Mar  5 21:57:03 XXX-s03 kernel: Skipping core dump

is probably fine, as the size of core dump could be fairly large. the largest coredump found in your log file

Mar 5 21:57:20 XXX-s03 abrt15593: Saved core dump of pid 617 to core.617 (2011574272 bytes)

sizes 1.87 G.

#8 Updated by karan singh about 9 years ago

Hi Kefu

Thanks for your observations, here is the output

[root@XXX-s04 ~]# cat /proc/sys/vm/max_map_count
65530
[root@XXX-s03 ~]#

Note: max_map_count value is same (65530) on all the Ceph nodes. FYI

Any thing else that you think i should check to get this fixed.

#9 Updated by Kefu Chai about 9 years ago

if you are using an openVZ host, you could fail to malloc while the free shows that you have enough memory. that's another reason i guess...

#10 Updated by Sage Weil about 9 years ago

  • Priority changed from Immediate to Urgent

#11 Updated by karan singh about 9 years ago

None of the ceph node are using any kind of virtualization. Ceph is installed and running on a standard HP x86 server + CentOS.

Things were working cool and normally before i hit this problem and still no resolution to it.

Even today i tried starting OSD services one by one with a delay of 15 minutes in between , i was able to start only 160 osds out of 240. However just after 30 minutes another 50 OSD's went down , and cluster came back to 110 OSD's :-(

Any other pointers to this problem would be appreciated.

#12 Updated by karan singh about 9 years ago

Troubleshooting this issue further today.

I rebooted my OSD node with stock 2.6.32 CentOS 6 kernel, just to find out if this problem is due to underlying kernel

[root@XXX-s01 ceph]# uname -r
2.6.32-431.el6.x86_64
[root@XXX-s01 ceph]#

Unfortunately , the problem is still there , Ceph cannot recover , OSD's are getting down. Now it looks like the problem is with CEPH CODE BASE.

Is there any Ceph developer who can work with me to fix this ??

     0> 2015-03-09 11:48:39.160224 7fd63146c700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fd63146c700 time 2015-03-09 11:48:39.157697
common/Thread.cc: 129: FAILED assert(ret == 0)
[root@XXX-s01 ceph]# tail -500  ceph-osd.4.log | grep -v heartbeat_check | grep -v osd_ping
  -245> 2015-03-09 11:48:38.147621 7fd664ecc700  1 -- 10.XXX.XX.1:6896/37300 <== osd.157 10.XXX.XX.3:6911/14701 209 ==== osd_map(26583..26583 src has 5446..26583) v3 ==== 2027+0+0 (2791851435 0 0) 0xbdcc5c0 con 0xc589b80
  -244> 2015-03-09 11:48:38.147880 7fd664ecc700  3 osd.4 26582 handle_osd_map epochs [26583,26583], i have 26582, src has [5446,26583]
  -239> 2015-03-09 11:48:38.165698 7fd664ecc700  1 -- 10.XXX.XX.1:6896/37300 <== osd.157 10.XXX.XX.3:6911/14701 210 ==== pg_notify(2.55f(16) epoch 26583) v5 ==== 1446+0+0 (4099753173 0 0) 0x1124d400 con 0xc589b80
  -238> 2015-03-09 11:48:38.165780 7fd664ecc700  5 -- op tracker -- , seq: 3366, time: 2015-03-09 11:48:38.147573, event: header_read, request: pg_notify(2.55f(16) epoch 26583) v5
  -237> 2015-03-09 11:48:38.165818 7fd664ecc700  5 -- op tracker -- , seq: 3366, time: 2015-03-09 11:48:38.147575, event: throttled, request: pg_notify(2.55f(16) epoch 26583) v5
  -236> 2015-03-09 11:48:38.165901 7fd664ecc700  5 -- op tracker -- , seq: 3366, time: 2015-03-09 11:48:38.147733, event: all_read, request: pg_notify(2.55f(16) epoch 26583) v5
  -235> 2015-03-09 11:48:38.165937 7fd664ecc700  5 -- op tracker -- , seq: 3366, time: 2015-03-09 11:48:38.165777, event: dispatched, request: pg_notify(2.55f(16) epoch 26583) v5
  -234> 2015-03-09 11:48:38.165971 7fd664ecc700  5 -- op tracker -- , seq: 3366, time: 2015-03-09 11:48:38.165971, event: waiting_for_osdmap, request: pg_notify(2.55f(16) epoch 26583) v5
  -233> 2015-03-09 11:48:38.166006 7fd664ecc700  5 -- op tracker -- , seq: 3366, time: 2015-03-09 11:48:38.166006, event: started, request: pg_notify(2.55f(16) epoch 26583) v5
  -232> 2015-03-09 11:48:38.166061 7fd664ecc700  5 -- op tracker -- , seq: 3366, time: 2015-03-09 11:48:38.166061, event: done, request: pg_notify(2.55f(16) epoch 26583) v5
  -231> 2015-03-09 11:48:38.167770 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.3:7061/57477 -- osd_map(26583..26583 src has 5446..26583) v3 -- ?+0 0xaa7e540 con 0xa52cd00
  -230> 2015-03-09 11:48:38.167649 7fd65d6c0700  5 osd.4 pg_epoch: 26583 pg[8.33f( empty local-les=26571 n=0 ec=2331 les/c 26571/26571 26582/26582/26420) [95,99,157]/[99,157,4] r=2 lpr=26582 pi=25357-26581/32 crt=0'0 remapped NOTIFY] exit Started/Stray 0.956628 4 0.000508
  -229> 2015-03-09 11:48:38.167847 7fd65d6c0700  5 osd.4 pg_epoch: 26583 pg[8.33f( empty local-les=26571 n=0 ec=2331 les/c 26571/26571 26582/26582/26420) [95,99,157]/[99,157,4] r=2 lpr=26582 pi=25357-26581/32 crt=0'0 remapped NOTIFY] exit Started 0.956929 0 0.000000
  -228> 2015-03-09 11:48:38.167855 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.3:7061/57477 -- pg_notify(12.2f4(194) epoch 26583) v5 -- ?+0 0x10ad7a80 con 0xa52cd00
  -227> 2015-03-09 11:48:38.167868 7fd65d6c0700  5 osd.4 pg_epoch: 26583 pg[8.33f( empty local-les=26571 n=0 ec=2331 les/c 26571/26571 26582/26582/26420) [95,99,157]/[99,157,4] r=2 lpr=26582 pi=25357-26581/32 crt=0'0 remapped NOTIFY] enter Reset
  -226> 2015-03-09 11:48:38.167979 7fd65d6c0700  5 osd.4 pg_epoch: 26583 pg[8.33f( empty local-les=26571 n=0 ec=2331 les/c 26571/26571 26582/26583/26583) [95,99,157] r=-1 lpr=26583 pi=25357-26582/33 crt=0'0 inactive NOTIFY] exit Reset 0.000112 1 0.000295
  -225> 2015-03-09 11:48:38.167998 7fd65d6c0700  5 osd.4 pg_epoch: 26583 pg[8.33f( empty local-les=26571 n=0 ec=2331 les/c 26571/26571 26582/26583/26583) [95,99,157] r=-1 lpr=26583 pi=25357-26582/33 crt=0'0 inactive NOTIFY] enter Started
  -224> 2015-03-09 11:48:38.168010 7fd65d6c0700  5 osd.4 pg_epoch: 26583 pg[8.33f( empty local-les=26571 n=0 ec=2331 les/c 26571/26571 26582/26583/26583) [95,99,157] r=-1 lpr=26583 pi=25357-26582/33 crt=0'0 inactive NOTIFY] enter Start
  -223> 2015-03-09 11:48:38.168059 7fd65d6c0700  1 osd.4 pg_epoch: 26583 pg[8.33f( empty local-les=26571 n=0 ec=2331 les/c 26571/26571 26582/26583/26583) [95,99,157] r=-1 lpr=26583 pi=25357-26582/33 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
  -222> 2015-03-09 11:48:38.168074 7fd65d6c0700  5 osd.4 pg_epoch: 26583 pg[8.33f( empty local-les=26571 n=0 ec=2331 les/c 26571/26571 26582/26583/26583) [95,99,157] r=-1 lpr=26583 pi=25357-26582/33 crt=0'0 inactive NOTIFY] exit Start 0.000063 0 0.000000
  -221> 2015-03-09 11:48:38.168088 7fd65d6c0700  5 osd.4 pg_epoch: 26583 pg[8.33f( empty local-les=26571 n=0 ec=2331 les/c 26571/26571 26582/26583/26583) [95,99,157] r=-1 lpr=26583 pi=25357-26582/33 crt=0'0 inactive NOTIFY] enter Started/Stray
  -220> 2015-03-09 11:48:38.169380 7fd65d6c0700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.4:6866/44142 -- osd_map(26484..26583 src has 5446..26583) v3 -- ?+0 0x9a1c380 con 0xc66e300
  -219> 2015-03-09 11:48:38.169524 7fd65d6c0700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.4:6866/44142 -- pg_notify(8.33f(33) epoch 26583) v5 -- ?+0 0xe1d2f40 con 0xc66e300
  -218> 2015-03-09 11:48:38.169604 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.1:7036/17900 -- osd_map(26583..26583 src has 5446..26583) v3 -- ?+0 0xaa78900 con 0xa6f1a20
  -217> 2015-03-09 11:48:38.169631 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.1:7036/17900 -- pg_notify(8.11c(179) epoch 26583) v5 -- ?+0 0x10ad78c0 con 0xa6f1a20
  -214> 2015-03-09 11:48:38.170722 7fd6630c9700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.3:6916/17106 -- osd_map(26583..26583 src has 5446..26583) v3 -- ?+0 0xc1fcc80 con 0x95edc20
  -211> 2015-03-09 11:48:38.176803 7fd6630c9700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.2:6984/38385 -- osd_map(26583..26583 src has 5446..26583) v3 -- ?+0 0xc1fe780 con 0x757aaa0
  -208> 2015-03-09 11:48:38.177506 7fd65d6c0700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.2:7018/19897 -- osd_map(26583..26583 src has 5446..26583) v3 -- ?+0 0xc1ef980 con 0x95e82c0
  -207> 2015-03-09 11:48:38.177618 7fd65d6c0700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.2:7018/19897 -- pg_notify(12.2c5(291) epoch 26583) v5 -- ?+0 0xe448c40 con 0x95e82c0
  -204> 2015-03-09 11:48:38.188004 7fd65d6c0700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.2:6822/9024 -- osd_map(26583..26583 src has 5446..26583) v3 -- ?+0 0xa2a4800 con 0x96e5120
  -203> 2015-03-09 11:48:38.207541 7fd65d6c0700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.2:6822/9024 -- pg_notify(4.19c(242) epoch 26583) v5 -- ?+0 0xf51e580 con 0x96e5120
  -202> 2015-03-09 11:48:38.211653 7fd65e0c1700  5 osd.4 pg_epoch: 26583 pg[2.4e6( empty local-les=26452 n=0 ec=1 les/c 26452/26452 26451/26451/26451) [4] r=0 lpr=26451 crt=0'0 mlcod 0'0 active+degraded] exit Started/Primary/Active/Clean 1062.444396 391 0.023074
  -201> 2015-03-09 11:48:38.212077 7fd65e0c1700  5 osd.4 pg_epoch: 26583 pg[2.4e6( empty local-les=26452 n=0 ec=1 les/c 26452/26452 26451/26451/26451) [4] r=0 lpr=26451 crt=0'0 mlcod 0'0 active+degraded] exit Started/Primary/Active 1062.447046 0 0.000000
  -200> 2015-03-09 11:48:38.212174 7fd65e0c1700  5 osd.4 pg_epoch: 26583 pg[2.4e6( empty local-les=26452 n=0 ec=1 les/c 26452/26452 26451/26451/26451) [4] r=0 lpr=26451 crt=0'0 mlcod 0'0 active] exit Started/Primary 1063.417807 0 0.000000
  -199> 2015-03-09 11:48:38.212237 7fd65e0c1700  5 osd.4 pg_epoch: 26583 pg[2.4e6( empty local-les=26452 n=0 ec=1 les/c 26452/26452 26451/26451/26451) [4] r=0 lpr=26451 crt=0'0 mlcod 0'0 active] exit Started 1063.418135 0 0.000000
  -198> 2015-03-09 11:48:38.212288 7fd65e0c1700  5 osd.4 pg_epoch: 26583 pg[2.4e6( empty local-les=26452 n=0 ec=1 les/c 26452/26452 26451/26451/26451) [4] r=0 lpr=26451 crt=0'0 mlcod 0'0 active] enter Reset
  -197> 2015-03-09 11:48:38.212385 7fd65e0c1700  5 osd.4 pg_epoch: 26583 pg[2.4e6( empty local-les=26452 n=0 ec=1 les/c 26452/26452 26583/26583/26583) [98,4] r=1 lpr=26583 pi=26451-26582/1 crt=0'0 inactive NOTIFY] exit Reset 0.000097 1 0.000724
  -196> 2015-03-09 11:48:38.212469 7fd65e0c1700  5 osd.4 pg_epoch: 26583 pg[2.4e6( empty local-les=26452 n=0 ec=1 les/c 26452/26452 26583/26583/26583) [98,4] r=1 lpr=26583 pi=26451-26582/1 crt=0'0 inactive NOTIFY] enter Started
  -195> 2015-03-09 11:48:38.212520 7fd65e0c1700  5 osd.4 pg_epoch: 26583 pg[2.4e6( empty local-les=26452 n=0 ec=1 les/c 26452/26452 26583/26583/26583) [98,4] r=1 lpr=26583 pi=26451-26582/1 crt=0'0 inactive NOTIFY] enter Start
  -194> 2015-03-09 11:48:38.212557 7fd65e0c1700  1 osd.4 pg_epoch: 26583 pg[2.4e6( empty local-les=26452 n=0 ec=1 les/c 26452/26452 26583/26583/26583) [98,4] r=1 lpr=26583 pi=26451-26582/1 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
  -193> 2015-03-09 11:48:38.212596 7fd65e0c1700  5 osd.4 pg_epoch: 26583 pg[2.4e6( empty local-les=26452 n=0 ec=1 les/c 26452/26452 26583/26583/26583) [98,4] r=1 lpr=26583 pi=26451-26582/1 crt=0'0 inactive NOTIFY] exit Start 0.000076 0 0.000000
  -192> 2015-03-09 11:48:38.212637 7fd65e0c1700  5 osd.4 pg_epoch: 26583 pg[2.4e6( empty local-les=26452 n=0 ec=1 les/c 26452/26452 26583/26583/26583) [98,4] r=1 lpr=26583 pi=26451-26582/1 crt=0'0 inactive NOTIFY] enter Started/Stray
  -191> 2015-03-09 11:48:38.213649 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.4:6871/45800 -- osd_map(26484..26583 src has 5446..26583) v3 -- ?+0 0xaa7b180 con 0xc669fa0
  -190> 2015-03-09 11:48:38.213807 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.4:6871/45800 -- pg_notify(2.4e6(1) epoch 26583) v5 -- ?+0 0x10ad0700 con 0xc669fa0
  -189> 2015-03-09 11:48:38.216076 7fd664ecc700  1 -- 10.XXX.XX.1:6896/37300 <== osd.95 10.XXX.XX.4:6866/44142 1 ==== pg_query(8.33f epoch 26583) v3 ==== 144+0+0 (133574554 0 0) 0xa0aa760 con 0xc66e300
  -188> 2015-03-09 11:48:38.216166 7fd664ecc700  5 -- op tracker -- , seq: 3367, time: 2015-03-09 11:48:38.187451, event: header_read, request: pg_query(8.33f epoch 26583) v3
  -187> 2015-03-09 11:48:38.216204 7fd664ecc700  5 -- op tracker -- , seq: 3367, time: 2015-03-09 11:48:38.187455, event: throttled, request: pg_query(8.33f epoch 26583) v3
  -186> 2015-03-09 11:48:38.216232 7fd664ecc700  5 -- op tracker -- , seq: 3367, time: 2015-03-09 11:48:38.187532, event: all_read, request: pg_query(8.33f epoch 26583) v3
  -185> 2015-03-09 11:48:38.216260 7fd664ecc700  5 -- op tracker -- , seq: 3367, time: 2015-03-09 11:48:38.216160, event: dispatched, request: pg_query(8.33f epoch 26583) v3
  -184> 2015-03-09 11:48:38.216297 7fd664ecc700  5 -- op tracker -- , seq: 3367, time: 2015-03-09 11:48:38.216297, event: waiting_for_osdmap, request: pg_query(8.33f epoch 26583) v3
  -183> 2015-03-09 11:48:38.216333 7fd664ecc700  5 -- op tracker -- , seq: 3367, time: 2015-03-09 11:48:38.216333, event: started, request: pg_query(8.33f epoch 26583) v3
  -182> 2015-03-09 11:48:38.216378 7fd664ecc700  5 -- op tracker -- , seq: 3367, time: 2015-03-09 11:48:38.216378, event: done, request: pg_query(8.33f epoch 26583) v3
  -181> 2015-03-09 11:48:38.216431 7fd664ecc700  1 -- 10.XXX.XX.1:6896/37300 <== osd.23 10.XXX.XX.1:6836/16768 204 ==== osd_map(26583..26583 src has 5446..26583) v3 ==== 2027+0+0 (2791851435 0 0) 0xbdcd100 con 0x9a00f20
  -180> 2015-03-09 11:48:38.216488 7fd664ecc700  3 osd.4 26583 handle_osd_map epochs [26583,26583], i have 26583, src has [5446,26583]
  -179> 2015-03-09 11:48:38.216527 7fd664ecc700  1 -- 10.XXX.XX.1:6896/37300 <== osd.184 10.XXX.XX.3:6906/10319 210 ==== osd_map(26583..26583 src has 5446..26583) v3 ==== 2027+0+0 (2791851435 0 0) 0xc1f98c0 con 0xc58cfc0
  -178> 2015-03-09 11:48:38.216574 7fd664ecc700  3 osd.4 26583 handle_osd_map epochs [26583,26583], i have 26583, src has [5446,26583]
  -177> 2015-03-09 11:48:38.216611 7fd664ecc700  1 -- 10.XXX.XX.1:6896/37300 <== osd.23 10.XXX.XX.1:6836/16768 205 ==== pg_notify(2.55f(16) epoch 26583) v5 ==== 1446+0+0 (619452503 0 0) 0xe3baf40 con 0x9a00f20
  -176> 2015-03-09 11:48:38.216654 7fd664ecc700  5 -- op tracker -- , seq: 3368, time: 2015-03-09 11:48:38.191344, event: header_read, request: pg_notify(2.55f(16) epoch 26583) v5
  -175> 2015-03-09 11:48:38.217064 7fd664ecc700  5 -- op tracker -- , seq: 3368, time: 2015-03-09 11:48:38.191345, event: throttled, request: pg_notify(2.55f(16) epoch 26583) v5
  -174> 2015-03-09 11:48:38.217137 7fd664ecc700  5 -- op tracker -- , seq: 3368, time: 2015-03-09 11:48:38.191441, event: all_read, request: pg_notify(2.55f(16) epoch 26583) v5
  -173> 2015-03-09 11:48:38.217172 7fd664ecc700  5 -- op tracker -- , seq: 3368, time: 2015-03-09 11:48:38.216651, event: dispatched, request: pg_notify(2.55f(16) epoch 26583) v5
  -172> 2015-03-09 11:48:38.217219 7fd664ecc700  5 -- op tracker -- , seq: 3368, time: 2015-03-09 11:48:38.217218, event: waiting_for_osdmap, request: pg_notify(2.55f(16) epoch 26583) v5
  -171> 2015-03-09 11:48:38.217255 7fd664ecc700  5 -- op tracker -- , seq: 3368, time: 2015-03-09 11:48:38.217255, event: started, request: pg_notify(2.55f(16) epoch 26583) v5
  -170> 2015-03-09 11:48:38.217573 7fd65d6c0700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.3:7076/9863 -- osd_map(26583..26583 src has 5446..26583) v3 -- ?+0 0xa2a06c0 con 0xc66f7a0
  -169> 2015-03-09 11:48:38.217646 7fd65d6c0700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.3:7076/9863 -- pg_notify(1.706(233) epoch 26583) v5 -- ?+0 0xf51d940 con 0xc66f7a0
  -168> 2015-03-09 11:48:38.220704 7fd664ecc700  5 -- op tracker -- , seq: 3368, time: 2015-03-09 11:48:38.220704, event: done, request: pg_notify(2.55f(16) epoch 26583) v5
  -167> 2015-03-09 11:48:38.220830 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.4:6866/44142 -- pg_notify(8.33f(33) epoch 26583) v5 -- ?+0 0x1101b800 con 0xc66e300
  -166> 2015-03-09 11:48:38.222512 7fd664ecc700  1 -- 10.XXX.XX.1:6896/37300 <== osd.184 10.XXX.XX.3:6906/10319 211 ==== pg_notify(2.55f(16) epoch 26583) v5 ==== 1446+0+0 (1379774543 0 0) 0xc7bf1c0 con 0xc58cfc0
  -165> 2015-03-09 11:48:38.222867 7fd664ecc700  5 -- op tracker -- , seq: 3369, time: 2015-03-09 11:48:38.198794, event: header_read, request: pg_notify(2.55f(16) epoch 26583) v5
  -164> 2015-03-09 11:48:38.223220 7fd664ecc700  5 -- op tracker -- , seq: 3369, time: 2015-03-09 11:48:38.198795, event: throttled, request: pg_notify(2.55f(16) epoch 26583) v5
  -163> 2015-03-09 11:48:38.223465 7fd664ecc700  5 -- op tracker -- , seq: 3369, time: 2015-03-09 11:48:38.198902, event: all_read, request: pg_notify(2.55f(16) epoch 26583) v5
  -162> 2015-03-09 11:48:38.223951 7fd664ecc700  5 -- op tracker -- , seq: 3369, time: 2015-03-09 11:48:38.222863, event: dispatched, request: pg_notify(2.55f(16) epoch 26583) v5
  -161> 2015-03-09 11:48:38.224139 7fd664ecc700  5 -- op tracker -- , seq: 3369, time: 2015-03-09 11:48:38.224139, event: waiting_for_osdmap, request: pg_notify(2.55f(16) epoch 26583) v5
  -160> 2015-03-09 11:48:38.224362 7fd664ecc700  5 -- op tracker -- , seq: 3369, time: 2015-03-09 11:48:38.224362, event: started, request: pg_notify(2.55f(16) epoch 26583) v5
  -159> 2015-03-09 11:48:38.224832 7fd664ecc700  5 -- op tracker -- , seq: 3369, time: 2015-03-09 11:48:38.224832, event: done, request: pg_notify(2.55f(16) epoch 26583) v5
  -158> 2015-03-09 11:48:38.225155 7fd664ecc700  1 -- 10.XXX.XX.1:6896/37300 <== osd.98 10.XXX.XX.4:6871/45800 1 ==== pg_query(2.4e6 epoch 26583) v3 ==== 144+0+0 (2534493726 0 0) 0xa0aaee0 con 0xc669fa0
  -157> 2015-03-09 11:48:38.225418 7fd664ecc700  5 -- op tracker -- , seq: 3370, time: 2015-03-09 11:48:38.224898, event: header_read, request: pg_query(2.4e6 epoch 26583) v3
  -156> 2015-03-09 11:48:38.226051 7fd664ecc700  5 -- op tracker -- , seq: 3370, time: 2015-03-09 11:48:38.224900, event: throttled, request: pg_query(2.4e6 epoch 26583) v3
  -155> 2015-03-09 11:48:38.226367 7fd664ecc700  5 -- op tracker -- , seq: 3370, time: 2015-03-09 11:48:38.224967, event: all_read, request: pg_query(2.4e6 epoch 26583) v3
  -154> 2015-03-09 11:48:38.226534 7fd664ecc700  5 -- op tracker -- , seq: 3370, time: 2015-03-09 11:48:38.225414, event: dispatched, request: pg_query(2.4e6 epoch 26583) v3
  -153> 2015-03-09 11:48:38.226817 7fd664ecc700  5 -- op tracker -- , seq: 3370, time: 2015-03-09 11:48:38.226816, event: waiting_for_osdmap, request: pg_query(2.4e6 epoch 26583) v3
  -152> 2015-03-09 11:48:38.227257 7fd664ecc700  5 -- op tracker -- , seq: 3370, time: 2015-03-09 11:48:38.227257, event: started, request: pg_query(2.4e6 epoch 26583) v3
  -151> 2015-03-09 11:48:38.227548 7fd664ecc700  5 -- op tracker -- , seq: 3370, time: 2015-03-09 11:48:38.227548, event: done, request: pg_query(2.4e6 epoch 26583) v3
  -150> 2015-03-09 11:48:38.227795 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.4:6871/45800 -- pg_notify(2.4e6(1) epoch 26583) v5 -- ?+0 0x1101af40 con 0xc669fa0
  -145> 2015-03-09 11:48:38.257606 7fd664ecc700  1 -- 10.XXX.XX.1:6896/37300 <== osd.193 10.XXX.XX.3:7021/20901 322 ==== osd_map(26583..26583 src has 5446..26583) v3 ==== 2027+0+0 (2791851435 0 0) 0xd664a40 con 0x506f220
  -144> 2015-03-09 11:48:38.257707 7fd664ecc700  3 osd.4 26583 handle_osd_map epochs [26583,26583], i have 26583, src has [5446,26583]
  -143> 2015-03-09 11:48:38.260096 7fd664ecc700  1 -- 10.XXX.XX.1:6896/37300 <== osd.193 10.XXX.XX.3:7021/20901 323 ==== pg_notify(4.2f7(26) epoch 26583) v5 ==== 2084+0+0 (1842663399 0 0) 0xf4efe00 con 0x506f220
  -142> 2015-03-09 11:48:38.260180 7fd664ecc700  5 -- op tracker -- , seq: 3371, time: 2015-03-09 11:48:38.259882, event: header_read, request: pg_notify(4.2f7(26) epoch 26583) v5
  -141> 2015-03-09 11:48:38.260147 7fd652f7c700  2 -- 10.XXX.XX.1:6897/37300 >> 10.XXX.XX.2:0/49594 pipe(0x97ce680 sd=185 :6897 s=2 pgs=2248 cs=1 l=1 c=0xa92a680).reader couldn't read tag, (0) Success
  -140> 2015-03-09 11:48:38.260255 7fd652f7c700  2 -- 10.XXX.XX.1:6897/37300 >> 10.XXX.XX.2:0/49594 pipe(0x97ce680 sd=185 :6897 s=2 pgs=2248 cs=1 l=1 c=0xa92a680).fault (0) Success
  -139> 2015-03-09 11:48:38.260579 7fd664ecc700  5 -- op tracker -- , seq: 3371, time: 2015-03-09 11:48:38.259884, event: throttled, request: pg_notify(4.2f7(26) epoch 26583) v5
  -138> 2015-03-09 11:48:38.260638 7fd664ecc700  5 -- op tracker -- , seq: 3371, time: 2015-03-09 11:48:38.259992, event: all_read, request: pg_notify(4.2f7(26) epoch 26583) v5
  -137> 2015-03-09 11:48:38.260672 7fd664ecc700  5 -- op tracker -- , seq: 3371, time: 2015-03-09 11:48:38.260176, event: dispatched, request: pg_notify(4.2f7(26) epoch 26583) v5
  -136> 2015-03-09 11:48:38.260708 7fd664ecc700  5 -- op tracker -- , seq: 3371, time: 2015-03-09 11:48:38.260708, event: waiting_for_osdmap, request: pg_notify(4.2f7(26) epoch 26583) v5
  -133> 2015-03-09 11:48:38.260747 7fd664ecc700  5 -- op tracker -- , seq: 3371, time: 2015-03-09 11:48:38.260747, event: started, request: pg_notify(4.2f7(26) epoch 26583) v5
  -130> 2015-03-09 11:48:38.260927 7fd664ecc700  5 -- op tracker -- , seq: 3371, time: 2015-03-09 11:48:38.260926, event: done, request: pg_notify(4.2f7(26) epoch 26583) v5
  -129> 2015-03-09 11:48:38.262411 7fd65307d700  2 -- 10.XXX.XX.1:6898/37300 >> 10.XXX.XX.2:0/49594 pipe(0xa68b700 sd=186 :6898 s=2 pgs=2249 cs=1 l=1 c=0xa1206e0).reader couldn't read tag, (0) Success
  -128> 2015-03-09 11:48:38.262489 7fd65307d700  2 -- 10.XXX.XX.1:6898/37300 >> 10.XXX.XX.2:0/49594 pipe(0xa68b700 sd=186 :6898 s=2 pgs=2249 cs=1 l=1 c=0xa1206e0).fault (0) Success
  -127> 2015-03-09 11:48:38.265822 7fd664ecc700  1 -- 10.XXX.XX.1:6896/37300 <== osd.181 10.XXX.XX.3:7066/61722 12 ==== osd_map(26583..26583 src has 5446..26583) v3 ==== 2027+0+0 (2791851435 0 0) 0xaa7f080 con 0xa95c0a0
  -126> 2015-03-09 11:48:38.265895 7fd664ecc700  3 osd.4 26583 handle_osd_map epochs [26583,26583], i have 26583, src has [5446,26583]
  -125> 2015-03-09 11:48:38.269540 7fd63136b700  2 -- 10.XXX.XX.1:6897/37300 >> 10.XXX.XX.3:0/24000 pipe(0x965b700 sd=574 :6897 s=2 pgs=203 cs=1 l=1 c=0xa92bc80).reader couldn't read tag, (0) Success
  -124> 2015-03-09 11:48:38.269624 7fd63136b700  2 -- 10.XXX.XX.1:6897/37300 >> 10.XXX.XX.3:0/24000 pipe(0x965b700 sd=574 :6897 s=2 pgs=203 cs=1 l=1 c=0xa92bc80).fault (0) Success
  -121> 2015-03-09 11:48:38.271978 7fd63146c700  2 -- 10.XXX.XX.1:6898/37300 >> 10.XXX.XX.3:0/24000 pipe(0x38c2580 sd=197 :6898 s=2 pgs=204 cs=1 l=1 c=0xa123020).reader couldn't read tag, (0) Success
  -120> 2015-03-09 11:48:38.272086 7fd63146c700  2 -- 10.XXX.XX.1:6898/37300 >> 10.XXX.XX.3:0/24000 pipe(0x38c2580 sd=197 :6898 s=2 pgs=204 cs=1 l=1 c=0xa123020).fault (0) Success
   -97> 2015-03-09 11:48:38.357713 7fd675125700  5 osd.4 26583 tick
    -4> 2015-03-09 11:48:39.148773 7fd664ecc700  1 -- 10.XXX.XX.1:6896/37300 <== osd.157 10.XXX.XX.3:6911/14701 211 ==== osd_map(26584..26584 src has 5446..26584) v3 ==== 1312+0+0 (3135781505 0 0) 0xbdce300 con 0xc589b80
    -3> 2015-03-09 11:48:39.148921 7fd664ecc700  3 osd.4 26583 handle_osd_map epochs [26584,26584], i have 26583, src has [5446,26584]
    -2> 2015-03-09 11:48:39.157434 7fd63146c700  1 -- 10.XXX.XX.1:6898/37300 >> :/0 pipe(0xcb0a300 sd=185 :6898 s=0 pgs=0 cs=0 l=0 c=0xc512520).accept sd=185 10.XXX.XX.4:46642/0
    -1> 2015-03-09 11:48:39.157789 7fd630e66700  1 -- 10.XXX.XX.1:6897/37300 >> :/0 pipe(0xc5dda00 sd=197 :6897 s=0 pgs=0 cs=0 l=0 c=0xc635800).accept sd=197 10.XXX.XX.4:32894/0
     0> 2015-03-09 11:48:39.160224 7fd63146c700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fd63146c700 time 2015-03-09 11:48:39.157697
common/Thread.cc: 129: FAILED assert(ret == 0)

 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]
 2: (Pipe::accept()+0x4ac5) [0xb48585]
 3: (Pipe::reader()+0x1bae) [0xb4b3ce]
 4: (Pipe::Reader::entry()+0xd) [0xb4d8ad]
 5: /lib64/libpthread.so.0() [0x3c8a6079d1]
 6: (clone()+0x6d) [0x3c8a2e89dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.4.log
--- end dump of recent events ---
2015-03-09 11:48:39.162317 7fd630e66700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fd630e66700 time 2015-03-09 11:48:39.157992
common/Thread.cc: 129: FAILED assert(ret == 0)

 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]
 2: (Pipe::accept()+0x4ac5) [0xb48585]
 3: (Pipe::reader()+0x1bae) [0xb4b3ce]
 4: (Pipe::Reader::entry()+0xd) [0xb4d8ad]
 5: /lib64/libpthread.so.0() [0x3c8a6079d1]
 6: (clone()+0x6d) [0x3c8a2e89dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2015-03-09 11:48:39.170787 7fd6612c6700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fd6612c6700 time 2015-03-09 11:48:39.169732
common/Thread.cc: 129: FAILED assert(ret == 0)

 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]
 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xae84fa]
 3: (Accepter::entry()+0x265) [0xb5c635]
 4: /lib64/libpthread.so.0() [0x3c8a6079d1]
 5: (clone()+0x6d) [0x3c8a2e89dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2015-03-09 11:48:39.182106 7fd6626c8700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fd6626c8700 time 2015-03-09 11:48:39.180960
common/Thread.cc: 129: FAILED assert(ret == 0)

 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]
 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xae84fa]
 3: (Accepter::entry()+0x265) [0xb5c635]
 4: /lib64/libpthread.so.0() [0x3c8a6079d1]
 5: (clone()+0x6d) [0x3c8a2e89dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -22> 2015-03-09 11:48:39.161119 7fd661cc7700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.3:6901/8063 -- osd_map(26584..26584 src has 5446..26584) v3 -- ?+0 0xa2a69c0 con 0xc58cd00
   -21> 2015-03-09 11:48:39.162317 7fd630e66700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fd630e66700 time 2015-03-09 11:48:39.157992
common/Thread.cc: 129: FAILED assert(ret == 0)

 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]
 2: (Pipe::accept()+0x4ac5) [0xb48585]
 3: (Pipe::reader()+0x1bae) [0xb4b3ce]
 4: (Pipe::Reader::entry()+0xd) [0xb4d8ad]
 5: /lib64/libpthread.so.0() [0x3c8a6079d1]
 6: (clone()+0x6d) [0x3c8a2e89dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

   -18> 2015-03-09 11:48:39.170787 7fd6612c6700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fd6612c6700 time 2015-03-09 11:48:39.169732
common/Thread.cc: 129: FAILED assert(ret == 0)

 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]
 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xae84fa]
 3: (Accepter::entry()+0x265) [0xb5c635]
 4: /lib64/libpthread.so.0() [0x3c8a6079d1]
 5: (clone()+0x6d) [0x3c8a2e89dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

   -17> 2015-03-09 11:48:39.182106 7fd6626c8700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fd6626c8700 time 2015-03-09 11:48:39.180960
common/Thread.cc: 129: FAILED assert(ret == 0)

 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]
 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xae84fa]
 3: (Accepter::entry()+0x265) [0xb5c635]
 4: /lib64/libpthread.so.0() [0x3c8a6079d1]
 5: (clone()+0x6d) [0x3c8a2e89dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

   -13> 2015-03-09 11:48:39.210662 7fd65d6c0700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.2:7018/19897 -- osd_map(26584..26584 src has 5446..26584) v3 -- ?+0 0xa2a45c0 con 0x95e82c0
   -12> 2015-03-09 11:48:39.210729 7fd65d6c0700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.2:7018/19897 -- pg_notify(12.2c5(291) epoch 26584) v5 -- ?+0 0xf518380 con 0x95e82c0
   -11> 2015-03-09 11:48:39.218260 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.4:6866/44142 -- osd_map(26584..26584 src has 5446..26584) v3 -- ?+0 0xaa7e780 con 0xc66e300
   -10> 2015-03-09 11:48:39.218310 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.4:6866/44142 -- pg_notify(8.33f(33) epoch 26584) v5 -- ?+0 0x11018000 con 0xc66e300
    -9> 2015-03-09 11:48:39.244840 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.3:7061/57477 -- osd_map(26584..26584 src has 5446..26584) v3 -- ?+0 0xaa7c140 con 0xa52cd00
    -8> 2015-03-09 11:48:39.244889 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.3:7061/57477 -- pg_notify(12.2f4(194) epoch 26584) v5 -- ?+0 0xd249340 con 0xa52cd00
    -7> 2015-03-09 11:48:39.253518 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.1:7036/17900 -- osd_map(26584..26584 src has 5446..26584) v3 -- ?+0 0xaa78fc0 con 0xa6f1a20
    -6> 2015-03-09 11:48:39.253568 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.1:7036/17900 -- pg_notify(8.11c(179) epoch 26584) v5 -- ?+0 0xd24f000 con 0xa6f1a20
    -5> 2015-03-09 11:48:39.282716 7fd65d6c0700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.3:7076/9863 -- osd_map(26584..26584 src has 5446..26584) v3 -- ?+0 0xa2a4380 con 0xc66f7a0
    -4> 2015-03-09 11:48:39.282767 7fd65d6c0700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.3:7076/9863 -- pg_notify(1.706(233) epoch 26584) v5 -- ?+0 0xf51d780 con 0xc66f7a0
    -3> 2015-03-09 11:48:39.298106 7fd65d6c0700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.2:6822/9024 -- osd_map(26584..26584 src has 5446..26584) v3 -- ?+0 0xa2a4140 con 0x96e5120
    -2> 2015-03-09 11:48:39.298175 7fd65d6c0700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.2:6822/9024 -- pg_notify(4.19c(242) epoch 26584) v5 -- ?+0 0xf64b640 con 0x96e5120
    -1> 2015-03-09 11:48:39.329683 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.4:6871/45800 -- osd_map(26584..26584 src has 5446..26584) v3 -- ?+0 0xaa78b40 con 0xc669fa0
     0> 2015-03-09 11:48:39.329737 7fd65e0c1700  1 -- 10.XXX.XX.1:6896/37300 --> 10.XXX.XX.4:6871/45800 -- pg_notify(2.4e6(1) epoch 26584) v5 -- ?+0 0xd248700 con 0xc669fa0
--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.4.log
--- end dump of recent events ---
--- begin dump of recent events ---
--- logging levels ---
[root@XXX-s01 ceph]#

#13 Updated by Christian Eichelmann about 9 years ago

Hi Karan,

like I was writing on the mailing list, 65536 is to low for high density hardware.

In our cluster, one OSD server has in an idle situation about 66.000 Threads (60 OSDs per Server). The number of threads increases when you increase the number of placement groups in the cluster, which I think has triggered your problem.

Set the "kernel.pid_max" setting to 4194303 (the maximum) and it should work.

Regards,
Christian

#14 Updated by karan singh about 9 years ago

Thanks Christian it worked with kernel.pid_max=4194303

#15 Updated by Samuel Just about 9 years ago

  • Status changed from New to Rejected

Also available in: Atom PDF