Bug #17958
Simultaneous start of ceph-osd processes causes cluster stucks in error state
0%
Description
Hi.
I have a CEPH cluster containig of 3 storages (20 OSDs per one) and 3 monitors.
root@storage-ceph-01z:~# lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 8.6 (jessie) Release: 8.6 Codename: jessie
root@storage-ceph-01z:~# ceph -v ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
I have noticed that multiple OSDs start simultaneous improperly. This causes many PGs stuck in the state peering and/or peering+remapped for 15-20 minutes. Nothing happens, there is no recovery, there are no any actions in cluster. Some OSDs can`t see just launched OSDs and their heartbeat fails, but status of the cluster shows, that 60/60 OSDs are up and in. Some time later cluster comes alive, recovery process is beginning and a health becomes OK.
My osd tree looks like:
root@storage-ceph-01z:~# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 109.10980 root default -2 36.36993 host storage-ceph-01y 0 1.81850 osd.0 up 1.00000 1.00000 3 1.81850 osd.3 up 1.00000 1.00000 5 1.81850 osd.5 up 1.00000 1.00000 8 1.81850 osd.8 up 1.00000 1.00000 9 1.81850 osd.9 up 1.00000 1.00000 11 1.81850 osd.11 up 1.00000 1.00000 14 1.81850 osd.14 up 1.00000 1.00000 16 1.81850 osd.16 up 1.00000 1.00000 19 1.81850 osd.19 up 1.00000 1.00000 22 1.81850 osd.22 up 1.00000 1.00000 24 1.81850 osd.24 up 1.00000 1.00000 26 1.81850 osd.26 up 1.00000 1.00000 29 1.81850 osd.29 up 1.00000 1.00000 32 1.81850 osd.32 up 1.00000 1.00000 35 1.81850 osd.35 up 1.00000 1.00000 38 1.81850 osd.38 up 1.00000 1.00000 41 1.81850 osd.41 up 1.00000 1.00000 43 1.81850 osd.43 up 1.00000 1.00000 45 1.81850 osd.45 up 1.00000 1.00000 48 1.81850 osd.48 up 1.00000 1.00000 -3 36.36993 host storage-ceph-01x 1 1.81850 osd.1 up 1.00000 1.00000 2 1.81850 osd.2 up 1.00000 1.00000 4 1.81850 osd.4 up 1.00000 1.00000 7 1.81850 osd.7 up 1.00000 1.00000 10 1.81850 osd.10 up 1.00000 1.00000 12 1.81850 osd.12 up 1.00000 1.00000 15 1.81850 osd.15 up 1.00000 1.00000 17 1.81850 osd.17 up 1.00000 1.00000 18 1.81850 osd.18 up 1.00000 1.00000 21 1.81850 osd.21 up 1.00000 1.00000 23 1.81850 osd.23 up 1.00000 1.00000 25 1.81850 osd.25 up 1.00000 1.00000 28 1.81850 osd.28 up 1.00000 1.00000 31 1.81850 osd.31 up 1.00000 1.00000 34 1.81850 osd.34 up 1.00000 1.00000 37 1.81850 osd.37 up 1.00000 1.00000 40 1.81850 osd.40 up 1.00000 1.00000 42 1.81850 osd.42 up 1.00000 1.00000 44 1.81850 osd.44 up 1.00000 1.00000 47 1.81850 osd.47 up 1.00000 1.00000 -4 36.36993 host storage-ceph-01z 6 1.81850 osd.6 up 1.00000 1.00000 13 1.81850 osd.13 up 1.00000 1.00000 20 1.81850 osd.20 up 1.00000 1.00000 27 1.81850 osd.27 up 1.00000 1.00000 30 1.81850 osd.30 up 1.00000 1.00000 33 1.81850 osd.33 up 1.00000 1.00000 36 1.81850 osd.36 up 1.00000 1.00000 39 1.81850 osd.39 up 1.00000 1.00000 46 1.81850 osd.46 up 1.00000 1.00000 49 1.81850 osd.49 up 1.00000 1.00000 50 1.81850 osd.50 up 1.00000 1.00000 51 1.81850 osd.51 up 1.00000 1.00000 52 1.81850 osd.52 up 1.00000 1.00000 53 1.81850 osd.53 up 1.00000 1.00000 54 1.81850 osd.54 up 1.00000 1.00000 55 1.81850 osd.55 up 1.00000 1.00000 56 1.81850 osd.56 up 1.00000 1.00000 57 1.81850 osd.57 up 1.00000 1.00000 58 1.81850 osd.58 up 1.00000 1.00000 59 1.81850 osd.59 up 1.00000 1.00000
How to reproduce:
1. Check the health of your cluster:
root@storage-ceph-01z:~# ceph -s cluster 63b92a66-8523-4adc-9e3a-ee267e5be456 health HEALTH_OK monmap e1: 3 mons at {hoster-kvm-07b=172.30.0.221:6789/0,hoster-kvm-08b=172.30.0.220:6789/0,hoster-kvm-08c=172.30.0.222:6789/0} election epoch 24, quorum 0,1,2 hoster-kvm-08b,hoster-kvm-07b,hoster-kvm-08c osdmap e1625: 60 osds: 60 up, 60 in flags sortbitwise pgmap v131106: 4096 pgs, 1 pools, 139 GB data, 38396 objects 423 GB used, 108 TB / 109 TB avail 4096 active+clean
2. Set parameter mon osd down out interval to 60 in ceph.conf on a monitors for faster reproduction.
3. Restart the monitors.
4. Stop all ceph-osd processes on one node:
root@storage-ceph-01z:~# systemctl stop ceph-osd.target
The cluster goes in degraded state.
5. Wait until the fallen OSDs will be moved out of cluster and recovery is beginning.
root@storage-ceph-01z:~# ceph -s cluster 63b92a66-8523-4adc-9e3a-ee267e5be456 health HEALTH_WARN 3971 pgs degraded 4 pgs recovering 12 pgs recovery_wait 4050 pgs stuck unclean 1485 pgs undersized recovery 71746/115188 objects degraded (62.286%) recovery 24513/115188 objects misplaced (21.281%) monmap e1: 3 mons at {hoster-kvm-07b=172.30.0.221:6789/0,hoster-kvm-08b=172.30.0.220:6789/0,hoster-kvm-08c=172.30.0.222:6789/0} election epoch 26, quorum 0,1,2 hoster-kvm-08b,hoster-kvm-07b,hoster-kvm-08c osdmap e1630: 60 osds: 40 up, 40 in; 4096 remapped pgs flags sortbitwise pgmap v131545: 4096 pgs, 1 pools, 139 GB data, 38396 objects 286 GB used, 74197 GB / 74484 GB avail 71746/115188 objects degraded (62.286%) 24513/115188 objects misplaced (21.281%) 1485 active+undersized+degraded 1294 active+degraded+remapped 1176 active+degraded 84 active 41 active+remapped 7 active+recovery_wait+degraded+remapped 5 active+recovery_wait+degraded 3 active+recovering+degraded 1 active+recovering+degraded+remapped recovery io 360 MB/s, 96 objects/s
6. Then start all ceph-osd processes:
root@storage-ceph-01z:~# systemctl start ceph-osd.target
7. OSDs become UP and IN. Recovery continues.
8. But then recovery stops and some count of PGs stuck:
root@storage-ceph-01z:~# ceph -s cluster 63b92a66-8523-4adc-9e3a-ee267e5be456 health HEALTH_ERR 53 pgs are stuck inactive for more than 300 seconds 92 pgs peering 53 pgs stuck inactive monmap e1: 3 mons at {hoster-kvm-07b=172.30.0.221:6789/0,hoster-kvm-08b=172.30.0.220:6789/0,hoster-kvm-08c=172.30.0.222:6789/0} election epoch 26, quorum 0,1,2 hoster-kvm-08b,hoster-kvm-07b,hoster-kvm-08c osdmap e1642: 60 osds: 60 up, 60 in; 38 remapped pgs flags sortbitwise pgmap v131577: 4096 pgs, 1 pools, 139 GB data, 38396 objects 423 GB used, 108 TB / 109 TB avail 4004 active+clean 54 peering 38 remapped+peering
ceph -w shows nothing (new lines don`t appear).
root@storage-ceph-01z:~# ceph osd pool stats pool rbdkvm_sata id 2 nothing is going on
9. I check logs of other OSDs on other nodes and see that some OSDs can`t ping just started OSDs.
X-node: 2016-11-18 17:31:35.886226 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.13 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:15.886221) 2016-11-18 17:31:35.886247 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.30 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:15.886221) 2016-11-18 17:31:35.886255 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.39 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:15.886221) 2016-11-18 17:31:35.886262 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.49 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:15.886221) 2016-11-18 17:31:36.886450 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.13 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:16.886445) 2016-11-18 17:31:36.886475 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.30 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:16.886445) 2016-11-18 17:31:36.886484 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.39 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:16.886445) 2016-11-18 17:31:36.886492 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.49 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:16.886445) Y-node: 2016-11-18 17:32:01.819803 7f216f1c7700 -1 osd.22 1642 heartbeat_check: no reply from osd.20 ever on either front or back, first ping sent 2016-11-18 17:19:23.111839 (cutoff 2016-11-18 17:31:41.819796) 2016-11-18 17:32:02.611614 7f214b4cf700 -1 osd.22 1642 heartbeat_check: no reply from osd.20 ever on either front or back, first ping sent 2016-11-18 17:19:23.111839 (cutoff 2016-11-18 17:31:42.611528) 2016-11-18 17:32:02.819995 7f216f1c7700 -1 osd.22 1642 heartbeat_check: no reply from osd.20 ever on either front or back, first ping sent 2016-11-18 17:19:23.111839 (cutoff 2016-11-18 17:31:42.819988) 2016-11-18 17:32:03.820178 7f216f1c7700 -1 osd.22 1642 heartbeat_check: no reply from osd.20 ever on either front or back, first ping sent 2016-11-18 17:19:23.111839 (cutoff 2016-11-18 17:31:43.820170) 2016-11-18 17:32:04.820376 7f216f1c7700 -1 osd.22 1642 heartbeat_check: no reply from osd.20 ever on either front or back, first ping sent 2016-11-18 17:19:23.111839 (cutoff 2016-11-18 17:31:44.820354)
All "unreachable" OSDs are on Z-node:
root@storage-ceph-01y:/var/log/ceph# ceph osd find 20
{
"osd": 20,
"ip": "172.30.0.3:6838\/904375",
"crush_location": {
"host": "storage-ceph-01z",
"root": "default"
}
}
root@storage-ceph-01y:/var/log/ceph# ceph osd find 39
{
"osd": 39,
"ip": "172.30.0.3:6832\/904369",
"crush_location": {
"host": "storage-ceph-01z",
"root": "default"
}
}
root@storage-ceph-01y:/var/log/ceph# ceph osd find 49
{
"osd": 49,
"ip": "172.30.0.3:6830\/904336",
"crush_location": {
"host": "storage-ceph-01z",
"root": "default"
}
}
9. But these OSDs are running, UP and IN the cluster.
ceph 904336 2.0 0.3 1572552 180348 ? Ssl 17:19 0:18 /usr/bin/ceph-osd -f --cluster ceph --id 49 --setuser ceph --setgroup ceph
ceph 904350 2.0 0.3 1587096 194988 ? Ssl 17:19 0:19 /usr/bin/ceph-osd -f --cluster ceph --id 27 --setuser ceph --setgroup ceph
ceph 904369 2.0 0.3 1557940 185516 ? Ssl 17:19 0:18 /usr/bin/ceph-osd -f --cluster ceph --id 39 --setuser ceph --setgroup ceph
ceph 904375 2.2 0.4 1590648 206332 ? Ssl 17:19 0:20 /usr/bin/ceph-osd -f --cluster ceph --id 20 --setuser ceph --setgroup ceph
Here are a typical examples of such OSD`s logs:
osd.20:
2016-11-18 17:19:22.661630 7f7788e2a700 0 -- 172.30.1.3:6838/904375 >> 172.30.1.1:6814/296626 pipe(0x55a506ae3400 sd=337 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507147600).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.679902 7f7788c28700 0 -- 172.30.1.3:6838/904375 >> 172.30.1.2:6804/136046 pipe(0x55a506ae4800 sd=339 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507147780).accept connect_seq 0 vs existing 0 state connecting
2016-11-18 17:19:22.686109 7f7788a26700 0 -- 172.30.1.3:6838/904375 >> 172.30.1.2:6838/144524 pipe(0x55a507299400 sd=341 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507147a80).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.690474 7f7796702700 0 -- 172.30.1.3:6838/904375 >> 172.30.1.1:6828/301114 pipe(0x55a507298000 sd=228 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507147900).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.777298 7f778831f700 0 -- 172.30.1.3:6838/904375 >> 172.30.1.2:6800/135772 pipe(0x55a50670e000 sd=257 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507147d80).accept connect_seq 0 vs existing 0 state connecting
2016-11-18 17:19:22.818229 7f778811d700 0 -- 172.30.1.3:6838/904375 >> 172.30.1.1:6834/302750 pipe(0x55a506388000 sd=299 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507292d00).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:23.021428 7f7788824700 0 -- 172.30.1.3:6838/904375 >> 172.30.1.2:6829/142225 pipe(0x55a50729a800 sd=268 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507147c00).accept connect_seq 0 vs existing 0 state connecting
2016-11-18 17:19:23.303791 7f778730f700 0 -- 172.30.1.3:6838/904375 >> 172.30.1.2:6840/145069 pipe(0x55a506208800 sd=344 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507292e80).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:23.843760 7f77860fd700 0 -- 172.30.1.3:6838/904375 >> 172.30.1.2:6818/139814 pipe(0x55a507782000 sd=353 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507293000).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:27.867375 7f7784ce9700 0 -- 172.30.1.3:6838/904375 >> 172.30.1.2:6824/141442 pipe(0x55a507784800 sd=65 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507293300).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:32.890289 7f7785dfa700 0 -- 172.30.1.3:6838/904375 >> 172.30.1.1:6832/302180 pipe(0x55a507783400 sd=355 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507293180).accept connect_seq 0 vs existing 0 state wait
osd.40:
2016-11-18 17:19:22.807629 7f0773fad700 0 -- 172.30.1.3:6830/904336 >> 172.30.1.1:6832/302180 pipe(0x564c7e680800 sd=257 :6830 s=0 pgs=0 cs=0 l=0 c=0x564c7f3ae280).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.828839 7f07740ae700 0 -- 172.30.1.3:6830/904336 >> 172.30.1.2:6822/140921 pipe(0x564c7e368800 sd=252 :6830 s=0 pgs=0 cs=0 l=0 c=0x564c7f3ae100).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.902583 7f0773aa8700 0 -- 172.30.1.3:6830/904336 >> 172.30.1.1:6820/298383 pipe(0x564c7e67f400 sd=263 :6830 s=0 pgs=0 cs=0 l=0 c=0x564c7f3ae400).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.994463 7f07737a5700 0 -- 172.30.1.3:6830/904336 >> 172.30.1.1:6812/296199 pipe(0x564c7e5df400 sd=354 :6830 s=0 pgs=0 cs=0 l=0 c=0x564c7f3ae580).accept connect_seq 0 vs existing 0 state connecting
2016-11-18 17:19:23.300791 7f0777cea700 0 -- 172.30.1.3:6830/904336 >> 172.30.1.2:6838/144524 pipe(0x564c7eb04800 sd=323 :6830 s=0 pgs=0 cs=0 l=0 c=0x564c7e919e00).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:23.826907 7f0772c9a700 0 -- 172.30.1.3:6830/904336 >> 172.30.1.2:6800/135772 pipe(0x564c7e53c800 sd=357 :6830 s=0 pgs=0 cs=0 l=0 c=0x564c7f3ae700).accept connect_seq 0 vs existing 0 state wait
osd.39
2016-11-18 17:19:22.754105 7f4020f6e700 0 -- 172.30.1.3:6832/904369 >> 172.30.1.1:6824/299914 pipe(0x555d3dea6800 sd=291 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3e1a0d00).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.848674 7f4020a69700 0 -- 172.30.1.3:6832/904369 >> 172.30.1.2:6831/142764 pipe(0x555d3ddbd400 sd=350 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3e1a0b80).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.933021 7f4020867700 0 -- 172.30.1.3:6832/904369 >> 172.30.1.1:6820/298383 pipe(0x555d3dd68000 sd=205 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3df87600).accept connect_seq 0 vs existing 0 state connecting
2016-11-18 17:19:23.978576 7f401f756700 0 -- 172.30.1.3:6832/904369 >> 172.30.1.2:6820/140664 pipe(0x555d3eb3b400 sd=359 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3e524400).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:23.979519 7f401f857700 0 -- 172.30.1.3:6832/904369 >> 172.30.1.2:6833/143475 pipe(0x555d3eb3a000 sd=358 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3dc32280).accept connect_seq 0 vs existing 0 state connecting
2016-11-18 17:19:24.039784 7f401ea49700 0 -- 172.30.1.3:6832/904369 >> 172.30.1.1:6826/300579 pipe(0x555d3ece3400 sd=302 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3e7d0480).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:27.867527 7f401f958700 0 -- 172.30.1.3:6832/904369 >> 172.30.1.2:6824/141442 pipe(0x555d3ece4800 sd=80 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3e7d3d80).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:32.890035 7f401f352700 0 -- 172.30.1.3:6832/904369 >> 172.30.1.2:6800/135772 pipe(0x555d3eb3c800 sd=360 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3e524b80).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:32.921769 7f401f251700 0 -- 172.30.1.3:6832/904369 >> 172.30.1.1:6832/302180 pipe(0x555d3ece2000 sd=361 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3e7d0600).accept connect_seq 0 vs existing 0 state wait
10. Monitor log is a little more informative for me. Here we see a start of the problem:
2016-11-18 17:19:39.066910 7f8cd591f700 0 log_channel(cluster) log [INF] : pgmap v131567: 4096 pgs: 3945 active+clean, 43 remapped+peering, 10 active+recovery_wait+degraded, 86 peering, 7 activating, 5 active+recovering+degraded;
139 GB data, 424 GB used, 108 TB / 109 TB avail; 278/115188 objects degraded (0.241%); 5576 kB/s, 1 objects/s recovering
2016-11-18 17:19:40.280360 7f8cd591f700 0 log_channel(cluster) log [INF] : pgmap v131568: 4096 pgs: 4001 active+clean, 38 remapped+peering, 54 peering, 3 activating; 139 GB data, 423 GB used, 108 TB / 109 TB avail; 37566 kB/s, 8 o
bjects/s recovering
2016-11-18 17:19:41.475118 7f8cd591f700 0 log_channel(cluster) log [INF] : pgmap v131569: 4096 pgs: 4004 active+clean, 38 remapped+peering, 54 peering; 139 GB data, 423 GB used, 108 TB / 109 TB avail; 37453 kB/s, 8 objects/s recov
ering
2016-11-18 17:19:41.976816 7f8cd2e17700 1 mon.hoster-kvm-08b@0(leader).osd e1642 prepare_failure osd.20 172.30.0.3:6838/904375 from osd.21 172.30.0.1:6812/296199 is reporting failure:1
2016-11-18 17:19:41.976836 7f8cd2e17700 0 log_channel(cluster) log [DBG] : osd.20 172.30.0.3:6838/904375 reported failed by osd.21 172.30.0.1:6812/296199
2016-11-18 17:19:41.977010 7f8cd2e17700 1 mon.hoster-kvm-08b@0(leader).osd e1642 prepare_failure osd.30 172.30.0.3:6820/904356 from osd.21 172.30.0.1:6812/296199 is reporting failure:1
2016-11-18 17:19:41.977026 7f8cd2e17700 0 log_channel(cluster) log [DBG] : osd.30 172.30.0.3:6820/904356 reported failed by osd.21 172.30.0.1:6812/296199
2016-11-18 17:19:42.643132 7f8cd591f700 0 log_channel(cluster) log [INF] : pgmap v131570: 4096 pgs: 4004 active+clean, 38 remapped+peering, 54 peering; 139 GB data, 423 GB used, 108 TB / 109 TB avail
2016-11-18 17:19:43.119723 7f8cd2e17700 1 mon.hoster-kvm-08b@0(leader).osd e1642 prepare_failure osd.6 172.30.0.3:6814/904372 from osd.17 172.30.0.1:6806/294562 is reporting failure:1
2016-11-18 17:19:43.119740 7f8cd2e17700 0 log_channel(cluster) log [DBG] : osd.6 172.30.0.3:6814/904372 reported failed by osd.17 172.30.0.1:6806/294562
2016-11-18 17:19:43.800926 7f8cd591f700 0 log_channel(cluster) log [INF] : pgmap v131571: 4096 pgs: 4004 active+clean, 38 remapped+peering, 54 peering; 139 GB data, 423 GB used, 108 TB / 109 TB avail
<more lines telling that there is a failure with osd>
Then we mark osd.20 down:
2016-11-18 17:20:06.139983 7f8cd3618700 1 mon.hoster-kvm-08b@0(leader).osd e1642 we have enough reporters to mark osd.20 down
2016-11-18 17:20:06.140012 7f8cd3618700 0 log_channel(cluster) log [INF] : osd.20 172.30.0.3:6838/904375 failed (2 reporters from different host after 22.187187 >= grace 20.000000)
But the osdmap e1642 is changing only after 14 minutes:
2016-11-18 17:34:20.821605 7f8cd2e17700 1 mon.hoster-kvm-08b@0(leader).osd e1642 prepare_failure osd.50 172.30.0.3:6806/904318 from osd.31 172.30.0.1:6822/299270 is reporting failure:0
2016-11-18 17:34:20.821615 7f8cd2e17700 0 log_channel(cluster) log [DBG] : osd.50 172.30.0.3:6806/904318 failure report canceled by osd.31 172.30.0.1:6822/299270
2016-11-18 17:34:20.821884 7f8cd2e17700 1 mon.hoster-kvm-08b@0(leader).osd e1642 prepare_failure osd.30 172.30.0.3:6820/904356 from osd.21 172.30.0.1:6812/296199 is reporting failure:0
2016-11-18 17:34:20.821894 7f8cd2e17700 0 log_channel(cluster) log [DBG] : osd.30 172.30.0.3:6820/904356 failure report canceled by osd.21 172.30.0.1:6812/296199
2016-11-18 17:34:20.855817 7f8cd2e17700 1 mon.hoster-kvm-08b@0(leader).osd e1642 prepare_failure osd.6 172.30.0.3:6814/904372 from osd.17 172.30.0.1:6806/294562 is reporting failure:0
2016-11-18 17:34:20.855832 7f8cd2e17700 0 log_channel(cluster) log [DBG] : osd.6 172.30.0.3:6814/904372 failure report canceled by osd.17 172.30.0.1:6806/294562
2016-11-18 17:34:21.023832 7f8cd591f700 1 mon.hoster-kvm-08b@0(leader).osd e1643 e1643: 60 osds: 59 up, 60 in
2016-11-18 17:34:21.090024 7f8cd591f700 0 log_channel(cluster) log [INF] : osdmap e1643: 60 osds: 59 up, 60 in
2016-11-18 17:34:21.149598 7f8cd591f700 0 log_channel(cluster) log [INF] : pgmap v131755: 4096 pgs: 4004 active+clean, 38 remapped+peering, 54 peering; 139 GB data, 423 GB used, 108 TB / 109 TB avail
Then recovery process goes like normal and cluster goes to OK.
11. pg query to PGs in such state shows, that they are blocked by other osd and so on.
I am attaching a ceph report in the moment of problem.
This situation repeats for me in 100% of cases.
SOLUTION:
If I start OSDs in series, there is no problem, cluster doesn`t stop to recover, doesn`t go in the ERR state and after some time becomes health OK.
Possible cause:
On a start ceph-osd processes begin "races" for a free ports.
Tell me if you need more information or something of the written you do not understand.
History
#1 Updated by Nikita Shalnov over 7 years ago
Sorry, didn`t notice that I an in rbd project now. Can you move this bug to Ceph project?
#2 Updated by Jason Dillaman over 7 years ago
- Project changed from rbd to Ceph
#3 Updated by Samuel Just over 7 years ago
- Status changed from New to Closed
It looks to me as though your cluster is just slow. You might want to post to ceph-users about ways to diagnose the bottleneck.
#4 Updated by Denis Pudov over 7 years ago
- File issue_17598.tar.xz added
Samuel Just wrote:
It looks to me as though your cluster is just slow. You might want to post to ceph-users about ways to diagnose the bottleneck.
When we tested our cluster on version 10.2.2 was no such problems.
Sequential start OSDs ensures that no changes to the osdmap (the OSDs will have the same binding ports).
With parallel start OSDs, health cluster become OK, after the osdmap has changed like:
--- osdmap_e1141.txt 2016-11-17 17:49:37.793888851 +0300 +++ osdmap_e1143.txt 2016-11-17 17:49:37.797888893 +0300 @@ -1,7 +1,7 @@ -epoch 1141 +epoch 1143 fsid 63b92a66-8523-4adc-9e3a-ee267e5be456 created 2016-11-16 11:37:15.441802 -modified 2016-11-17 15:42:10.457161 +modified 2016-11-17 15:56:56.769887 flags sortbitwise pool 2 'rbdkvm_sata' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 339 flags hashpspool stripe_width 0 removed_snaps [1~3] @@ -12,7 +12,7 @@ osd.2 up in weight 1 up_from 934 up_t osd.3 up in weight 1 up_from 967 up_thru 1138 down_at 949 last_clean_interval [782,948) 172.30.0.2:6825/141442 172.30.1.2:6824/141442 172.30.1.2:6827/141442 172.30.0.2:6826/141442 exists,up 9b423230-bc8f-4fd0-b2be-128b6706e930 osd.4 up in weight 1 up_from 944 up_thru 1138 down_at 917 last_clean_interval [797,916) 172.30.0.1:6836/303291 172.30.1.1:6836/303291 172.30.1.1:6837/303291 172.30.0.1:6837/303291 exists,up 52ad87d4-5869-48fe-a6fe-780358865862 osd.5 up in weight 1 up_from 973 up_thru 1137 down_at 949 last_clean_interval [780,948) 172.30.0.2:6835/144524 172.30.1.2:6838/144524 172.30.1.2:6839/144524 172.30.0.2:6836/144524 exists,up 867a13e4-d7a0-46b9-8501-8a34732591c5 -osd.6 up in weight 1 up_from 1131 up_thru 1138 down_at 1128 last_clean_interval [1107,1127) 172.30.0.3:6825/277571 172.30.1.3:6828/277571 172.30.1.3:6829/277571 172.30.0.3:6826/277571 exists,up 50c996d4-55a5-421e-ae25-6be972075c54 +osd.6 up in weight 1 up_from 1131 up_thru 1142 down_at 1128 last_clean_interval [1107,1127) 172.30.0.3:6825/277571 172.30.1.3:6828/277571 172.30.1.3:6829/277571 172.30.0.3:6826/277571 exists,up 50c996d4-55a5-421e-ae25-6be972075c54 ...