Project

General

Profile

Bug #17958

Simultaneous start of ceph-osd processes causes cluster stucks in error state

Added by Nikita Shalnov over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi.
I have a CEPH cluster containig of 3 storages (20 OSDs per one) and 3 monitors.


root@storage-ceph-01z:~# lsb_release -a
No LSB modules are available.
Distributor ID:    Debian
Description:    Debian GNU/Linux 8.6 (jessie)
Release:    8.6
Codename:    jessie



root@storage-ceph-01z:~# ceph -v
ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)

I have noticed that multiple OSDs start simultaneous improperly. This causes many PGs stuck in the state peering and/or peering+remapped for 15-20 minutes. Nothing happens, there is no recovery, there are no any actions in cluster. Some OSDs can`t see just launched OSDs and their heartbeat fails, but status of the cluster shows, that 60/60 OSDs are up and in. Some time later cluster comes alive, recovery process is beginning and a health becomes OK.

My osd tree looks like:


root@storage-ceph-01z:~# ceph osd tree
ID WEIGHT    TYPE NAME                 UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 109.10980 root default                                                
-2  36.36993     host storage-ceph-01y                                   
 0   1.81850         osd.0                  up  1.00000          1.00000 
 3   1.81850         osd.3                  up  1.00000          1.00000 
 5   1.81850         osd.5                  up  1.00000          1.00000 
 8   1.81850         osd.8                  up  1.00000          1.00000 
 9   1.81850         osd.9                  up  1.00000          1.00000 
11   1.81850         osd.11                 up  1.00000          1.00000 
14   1.81850         osd.14                 up  1.00000          1.00000 
16   1.81850         osd.16                 up  1.00000          1.00000 
19   1.81850         osd.19                 up  1.00000          1.00000 
22   1.81850         osd.22                 up  1.00000          1.00000 
24   1.81850         osd.24                 up  1.00000          1.00000 
26   1.81850         osd.26                 up  1.00000          1.00000 
29   1.81850         osd.29                 up  1.00000          1.00000 
32   1.81850         osd.32                 up  1.00000          1.00000 
35   1.81850         osd.35                 up  1.00000          1.00000 
38   1.81850         osd.38                 up  1.00000          1.00000 
41   1.81850         osd.41                 up  1.00000          1.00000 
43   1.81850         osd.43                 up  1.00000          1.00000 
45   1.81850         osd.45                 up  1.00000          1.00000 
48   1.81850         osd.48                 up  1.00000          1.00000 
-3  36.36993     host storage-ceph-01x                                   
 1   1.81850         osd.1                  up  1.00000          1.00000 
 2   1.81850         osd.2                  up  1.00000          1.00000 
 4   1.81850         osd.4                  up  1.00000          1.00000 
 7   1.81850         osd.7                  up  1.00000          1.00000 
10   1.81850         osd.10                 up  1.00000          1.00000 
12   1.81850         osd.12                 up  1.00000          1.00000 
15   1.81850         osd.15                 up  1.00000          1.00000 
17   1.81850         osd.17                 up  1.00000          1.00000 
18   1.81850         osd.18                 up  1.00000          1.00000 
21   1.81850         osd.21                 up  1.00000          1.00000 
23   1.81850         osd.23                 up  1.00000          1.00000 
25   1.81850         osd.25                 up  1.00000          1.00000 
28   1.81850         osd.28                 up  1.00000          1.00000 
31   1.81850         osd.31                 up  1.00000          1.00000 
34   1.81850         osd.34                 up  1.00000          1.00000 
37   1.81850         osd.37                 up  1.00000          1.00000 
40   1.81850         osd.40                 up  1.00000          1.00000 
42   1.81850         osd.42                 up  1.00000          1.00000 
44   1.81850         osd.44                 up  1.00000          1.00000 
47   1.81850         osd.47                 up  1.00000          1.00000 
-4  36.36993     host storage-ceph-01z                                   
 6   1.81850         osd.6                  up  1.00000          1.00000 
13   1.81850         osd.13                 up  1.00000          1.00000 
20   1.81850         osd.20                 up  1.00000          1.00000 
27   1.81850         osd.27                 up  1.00000          1.00000 
30   1.81850         osd.30                 up  1.00000          1.00000 
33   1.81850         osd.33                 up  1.00000          1.00000 
36   1.81850         osd.36                 up  1.00000          1.00000 
39   1.81850         osd.39                 up  1.00000          1.00000 
46   1.81850         osd.46                 up  1.00000          1.00000 
49   1.81850         osd.49                 up  1.00000          1.00000 
50   1.81850         osd.50                 up  1.00000          1.00000 
51   1.81850         osd.51                 up  1.00000          1.00000 
52   1.81850         osd.52                 up  1.00000          1.00000 
53   1.81850         osd.53                 up  1.00000          1.00000 
54   1.81850         osd.54                 up  1.00000          1.00000 
55   1.81850         osd.55                 up  1.00000          1.00000 
56   1.81850         osd.56                 up  1.00000          1.00000 
57   1.81850         osd.57                 up  1.00000          1.00000 
58   1.81850         osd.58                 up  1.00000          1.00000 
59   1.81850         osd.59                 up  1.00000          1.00000 

How to reproduce:
1. Check the health of your cluster:


root@storage-ceph-01z:~# ceph -s
    cluster 63b92a66-8523-4adc-9e3a-ee267e5be456
     health HEALTH_OK
     monmap e1: 3 mons at {hoster-kvm-07b=172.30.0.221:6789/0,hoster-kvm-08b=172.30.0.220:6789/0,hoster-kvm-08c=172.30.0.222:6789/0}
            election epoch 24, quorum 0,1,2 hoster-kvm-08b,hoster-kvm-07b,hoster-kvm-08c
     osdmap e1625: 60 osds: 60 up, 60 in
            flags sortbitwise
      pgmap v131106: 4096 pgs, 1 pools, 139 GB data, 38396 objects
            423 GB used, 108 TB / 109 TB avail
                4096 active+clean


2. Set parameter mon osd down out interval to 60 in ceph.conf on a monitors for faster reproduction.
3. Restart the monitors.
4. Stop all ceph-osd processes on one node:

root@storage-ceph-01z:~# systemctl stop ceph-osd.target


The cluster goes in degraded state.
5. Wait until the fallen OSDs will be moved out of cluster and recovery is beginning.

root@storage-ceph-01z:~# ceph -s
    cluster 63b92a66-8523-4adc-9e3a-ee267e5be456
     health HEALTH_WARN
            3971 pgs degraded
            4 pgs recovering
            12 pgs recovery_wait
            4050 pgs stuck unclean
            1485 pgs undersized
            recovery 71746/115188 objects degraded (62.286%)
            recovery 24513/115188 objects misplaced (21.281%)
     monmap e1: 3 mons at {hoster-kvm-07b=172.30.0.221:6789/0,hoster-kvm-08b=172.30.0.220:6789/0,hoster-kvm-08c=172.30.0.222:6789/0}
            election epoch 26, quorum 0,1,2 hoster-kvm-08b,hoster-kvm-07b,hoster-kvm-08c
     osdmap e1630: 60 osds: 40 up, 40 in; 4096 remapped pgs
            flags sortbitwise
      pgmap v131545: 4096 pgs, 1 pools, 139 GB data, 38396 objects
            286 GB used, 74197 GB / 74484 GB avail
            71746/115188 objects degraded (62.286%)
            24513/115188 objects misplaced (21.281%)
                1485 active+undersized+degraded
                1294 active+degraded+remapped
                1176 active+degraded
                  84 active
                  41 active+remapped
                   7 active+recovery_wait+degraded+remapped
                   5 active+recovery_wait+degraded
                   3 active+recovering+degraded
                   1 active+recovering+degraded+remapped
recovery io 360 MB/s, 96 objects/s


6. Then start all ceph-osd processes:

root@storage-ceph-01z:~# systemctl start ceph-osd.target


7. OSDs become UP and IN. Recovery continues.
8. But then recovery stops and some count of PGs stuck:

root@storage-ceph-01z:~# ceph -s
    cluster 63b92a66-8523-4adc-9e3a-ee267e5be456
     health HEALTH_ERR
            53 pgs are stuck inactive for more than 300 seconds
            92 pgs peering
            53 pgs stuck inactive
     monmap e1: 3 mons at {hoster-kvm-07b=172.30.0.221:6789/0,hoster-kvm-08b=172.30.0.220:6789/0,hoster-kvm-08c=172.30.0.222:6789/0}
            election epoch 26, quorum 0,1,2 hoster-kvm-08b,hoster-kvm-07b,hoster-kvm-08c
     osdmap e1642: 60 osds: 60 up, 60 in; 38 remapped pgs
            flags sortbitwise
      pgmap v131577: 4096 pgs, 1 pools, 139 GB data, 38396 objects
            423 GB used, 108 TB / 109 TB avail
                4004 active+clean
                  54 peering
                  38 remapped+peering


ceph -w shows nothing (new lines don`t appear).

root@storage-ceph-01z:~# ceph osd pool stats
pool rbdkvm_sata id 2
  nothing is going on


9. I check logs of other OSDs on other nodes and see that some OSDs can`t ping just started OSDs.

X-node:

2016-11-18 17:31:35.886226 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.13 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:15.886221)
2016-11-18 17:31:35.886247 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.30 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:15.886221)
2016-11-18 17:31:35.886255 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.39 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:15.886221)
2016-11-18 17:31:35.886262 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.49 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:15.886221)
2016-11-18 17:31:36.886450 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.13 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:16.886445)
2016-11-18 17:31:36.886475 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.30 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:16.886445)
2016-11-18 17:31:36.886484 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.39 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:16.886445)
2016-11-18 17:31:36.886492 7f7972f1b700 -1 osd.34 1642 heartbeat_check: no reply from osd.49 ever on either front or back, first ping sent 2016-11-18 17:19:25.080122 (cutoff 2016-11-18 17:31:16.886445)

Y-node:

2016-11-18 17:32:01.819803 7f216f1c7700 -1 osd.22 1642 heartbeat_check: no reply from osd.20 ever on either front or back, first ping sent 2016-11-18 17:19:23.111839 (cutoff 2016-11-18 17:31:41.819796)
2016-11-18 17:32:02.611614 7f214b4cf700 -1 osd.22 1642 heartbeat_check: no reply from osd.20 ever on either front or back, first ping sent 2016-11-18 17:19:23.111839 (cutoff 2016-11-18 17:31:42.611528)
2016-11-18 17:32:02.819995 7f216f1c7700 -1 osd.22 1642 heartbeat_check: no reply from osd.20 ever on either front or back, first ping sent 2016-11-18 17:19:23.111839 (cutoff 2016-11-18 17:31:42.819988)
2016-11-18 17:32:03.820178 7f216f1c7700 -1 osd.22 1642 heartbeat_check: no reply from osd.20 ever on either front or back, first ping sent 2016-11-18 17:19:23.111839 (cutoff 2016-11-18 17:31:43.820170)
2016-11-18 17:32:04.820376 7f216f1c7700 -1 osd.22 1642 heartbeat_check: no reply from osd.20 ever on either front or back, first ping sent 2016-11-18 17:19:23.111839 (cutoff 2016-11-18 17:31:44.820354)

All "unreachable" OSDs are on Z-node:


root@storage-ceph-01y:/var/log/ceph# ceph osd find 20
{
    "osd": 20,
    "ip": "172.30.0.3:6838\/904375",
    "crush_location": {
        "host": "storage-ceph-01z",
        "root": "default" 
    }
}
root@storage-ceph-01y:/var/log/ceph# ceph osd find 39
{
    "osd": 39,
    "ip": "172.30.0.3:6832\/904369",
    "crush_location": {
        "host": "storage-ceph-01z",
        "root": "default" 
    }
}
root@storage-ceph-01y:/var/log/ceph# ceph osd find 49
{
    "osd": 49,
    "ip": "172.30.0.3:6830\/904336",
    "crush_location": {
        "host": "storage-ceph-01z",
        "root": "default" 
    }
}

9. But these OSDs are running, UP and IN the cluster.


ceph      904336  2.0  0.3 1572552 180348 ?      Ssl  17:19   0:18 /usr/bin/ceph-osd -f --cluster ceph --id 49 --setuser ceph --setgroup ceph
ceph      904350  2.0  0.3 1587096 194988 ?      Ssl  17:19   0:19 /usr/bin/ceph-osd -f --cluster ceph --id 27 --setuser ceph --setgroup ceph
ceph      904369  2.0  0.3 1557940 185516 ?      Ssl  17:19   0:18 /usr/bin/ceph-osd -f --cluster ceph --id 39 --setuser ceph --setgroup ceph
ceph      904375  2.2  0.4 1590648 206332 ?      Ssl  17:19   0:20 /usr/bin/ceph-osd -f --cluster ceph --id 20 --setuser ceph --setgroup ceph

Here are a typical examples of such OSD`s logs:


osd.20:

2016-11-18 17:19:22.661630 7f7788e2a700  0 -- 172.30.1.3:6838/904375 >> 172.30.1.1:6814/296626 pipe(0x55a506ae3400 sd=337 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507147600).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.679902 7f7788c28700  0 -- 172.30.1.3:6838/904375 >> 172.30.1.2:6804/136046 pipe(0x55a506ae4800 sd=339 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507147780).accept connect_seq 0 vs existing 0 state connecting
2016-11-18 17:19:22.686109 7f7788a26700  0 -- 172.30.1.3:6838/904375 >> 172.30.1.2:6838/144524 pipe(0x55a507299400 sd=341 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507147a80).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.690474 7f7796702700  0 -- 172.30.1.3:6838/904375 >> 172.30.1.1:6828/301114 pipe(0x55a507298000 sd=228 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507147900).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.777298 7f778831f700  0 -- 172.30.1.3:6838/904375 >> 172.30.1.2:6800/135772 pipe(0x55a50670e000 sd=257 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507147d80).accept connect_seq 0 vs existing 0 state connecting
2016-11-18 17:19:22.818229 7f778811d700  0 -- 172.30.1.3:6838/904375 >> 172.30.1.1:6834/302750 pipe(0x55a506388000 sd=299 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507292d00).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:23.021428 7f7788824700  0 -- 172.30.1.3:6838/904375 >> 172.30.1.2:6829/142225 pipe(0x55a50729a800 sd=268 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507147c00).accept connect_seq 0 vs existing 0 state connecting
2016-11-18 17:19:23.303791 7f778730f700  0 -- 172.30.1.3:6838/904375 >> 172.30.1.2:6840/145069 pipe(0x55a506208800 sd=344 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507292e80).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:23.843760 7f77860fd700  0 -- 172.30.1.3:6838/904375 >> 172.30.1.2:6818/139814 pipe(0x55a507782000 sd=353 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507293000).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:27.867375 7f7784ce9700  0 -- 172.30.1.3:6838/904375 >> 172.30.1.2:6824/141442 pipe(0x55a507784800 sd=65 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507293300).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:32.890289 7f7785dfa700  0 -- 172.30.1.3:6838/904375 >> 172.30.1.1:6832/302180 pipe(0x55a507783400 sd=355 :6838 s=0 pgs=0 cs=0 l=0 c=0x55a507293180).accept connect_seq 0 vs existing 0 state wait

osd.40:
2016-11-18 17:19:22.807629 7f0773fad700  0 -- 172.30.1.3:6830/904336 >> 172.30.1.1:6832/302180 pipe(0x564c7e680800 sd=257 :6830 s=0 pgs=0 cs=0 l=0 c=0x564c7f3ae280).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.828839 7f07740ae700  0 -- 172.30.1.3:6830/904336 >> 172.30.1.2:6822/140921 pipe(0x564c7e368800 sd=252 :6830 s=0 pgs=0 cs=0 l=0 c=0x564c7f3ae100).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.902583 7f0773aa8700  0 -- 172.30.1.3:6830/904336 >> 172.30.1.1:6820/298383 pipe(0x564c7e67f400 sd=263 :6830 s=0 pgs=0 cs=0 l=0 c=0x564c7f3ae400).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.994463 7f07737a5700  0 -- 172.30.1.3:6830/904336 >> 172.30.1.1:6812/296199 pipe(0x564c7e5df400 sd=354 :6830 s=0 pgs=0 cs=0 l=0 c=0x564c7f3ae580).accept connect_seq 0 vs existing 0 state connecting
2016-11-18 17:19:23.300791 7f0777cea700  0 -- 172.30.1.3:6830/904336 >> 172.30.1.2:6838/144524 pipe(0x564c7eb04800 sd=323 :6830 s=0 pgs=0 cs=0 l=0 c=0x564c7e919e00).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:23.826907 7f0772c9a700  0 -- 172.30.1.3:6830/904336 >> 172.30.1.2:6800/135772 pipe(0x564c7e53c800 sd=357 :6830 s=0 pgs=0 cs=0 l=0 c=0x564c7f3ae700).accept connect_seq 0 vs existing 0 state wait

osd.39
2016-11-18 17:19:22.754105 7f4020f6e700  0 -- 172.30.1.3:6832/904369 >> 172.30.1.1:6824/299914 pipe(0x555d3dea6800 sd=291 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3e1a0d00).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.848674 7f4020a69700  0 -- 172.30.1.3:6832/904369 >> 172.30.1.2:6831/142764 pipe(0x555d3ddbd400 sd=350 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3e1a0b80).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:22.933021 7f4020867700  0 -- 172.30.1.3:6832/904369 >> 172.30.1.1:6820/298383 pipe(0x555d3dd68000 sd=205 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3df87600).accept connect_seq 0 vs existing 0 state connecting
2016-11-18 17:19:23.978576 7f401f756700  0 -- 172.30.1.3:6832/904369 >> 172.30.1.2:6820/140664 pipe(0x555d3eb3b400 sd=359 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3e524400).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:23.979519 7f401f857700  0 -- 172.30.1.3:6832/904369 >> 172.30.1.2:6833/143475 pipe(0x555d3eb3a000 sd=358 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3dc32280).accept connect_seq 0 vs existing 0 state connecting
2016-11-18 17:19:24.039784 7f401ea49700  0 -- 172.30.1.3:6832/904369 >> 172.30.1.1:6826/300579 pipe(0x555d3ece3400 sd=302 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3e7d0480).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:27.867527 7f401f958700  0 -- 172.30.1.3:6832/904369 >> 172.30.1.2:6824/141442 pipe(0x555d3ece4800 sd=80 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3e7d3d80).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:32.890035 7f401f352700  0 -- 172.30.1.3:6832/904369 >> 172.30.1.2:6800/135772 pipe(0x555d3eb3c800 sd=360 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3e524b80).accept connect_seq 0 vs existing 0 state wait
2016-11-18 17:19:32.921769 7f401f251700  0 -- 172.30.1.3:6832/904369 >> 172.30.1.1:6832/302180 pipe(0x555d3ece2000 sd=361 :6832 s=0 pgs=0 cs=0 l=0 c=0x555d3e7d0600).accept connect_seq 0 vs existing 0 state wait

10. Monitor log is a little more informative for me. Here we see a start of the problem:


2016-11-18 17:19:39.066910 7f8cd591f700  0 log_channel(cluster) log [INF] : pgmap v131567: 4096 pgs: 3945 active+clean, 43 remapped+peering, 10 active+recovery_wait+degraded, 86 peering, 7 activating, 5 active+recovering+degraded; 
139 GB data, 424 GB used, 108 TB / 109 TB avail; 278/115188 objects degraded (0.241%); 5576 kB/s, 1 objects/s recovering
2016-11-18 17:19:40.280360 7f8cd591f700  0 log_channel(cluster) log [INF] : pgmap v131568: 4096 pgs: 4001 active+clean, 38 remapped+peering, 54 peering, 3 activating; 139 GB data, 423 GB used, 108 TB / 109 TB avail; 37566 kB/s, 8 o
bjects/s recovering
2016-11-18 17:19:41.475118 7f8cd591f700  0 log_channel(cluster) log [INF] : pgmap v131569: 4096 pgs: 4004 active+clean, 38 remapped+peering, 54 peering; 139 GB data, 423 GB used, 108 TB / 109 TB avail; 37453 kB/s, 8 objects/s recov
ering
2016-11-18 17:19:41.976816 7f8cd2e17700  1 mon.hoster-kvm-08b@0(leader).osd e1642 prepare_failure osd.20 172.30.0.3:6838/904375 from osd.21 172.30.0.1:6812/296199 is reporting failure:1
2016-11-18 17:19:41.976836 7f8cd2e17700  0 log_channel(cluster) log [DBG] : osd.20 172.30.0.3:6838/904375 reported failed by osd.21 172.30.0.1:6812/296199
2016-11-18 17:19:41.977010 7f8cd2e17700  1 mon.hoster-kvm-08b@0(leader).osd e1642 prepare_failure osd.30 172.30.0.3:6820/904356 from osd.21 172.30.0.1:6812/296199 is reporting failure:1
2016-11-18 17:19:41.977026 7f8cd2e17700  0 log_channel(cluster) log [DBG] : osd.30 172.30.0.3:6820/904356 reported failed by osd.21 172.30.0.1:6812/296199
2016-11-18 17:19:42.643132 7f8cd591f700  0 log_channel(cluster) log [INF] : pgmap v131570: 4096 pgs: 4004 active+clean, 38 remapped+peering, 54 peering; 139 GB data, 423 GB used, 108 TB / 109 TB avail
2016-11-18 17:19:43.119723 7f8cd2e17700  1 mon.hoster-kvm-08b@0(leader).osd e1642 prepare_failure osd.6 172.30.0.3:6814/904372 from osd.17 172.30.0.1:6806/294562 is reporting failure:1
2016-11-18 17:19:43.119740 7f8cd2e17700  0 log_channel(cluster) log [DBG] : osd.6 172.30.0.3:6814/904372 reported failed by osd.17 172.30.0.1:6806/294562
2016-11-18 17:19:43.800926 7f8cd591f700  0 log_channel(cluster) log [INF] : pgmap v131571: 4096 pgs: 4004 active+clean, 38 remapped+peering, 54 peering; 139 GB data, 423 GB used, 108 TB / 109 TB avail
<more lines telling that there is a failure with osd>


Then we mark osd.20 down:


2016-11-18 17:20:06.139983 7f8cd3618700  1 mon.hoster-kvm-08b@0(leader).osd e1642  we have enough reporters to mark osd.20 down
2016-11-18 17:20:06.140012 7f8cd3618700  0 log_channel(cluster) log [INF] : osd.20 172.30.0.3:6838/904375 failed (2 reporters from different host after 22.187187 >= grace 20.000000)

But the osdmap e1642 is changing only after 14 minutes:


2016-11-18 17:34:20.821605 7f8cd2e17700  1 mon.hoster-kvm-08b@0(leader).osd e1642 prepare_failure osd.50 172.30.0.3:6806/904318 from osd.31 172.30.0.1:6822/299270 is reporting failure:0
2016-11-18 17:34:20.821615 7f8cd2e17700  0 log_channel(cluster) log [DBG] : osd.50 172.30.0.3:6806/904318 failure report canceled by osd.31 172.30.0.1:6822/299270
2016-11-18 17:34:20.821884 7f8cd2e17700  1 mon.hoster-kvm-08b@0(leader).osd e1642 prepare_failure osd.30 172.30.0.3:6820/904356 from osd.21 172.30.0.1:6812/296199 is reporting failure:0
2016-11-18 17:34:20.821894 7f8cd2e17700  0 log_channel(cluster) log [DBG] : osd.30 172.30.0.3:6820/904356 failure report canceled by osd.21 172.30.0.1:6812/296199
2016-11-18 17:34:20.855817 7f8cd2e17700  1 mon.hoster-kvm-08b@0(leader).osd e1642 prepare_failure osd.6 172.30.0.3:6814/904372 from osd.17 172.30.0.1:6806/294562 is reporting failure:0
2016-11-18 17:34:20.855832 7f8cd2e17700  0 log_channel(cluster) log [DBG] : osd.6 172.30.0.3:6814/904372 failure report canceled by osd.17 172.30.0.1:6806/294562
2016-11-18 17:34:21.023832 7f8cd591f700  1 mon.hoster-kvm-08b@0(leader).osd e1643 e1643: 60 osds: 59 up, 60 in
2016-11-18 17:34:21.090024 7f8cd591f700  0 log_channel(cluster) log [INF] : osdmap e1643: 60 osds: 59 up, 60 in
2016-11-18 17:34:21.149598 7f8cd591f700  0 log_channel(cluster) log [INF] : pgmap v131755: 4096 pgs: 4004 active+clean, 38 remapped+peering, 54 peering; 139 GB data, 423 GB used, 108 TB / 109 TB avail

Then recovery process goes like normal and cluster goes to OK.

11. pg query to PGs in such state shows, that they are blocked by other osd and so on.

I am attaching a ceph report in the moment of problem.

This situation repeats for me in 100% of cases.

SOLUTION:
If I start OSDs in series, there is no problem, cluster doesn`t stop to recover, doesn`t go in the ERR state and after some time becomes health OK.

Possible cause:
On a start ceph-osd processes begin "races" for a free ports.

Tell me if you need more information or something of the written you do not understand.

report.ceph.bz2 (379 KB) Nikita Shalnov, 11/18/2016 03:31 PM

issue_17598.tar.xz - PGs, ceph statuses, osdmaps (13.8 KB) Denis Pudov, 11/22/2016 10:15 AM

History

#1 Updated by Nikita Shalnov over 7 years ago

Sorry, didn`t notice that I an in rbd project now. Can you move this bug to Ceph project?

#2 Updated by Jason Dillaman over 7 years ago

  • Project changed from rbd to Ceph

#3 Updated by Samuel Just over 7 years ago

  • Status changed from New to Closed

It looks to me as though your cluster is just slow. You might want to post to ceph-users about ways to diagnose the bottleneck.

#4 Updated by Denis Pudov over 7 years ago

Samuel Just wrote:

It looks to me as though your cluster is just slow. You might want to post to ceph-users about ways to diagnose the bottleneck.

When we tested our cluster on version 10.2.2 was no such problems.

Sequential start OSDs ensures that no changes to the osdmap (the OSDs will have the same binding ports).

With parallel start OSDs, health cluster become OK, after the osdmap has changed like:

--- osdmap_e1141.txt    2016-11-17 17:49:37.793888851 +0300
+++ osdmap_e1143.txt    2016-11-17 17:49:37.797888893 +0300
@@ -1,7 +1,7 @@
-epoch 1141
+epoch 1143
 fsid 63b92a66-8523-4adc-9e3a-ee267e5be456
 created 2016-11-16 11:37:15.441802
-modified 2016-11-17 15:42:10.457161
+modified 2016-11-17 15:56:56.769887
 flags sortbitwise
 pool 2 'rbdkvm_sata' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 339 flags hashpspool stripe_width 0
        removed_snaps [1~3]
@@ -12,7 +12,7 @@ osd.2 up   in  weight 1 up_from 934 up_t
 osd.3 up   in  weight 1 up_from 967 up_thru 1138 down_at 949 last_clean_interval [782,948) 172.30.0.2:6825/141442 172.30.1.2:6824/141442 172.30.1.2:6827/141442 172.30.0.2:6826/141442 exists,up 9b423230-bc8f-4fd0-b2be-128b6706e930
 osd.4 up   in  weight 1 up_from 944 up_thru 1138 down_at 917 last_clean_interval [797,916) 172.30.0.1:6836/303291 172.30.1.1:6836/303291 172.30.1.1:6837/303291 172.30.0.1:6837/303291 exists,up 52ad87d4-5869-48fe-a6fe-780358865862
 osd.5 up   in  weight 1 up_from 973 up_thru 1137 down_at 949 last_clean_interval [780,948) 172.30.0.2:6835/144524 172.30.1.2:6838/144524 172.30.1.2:6839/144524 172.30.0.2:6836/144524 exists,up 867a13e4-d7a0-46b9-8501-8a34732591c5
-osd.6 up   in  weight 1 up_from 1131 up_thru 1138 down_at 1128 last_clean_interval [1107,1127) 172.30.0.3:6825/277571 172.30.1.3:6828/277571 172.30.1.3:6829/277571 172.30.0.3:6826/277571 exists,up 50c996d4-55a5-421e-ae25-6be972075c54
+osd.6 up   in  weight 1 up_from 1131 up_thru 1142 down_at 1128 last_clean_interval [1107,1127) 172.30.0.3:6825/277571 172.30.1.3:6828/277571 172.30.1.3:6829/277571 172.30.0.3:6826/277571 exists,up 50c996d4-55a5-421e-ae25-6be972075c54
...

Also available in: Atom PDF