Bug #13828: OSD: wrong broken osd state due to disconnection of node backend network - Ceph - Ceph

Actions

Copy link

Bug #13828

closed

OSD: wrong broken osd state due to disconnection of node backend network

Added by xie xingguo over 8 years ago. Updated over 8 years ago.

Status:

Can't reproduce

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Our cluster is made up of 4 nodes, which consist of 10 osds each. The
topology of the cluster is illustrated as below:

[root@ceph242 ceph]# ceph osd tree
# id    weight  type name       up/down reweight
-1      66.43   root default
-4      10.92           host ceph0
6       0.91                    osd.6   up      1
7       0.91                    osd.7   up      1
8       0.91                    osd.8   up      1
9       0.91                    osd.9   up      1
10      0.91                    osd.10  up      1
19      0.91                    osd.19  up      1
20      0.91                    osd.20  up      1
21      0.91                    osd.21  up      1
22      0.91                    osd.22  up      1
23      0.91                    osd.23  up      1
24      1.82                    osd.24  up      1
-5      19.11           host ceph4
46      1.82                    osd.46  up      1
48      1.82                    osd.48  up      1
44      1.82                    osd.44  up      1
43      1.82                    osd.43  up      1
45      1.82                    osd.45  up      1
3       0.91                    osd.3   up      1
49      1.82                    osd.49  up      1
4       1.82                    osd.4   up      1
0       1.82                    osd.0   up      1
2       1.82                    osd.2   up      1
1       1.82                    osd.1   up      1
-6      18.2            host ceph243
28      1.82                    osd.28  up      1
15      1.82                    osd.15  up      1
17      1.82                    osd.17  up      1
11      1.82                    osd.11  up      1
12      1.82                    osd.12  up      1
16      1.82                    osd.16  up      1
14      1.82                    osd.14  up      1
13      1.82                    osd.13  up      1
18      1.82                    osd.18  up      1
32      1.82                    osd.32  up      1
-2      18.2            host ceph242
66      1.82                    osd.66  up      1
60      1.82                    osd.60  up      1
31      1.82                    osd.31  up      1
30      1.82                    osd.30  up      1
47      1.82                    osd.47  up      1
29      1.82                    osd.29  up      1
25      1.82                    osd.25  up      1
26      1.82                    osd.26  up      1
5       1.82                    osd.5   up      1
27      1.82                    osd.27  up      1

When I occasionally isolate one of node(namely ceph242) from the rest of the cluster by cutting off
its backend network connection, the cluster finally goes into a stable status after some
transient jitter. However, the result is somewhat surprising and probably problematic,
as you can see below:

[root@ceph0 minion]# ceph  osd tree
# id    weight  type name       up/down reweight
-1      66.43   root default
-4      10.92           host ceph0
6       0.91                    osd.6   down    0
7       0.91                    osd.7   down    0
8       0.91                    osd.8   down    0
9       0.91                    osd.9   down    0
10      0.91                    osd.10  down    0
19      0.91                    osd.19  down    0
20      0.91                    osd.20  down    0
21      0.91                    osd.21  down    0
22      0.91                    osd.22  down    0
23      0.91                    osd.23  down    0
24      1.82                    osd.24  down    0
-5      19.11           host ceph4
46      1.82                    osd.46  down    0
48      1.82                    osd.48  down    0
44      1.82                    osd.44  down    0
43      1.82                    osd.43  down    0
45      1.82                    osd.45  down    0
3       0.91                    osd.3   down    0
49      1.82                    osd.49  down    1
4       1.82                    osd.4   down    0
0       1.82                    osd.0   down    0
2       1.82                    osd.2   down    0
1       1.82                    osd.1   down    0
-6      18.2            host ceph243
28      1.82                    osd.28  down    1
15      1.82                    osd.15  down    0
17      1.82                    osd.17  down    0
11      1.82                    osd.11  down    0
12      1.82                    osd.12  down    0
16      1.82                    osd.16  down    0
14      1.82                    osd.14  down    0
13      1.82                    osd.13  down    0
18      1.82                    osd.18  down    0
32      1.82                    osd.32  down    0
-2      18.2            host ceph242
66      1.82                    osd.66  up      1
60      1.82                    osd.60  up      1
31      1.82                    osd.31  up      1
30      1.82                    osd.30  up      1
47      1.82                    osd.47  up      1
29      1.82                    osd.29  up      1
25      1.82                    osd.25  up      1
26      1.82                    osd.26  up      1
5       1.82                    osd.5   up      1
27      1.82                    osd.27  up      1

All the osds located on the isoldated node(ceph242) successfully survived at last while
the rest unexpectedly died instead, which therefore caused the whole cluster to be totally unaccessible.

I guess the above problem may be caused by the possibly problematical code logic of OSD::_is_healthy()
but I am still not certain of this since this problem can be hardly reproduced after a handful of unsuccessful
tries in our environment.

Actions

Copy link

Updated by David Zafman over 8 years ago

Where are the monitor(s) running? How many do you have configured? If there was a single ceph-mon running on ceph242 you might see this behavior.

Actions

Copy link

Updated by xie xingguo over 8 years ago

David Zafman wrote:

Where are the monitor(s) running? How many do you have configured? If there was a single ceph-mon running on ceph242 you might see this behavior.

@David Zafman
We have 3 monitors, none of them is collocated with ceph242, as you can see below:
[root@ceph242 ~]# ceph -s
cluster bd9e23c5-da96-4908-a335-455445480e54
health HEALTH_ERR 355 pgs backfill; 45 pgs backfilling; 899 pgs degraded; 24 pgs inconsistent; 45 pgs recovering; 378 pgs recovery_wait; 899 pgs stuck degraded; 1747 pgs stuck unclean; 474 pgs stuck undersized; 474 pgs undersized; recovery 81926/1991461 objects degraded (4.114%); 335395/1991461 objects misplaced (16.842%); 281 scrub errors; pool zpool20151118 has too few pgs
monmap e26: 3 mons at {ceph240=100.100.100.240:6789/0,ceph243=100.100.100.243:6789/0,ceph244=100.100.100.244:6789/0}, election epoch 5466, quorum 0,1,2 ceph240,ceph243,ceph244

Actions

Copy link