Bug #19700
closedOSD remained up despite cluster network being inactive?
0%
Description
We have a ceph cluster with segregated cluster network for the OSDs to communicate with each other, and a "public" network for clients to talk to the cluster. The monitors are on the "public" network, and the OSDs talk to the monitor through that interface. We had an issue where the private network interface went down on one of our OSD nodes, but everything seemed to think that things were normal despite the fact that OSD node couldn't talk to any other OSDs. The monitor reported the cluster as healthy, and 'ceph osd tree' showed the downed node as up.
We have 12 OSDs per node, and 6 nodes in the cluster
Updated by Vasu Kulkarni about 7 years ago
one osd was unable to communicate and ceph osd tree showed the osd as up
Updated by Patrick McLean about 7 years ago
Here is the output of "ceph osd tree"
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 576.00000 root default
-2 192.00000 rack r2.XXXXXXXXXXXXXXXXXXX
-5 96.00000 chassis ceph1.r2.XXXXXXXXXXXXXXXXXXX
-11 24.00000 host ceph1.r2.XXXXXXXXXXXXXXXXXXX-cdg0
1 8.00000 osd.1 up 1.00000 1.00000
7 8.00000 osd.7 up 1.00000 1.00000
15 8.00000 osd.15 up 1.00000 1.00000
-18 24.00000 host ceph1.r2.XXXXXXXXXXXXXXXXXXX-cdg1
21 8.00000 osd.21 up 1.00000 1.00000
28 8.00000 osd.28 up 1.00000 1.00000
31 8.00000 osd.31 up 1.00000 1.00000
-24 24.00000 host ceph1.r2.XXXXXXXXXXXXXXXXXXX-cdg2
37 8.00000 osd.37 up 1.00000 1.00000
45 8.00000 osd.45 up 1.00000 1.00000
51 8.00000 osd.51 up 1.00000 1.00000
-30 24.00000 host ceph1.r2.XXXXXXXXXXXXXXXXXXX-cdg3
56 8.00000 osd.56 up 1.00000 1.00000
61 8.00000 osd.61 up 1.00000 1.00000
67 8.00000 osd.67 up 1.00000 1.00000
-7 96.00000 chassis ceph2.r2.XXXXXXXXXXXXXXXXXXX
-13 24.00000 host ceph2.r2.XXXXXXXXXXXXXXXXXXX-cdg0
0 8.00000 osd.0 up 1.00000 1.00000
6 8.00000 osd.6 up 1.00000 1.00000
12 8.00000 osd.12 up 1.00000 1.00000
-19 24.00000 host ceph2.r2.XXXXXXXXXXXXXXXXXXX-cdg1
18 8.00000 osd.18 up 1.00000 1.00000
24 8.00000 osd.24 up 1.00000 1.00000
30 8.00000 osd.30 up 1.00000 1.00000
-28 24.00000 host ceph2.r2.XXXXXXXXXXXXXXXXXXX-cdg2
38 8.00000 osd.38 up 1.00000 1.00000
46 8.00000 osd.46 up 1.00000 1.00000
52 8.00000 osd.52 up 1.00000 1.00000
-33 24.00000 host ceph2.r2.XXXXXXXXXXXXXXXXXXX-cdg3
57 8.00000 osd.57 up 1.00000 1.00000
60 8.00000 osd.60 up 1.00000 1.00000
66 8.00000 osd.66 up 1.00000 1.00000
-3 192.00000 rack r1.XXXXXXXXXXXXXXXXXXX
-6 96.00000 chassis ceph1.r1.XXXXXXXXXXXXXXXXXXX
-12 24.00000 host ceph1.r1.XXXXXXXXXXXXXXXXXXX-cdg0
2 8.00000 osd.2 up 1.00000 1.00000
9 8.00000 osd.9 up 1.00000 1.00000
16 8.00000 osd.16 up 1.00000 1.00000
-17 24.00000 host ceph1.r1.XXXXXXXXXXXXXXXXXXX-cdg1
22 8.00000 osd.22 up 1.00000 1.00000
29 8.00000 osd.29 up 1.00000 1.00000
32 8.00000 osd.32 up 1.00000 1.00000
-23 24.00000 host ceph1.r1.XXXXXXXXXXXXXXXXXXX-cdg2
36 8.00000 osd.36 up 1.00000 1.00000
44 8.00000 osd.44 up 1.00000 1.00000
50 8.00000 osd.50 up 1.00000 1.00000
-29 24.00000 host ceph1.r1.XXXXXXXXXXXXXXXXXXX-cdg3
54 8.00000 osd.54 up 1.00000 1.00000
63 8.00000 osd.63 up 1.00000 1.00000
70 8.00000 osd.70 up 1.00000 1.00000
-8 96.00000 chassis ceph2.r1.XXXXXXXXXXXXXXXXXXX
-14 24.00000 host ceph2.r1.XXXXXXXXXXXXXXXXXXX-cdg0
3 8.00000 osd.3 up 1.00000 1.00000
11 8.00000 osd.11 up 1.00000 1.00000
14 8.00000 osd.14 up 1.00000 1.00000
-22 24.00000 host ceph2.r1.XXXXXXXXXXXXXXXXXXX-cdg1
23 8.00000 osd.23 up 1.00000 1.00000
25 8.00000 osd.25 up 1.00000 1.00000
33 8.00000 osd.33 up 1.00000 1.00000
-26 24.00000 host ceph2.r1.XXXXXXXXXXXXXXXXXXX-cdg2
40 8.00000 osd.40 up 1.00000 1.00000
43 8.00000 osd.43 up 1.00000 1.00000
49 8.00000 osd.49 up 1.00000 1.00000
-31 24.00000 host ceph2.r1.XXXXXXXXXXXXXXXXXXX-cdg3
58 8.00000 osd.58 up 1.00000 1.00000
62 8.00000 osd.62 up 1.00000 1.00000
68 8.00000 osd.68 up 1.00000 1.00000
-4 192.00000 rack r3.XXXXXXXXXXXXXXXXXXX
-9 96.00000 chassis ceph1.r3.XXXXXXXXXXXXXXXXXXX
-15 24.00000 host ceph1.r3.XXXXXXXXXXXXXXXXXXX-cdg0
5 8.00000 osd.5 up 1.00000 1.00000
10 8.00000 osd.10 up 1.00000 1.00000
13 8.00000 osd.13 up 1.00000 1.00000
-20 24.00000 host ceph1.r3.XXXXXXXXXXXXXXXXXXX-cdg1
20 8.00000 osd.20 up 1.00000 1.00000
27 8.00000 osd.27 up 1.00000 1.00000
35 8.00000 osd.35 up 1.00000 1.00000
-27 24.00000 host ceph1.r3.XXXXXXXXXXXXXXXXXXX-cdg2
39 8.00000 osd.39 up 1.00000 1.00000
42 8.00000 osd.42 up 1.00000 1.00000
48 8.00000 osd.48 up 1.00000 1.00000
-34 24.00000 host ceph1.r3.XXXXXXXXXXXXXXXXXXX-cdg3
55 8.00000 osd.55 up 1.00000 1.00000
64 8.00000 osd.64 up 1.00000 1.00000
69 8.00000 osd.69 up 1.00000 1.00000
-10 96.00000 chassis ceph2.r3.XXXXXXXXXXXXXXXXXXX
-16 24.00000 host ceph2.r3.XXXXXXXXXXXXXXXXXXX-cdg0
4 8.00000 osd.4 up 1.00000 1.00000
8 8.00000 osd.8 up 1.00000 1.00000
17 8.00000 osd.17 up 1.00000 1.00000
-21 24.00000 host ceph2.r3.XXXXXXXXXXXXXXXXXXX-cdg1
19 8.00000 osd.19 up 1.00000 1.00000
26 8.00000 osd.26 up 1.00000 1.00000
34 8.00000 osd.34 up 1.00000 1.00000
-25 24.00000 host ceph2.r3.XXXXXXXXXXXXXXXXXXX-cdg2
41 8.00000 osd.41 up 1.00000 1.00000
47 8.00000 osd.47 up 1.00000 1.00000
53 8.00000 osd.53 up 1.00000 1.00000
-32 24.00000 host ceph2.r3.XXXXXXXXXXXXXXXXXXX-cdg3
59 8.00000 osd.59 up 1.00000 1.00000
65 8.00000 osd.65 up 1.00000 1.00000
71 8.00000 osd.71 up 1.00000 1.00000
Updated by Patrick McLean about 7 years ago
Here is the output of "ip addr", note that the "internal" interface is DOWN with NO-CARRIER
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: external: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff
inet 10.XXX.XXX.XXX/24 brd 10.XXX.XXX.255 scope global external
valid_lft forever preferred_lft forever
inet6 XXXX::XXXX:XXXX:XXXX:XXXX/64 scope link
valid_lft forever preferred_lft forever
3: internal: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9000 qdisc mq state DOWN group default qlen 1000
link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff
inet 172.XXX.XXX.XXX/24 brd 172.XXX.XXX.255 scope global internal
valid_lft forever preferred_lft forever
inet6 XXXX::XXXX:XXXX:XXXX:XXXX/64 scope link
valid_lft forever preferred_lft forever
Updated by Patrick McLean about 7 years ago
Just for clarification, it was ceph2.r2 that was down, "chassis" is the physical node, and "host" is the subgroup on the physical node
Updated by Greg Farnum almost 7 years ago
- Subject changed from OSD node reported as up when the cluster network is down, but the "public" network was up to OSD remained up despite cluster network being inactive?
Was the cluster performing IO while this happened? Do your public and private networks perhaps route to each other?
Is it running 10.2.5? I see it tagged that way but no textual report of the version. :)
Updated by Greg Farnum almost 7 years ago
- Project changed from Ceph to RADOS
- Status changed from New to Need More Info
- Component(RADOS) Monitor, OSD added
Updated by Patrick McLean almost 7 years ago
The cluster does not need to be performing any IO, other than normal peering and checking, and this will still happen. The networks do not route to each other, they are completely separate networks with physically separate network cards.
We were initially seeing this with 10.2.5, we are now updated to 10.2.7 and are still seeing this.
Updated by Greg Farnum almost 7 years ago
- Status changed from Need More Info to 12
- Priority changed from Normal to High
Sounds like we messed up the way cluster network heartbeating and the monitor's public network connection to the OSDs interact again...
Updated by Sage Weil over 6 years ago
- Status changed from 12 to Need More Info
Patrick, can you still reproduce this?
Updated by Patrick McLean over 6 years ago
Yes, we can still reproduce this on 10.2.10. We have not updated to luminous as of yet.
Updated by Nathan Cutler over 6 years ago
- Status changed from Need More Info to 12
Updated by Nathan Cutler over 6 years ago
- Affected Versions v10.2.10 added
- Affected Versions deleted (
v10.2.5)
Updated by Neha Ojha almost 4 years ago
- Status changed from New to Closed
Please reopen this bug if the issue is seen in nautilus or newer releases.