Project

General

Profile

Bug #19700

OSD remained up despite cluster network being inactive?

Added by Patrick McLean almost 7 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor, OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a ceph cluster with segregated cluster network for the OSDs to communicate with each other, and a "public" network for clients to talk to the cluster. The monitors are on the "public" network, and the OSDs talk to the monitor through that interface. We had an issue where the private network interface went down on one of our OSD nodes, but everything seemed to think that things were normal despite the fact that OSD node couldn't talk to any other OSDs. The monitor reported the cluster as healthy, and 'ceph osd tree' showed the downed node as up.

We have 12 OSDs per node, and 6 nodes in the cluster

History

#1 Updated by Vasu Kulkarni almost 7 years ago

one osd was unable to communicate and ceph osd tree showed the osd as up

#2 Updated by Patrick McLean almost 7 years ago

Here is the output of "ceph osd tree"

ID  WEIGHT    TYPE NAME                                           UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -1 576.00000 root default   
 -2 192.00000     rack r2.XXXXXXXXXXXXXXXXXXX 
 -5  96.00000         chassis ceph1.r2.XXXXXXXXXXXXXXXXXXX   
-11  24.00000             host ceph1.r2.XXXXXXXXXXXXXXXXXXX-cdg0 
  1   8.00000                 osd.1                                    up  1.00000          1.00000
  7   8.00000                 osd.7                                    up  1.00000          1.00000
 15   8.00000                 osd.15                                   up  1.00000          1.00000
-18  24.00000             host ceph1.r2.XXXXXXXXXXXXXXXXXXX-cdg1 
 21   8.00000                 osd.21                                   up  1.00000          1.00000
 28   8.00000                 osd.28                                   up  1.00000          1.00000
 31   8.00000                 osd.31                                   up  1.00000          1.00000
-24  24.00000             host ceph1.r2.XXXXXXXXXXXXXXXXXXX-cdg2 
 37   8.00000                 osd.37                                   up  1.00000          1.00000
 45   8.00000                 osd.45                                   up  1.00000          1.00000
 51   8.00000                 osd.51                                   up  1.00000          1.00000
-30  24.00000             host ceph1.r2.XXXXXXXXXXXXXXXXXXX-cdg3   
 56   8.00000                 osd.56                                   up  1.00000          1.00000
 61   8.00000                 osd.61                                   up  1.00000          1.00000
 67   8.00000                 osd.67                                   up  1.00000          1.00000
 -7  96.00000         chassis ceph2.r2.XXXXXXXXXXXXXXXXXXX 
-13  24.00000             host ceph2.r2.XXXXXXXXXXXXXXXXXXX-cdg0  
  0   8.00000                 osd.0                                    up  1.00000          1.00000
  6   8.00000                 osd.6                                    up  1.00000          1.00000
 12   8.00000                 osd.12                                   up  1.00000          1.00000
-19  24.00000             host ceph2.r2.XXXXXXXXXXXXXXXXXXX-cdg1 
 18   8.00000                 osd.18                                   up  1.00000          1.00000
 24   8.00000                 osd.24                                   up  1.00000          1.00000
 30   8.00000                 osd.30                                   up  1.00000          1.00000
-28  24.00000             host ceph2.r2.XXXXXXXXXXXXXXXXXXX-cdg2 
 38   8.00000                 osd.38                                   up  1.00000          1.00000
 46   8.00000                 osd.46                                   up  1.00000          1.00000
 52   8.00000                 osd.52                                   up  1.00000          1.00000
-33  24.00000             host ceph2.r2.XXXXXXXXXXXXXXXXXXX-cdg3 
 57   8.00000                 osd.57                                   up  1.00000          1.00000
 60   8.00000                 osd.60                                   up  1.00000          1.00000
 66   8.00000                 osd.66                                   up  1.00000          1.00000
 -3 192.00000     rack r1.XXXXXXXXXXXXXXXXXXX 
 -6  96.00000         chassis ceph1.r1.XXXXXXXXXXXXXXXXXXX 
-12  24.00000             host ceph1.r1.XXXXXXXXXXXXXXXXXXX-cdg0 
  2   8.00000                 osd.2                                    up  1.00000          1.00000 
  9   8.00000                 osd.9                                    up  1.00000          1.00000
 16   8.00000                 osd.16                                   up  1.00000          1.00000
-17  24.00000             host ceph1.r1.XXXXXXXXXXXXXXXXXXX-cdg1 
 22   8.00000                 osd.22                                   up  1.00000          1.00000
 29   8.00000                 osd.29                                   up  1.00000          1.00000
 32   8.00000                 osd.32                                   up  1.00000          1.00000
-23  24.00000             host ceph1.r1.XXXXXXXXXXXXXXXXXXX-cdg2 
 36   8.00000                 osd.36                                   up  1.00000          1.00000
 44   8.00000                 osd.44                                   up  1.00000          1.00000
 50   8.00000                 osd.50                                   up  1.00000          1.00000
-29  24.00000             host ceph1.r1.XXXXXXXXXXXXXXXXXXX-cdg3 
 54   8.00000                 osd.54                                   up  1.00000          1.00000
 63   8.00000                 osd.63                                   up  1.00000          1.00000
 70   8.00000                 osd.70                                   up  1.00000          1.00000
 -8  96.00000         chassis ceph2.r1.XXXXXXXXXXXXXXXXXXX 
-14  24.00000             host ceph2.r1.XXXXXXXXXXXXXXXXXXX-cdg0 
  3   8.00000                 osd.3                                    up  1.00000          1.00000
 11   8.00000                 osd.11                                   up  1.00000          1.00000
 14   8.00000                 osd.14                                   up  1.00000          1.00000
-22  24.00000             host ceph2.r1.XXXXXXXXXXXXXXXXXXX-cdg1 
 23   8.00000                 osd.23                                   up  1.00000          1.00000
 25   8.00000                 osd.25                                   up  1.00000          1.00000
 33   8.00000                 osd.33                                   up  1.00000          1.00000
-26  24.00000             host ceph2.r1.XXXXXXXXXXXXXXXXXXX-cdg2      
 40   8.00000                 osd.40                                   up  1.00000          1.00000
 43   8.00000                 osd.43                                   up  1.00000          1.00000
 49   8.00000                 osd.49                                   up  1.00000          1.00000
-31  24.00000             host ceph2.r1.XXXXXXXXXXXXXXXXXXX-cdg3 
 58   8.00000                 osd.58                                   up  1.00000          1.00000
 62   8.00000                 osd.62                                   up  1.00000          1.00000
 68   8.00000                 osd.68                                   up  1.00000          1.00000
 -4 192.00000     rack r3.XXXXXXXXXXXXXXXXXXX 
 -9  96.00000         chassis ceph1.r3.XXXXXXXXXXXXXXXXXXX 
-15  24.00000             host ceph1.r3.XXXXXXXXXXXXXXXXXXX-cdg0    
  5   8.00000                 osd.5                                    up  1.00000          1.00000
 10   8.00000                 osd.10                                   up  1.00000          1.00000
 13   8.00000                 osd.13                                   up  1.00000          1.00000
-20  24.00000             host ceph1.r3.XXXXXXXXXXXXXXXXXXX-cdg1 
 20   8.00000                 osd.20                                   up  1.00000          1.00000
 27   8.00000                 osd.27                                   up  1.00000          1.00000
 35   8.00000                 osd.35                                   up  1.00000          1.00000
-27  24.00000             host ceph1.r3.XXXXXXXXXXXXXXXXXXX-cdg2 
 39   8.00000                 osd.39                                   up  1.00000          1.00000
 42   8.00000                 osd.42                                   up  1.00000          1.00000
 48   8.00000                 osd.48                                   up  1.00000          1.00000
-34  24.00000             host ceph1.r3.XXXXXXXXXXXXXXXXXXX-cdg3 
 55   8.00000                 osd.55                                   up  1.00000          1.00000
 64   8.00000                 osd.64                                   up  1.00000          1.00000
 69   8.00000                 osd.69                                   up  1.00000          1.00000
-10  96.00000         chassis ceph2.r3.XXXXXXXXXXXXXXXXXXX 
-16  24.00000             host ceph2.r3.XXXXXXXXXXXXXXXXXXX-cdg0 
  4   8.00000                 osd.4                                    up  1.00000          1.00000
  8   8.00000                 osd.8                                    up  1.00000          1.00000
 17   8.00000                 osd.17                                   up  1.00000          1.00000
-21  24.00000             host ceph2.r3.XXXXXXXXXXXXXXXXXXX-cdg1 
 19   8.00000                 osd.19                                   up  1.00000          1.00000
 26   8.00000                 osd.26                                   up  1.00000          1.00000
 34   8.00000                 osd.34                                   up  1.00000          1.00000
-25  24.00000             host ceph2.r3.XXXXXXXXXXXXXXXXXXX-cdg2 
 41   8.00000                 osd.41                                   up  1.00000          1.00000
 47   8.00000                 osd.47                                   up  1.00000          1.00000
 53   8.00000                 osd.53                                   up  1.00000          1.00000
-32  24.00000             host ceph2.r3.XXXXXXXXXXXXXXXXXXX-cdg3 
 59   8.00000                 osd.59                                   up  1.00000          1.00000
 65   8.00000                 osd.65                                   up  1.00000          1.00000
 71   8.00000                 osd.71                                   up  1.00000          1.00000

#3 Updated by Patrick McLean almost 7 years ago

Here is the output of "ip addr", note that the "internal" interface is DOWN with NO-CARRIER

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: external: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether XX:XX:XX:XX:XX:XX  brd ff:ff:ff:ff:ff:ff
    inet 10.XXX.XXX.XXX/24 brd 10.XXX.XXX.255 scope global external
       valid_lft forever preferred_lft forever
    inet6 XXXX::XXXX:XXXX:XXXX:XXXX/64 scope link 
       valid_lft forever preferred_lft forever
3: internal: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9000 qdisc mq state DOWN group default qlen 1000
    link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff
    inet 172.XXX.XXX.XXX/24 brd 172.XXX.XXX.255 scope global internal
       valid_lft forever preferred_lft forever
    inet6 XXXX::XXXX:XXXX:XXXX:XXXX/64 scope link
       valid_lft forever preferred_lft forever

#4 Updated by Patrick McLean almost 7 years ago

Just for clarification, it was ceph2.r2 that was down, "chassis" is the physical node, and "host" is the subgroup on the physical node

#5 Updated by Greg Farnum almost 7 years ago

  • Subject changed from OSD node reported as up when the cluster network is down, but the "public" network was up to OSD remained up despite cluster network being inactive?

Was the cluster performing IO while this happened? Do your public and private networks perhaps route to each other?
Is it running 10.2.5? I see it tagged that way but no textual report of the version. :)

#6 Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Status changed from New to Need More Info
  • Component(RADOS) Monitor, OSD added

#7 Updated by Patrick McLean almost 7 years ago

The cluster does not need to be performing any IO, other than normal peering and checking, and this will still happen. The networks do not route to each other, they are completely separate networks with physically separate network cards.

We were initially seeing this with 10.2.5, we are now updated to 10.2.7 and are still seeing this.

#8 Updated by Greg Farnum almost 7 years ago

  • Status changed from Need More Info to 12
  • Priority changed from Normal to High

Sounds like we messed up the way cluster network heartbeating and the monitor's public network connection to the OSDs interact again...

#9 Updated by Sage Weil over 6 years ago

  • Status changed from 12 to Need More Info

Patrick, can you still reproduce this?

#10 Updated by Patrick McLean over 6 years ago

Yes, we can still reproduce this on 10.2.10. We have not updated to luminous as of yet.

#11 Updated by Nathan Cutler over 6 years ago

  • Status changed from Need More Info to 12

#12 Updated by Nathan Cutler over 6 years ago

  • Affected Versions v10.2.10 added
  • Affected Versions deleted (v10.2.5)

#13 Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New

#14 Updated by Neha Ojha over 3 years ago

  • Status changed from New to Closed

Please reopen this bug if the issue is seen in nautilus or newer releases.

Also available in: Atom PDF