Bug #19700: OSD remained up despite cluster network being inactive? - RADOS - Ceph

Actions

Copy link

Bug #19700

closed

OSD remained up despite cluster network being inactive?

Added by Patrick McLean about 7 years ago. Updated almost 4 years ago.

Status:

Closed

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v10.2.10

ceph-qa-suite:

Component(RADOS):

Monitor, OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We have a ceph cluster with segregated cluster network for the OSDs to communicate with each other, and a "public" network for clients to talk to the cluster. The monitors are on the "public" network, and the OSDs talk to the monitor through that interface. We had an issue where the private network interface went down on one of our OSD nodes, but everything seemed to think that things were normal despite the fact that OSD node couldn't talk to any other OSDs. The monitor reported the cluster as healthy, and 'ceph osd tree' showed the downed node as up.

We have 12 OSDs per node, and 6 nodes in the cluster

Actions

Copy link

Updated by Vasu Kulkarni about 7 years ago

one osd was unable to communicate and ceph osd tree showed the osd as up

Actions

Copy link

Updated by Patrick McLean about 7 years ago

Here is the output of "ceph -1 576.00000 root default -2 192.00000 rack r2.XXXXXXXXXXXXXXXXXXX -5 96.00000 chassis -11 24.00000 host 1 8.00000 osd.1 7 8.00000 osd.7 15 8.00000 osd.15 -18 24.00000 host 21 8.00000 osd.21 28 8.00000 osd.28 31 8.00000 osd.31 -24 24.00000 host 37 8.00000 osd.37 45 8.00000 osd.45 51 8.00000 osd.51 -30 24.00000 host 56 8.00000 osd.56 61 8.00000 osd.61 67 8.00000 osd.67 -7 96.00000 chassis -13 24.00000 host 0 8.00000 osd.0 6 8.00000 osd.6 12 8.00000 osd.12 -19 24.00000 host 18 8.00000 osd.18 24 8.00000 osd.24 30 8.00000 osd.30 -28 24.00000 host 38 8.00000 osd.38 46 8.00000 osd.46 52 8.00000 osd.52 -33 24.00000 host 57 8.00000 osd.57 60 8.00000 osd.60 66 8.00000 osd.66 -3 192.00000 rack r1.XXXXXXXXXXXXXXXXXXX -6 96.00000 chassis -12 24.00000 host 2 8.00000 osd.2 9 8.00000 osd.9 16 8.00000 osd.16 -17 24.00000 host 22 8.00000 osd.22 29 8.00000 osd.29 32 8.00000 osd.32 -23 24.00000 host 36 8.00000 osd.36 44 8.00000 osd.44 50 8.00000 osd.50 -29 24.00000 host 54 8.00000 osd.54 63 8.00000 osd.63 70 8.00000 osd.70 -8 96.00000 chassis -14 24.00000 host 3 8.00000 osd.3 11 8.00000 osd.11 14 8.00000 osd.14 -22 24.00000 host 23 8.00000 osd.23 25 8.00000 osd.25 33 8.00000 osd.33 -26 24.00000 host 40 8.00000 osd.40 43 8.00000 osd.43 49 8.00000 osd.49 -31 24.00000 host 58 8.00000 osd.58 62 8.00000 osd.62 68 8.00000 osd.68 -4 192.00000 rack r3.XXXXXXXXXXXXXXXXXXX -9 96.00000 chassis -15 24.00000 host 5 8.00000 osd.5 10 8.00000 osd.10 13 8.00000 osd.13 -20 24.00000 host 20 8.00000 osd.20 27 8.00000 osd.27 35 8.00000 osd.35 -27 24.00000 host 39 8.00000 osd.39 42 8.00000 osd.42 48 8.00000 osd.48 -34 24.00000 host 55 8.00000 osd.55 64 8.00000 osd.64 69 8.00000 osd.69 -10 96.00000 chassis -16 24.00000 host 4 8.00000 osd.4 8 8.00000 osd.8 17 8.00000 osd.17 -21 24.00000 host 19 8.00000 osd.19 26 8.00000 osd.26 34 8.00000 osd.34 -25 24.00000 host 41 8.00000 osd.41 47 8.00000 osd.47 53 8.00000 osd.53 -32 24.00000 host 59 8.00000 osd.59 65 8.00000 osd.65 71 8.00000 osd.71 osd tree"

ID  WEIGHT    TYPE NAME                                           UP/DOWN REWEIGHT PRIMARY-AFFINITY

ceph1.r2.XXXXXXXXXXXXXXXXXXX ceph1.r2.XXXXXXXXXXXXXXXXXXX-cdg0 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph1.r2.XXXXXXXXXXXXXXXXXXX-cdg1 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph1.r2.XXXXXXXXXXXXXXXXXXX-cdg2 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph1.r2.XXXXXXXXXXXXXXXXXXX-cdg3 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph2.r2.XXXXXXXXXXXXXXXXXXX ceph2.r2.XXXXXXXXXXXXXXXXXXX-cdg0 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph2.r2.XXXXXXXXXXXXXXXXXXX-cdg1 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph2.r2.XXXXXXXXXXXXXXXXXXX-cdg2 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph2.r2.XXXXXXXXXXXXXXXXXXX-cdg3 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000

ceph1.r1.XXXXXXXXXXXXXXXXXXX ceph1.r1.XXXXXXXXXXXXXXXXXXX-cdg0 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph1.r1.XXXXXXXXXXXXXXXXXXX-cdg1 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph1.r1.XXXXXXXXXXXXXXXXXXX-cdg2 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph1.r1.XXXXXXXXXXXXXXXXXXX-cdg3 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph2.r1.XXXXXXXXXXXXXXXXXXX ceph2.r1.XXXXXXXXXXXXXXXXXXX-cdg0 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph2.r1.XXXXXXXXXXXXXXXXXXX-cdg1 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph2.r1.XXXXXXXXXXXXXXXXXXX-cdg2 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph2.r1.XXXXXXXXXXXXXXXXXXX-cdg3 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000

ceph1.r3.XXXXXXXXXXXXXXXXXXX ceph1.r3.XXXXXXXXXXXXXXXXXXX-cdg0 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph1.r3.XXXXXXXXXXXXXXXXXXX-cdg1 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph1.r3.XXXXXXXXXXXXXXXXXXX-cdg2 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph1.r3.XXXXXXXXXXXXXXXXXXX-cdg3 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph2.r3.XXXXXXXXXXXXXXXXXXX ceph2.r3.XXXXXXXXXXXXXXXXXXX-cdg0 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph2.r3.XXXXXXXXXXXXXXXXXXX-cdg1 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph2.r3.XXXXXXXXXXXXXXXXXXX-cdg2 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000 ceph2.r3.XXXXXXXXXXXXXXXXXXX-cdg3 up  1.00000          1.00000 up  1.00000          1.00000 up  1.00000          1.00000


  
  
    
    
      ActionsCopy link
      #3
    
    
      
      Updated by Patrick McLean about 7 years ago
      
      
    

    Here is the output of "ip addr", note that the "internal" interface is DOWN with NO-CARRIER


1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: external: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether XX:XX:XX:XX:XX:XX  brd ff:ff:ff:ff:ff:ff
    inet 10.XXX.XXX.XXX/24 brd 10.XXX.XXX.255 scope global external
       valid_lft forever preferred_lft forever
    inet6 XXXX::XXXX:XXXX:XXXX:XXXX/64 scope link 
       valid_lft forever preferred_lft forever
3: internal: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9000 qdisc mq state DOWN group default qlen 1000
    link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff
    inet 172.XXX.XXX.XXX/24 brd 172.XXX.XXX.255 scope global internal
       valid_lft forever preferred_lft forever
    inet6 XXXX::XXXX:XXXX:XXXX:XXXX/64 scope link
       valid_lft forever preferred_lft forever

    
  
  
  
    
    
      ActionsCopy link
      #4
    
    
      
      Updated by Patrick McLean about 7 years ago
      
      
    

    Just for clarification, it was ceph2.r2 that was down, "chassis" is the physical node, and "host" is the subgroup on the physical node
    
  
  
  
    
    
      ActionsCopy link
      #5
    
    
      
      Updated by Greg Farnum almost 7 years ago
      
      
    

    
       Subject changed from OSD node reported as up when the cluster network is down, but the "public" network was up to OSD remained up despite cluster network being inactive?
    
    Was the cluster performing IO while this happened? Do your public and private networks perhaps route to each other?
Is it running 10.2.5? I see it tagged that way but no textual report of the version. :)
    
  
  
  
    
    
      ActionsCopy link
      #6
    
    
      
      Updated by Greg Farnum almost 7 years ago
      
      
    

    
       Project changed from Ceph to RADOS
       Status changed from New to Need More Info
       Component(RADOS) Monitor, OSD added
    
    
    
  
  
  
    
    
      ActionsCopy link
      #7
    
    
      
      Updated by Patrick McLean almost 7 years ago
      
      
    

    The cluster does not need to be performing any IO, other than normal peering and checking, and this will still happen. The networks do not route to each other, they are completely separate networks with physically separate network cards.


	We were initially seeing this with 10.2.5, we are now updated to 10.2.7 and are still seeing this.
    
  
  
  
    
    
      ActionsCopy link
      #8
    
    
      
      Updated by Greg Farnum almost 7 years ago
      
      
    

    
       Status changed from Need More Info to 12
       Priority changed from Normal to High
    
    Sounds like we messed up the way cluster network heartbeating and the monitor's public network connection to the OSDs interact again...
    
  
  
  
    
    
      ActionsCopy link
      #9
    
    
      
      Updated by Sage Weil over 6 years ago
      
      
    

    
       Status changed from 12 to Need More Info
    
    Patrick, can you still reproduce this?
    
  
  
  
    
    
      ActionsCopy link
      #10
    
    
      
      Updated by Patrick McLean over 6 years ago
      
      
    

    Yes, we can still reproduce this on 10.2.10. We have not updated to luminous as of yet.
    
  
  
  
    
    
      ActionsCopy link
      #11
    
    
      
      Updated by Nathan Cutler over 6 years ago
      
      
    

    
       Status changed from Need More Info to 12
    
    
    
  
  
  
    
    
      ActionsCopy link
      #12
    
    
      
      Updated by Nathan Cutler over 6 years ago
      
      
    

    
       Affected Versions v10.2.10 added
       Affected Versions deleted (v10.2.5)
    
    
    
  
  
  
    
    
      ActionsCopy link
      #13
    
    
      
      Updated by Patrick Donnelly over 4 years ago
      
      
    

    
       Status changed from 12 to New
    
    
    
  
  
  
    
    
      ActionsCopy link
      #14
    
    
      
      Updated by Neha Ojha almost 4 years ago
      
      
    

    
       Status changed from New to Closed
    
    Please reopen this bug if the issue is seen in nautilus or newer releases.









Actions
  Copy link
  






Also available in:  Atom
  PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #19700

OSD remained up despite cluster network being inactive?

Updated by Vasu Kulkarni about 7 years ago

Updated by Patrick McLean about 7 years ago

Updated by Patrick McLean about 7 years ago

Updated by Patrick McLean about 7 years ago

Updated by Greg Farnum almost 7 years ago

Updated by Greg Farnum almost 7 years ago

Updated by Patrick McLean almost 7 years ago

Updated by Greg Farnum almost 7 years ago

Updated by Sage Weil over 6 years ago

Updated by Patrick McLean over 6 years ago

Updated by Nathan Cutler over 6 years ago

Updated by Nathan Cutler over 6 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Neha Ojha almost 4 years ago