Project

General

Profile

Bug #9215

Ceph Firefly 0.80.5 : OSD flapping too frequently

Added by karan singh almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

I have not performed any changes to my cluster yet OSD's has started flapping too frequently ( within seconds ) , the flapping is too fast that its difficult to track the OSD which was failed and next time a different OSD fails.

Here is the output of ceph osd tree | grep -i down , executed after every 3 seconds , you can see each time different OSD is getting down.


# ceph osd tree | grep -i down
# id    weight    type name    up/down    reweight
359    2.73                osd.359    down    1
408    2.7                osd.408    down    1
#
#
#
# ceph osd tree | grep -i down
# id    weight    type name    up/down    reweight
334    2.73                osd.334    down    1
327    2.73                osd.327    down    1
366    2.73                osd.366    down    1
405    2.7                osd.405    down    1
#
#
#
# ceph osd tree | grep -i down
# id    weight    type name    up/down    reweight
11    2.73                osd.11    down    1
18    2.73                osd.18    down    1
327    2.73                osd.327    down    1
352    2.63                osd.352    down    1
310    2.73                osd.310    down    1
372    2.73                osd.372    down    1
394    2.63                osd.394    down    1
#
#
#
# ceph osd tree | grep -i down
# id    weight    type name    up/down    reweight
32    2.73                osd.32    down    1
337    2.73                osd.337    down    1
343    2.73                osd.343    down    1
353    2.73                osd.353    down    1
362    2.73                osd.362    down    1
301    2.73                osd.301    down    1
#
#
#
# ceph osd tree | grep -i down
# id    weight    type name    up/down    reweight
362    2.73                osd.362    down    1
#
#
#
# ceph osd tree | grep -i down
# id    weight    type name    up/down    reweight
47    2.73                osd.47    down    1
395    2.73                osd.395    down    1
403    2.7                osd.403    down    1
401    0.09                osd.401    down    1
#

Every 2.0s: ceph -s                                                                                                                                                    Mon Aug 25 11:42:53 2014

    cluster 009d3518-e60d-4f74-a26d-c08c1976263c
     health HEALTH_WARN 6 pgs degraded; 33 pgs peering; 104 pgs stale; recovery 1/649224 objects degraded (0.000%); 1/409 in osds are down
     monmap e3: 3 mons at , election epoch 346, quorum 0,1,2 
     mdsmap e14: 1/1/1 up {0= =up:active}
*     osdmap e412466: 409 osds: 408 up, 409 in*
      pgmap v1163583: 30912 pgs, 22 pools, 7512 GB data, 298 kobjects
            38572 GB used, 1283 TB / 1320 TB avail
            1/649224 objects degraded (0.000%)
               30769 active+clean
                  33 peering
                 104 stale+active+clean
                   6 active+degraded

    cluster 009d3518-e60d-4f74-a26d-c08c1976263c
     health HEALTH_WARN 22 pgs degraded; 48 pgs peering; 103 pgs stale; 3/409 in osds are down
     monmap e3: 3 mons at , election epoch 346, quorum 0,1,2 
     mdsmap e14: 1/1/1 up {0= =up:active}
  *   osdmap e412509: 409 osds: 406 up, 409 in*
      pgmap v1163641: 30912 pgs, 22 pools, 7512 GB data, 298 kobjects
            38576 GB used, 1283 TB / 1320 TB avail
               30749 active+clean
                  43 peering
                  93 stale+active+clean
                   5 stale+peering
                  17 active+degraded
                   5 stale+active+degraded

   cluster 009d3518-e60d-4f74-a26d-c08c1976263c
     health HEALTH_WARN 16 pgs degraded; 1 pgs incomplete; 66 pgs peering; 207 pgs stale; 4/409 in osds are down
     monmap e3: 3 mons at {}, election epoch 346, quorum 0,1,2 
     mdsmap e14: 1/1/1 up {0= =up:active}
    * osdmap e412541: 409 osds: 405 up, 409 in*
      pgmap v1163677: 30912 pgs, 22 pools, 7512 GB data, 298 kobjects
            38580 GB used, 1283 TB / 1320 TB avail
               30631 active+clean
                  58 peering
                 198 stale+active+clean
                   8 stale+peering
                  15 active+degraded
                   1 incomplete
                   1 stale+active+degraded

I have also captured OSD log with debut osd = 20 for one of the OSD. This might help you in locating the issue. Logs attached here.

osd.394-logs 2.rtf (4.2 MB) karan singh, 08/25/2014 01:43 AM

History

#1 Updated by karan singh almost 6 years ago

You can close this case , problem has been solved after applying fix (0.80.5-1-gc4b77d2)

#2 Updated by Sage Weil almost 6 years ago

  • Status changed from New to Resolved

#3 Updated by Wang Qiang over 5 years ago

karan singh wrote:

You can close this case , problem has been solved after applying fix (0.80.5-1-gc4b77d2)

May I know where can I find the fix (0.80.5-1-gc4b77d2)?

Also available in: Atom PDF