Bug #9215
closedCeph Firefly 0.80.5 : OSD flapping too frequently
0%
Description
I have not performed any changes to my cluster yet OSD's has started flapping too frequently ( within seconds ) , the flapping is too fast that its difficult to track the OSD which was failed and next time a different OSD fails.
Here is the output of ceph osd tree | grep -i down , executed after every 3 seconds , you can see each time different OSD is getting down.
# ceph osd tree | grep -i down # id weight type name up/down reweight 359 2.73 osd.359 down 1 408 2.7 osd.408 down 1 # # # # ceph osd tree | grep -i down # id weight type name up/down reweight 334 2.73 osd.334 down 1 327 2.73 osd.327 down 1 366 2.73 osd.366 down 1 405 2.7 osd.405 down 1 # # # # ceph osd tree | grep -i down # id weight type name up/down reweight 11 2.73 osd.11 down 1 18 2.73 osd.18 down 1 327 2.73 osd.327 down 1 352 2.63 osd.352 down 1 310 2.73 osd.310 down 1 372 2.73 osd.372 down 1 394 2.63 osd.394 down 1 # # # # ceph osd tree | grep -i down # id weight type name up/down reweight 32 2.73 osd.32 down 1 337 2.73 osd.337 down 1 343 2.73 osd.343 down 1 353 2.73 osd.353 down 1 362 2.73 osd.362 down 1 301 2.73 osd.301 down 1 # # # # ceph osd tree | grep -i down # id weight type name up/down reweight 362 2.73 osd.362 down 1 # # # # ceph osd tree | grep -i down # id weight type name up/down reweight 47 2.73 osd.47 down 1 395 2.73 osd.395 down 1 403 2.7 osd.403 down 1 401 0.09 osd.401 down 1 # Every 2.0s: ceph -s Mon Aug 25 11:42:53 2014 cluster 009d3518-e60d-4f74-a26d-c08c1976263c health HEALTH_WARN 6 pgs degraded; 33 pgs peering; 104 pgs stale; recovery 1/649224 objects degraded (0.000%); 1/409 in osds are down monmap e3: 3 mons at , election epoch 346, quorum 0,1,2 mdsmap e14: 1/1/1 up {0= =up:active} * osdmap e412466: 409 osds: 408 up, 409 in* pgmap v1163583: 30912 pgs, 22 pools, 7512 GB data, 298 kobjects 38572 GB used, 1283 TB / 1320 TB avail 1/649224 objects degraded (0.000%) 30769 active+clean 33 peering 104 stale+active+clean 6 active+degraded cluster 009d3518-e60d-4f74-a26d-c08c1976263c health HEALTH_WARN 22 pgs degraded; 48 pgs peering; 103 pgs stale; 3/409 in osds are down monmap e3: 3 mons at , election epoch 346, quorum 0,1,2 mdsmap e14: 1/1/1 up {0= =up:active} * osdmap e412509: 409 osds: 406 up, 409 in* pgmap v1163641: 30912 pgs, 22 pools, 7512 GB data, 298 kobjects 38576 GB used, 1283 TB / 1320 TB avail 30749 active+clean 43 peering 93 stale+active+clean 5 stale+peering 17 active+degraded 5 stale+active+degraded cluster 009d3518-e60d-4f74-a26d-c08c1976263c health HEALTH_WARN 16 pgs degraded; 1 pgs incomplete; 66 pgs peering; 207 pgs stale; 4/409 in osds are down monmap e3: 3 mons at {}, election epoch 346, quorum 0,1,2 mdsmap e14: 1/1/1 up {0= =up:active} * osdmap e412541: 409 osds: 405 up, 409 in* pgmap v1163677: 30912 pgs, 22 pools, 7512 GB data, 298 kobjects 38580 GB used, 1283 TB / 1320 TB avail 30631 active+clean 58 peering 198 stale+active+clean 8 stale+peering 15 active+degraded 1 incomplete 1 stale+active+degraded
I have also captured OSD log with debut osd = 20 for one of the OSD. This might help you in locating the issue. Logs attached here.
Files
Updated by karan singh over 9 years ago
You can close this case , problem has been solved after applying fix (0.80.5-1-gc4b77d2)
Updated by Wang Qiang over 9 years ago
karan singh wrote:
You can close this case , problem has been solved after applying fix (0.80.5-1-gc4b77d2)
May I know where can I find the fix (0.80.5-1-gc4b77d2)?