Project

General

Profile

Bug #23562

VDO OSD caused cluster to hang

Added by David Galloway almost 6 years ago. Updated almost 6 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I awoke to alerts that apache serving teuthology logs on the Octo Long Running Cluster was unresponsive.

Here was ceph health at the time

[root@reesi001 ~]# ceph health detail
HEALTH_ERR 1 MDSs report slow requests; 1 MDSs behind on trimming; 180685/13878854 objects misplaced (1.302%); Reduced data availability: 61 pgs inactive; Degraded data redundancy: 27/13878854 objects degraded (0.000%), 97 pgs unclean; 18 stuck requests are blocked > 4096 sec; too many PGs per OSD (240 > max 200); clock skew detected on mon.reesi002, mon.reesi003
MDS_SLOW_REQUEST 1 MDSs report slow requests
    mdsreesi002(mds.0): 19 slow requests are blocked > 30 sec
MDS_TRIM 1 MDSs behind on trimming
    mdsreesi002(mds.0): Behind on trimming (62/30)max_segments: 30, num_segments: 62
OBJECT_MISPLACED 180685/13878854 objects misplaced (1.302%)
PG_AVAILABILITY Reduced data availability: 61 pgs inactive
    pg 0.2f is stuck inactive for 69086.531699, current state activating, last acting [36,55,47]
    pg 0.49 is stuck inactive for 69082.161062, current state activating, last acting [16,55,36]
    pg 0.58 is stuck inactive for 69081.825816, current state activating, last acting [41,6,55]
    pg 0.70 is stuck inactive for 69087.520599, current state activating, last acting [49,55,39]
    pg 0.c7 is stuck inactive for 69086.536857, current state activating, last acting [36,33,55]
    pg 1.d is stuck inactive for 69087.563064, current state activating+remapped, last acting [44,36]
    pg 1.ea is stuck inactive for 69082.202892, current state activating+remapped, last acting [32,42]
    pg 3.21 is stuck inactive for 69087.525779, current state activating+remapped, last acting [52,36,46]
    pg 3.24 is stuck inactive for 69087.534145, current state activating+remapped, last acting [34,36,29]
    pg 3.28 is stuck inactive for 69083.506971, current state activating+remapped, last acting [52,35,32]
    pg 3.2e is stuck inactive for 69085.352589, current state activating+remapped, last acting [20,32,43]
    pg 3.42 is stuck inactive for 69085.329833, current state activating+remapped, last acting [4,45,32]
    pg 3.8e is stuck inactive for 69087.523738, current state activating+remapped, last acting [35,51,45]
    pg 3.b3 is stuck inactive for 69086.347952, current state activating+remapped, last acting [17,36,22]
    pg 3.c9 is stuck inactive for 69081.189643, current state activating+remapped, last acting [32,33,49]
    pg 3.f9 is stuck inactive for 69081.652400, current state activating+remapped, last acting [51,35,44]
    pg 3.108 is stuck inactive for 69087.531481, current state activating+remapped, last acting [40,4,41]
    pg 4.cc is stuck inactive for 69085.869941, current state activating, last acting [32,49,55]
    pg 5.3f is stuck inactive for 69084.305265, current state activating, last acting [15,31,55]
    pg 5.40 is stuck inactive for 69087.525576, current state activating, last acting [35,36,55]
    pg 5.f0 is stuck inactive for 69085.861911, current state activating, last acting [32,49,55]
    pg 6.28 is stuck inactive for 69087.548323, current state activating, last acting [33,55,38]
    pg 6.6e is stuck inactive for 69086.632833, current state activating, last acting [41,55,2]
    pg 6.96 is stuck inactive for 69085.861346, current state activating, last acting [32,55,45]
    pg 7.e8 is stuck inactive for 69087.501763, current state activating, last acting [1,55,38]
    pg 8.b9 is stuck inactive for 69081.285142, current state activating, last acting [41,55,19]
    pg 8.ef is stuck inactive for 69085.862704, current state activating, last acting [32,41,55]
    pg 9.39 is stuck inactive for 69081.908739, current state activating, last acting [41,55,52]
    pg 9.b3 is stuck inactive for 69086.632898, current state activating, last acting [41,39,55]
    pg 10.3 is stuck inactive for 69084.119800, current state activating, last acting [47,51,55]
    pg 10.a3 is stuck inactive for 69086.618442, current state activating, last acting [27,52,55]
    pg 10.c4 is stuck inactive for 69082.414980, current state activating, last acting [41,5,55]
    pg 11.25 is stuck inactive for 69085.868402, current state activating, last acting [32,55,8]
    pg 11.b9 is stuck inactive for 69087.475809, current state activating, last acting [43,55,45]
    pg 11.bc is stuck inactive for 69087.530531, current state activating, last acting [46,55,31]
    pg 11.c1 is stuck inactive for 69086.528787, current state activating, last acting [36,44,55]
    pg 12.5c is stuck inactive for 69086.543621, current state activating, last acting [36,55,43]
    pg 13.5f is stuck inactive for 69086.545275, current state activating, last acting [36,43,55]
    pg 13.79 is stuck inactive for 69086.529816, current state activating, last acting [36,25,55]
    pg 13.93 is stuck inactive for 69087.525420, current state activating, last acting [46,36,55]
    pg 13.9a is stuck inactive for 69086.529206, current state activating, last acting [36,31,55]
    pg 13.ba is stuck inactive for 69086.528491, current state activating, last acting [36,55,2]
    pg 14.3d is stuck inactive for 69085.870543, current state activating, last acting [32,55,7]
    pg 14.b6 is stuck inactive for 69087.542690, current state activating, last acting [4,42,55]
    pg 14.bb is stuck inactive for 69085.866308, current state activating, last acting [32,55,49]
    pg 14.e8 is stuck inactive for 69084.304540, current state activating, last acting [50,55,41]
    pg 15.6d is stuck inactive for 69086.866207, current state activating, last acting [51,41,55]
    pg 15.7e is stuck inactive for 69087.513151, current state activating, last acting [21,55,44]
    pg 15.c6 is stuck inactive for 69085.871274, current state activating, last acting [32,55,42]
    pg 16.93 is stuck inactive for 69080.139292, current state activating, last acting [36,55,48]
    pg 16.bb is stuck inactive for 69087.554115, current state activating, last acting [42,31,55]
PG_DEGRADED Degraded data redundancy: 27/13878854 objects degraded (0.000%), 97 pgs unclean
    pg 0.2f is stuck unclean for 69087.602309, current state activating, last acting [36,55,47]
    pg 0.49 is stuck unclean for 69085.318339, current state activating, last acting [16,55,36]
    pg 0.58 is stuck unclean for 69083.174286, current state activating, last acting [41,6,55]
    pg 0.70 is stuck unclean for 69088.179608, current state activating, last acting [49,55,39]
    pg 0.c7 is stuck unclean for 69087.604167, current state activating, last acting [36,33,55]
    pg 1.d is stuck unclean for 69088.178888, current state activating+remapped, last acting [44,36]
    pg 1.27 is stuck unclean for 73475.490211, current state active+remapped+backfill_wait, last acting [38,26]
    pg 1.73 is stuck unclean for 69088.568249, current state active+remapped+backfill_wait, last acting [37,41]
    pg 1.7e is stuck unclean for 69085.165903, current state active+remapped+backfill_wait, last acting [39,35]
    pg 1.80 is stuck unclean for 69088.184806, current state active+remapped+backfill_wait, last acting [45,41]
    pg 1.88 is stuck unclean for 89948.691231, current state active+remapped+backfill_wait, last acting [16,29]
    pg 1.c6 is stuck unclean for 69088.113388, current state active+remapped+backfill_wait, last acting [29,51]
    pg 1.ea is stuck unclean for 69087.683923, current state activating+remapped, last acting [32,42]
    pg 3.d is stuck unclean for 152885.114063, current state active+remapped+backfill_wait, last acting [35,47,41]
    pg 3.21 is stuck unclean for 69088.180058, current state activating+remapped, last acting [52,36,46]
    pg 3.24 is stuck unclean for 69088.189194, current state activating+remapped, last acting [34,36,29]
    pg 3.28 is stuck unclean for 69085.003690, current state activating+remapped, last acting [52,35,32]
    pg 3.2c is stuck unclean for 69088.185507, current state active+remapped+backfill_wait, last acting [40,31,44]
    pg 3.2e is stuck unclean for 69088.205024, current state activating+remapped, last acting [20,32,43]
    pg 3.2f is stuck unclean for 128965.109584, current state active+remapped+backfill_wait, last acting [30,40,50]
    pg 3.42 is stuck unclean for 69088.117329, current state activating+remapped, last acting [4,45,32]
    pg 3.66 is stuck unclean for 198419.974676, current state active+remapped+backfill_wait, last acting [23,32,43]
    pg 3.c9 is stuck unclean for 72579.226145, current state activating+remapped, last acting [32,33,49]
    pg 3.d2 is stuck unclean for 141820.284901, current state active+remapped+backfill_wait, last acting [16,38,47]
    pg 3.d3 is stuck unclean for 69088.395546, current state active+remapped+backfill_wait, last acting [47,17,16]
    pg 3.de is stuck unclean for 69085.332127, current state active+remapped+backfill_wait, last acting [47,44,39]
    pg 3.ea is stuck unclean for 69088.188802, current state active+remapped+backfill_wait, last acting [16,47,29]
    pg 3.f3 is stuck unclean for 69088.189187, current state active+remapped+backfill_wait, last acting [34,21,40]
    pg 3.f9 is stuck unclean for 69083.271362, current state activating+remapped, last acting [51,35,44]
    pg 4.cc is stuck unclean for 69087.684507, current state activating, last acting [32,49,55]
    pg 5.3f is stuck unclean for 69085.336617, current state activating, last acting [15,31,55]
    pg 5.40 is stuck unclean for 69088.195862, current state activating, last acting [35,36,55]
    pg 5.f0 is stuck unclean for 69086.819557, current state activating, last acting [32,49,55]
    pg 6.28 is stuck unclean for 69088.211527, current state activating, last acting [33,55,38]
    pg 6.6e is stuck unclean for 69087.877877, current state activating, last acting [41,55,2]
    pg 7.e8 is stuck unclean for 69088.106506, current state activating, last acting [1,55,38]
    pg 8.ef is stuck unclean for 69086.819963, current state activating, last acting [32,41,55]
    pg 9.39 is stuck unclean for 69084.840375, current state activating, last acting [41,55,52]
    pg 10.3 is stuck unclean for 69085.330155, current state activating, last acting [47,51,55]
    pg 10.c4 is stuck unclean for 69087.879928, current state activating, last acting [41,5,55]
    pg 11.25 is stuck unclean for 69087.684435, current state activating, last acting [32,55,8]
    pg 11.c1 is stuck unclean for 69087.601843, current state activating, last acting [36,44,55]
    pg 12.5c is stuck unclean for 69087.606443, current state activating, last acting [36,55,43]
    pg 13.5f is stuck unclean for 69087.606380, current state activating, last acting [36,43,55]
    pg 13.79 is stuck unclean for 69087.602192, current state activating, last acting [36,25,55]
    pg 14.3d is stuck unclean for 69087.684653, current state activating, last acting [32,55,7]
    pg 14.e8 is stuck unclean for 69084.746248, current state activating, last acting [50,55,41]
    pg 15.6d is stuck unclean for 69087.814585, current state activating, last acting [51,41,55]
    pg 15.7e is stuck unclean for 69088.109070, current state activating, last acting [21,55,44]
    pg 15.c6 is stuck unclean for 69087.685004, current state activating, last acting [32,55,42]
    pg 16.93 is stuck unclean for 69080.149981, current state activating, last acting [36,55,48]
REQUEST_STUCK 18 stuck requests are blocked > 4096 sec
    10 ops are blocked > 67108.9 sec
    4 ops are blocked > 33554.4 sec
    3 ops are blocked > 16777.2 sec
    1 ops are blocked > 4194.3 sec
    osds 32,35,36,40,44,52 have stuck requests > 67108.9 sec
TOO_MANY_PGS too many PGs per OSD (240 > max 200)
MON_CLOCK_SKEW clock skew detected on mon.reesi002, mon.reesi003
    mon.reesi002 addr 10.8.130.102:6789/0 clock skew 1.60458s > max 0.05s (latency 0.000638645s)
    mon.reesi003 addr 10.8.130.103:6789/0 clock skew 1.60776s > max 0.05s (latency 0.000610284s)

I resolved the mon clock skew. No joy.
I restarted OSD 32. No joy.
I stopped recovery and backfill. No joy.

On a hunch, I stopped the VDO OSD I added yesterday and the cluster recovered. I've attached the VDO OSD's log. Something went wrong around 00:20 or before then.

ceph-osd.55.log View (408 KB) David Galloway, 04/05/2018 12:23 PM

History

#1 Updated by Greg Farnum almost 6 years ago

  • Project changed from Ceph to RADOS
  • Assignee set to Sage Weil

Also available in: Atom PDF