Actions
Bug #8660
closedpg in forever "down+peering" state
Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
On 0.80.1 one PG somehow stuck in "down+peering" state for a very long time (hours).
OSD.4 repeatedly logs
2014-06-25 20:28:08.168237 osd.4 [WRN] 156 slow requests, 3 included below; oldest blocked for > 4273.847136 secs 2014-06-25 20:28:08.168243 osd.4 [WRN] slow request 3840.146110 seconds old, received at 2014-06-25 19:24:08.022030: osd_op(client.4977504.0:3520 10000117daf.00000002 [write 0~4194304] 16.906f7607 snapc 1=[] ondisk+write e41433) v4 currently waiting for blocked object 2014-06-25 20:28:08.168248 osd.4 [WRN] slow request 3840.116339 seconds old, received at 2014-06-25 19:24:08.051801: osd_op(osd.4.41279:5848 10000117daf.00000002@snapdir [list-snaps] 14.906f7607 ack+read+ignore_cache+ignore_overlay+map_snap_clone e41433) v4 currently reached pg 2014-06-25 20:28:08.168252 osd.4 [WRN] slow request 3840.116281 seconds old, received at 2014-06-25 19:24:08.051859: osd_op(osd.4.41279:5849 10000117daf.00000002 [copy-get max 8388608] 14.906f7607 ack+read+ignore_cache+ignore_overlay+map_snap_clone e41433) v4 currently reached pg 2014-06-25 20:28:09.168543 osd.4 [WRN] 156 slow requests, 3 included below; oldest blocked for > 4274.847444 secs 2014-06-25 20:28:09.168549 osd.4 [WRN] slow request 3840.240301 seconds old, received at 2014-06-25 19:24:08.928147: osd_op(client.4977504.0:3521 10000117daf.00000003 [write 0~4194304] 16.77a57787 snapc 1=[] ondisk+write e41433) v4 currently waiting for blocked object 2014-06-25 20:28:09.168554 osd.4 [WRN] slow request 3840.210786 seconds old, received at 2014-06-25 19:24:08.957662: osd_op(osd.4.41279:5853 10000117daf.00000003@snapdir [list-snaps] 14.77a57787 ack+read+ignore_cache+ignore_overlay+map_snap_clone e41433) v4 currently reached pg 2014-06-25 20:28:09.168559 osd.4 [WRN] slow request 3840.210729 seconds old, received at 2014-06-25 19:24:08.957719: osd_op(osd.4.41279:5854 10000117daf.00000003 [copy-get max 8388608] 14.77a57787 ack+read+ignore_cache+ignore_overlay+map_snap_clone e41433) v4 currently reached pg 2014-06-25 20:28:32.173212 osd.4 [WRN] 156 slow requests, 3 included below; oldest blocked for > 4297.852107 secs 2014-06-25 20:28:32.173219 osd.4 [WRN] slow request 3840.467853 seconds old, received at 2014-06-25 19:24:31.705258: osd_op(client.4977504.0:3578 10000117db7.00000010 [write 0~4194304] 16.944fdd87 snapc 1=[] ondisk+write e41433) v4 currently waiting for blocked object 2014-06-25 20:28:32.173224 osd.4 [WRN] slow request 3840.433519 seconds old, received at 2014-06-25 19:24:31.739592: osd_op(osd.4.41279:5970 10000117db7.00000010@snapdir [list-snaps] 14.944fdd87 ack+read+ignore_cache+ignore_overlay+map_snap_clone e41433) v4 currently reached pg 2014-06-25 20:28:32.173229 osd.4 [WRN] slow request 3840.433455 seconds old, received at 2014-06-25 19:24:31.739656: osd_op(osd.4.41279:5971 10000117db7.00000010 [copy-get max 8388608] 14.944fdd87 ack+read+ignore_cache+ignore_overlay+map_snap_clone e41433) v4 currently reached pg
Apparently restarting OSDs and MONs do not help.
# ceph health detail | grep "peering" pg 14.7 is stuck inactive for 1556.618945, current state down+peering, last acting [6,3,2147483647,4] pg 14.7 is stuck unclean for 242806.037018, current state down+peering, last acting [6,3,2147483647,4] pg 14.7 is down+peering, acting [6,3,2147483647,4]
Any ideas?
Files
Updated by Dmitry Smirnov almost 10 years ago
I've resetted "reweight" value for two OSDs to '1' and now
"sudo ceph pg map 14.7" shows:
osdmap e41511 pg 14.7 (14.7) -> up [6,3,2147483647,4] acting [2147483647,2147483647,2147483647,4]
Updated by Samuel Just almost 10 years ago
Attach an actual osdmap (ceph osd getmap -o /tmp/map)
Updated by Samuel Just almost 10 years ago
- Status changed from New to Closed
Also, this is almost certainly not a bug, but rather a consequence of 8643. We don't go active with fewer than M osds in an ec pool by design.
Actions