Project

General

Profile

Actions

Bug #8660

closed

pg in forever "down+peering" state

Added by Dmitry Smirnov almost 10 years ago. Updated almost 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On 0.80.1 one PG somehow stuck in "down+peering" state for a very long time (hours).

OSD.4 repeatedly logs

2014-06-25 20:28:08.168237 osd.4 [WRN] 156 slow requests, 3 included below; oldest blocked for > 4273.847136 secs 
2014-06-25 20:28:08.168243 osd.4 [WRN] slow request 3840.146110 seconds old, received at 2014-06-25 19:24:08.022030: osd_op(client.4977504.0:3520 10000117daf.00000002 [write 0~4194304] 16.906f7607 snapc 1=[] ondisk+write e41433) v4 currently waiting for blocked object 
2014-06-25 20:28:08.168248 osd.4 [WRN] slow request 3840.116339 seconds old, received at 2014-06-25 19:24:08.051801: osd_op(osd.4.41279:5848 10000117daf.00000002@snapdir [list-snaps] 14.906f7607 ack+read+ignore_cache+ignore_overlay+map_snap_clone e41433) v4 currently reached pg 
2014-06-25 20:28:08.168252 osd.4 [WRN] slow request 3840.116281 seconds old, received at 2014-06-25 19:24:08.051859: osd_op(osd.4.41279:5849 10000117daf.00000002 [copy-get max 8388608] 14.906f7607 ack+read+ignore_cache+ignore_overlay+map_snap_clone e41433) v4 currently reached pg 
2014-06-25 20:28:09.168543 osd.4 [WRN] 156 slow requests, 3 included below; oldest blocked for > 4274.847444 secs 
2014-06-25 20:28:09.168549 osd.4 [WRN] slow request 3840.240301 seconds old, received at 2014-06-25 19:24:08.928147: osd_op(client.4977504.0:3521 10000117daf.00000003 [write 0~4194304] 16.77a57787 snapc 1=[] ondisk+write e41433) v4 currently waiting for blocked object 
2014-06-25 20:28:09.168554 osd.4 [WRN] slow request 3840.210786 seconds old, received at 2014-06-25 19:24:08.957662: osd_op(osd.4.41279:5853 10000117daf.00000003@snapdir [list-snaps] 14.77a57787 ack+read+ignore_cache+ignore_overlay+map_snap_clone e41433) v4 currently reached pg 
2014-06-25 20:28:09.168559 osd.4 [WRN] slow request 3840.210729 seconds old, received at 2014-06-25 19:24:08.957719: osd_op(osd.4.41279:5854 10000117daf.00000003 [copy-get max 8388608] 14.77a57787 ack+read+ignore_cache+ignore_overlay+map_snap_clone e41433) v4 currently reached pg 
2014-06-25 20:28:32.173212 osd.4 [WRN] 156 slow requests, 3 included below; oldest blocked for > 4297.852107 secs 
2014-06-25 20:28:32.173219 osd.4 [WRN] slow request 3840.467853 seconds old, received at 2014-06-25 19:24:31.705258: osd_op(client.4977504.0:3578 10000117db7.00000010 [write 0~4194304] 16.944fdd87 snapc 1=[] ondisk+write e41433) v4 currently waiting for blocked object 
2014-06-25 20:28:32.173224 osd.4 [WRN] slow request 3840.433519 seconds old, received at 2014-06-25 19:24:31.739592: osd_op(osd.4.41279:5970 10000117db7.00000010@snapdir [list-snaps] 14.944fdd87 ack+read+ignore_cache+ignore_overlay+map_snap_clone e41433) v4 currently reached pg 
2014-06-25 20:28:32.173229 osd.4 [WRN] slow request 3840.433455 seconds old, received at 2014-06-25 19:24:31.739656: osd_op(osd.4.41279:5971 10000117db7.00000010 [copy-get max 8388608] 14.944fdd87 ack+read+ignore_cache+ignore_overlay+map_snap_clone e41433) v4 currently reached pg 

Apparently restarting OSDs and MONs do not help.

# ceph health detail | grep "peering" 

pg 14.7 is stuck inactive for 1556.618945, current state down+peering, last acting [6,3,2147483647,4]
pg 14.7 is stuck unclean for 242806.037018, current state down+peering, last acting [6,3,2147483647,4]
pg 14.7 is down+peering, acting [6,3,2147483647,4]

Any ideas?


Files

osdmap.txt (4.72 KB) osdmap.txt Dmitry Smirnov, 06/25/2014 04:56 AM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #8643: 0.80.1: OSD crash: osd/ECBackend.cc: 529: FAILED assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))ClosedSamuel Just06/22/2014

Actions
Actions #1

Updated by Dmitry Smirnov almost 10 years ago

I've resetted "reweight" value for two OSDs to '1' and now
"sudo ceph pg map 14.7" shows:

osdmap e41511 pg 14.7 (14.7) -> up [6,3,2147483647,4] acting [2147483647,2147483647,2147483647,4]
Actions #2

Updated by Dmitry Smirnov almost 10 years ago

Actions #3

Updated by Samuel Just almost 10 years ago

Attach an actual osdmap (ceph osd getmap -o /tmp/map)

Actions #4

Updated by Samuel Just almost 10 years ago

  • Status changed from New to Closed

Also, this is almost certainly not a bug, but rather a consequence of 8643. We don't go active with fewer than M osds in an ec pool by design.

Actions

Also available in: Atom PDF