Project

General

Profile

Actions

Bug #5517

closed

osd: stuck peering on cuttlefish

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Support
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Settings:
- paxos propose interval = 1
- debug ms = 1
- debug osd = 20
- debug mon = 20

Log:
11:45: start test
11:45: setting logging to debug levels
11:45: ceph osd down osd.1
11:46: slow requests begin to queue up
11:46: osd.1 comes back up
11:47: 10 other osd's go down! - see notes for specific osds.
11:47: set "nodown" to prevent too many more osd's to go down.
11:48: osd's slowly coming back up.
11:53: all osd's back up.
11:55: 17 pgs stuck inactive, peering. - see notes for pg numbers.
11:58: 17 pg's still stuck
11:58: marking osd.227 down to kick stuck peering's
11:59: all pg's OK.
11:59: no more slow requests.
11:59: reset all logging levels to defaults.
12:00: unset nodown.
12:00: test done

only one osd had to be marked down to mitigate stuck peering for only 17 pg's. - And fewer osd's was marked down.

all marked-down osds were in 2 racks (third was unaffected). (the third has no primaries, iirc.)

1) stuck pg's:
pg 16.2488 is stuck inactive for 520.607318, current state peering, last acting [282,227,114]
pg 16.5c41 is stuck inactive for 520.295324, current state peering, last acting [35,227,120]
pg 16.3209 is stuck inactive for 520.627714, current state peering, last acting [313,227,108]
pg 16.284 is stuck inactive for 519.933286, current state peering, last acting [55,227,335]
pg 16.1ce0 is stuck inactive for 520.166106, current state peering, last acting [6,227,113]
pg 14.286 is stuck inactive for 519.930437, current state peering, last acting [55,227,335]
pg 16.5565 is stuck inactive for 520.654965, current state peering, last acting [29,227,80]
pg 16.2521 is stuck inactive for 520.628269, current state peering, last acting [36,227,339]
pg 16.217b is stuck inactive for 520.610316, current state peering, last acting [36,227,138]
pg 16.30e is stuck inactive for 594.718988, current state peering, last acting [227,21,298]
pg 16.3358 is stuck inactive for 519.745224, current state peering, last acting [6,227,72]
pg 16.53a3 is stuck inactive for 520.183747, current state peering, last acting [6,227,332]
pg 16.3f59 is stuck inactive for 520.632270, current state peering, last acting [282,227,294]
pg 16.24ae is stuck inactive for 596.909301, current state peering, last acting [227,270,315]
pg 16.2f5b is stuck inactive for 596.899636, current state peering, last acting [227,18,336]
pg 16.286c is stuck inactive for 520.177838, current state peering, last acting [6,227,142]
pg 16.5e12 is stuck inactive for 690.029414, current state peering, last acting [227,18,110]

2) down osd's (other than the administratively marked down osd.1):
osd.227
osd.243
osd.91
osd.103
osd.172
osd.208
osd.246
osd.253
osd.101
osd.103

logs are cephdropped, */*_peering-kick-test_2.tar.gz


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #5655: Slow requests for 1h30 "currently waiting for missing objects"Resolved07/17/2013

Actions
Actions

Also available in: Atom PDF