Project

General

Profile

Actions

Bug #9503

closed

Dumpling: removing many snapshots in a short time makes OSDs go berserk

Added by Florian Haas over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Back in March, there was a report from Craig Lewis on the users list that mentioned several OSDs going to 100% CPU for no apparent reason, spinning and not responding to heartbeats until they were marked down by their peers, only to reappear a few minutes later and putting the cluster in a state of perpetual degradation and recovery:

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg08355.html

The user at the time reported that removing all snapshots in the cluster and restarting the affected OSDs made the issue go away. Thus, no-one ever got to the bottom of it. That report was for Emperor, and I am now seeing the same problem in a Dumpling (0.67.10) cluster with 4 OSD nodes and a total of 92 OSDs (not exactly an optimal OSD-to-node ratio, but it shouldn't cause the cluster to go haywire the way it does).

We see CPU-hogging hosts spinning spewing these messages in the logs:

2014-09-16 23:38:52.383297 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f55(u
nlocked)] enter Initial
2014-09-16 23:38:52.693380 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f55(
v 86054'140712 (72936'132027,86054'140712] local-les=85226 n=931 ec=2992 les
/c 85226/85226 85225/85225/85225) [63,85] r=1 lpr=0 pi=85155-85224/1 lcod 0'
0 inactive NOTIFY] exit Initial 0.310083 0 0.000000
2014-09-16 23:38:52.693393 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f55(
v 86054'140712 (72936'132027,86054'140712] local-les=85226 n=931 ec=2992 les
/c 85226/85226 85225/85225/85225) [63,85] r=1 lpr=0 pi=85155-85224/1 lcod 0'
0 inactive NOTIFY] enter Reset
2014-09-16 23:38:52.693952 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f77(u
nlocked)] enter Initial
2014-09-16 23:38:52.963101 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f77(
v 86054'130308 (73304'121701,86054'130308] local-les=85219 n=981 ec=2992 les
/c 85219/85219 85218/85218/85218) [48,85] r=1 lpr=0 pi=85145-85217/1 lcod 0'
0 inactive NOTIFY] exit Initial 0.269148 0 0.000000
2014-09-16 23:38:52.963115 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f77(
v 86054'130308 (73304'121701,86054'130308] local-les=85219 n=981 ec=2992 les
/c 85219/85219 85218/85218/85218) [48,85] r=1 lpr=0 pi=85145-85217/1 lcod 0'
0 inactive NOTIFY] enter Reset

The OSDs then settle into a state where they log this (and are still at 100%, getting randomly marked out and in):

2014-09-16 23:44:18.329467 7f59340e5700  5 osd.85 88499 heartbeat: osd_stat(
993 GB used, 2730 GB avail, 3723 GB total, peers [0,2,3,4,5,6,7,9,11,12,14,1
6,17,20,23,43,48,50,51,53,54,55,56,60,61,64,65,68,70,84,86]/[] op hist [])
2014-09-16 23:44:21.231029 7f59340e5700  5 osd.85 88499 heartbeat: osd_stat(
993 GB used, 2730 GB avail, 3723 GB total, peers [0,2,3,4,5,6,7,9,11,12,14,1
6,17,20,23,43,48,50,51,53,54,55,56,60,61,64,65,68,70,84,86]/[] op hist [])

The first set of log messages superficially looks similar to issue #6565 which, per http://ceph.com/releases/v0-67-9-dumpling-released/, should have been fixed in 0.67.9. This system is running 0.67.10, which has me tentatively rule that bug out (unless the fix broke something).

Just like the issue reported by Craig, this system has lots of snapshots (RBD, all using image format 1, because the intended primary use case is kernel-mapped RBD volume). Unlike Craig's system, it runs no radosgw.

Shutting down the OSDs and marking them out serves no purpose other than making the issue promptly reappear somewhere else (just as Craig reported in March).

If I can provide any additional information to get to the bottom of this (as it's apparently unresolved at least in current Dumpling), please let me know.


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #9487: dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helpingResolvedSamuel Just09/16/2014

Actions
Actions

Also available in: Atom PDF