Bug #9503: Dumpling: removing many snapshots in a short time makes OSDs go berserk - Ceph - Ceph

Actions

Copy link

Bug #9503

closed

Dumpling: removing many snapshots in a short time makes OSDs go berserk

Added by Florian Haas over 9 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Back in March, there was a report from Craig Lewis on the users list that mentioned several OSDs going to 100% CPU for no apparent reason, spinning and not responding to heartbeats until they were marked down by their peers, only to reappear a few minutes later and putting the cluster in a state of perpetual degradation and recovery:

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg08355.html

The user at the time reported that removing all snapshots in the cluster and restarting the affected OSDs made the issue go away. Thus, no-one ever got to the bottom of it. That report was for Emperor, and I am now seeing the same problem in a Dumpling (0.67.10) cluster with 4 OSD nodes and a total of 92 OSDs (not exactly an optimal OSD-to-node ratio, but it shouldn't cause the cluster to go haywire the way it does).

We see CPU-hogging hosts spinning spewing these messages in the logs:

2014-09-16 23:38:52.383297 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f55(u
nlocked)] enter Initial
2014-09-16 23:38:52.693380 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f55(
v 86054'140712 (72936'132027,86054'140712] local-les=85226 n=931 ec=2992 les
/c 85226/85226 85225/85225/85225) [63,85] r=1 lpr=0 pi=85155-85224/1 lcod 0'
0 inactive NOTIFY] exit Initial 0.310083 0 0.000000
2014-09-16 23:38:52.693393 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f55(
v 86054'140712 (72936'132027,86054'140712] local-les=85226 n=931 ec=2992 les
/c 85226/85226 85225/85225/85225) [63,85] r=1 lpr=0 pi=85155-85224/1 lcod 0'
0 inactive NOTIFY] enter Reset
2014-09-16 23:38:52.693952 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f77(u
nlocked)] enter Initial
2014-09-16 23:38:52.963101 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f77(
v 86054'130308 (73304'121701,86054'130308] local-les=85219 n=981 ec=2992 les
/c 85219/85219 85218/85218/85218) [48,85] r=1 lpr=0 pi=85145-85217/1 lcod 0'
0 inactive NOTIFY] exit Initial 0.269148 0 0.000000
2014-09-16 23:38:52.963115 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f77(
v 86054'130308 (73304'121701,86054'130308] local-les=85219 n=981 ec=2992 les
/c 85219/85219 85218/85218/85218) [48,85] r=1 lpr=0 pi=85145-85217/1 lcod 0'
0 inactive NOTIFY] enter Reset

The OSDs then settle into a state where they log this (and are still at 100%, getting randomly marked out and in):

2014-09-16 23:44:18.329467 7f59340e5700  5 osd.85 88499 heartbeat: osd_stat(
993 GB used, 2730 GB avail, 3723 GB total, peers [0,2,3,4,5,6,7,9,11,12,14,1
6,17,20,23,43,48,50,51,53,54,55,56,60,61,64,65,68,70,84,86]/[] op hist [])
2014-09-16 23:44:21.231029 7f59340e5700  5 osd.85 88499 heartbeat: osd_stat(
993 GB used, 2730 GB avail, 3723 GB total, peers [0,2,3,4,5,6,7,9,11,12,14,1
6,17,20,23,43,48,50,51,53,54,55,56,60,61,64,65,68,70,84,86]/[] op hist [])

The first set of log messages superficially looks similar to issue #6565 which, per http://ceph.com/releases/v0-67-9-dumpling-released/, should have been fixed in 0.67.9. This system is running 0.67.10, which has me tentatively rule that bug out (unless the fix broke something).

Just like the issue reported by Craig, this system has lots of snapshots (RBD, all using image format 1, because the intended primary use case is kernel-mapped RBD volume). Unlike Craig's system, it runs no radosgw.

Shutting down the OSDs and marking them out serves no purpose other than making the issue promptly reappear somewhere else (just as Craig reported in March).

If I can provide any additional information to get to the bottom of this (as it's apparently unresolved at least in current Dumpling), please let me know.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #9503

Dumpling: removing many snapshots in a short time makes OSDs go berserk

Updated by Florian Haas over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Dan van der Ster over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Christopher Thorjussen over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Florian Haas over 9 years ago