Bug #9503
Dumpling: removing many snapshots in a short time makes OSDs go berserk
0%
Description
Back in March, there was a report from Craig Lewis on the users list that mentioned several OSDs going to 100% CPU for no apparent reason, spinning and not responding to heartbeats until they were marked down by their peers, only to reappear a few minutes later and putting the cluster in a state of perpetual degradation and recovery:
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg08355.html
The user at the time reported that removing all snapshots in the cluster and restarting the affected OSDs made the issue go away. Thus, no-one ever got to the bottom of it. That report was for Emperor, and I am now seeing the same problem in a Dumpling (0.67.10) cluster with 4 OSD nodes and a total of 92 OSDs (not exactly an optimal OSD-to-node ratio, but it shouldn't cause the cluster to go haywire the way it does).
We see CPU-hogging hosts spinning spewing these messages in the logs:
2014-09-16 23:38:52.383297 7f59b5ff0780 5 osd.85 pg_epoch: 88246 pg[5.f55(u nlocked)] enter Initial 2014-09-16 23:38:52.693380 7f59b5ff0780 5 osd.85 pg_epoch: 88246 pg[5.f55( v 86054'140712 (72936'132027,86054'140712] local-les=85226 n=931 ec=2992 les /c 85226/85226 85225/85225/85225) [63,85] r=1 lpr=0 pi=85155-85224/1 lcod 0' 0 inactive NOTIFY] exit Initial 0.310083 0 0.000000 2014-09-16 23:38:52.693393 7f59b5ff0780 5 osd.85 pg_epoch: 88246 pg[5.f55( v 86054'140712 (72936'132027,86054'140712] local-les=85226 n=931 ec=2992 les /c 85226/85226 85225/85225/85225) [63,85] r=1 lpr=0 pi=85155-85224/1 lcod 0' 0 inactive NOTIFY] enter Reset 2014-09-16 23:38:52.693952 7f59b5ff0780 5 osd.85 pg_epoch: 88246 pg[5.f77(u nlocked)] enter Initial 2014-09-16 23:38:52.963101 7f59b5ff0780 5 osd.85 pg_epoch: 88246 pg[5.f77( v 86054'130308 (73304'121701,86054'130308] local-les=85219 n=981 ec=2992 les /c 85219/85219 85218/85218/85218) [48,85] r=1 lpr=0 pi=85145-85217/1 lcod 0' 0 inactive NOTIFY] exit Initial 0.269148 0 0.000000 2014-09-16 23:38:52.963115 7f59b5ff0780 5 osd.85 pg_epoch: 88246 pg[5.f77( v 86054'130308 (73304'121701,86054'130308] local-les=85219 n=981 ec=2992 les /c 85219/85219 85218/85218/85218) [48,85] r=1 lpr=0 pi=85145-85217/1 lcod 0' 0 inactive NOTIFY] enter Reset
The OSDs then settle into a state where they log this (and are still at 100%, getting randomly marked out and in):
2014-09-16 23:44:18.329467 7f59340e5700 5 osd.85 88499 heartbeat: osd_stat( 993 GB used, 2730 GB avail, 3723 GB total, peers [0,2,3,4,5,6,7,9,11,12,14,1 6,17,20,23,43,48,50,51,53,54,55,56,60,61,64,65,68,70,84,86]/[] op hist []) 2014-09-16 23:44:21.231029 7f59340e5700 5 osd.85 88499 heartbeat: osd_stat( 993 GB used, 2730 GB avail, 3723 GB total, peers [0,2,3,4,5,6,7,9,11,12,14,1 6,17,20,23,43,48,50,51,53,54,55,56,60,61,64,65,68,70,84,86]/[] op hist [])
The first set of log messages superficially looks similar to issue #6565 which, per http://ceph.com/releases/v0-67-9-dumpling-released/, should have been fixed in 0.67.9. This system is running 0.67.10, which has me tentatively rule that bug out (unless the fix broke something).
Just like the issue reported by Craig, this system has lots of snapshots (RBD, all using image format 1, because the intended primary use case is kernel-mapped RBD volume). Unlike Craig's system, it runs no radosgw.
Shutting down the OSDs and marking them out serves no purpose other than making the issue promptly reappear somewhere else (just as Craig reported in March).
If I can provide any additional information to get to the bottom of this (as it's apparently unresolved at least in current Dumpling), please let me know.
Related issues
History
#1 Updated by Florian Haas about 9 years ago
- Description updated (diff)
#2 Updated by Florian Haas about 9 years ago
Added issue #9487 as possibly related.
#3 Updated by Sage Weil about 9 years ago
- Status changed from New to Need More Info
- Priority changed from Normal to High
Hi Florian-
Can you generate some OSD logs (debug ms = 1, debug osd = 20) and attach them to the bug? The messages you quote are (in and of themselves) perfectly normal. (If there are too many of the first sort that would just imply too many PGs, but I'm not sure how many 'spewing' implies. :) Probably a single OSD is enough if it is pegging the CPU.
#4 Updated by Florian Haas about 9 years ago
Sage, I do have logs (debug osd=20
, though not debug ms=1
), but after the discussion with Dan on the -devel list, this does look very much like a dup of #9487. Should I attach them there?
#5 Updated by Sage Weil about 9 years ago
- Status changed from Need More Info to Duplicate
Florian Haas wrote:
Sage, I do have logs (
debug osd=20
, though notdebug ms=1
), but after the discussion with Dan on the -devel list, this does look very much like a dup of #9487. Should I attach them there?
Yeah I agree. No need for the logs just yet, I should be albe to reproduce this pretty easily.
#6 Updated by Florian Haas about 9 years ago
Sage, a log is at https://www.dropbox.com/s/f2xyx12y2zr7fid/ceph-osd.14.log.xz -- behold the awesomeness of xz; that file is a whopping 661M uncompressed (generated in just under 4 minutes of "debug osd=20").
Note that snap trim sleep is 30s as you'll see in the one "snap_trimmer slept" line in there, this was deliberate yanked up to an insane level to see if it would make any meaningful difference to the problem at hand. It did not.
#7 Updated by Sage Weil about 9 years ago
- Status changed from Duplicate to Need More Info
- Assignee set to Sage Weil
- Priority changed from High to Urgent
#8 Updated by Sage Weil about 9 years ago
That log shows me PGs with huge snap_trimq, which is very unfriendy to the snap trimmer. I've added Dan's patch on top of mine as it mitigates that issue. I think with that applied you should avoid the worst of the flapping. I set the default to 16.
But, what I don't see in the log is how the PG comes to have that ginormous snap_trimq. Its log is (86457'233235,87838'238295] so it can't be pure log recovery unless there is some weirdness there. my guess is that this PG got backfilled prior to my wip-9487 patch, and now that it exists it has this huge set to churn through, and dan's patch should let it do so without thrashing too horribly.
Perhaps once the existing bungled PGs have gone through being expired and are in good condition we can look to see if any new ones start behaving badly and make sure we have all the needed logs to understand why/how that happens (if it does). I'm cautiously optimistic, though, that this is address between the two patches...
#9 Updated by Florian Haas about 9 years ago
Thanks a lot! I'll report back once there is an update to share.
#10 Updated by Florian Haas about 9 years ago
OK, that seems to have done it. After installing the updated autobuild with Dan's patch and keeping the snap_trim limit at its default of 16, OSD CPU utilization immediately went back to normal, enabling OSDs to respond in time to ping requests. OSD flapping thus ceased instantly.
Trimming the remaining snaps also seems to have completed much faster than at the prior snail's pace.
Is there a global way to check if any snap trimming is still underway? Or can it be assumed that if no OSDs are peering, recovering, or backfilling anymore, that all snap trimming has been completed?
#11 Updated by Sage Weil about 9 years ago
- Status changed from Need More Info to Fix Under Review
#12 Updated by Florian Haas about 9 years ago
Whoa. While a cluster with this patch applied doesn't spin like crazy in snap_trim anymore, killing an OSD seems to immediately lead to incomplete
PGs. Here's a pg query
from one of them:
# ceph pg 5.6c query { "state": "incomplete", "epoch": 116183, "up": [ 38], "acting": [ 38], "info": { "pgid": "5.6c", "last_update": "86058'151386", "last_complete": "86058'151386", "log_tail": "75975'145637", "last_backfill": "MAX", "purged_snaps": "[1~1402,1405~1,1408~a3,14ad~1,14b0~95,1547~1,154a~9,1555~1,1558~a3,15fd~1,1600~a3,16a5~1,1 6a8~a3,174d~1,1750~a3,17f5~1,17f8~2a7,1aa1~1,1aa4~128e,2d33~1,2d36~1,2d38~a2,2ddb~1,2dde~1,2de0~a2,2e83~1,2e86~1, 2e88~a2,2f2b~1,2f2e~1,2f30~a2,2fd3~1,2fd6~1,2fd8~a2,307b~1,307e~1,3080~a2,3123~1,3126~1,3128~a2,31cb~1,31ce~1,31d 0~a1,3272~1,3275~a3,3319~1,331c~1,331e~a2,33c2~2,33c6~a2,346a~2,346e~a2,3512~2,3516~a2,35ba~2,35be~94,3654~2,3658 ~8,3662~2,3666~a2,370a~2,370e~a2,37b2~2,37b6~a2,385a~2,385e~43,38a2~5e,3902~2,3906~a2,39aa~2,39ae~64,3a13~2,3a16~ 8,3a1f~1,3a21~33,3a56~2,3a5a~9f,3afb~2,3aff~22,3b22~53,3b76~2b,3ba3~2,3ba7~6,3bae~9b,3c4b~2,3c4f~94,3ce5~2,3ce9~8 ,3cf3~2,3cf7~a2,3d9b~2,3d9f~a2,3e43~2,3e47~a2,3eeb~2,3eef~a2,3f93~2,3f97~a2,403b~2,403f~a2,40e3~2,40e7~a2,418b~2, 418f~a2,4233~2,4237~a2,42db~2,42df~a2,4383~2,4387~a3,442b~2,442f~3d2,4805~a4,48ad~a4,4955~a4,49fd~a4,4aa5~a,4ab3~ 3,4aba~3,4ac1~3,4ac8~3,4acf~3,4ad6~3,4add~3,4ae4~3,4aeb~3,4af2~3,4af9~3,4b00~3,4b07~3,4b0e~3,4b15~3,4b1c~3,4b23~3 ,4b2a~3,4b31~3,4b38~3,4b3f~3,4b46~3,4b4d~3,4b54~3,4b5b~3,4b62~3,4b69~3,4b70~3,4b77~3,4b7e~3,4b85~3,4b8c~3,4b93~3, 4b9a~3,4ba1~3,4ba8~3,4baf~3,4bb6~3,4bbd~3,4bc4~3,4bcb~3,4bd2~3,4bd9~3,4be0~3,4be7~3,4bee~3,4bf5~3,4bfc~3,4c03~3,4 c0a~3,4c11~3,4c18~3,4c1f~3,4c26~3,4c2d~3,4c34~3,4c3b~3,4c42~3,4c49~3,4c50~3,4c57~3,4c5e~3,4c65~3,4c6c~3,4c73~3,4c 7a~3,4c81~3,4c88~3,4c8f~3,4c96~3,4c9d~3,4ca4~3,4cab~3,4cb2~3,4cb9~3,4cc0~3,4cc7~3,4cce~3,4cd5~3,4cdc~3,4ce3~3,4ce a~3,4cf1~3,4cf8~3,4cff~3,4d06~3,4d0d~3,4d14~3,4d1b~3,4d22~3,4d29~3,4d30~3,4d37~3,4d3e~3,4d45~3,4d4c~3,4d53~3,4d5a ~3,4d61~3,4d68~3,4d6f~3,4d76~3,4d7d~3,4d84~3,4d8b~3,4d92~3,4d99~3,4da0~3,4da7~3,4dae~3,4db5~3,4dbc~3,4dc3~3,4dca~ 3,4dd1~3,4dd8~3,4ddf~3,4de6~3,4ded~3,4df4~3,4dfb~3,4e02~3,4e09~3,4e10~3,4e17~3,4e1e~3,4e25~3,4e2c~3,4e33~3,4e3a~3 ,4e41~3,4e48~3,4e4f~3,4e56~3,4e5d~3,4e64~3,4e6b~c5,4f31~1,4f33~1,4f36~cc,5003~1,5005~1,5008~f4,5100~3,5107~3,510e ~f,5121~3,5128~16,5142~2f,5175~23,5199~3,519d~7,51a6~7,51ae~1,51b9~1760,691d~424,6d43~513,7258~1,725b~3,7263~1,72 69~2,726d~1,7270~2,7274~1,7277~2,727b~1,727e~2,7282~1,7285~2,7289~1,728c~2,7290~1,7293~2,7297~1,729a~2,729e~1,72a 1~2,72a5~1,72a8~2,72ac~1,72af~2,72b3~1,72b6~6,72be~3,72c5~1,72d0~1,72de~1,72e1~2,72e8~1,72ef~1,72f2~91]", "history": { "epoch_created": 2992, "last_epoch_started": 116069, "last_epoch_clean": 116069, "last_epoch_split": 0, "same_up_since": 116182, "same_interval_since": 116182, "same_primary_since": 116068, "last_scrub": "86058'151386", "last_scrub_stamp": "2014-09-22 17:42:25.962548", "last_deep_scrub": "86058'151386", "last_deep_scrub_stamp": "2014-09-22 17:42:25.962548", "last_clean_scrub_stamp": "2014-09-22 17:42:25.962548"}, "stats": { "version": "86058'151386", "reported_seq": "623325", "reported_epoch": "116183", "state": "incomplete", "last_fresh": "2014-09-23 15:01:11.330263", "last_change": "2014-09-23 15:01:10.521895", "last_active": "2014-09-23 14:55:12.428301", "last_clean": "2014-09-23 14:55:12.428301", "last_became_active": "0.000000", "last_unstale": "2014-09-23 15:01:11.330263", "mapping_epoch": 116068, "log_start": "75975'145637", "ondisk_log_start": "75975'145637", "created": 2992, "last_epoch_clean": 116069, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "86058'151386", "last_scrub_stamp": "2014-09-22 17:42:25.962548", "last_deep_scrub": "86058'151386", "last_deep_scrub_stamp": "2014-09-22 17:42:25.962548", "last_clean_scrub_stamp": "2014-09-22 17:42:25.962548", "log_size": 5749, "ondisk_log_size": 5749, "stats_invalid": "0", "stat_sum": { "num_bytes": 1557106688, "num_objects": 601, "num_object_clones": 332, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_degraded": 0, "num_objects_unfound": 0, "num_read": 414452, "num_read_kb": 11250984, "num_write": 106491, "num_write_kb": 3138432, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 1, "num_bytes_recovered": 3252224, "num_keys_recovered": 0}, "stat_cat_sum": {}, "up": [ 38], "acting": [ 38]}, "empty": 0, "dne": 0, "incomplete": 0, "last_epoch_started": 116069}, "recovery_state": [ { "name": "Started\/Primary\/Peering", "enter_time": "2014-09-23 15:01:10.521734", "past_intervals": [ { "first": 116068, "last": 116181, "maybe_went_rw": 1, "up": [ 38, 18], "acting": [ 38, 18]}], "probing_osds": [ 38], "down_osds_we_would_probe": [ 18], "peering_blocked_by": []}, { "name": "Started", "enter_time": "2014-09-23 15:01:10.521677"}]}
In the instance I saw this, the cluster was left untouched for 5 minutes and the after the mon osd down out interval
expired, the cluster started recovering/backfilling normally, but I'm still concerned about the PGs being marked incomplete, albeit temprorarily.
Looks like something is fishy with the fix, still.
#13 Updated by Samuel Just about 9 years ago
https://github.com/ceph/ceph/pull/2525
The num_trimmed does not seem to be reset. I think you are not trimming at all after the initial trims.
#14 Updated by Samuel Just about 9 years ago
- Status changed from Fix Under Review to 12
#15 Updated by Florian Haas about 9 years ago
- Subject changed from Dumpling: running many RBDs with many snapshots makes OSDs go berserk to Dumpling: removing many snapshots in a short time makes OSDs go berserk
I'm not 100% sure, so I'd thought I'd ask: what's the exact reason for the PG being marked incomplete here? Is it the failure of a recovering/backfilled PG to reconstruct its snap_trim history, or something else?
#16 Updated by Sage Weil about 9 years ago
- Status changed from 12 to 7
- Assignee changed from Sage Weil to Samuel Just
fix is in wip-9487 and wip-sam-testing
#17 Updated by Florian Haas almost 9 years ago
Please disregard #15. I just fell victim to inaccurate documentation about the incomplete
PG state.
Sam's hunch was correct in that Dan's original patch just ceased trimming after the initial batch. With the updated wip-9487-dumpling
packages deployed, OSDs started getting busy snap trimming again, and several OSDs were at 100% CPU for several minutes, but never stopped responding to heartbeats in the process and slept according to osd snap trim sleep
. After a few minutes, with snap trimming completed, OSDs were just humming along nicely.
So it looks like this is fixed now. Sam, please feel free to add my Tested-By to the commit.
Scratch that. I was still using Dan's old patch. wip-9487-dumpling
seems to have disappeared from the repos in http://gitbuilder.ceph.com/ceph-deb-wheezy-x86_64-notcmalloc/ref/, although there is now a disconcertingly named wip-9487-dumpling-forcehack
. Could you fill me in on what's up with that?
#18 Updated by Samuel Just almost 9 years ago
- Status changed from 7 to Fix Under Review
#19 Updated by Dan van der Ster almost 9 years ago
Hi Sam,
I think this is fixed in master/giant.. correct? Just a gentle reminder that we'd appreciate a backport in dumpling.
Cheers, Dan
#20 Updated by Florian Haas almost 9 years ago
I believe the fact that the commit message for 255b430a87201c7d0cf8f10a3c1e62cbe8dd2d93 said Backfill
where it should say Backport
may have caught the odd QA script off guard. :)
#21 Updated by Samuel Just almost 9 years ago
#22 Updated by Samuel Just almost 9 years ago
- Status changed from Fix Under Review to Pending Backport
#23 Updated by Christopher Thorjussen almost 9 years ago
Any news on the backport?
#24 Updated by Samuel Just almost 9 years ago
- Status changed from Pending Backport to Resolved
#25 Updated by Florian Haas almost 9 years ago
Sam, is it correct to assume that this was fixed for dumpling in 1be9476afb9f715502a14749dd44e08371535b54, and that it's pending release in 0.67.12?