Project

General

Profile

Bug #9503

Dumpling: removing many snapshots in a short time makes OSDs go berserk

Added by Florian Haas over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Back in March, there was a report from Craig Lewis on the users list that mentioned several OSDs going to 100% CPU for no apparent reason, spinning and not responding to heartbeats until they were marked down by their peers, only to reappear a few minutes later and putting the cluster in a state of perpetual degradation and recovery:

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg08355.html

The user at the time reported that removing all snapshots in the cluster and restarting the affected OSDs made the issue go away. Thus, no-one ever got to the bottom of it. That report was for Emperor, and I am now seeing the same problem in a Dumpling (0.67.10) cluster with 4 OSD nodes and a total of 92 OSDs (not exactly an optimal OSD-to-node ratio, but it shouldn't cause the cluster to go haywire the way it does).

We see CPU-hogging hosts spinning spewing these messages in the logs:

2014-09-16 23:38:52.383297 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f55(u
nlocked)] enter Initial
2014-09-16 23:38:52.693380 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f55(
v 86054'140712 (72936'132027,86054'140712] local-les=85226 n=931 ec=2992 les
/c 85226/85226 85225/85225/85225) [63,85] r=1 lpr=0 pi=85155-85224/1 lcod 0'
0 inactive NOTIFY] exit Initial 0.310083 0 0.000000
2014-09-16 23:38:52.693393 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f55(
v 86054'140712 (72936'132027,86054'140712] local-les=85226 n=931 ec=2992 les
/c 85226/85226 85225/85225/85225) [63,85] r=1 lpr=0 pi=85155-85224/1 lcod 0'
0 inactive NOTIFY] enter Reset
2014-09-16 23:38:52.693952 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f77(u
nlocked)] enter Initial
2014-09-16 23:38:52.963101 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f77(
v 86054'130308 (73304'121701,86054'130308] local-les=85219 n=981 ec=2992 les
/c 85219/85219 85218/85218/85218) [48,85] r=1 lpr=0 pi=85145-85217/1 lcod 0'
0 inactive NOTIFY] exit Initial 0.269148 0 0.000000
2014-09-16 23:38:52.963115 7f59b5ff0780  5 osd.85 pg_epoch: 88246 pg[5.f77(
v 86054'130308 (73304'121701,86054'130308] local-les=85219 n=981 ec=2992 les
/c 85219/85219 85218/85218/85218) [48,85] r=1 lpr=0 pi=85145-85217/1 lcod 0'
0 inactive NOTIFY] enter Reset

The OSDs then settle into a state where they log this (and are still at 100%, getting randomly marked out and in):

2014-09-16 23:44:18.329467 7f59340e5700  5 osd.85 88499 heartbeat: osd_stat(
993 GB used, 2730 GB avail, 3723 GB total, peers [0,2,3,4,5,6,7,9,11,12,14,1
6,17,20,23,43,48,50,51,53,54,55,56,60,61,64,65,68,70,84,86]/[] op hist [])
2014-09-16 23:44:21.231029 7f59340e5700  5 osd.85 88499 heartbeat: osd_stat(
993 GB used, 2730 GB avail, 3723 GB total, peers [0,2,3,4,5,6,7,9,11,12,14,1
6,17,20,23,43,48,50,51,53,54,55,56,60,61,64,65,68,70,84,86]/[] op hist [])

The first set of log messages superficially looks similar to issue #6565 which, per http://ceph.com/releases/v0-67-9-dumpling-released/, should have been fixed in 0.67.9. This system is running 0.67.10, which has me tentatively rule that bug out (unless the fix broke something).

Just like the issue reported by Craig, this system has lots of snapshots (RBD, all using image format 1, because the intended primary use case is kernel-mapped RBD volume). Unlike Craig's system, it runs no radosgw.

Shutting down the OSDs and marking them out serves no purpose other than making the issue promptly reappear somewhere else (just as Craig reported in March).

If I can provide any additional information to get to the bottom of this (as it's apparently unresolved at least in current Dumpling), please let me know.


Related issues

Related to Ceph - Bug #9487: dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helping Resolved 09/16/2014

History

#1 Updated by Florian Haas over 9 years ago

  • Description updated (diff)

#2 Updated by Florian Haas over 9 years ago

Added issue #9487 as possibly related.

#3 Updated by Sage Weil over 9 years ago

  • Status changed from New to Need More Info
  • Priority changed from Normal to High

Hi Florian-

Can you generate some OSD logs (debug ms = 1, debug osd = 20) and attach them to the bug? The messages you quote are (in and of themselves) perfectly normal. (If there are too many of the first sort that would just imply too many PGs, but I'm not sure how many 'spewing' implies. :) Probably a single OSD is enough if it is pegging the CPU.

#4 Updated by Florian Haas over 9 years ago

Sage, I do have logs (debug osd=20, though not debug ms=1), but after the discussion with Dan on the -devel list, this does look very much like a dup of #9487. Should I attach them there?

#5 Updated by Sage Weil over 9 years ago

  • Status changed from Need More Info to Duplicate

Florian Haas wrote:

Sage, I do have logs (debug osd=20, though not debug ms=1), but after the discussion with Dan on the -devel list, this does look very much like a dup of #9487. Should I attach them there?

Yeah I agree. No need for the logs just yet, I should be albe to reproduce this pretty easily.

#6 Updated by Florian Haas over 9 years ago

Sage, a log is at https://www.dropbox.com/s/f2xyx12y2zr7fid/ceph-osd.14.log.xz -- behold the awesomeness of xz; that file is a whopping 661M uncompressed (generated in just under 4 minutes of "debug osd=20").

Note that snap trim sleep is 30s as you'll see in the one "snap_trimmer slept" line in there, this was deliberate yanked up to an insane level to see if it would make any meaningful difference to the problem at hand. It did not.

#7 Updated by Sage Weil over 9 years ago

  • Status changed from Duplicate to Need More Info
  • Assignee set to Sage Weil
  • Priority changed from High to Urgent

#8 Updated by Sage Weil over 9 years ago

That log shows me PGs with huge snap_trimq, which is very unfriendy to the snap trimmer. I've added Dan's patch on top of mine as it mitigates that issue. I think with that applied you should avoid the worst of the flapping. I set the default to 16.

But, what I don't see in the log is how the PG comes to have that ginormous snap_trimq. Its log is (86457'233235,87838'238295] so it can't be pure log recovery unless there is some weirdness there. my guess is that this PG got backfilled prior to my wip-9487 patch, and now that it exists it has this huge set to churn through, and dan's patch should let it do so without thrashing too horribly.

Perhaps once the existing bungled PGs have gone through being expired and are in good condition we can look to see if any new ones start behaving badly and make sure we have all the needed logs to understand why/how that happens (if it does). I'm cautiously optimistic, though, that this is address between the two patches...

#9 Updated by Florian Haas over 9 years ago

Thanks a lot! I'll report back once there is an update to share.

#10 Updated by Florian Haas over 9 years ago

OK, that seems to have done it. After installing the updated autobuild with Dan's patch and keeping the snap_trim limit at its default of 16, OSD CPU utilization immediately went back to normal, enabling OSDs to respond in time to ping requests. OSD flapping thus ceased instantly.

Trimming the remaining snaps also seems to have completed much faster than at the prior snail's pace.

Is there a global way to check if any snap trimming is still underway? Or can it be assumed that if no OSDs are peering, recovering, or backfilling anymore, that all snap trimming has been completed?

#11 Updated by Sage Weil over 9 years ago

  • Status changed from Need More Info to Fix Under Review

#12 Updated by Florian Haas over 9 years ago

Whoa. While a cluster with this patch applied doesn't spin like crazy in snap_trim anymore, killing an OSD seems to immediately lead to incomplete PGs. Here's a pg query from one of them:

# ceph pg 5.6c query                                                                     
{ "state": "incomplete",                                                                                         
  "epoch": 116183,                                                                                               
  "up": [                                                                                                        
        38],                                                                                                     
  "acting": [                                                                                                    
        38],                                                                                                     
  "info": { "pgid": "5.6c",                                                                                      
      "last_update": "86058'151386",                                                                             
      "last_complete": "86058'151386",                                                                           
      "log_tail": "75975'145637",                                                                                
      "last_backfill": "MAX",                                                                                    
      "purged_snaps": "[1~1402,1405~1,1408~a3,14ad~1,14b0~95,1547~1,154a~9,1555~1,1558~a3,15fd~1,1600~a3,16a5~1,1
6a8~a3,174d~1,1750~a3,17f5~1,17f8~2a7,1aa1~1,1aa4~128e,2d33~1,2d36~1,2d38~a2,2ddb~1,2dde~1,2de0~a2,2e83~1,2e86~1,
2e88~a2,2f2b~1,2f2e~1,2f30~a2,2fd3~1,2fd6~1,2fd8~a2,307b~1,307e~1,3080~a2,3123~1,3126~1,3128~a2,31cb~1,31ce~1,31d
0~a1,3272~1,3275~a3,3319~1,331c~1,331e~a2,33c2~2,33c6~a2,346a~2,346e~a2,3512~2,3516~a2,35ba~2,35be~94,3654~2,3658
~8,3662~2,3666~a2,370a~2,370e~a2,37b2~2,37b6~a2,385a~2,385e~43,38a2~5e,3902~2,3906~a2,39aa~2,39ae~64,3a13~2,3a16~
8,3a1f~1,3a21~33,3a56~2,3a5a~9f,3afb~2,3aff~22,3b22~53,3b76~2b,3ba3~2,3ba7~6,3bae~9b,3c4b~2,3c4f~94,3ce5~2,3ce9~8
,3cf3~2,3cf7~a2,3d9b~2,3d9f~a2,3e43~2,3e47~a2,3eeb~2,3eef~a2,3f93~2,3f97~a2,403b~2,403f~a2,40e3~2,40e7~a2,418b~2,
418f~a2,4233~2,4237~a2,42db~2,42df~a2,4383~2,4387~a3,442b~2,442f~3d2,4805~a4,48ad~a4,4955~a4,49fd~a4,4aa5~a,4ab3~
3,4aba~3,4ac1~3,4ac8~3,4acf~3,4ad6~3,4add~3,4ae4~3,4aeb~3,4af2~3,4af9~3,4b00~3,4b07~3,4b0e~3,4b15~3,4b1c~3,4b23~3
,4b2a~3,4b31~3,4b38~3,4b3f~3,4b46~3,4b4d~3,4b54~3,4b5b~3,4b62~3,4b69~3,4b70~3,4b77~3,4b7e~3,4b85~3,4b8c~3,4b93~3,
4b9a~3,4ba1~3,4ba8~3,4baf~3,4bb6~3,4bbd~3,4bc4~3,4bcb~3,4bd2~3,4bd9~3,4be0~3,4be7~3,4bee~3,4bf5~3,4bfc~3,4c03~3,4
c0a~3,4c11~3,4c18~3,4c1f~3,4c26~3,4c2d~3,4c34~3,4c3b~3,4c42~3,4c49~3,4c50~3,4c57~3,4c5e~3,4c65~3,4c6c~3,4c73~3,4c
7a~3,4c81~3,4c88~3,4c8f~3,4c96~3,4c9d~3,4ca4~3,4cab~3,4cb2~3,4cb9~3,4cc0~3,4cc7~3,4cce~3,4cd5~3,4cdc~3,4ce3~3,4ce
a~3,4cf1~3,4cf8~3,4cff~3,4d06~3,4d0d~3,4d14~3,4d1b~3,4d22~3,4d29~3,4d30~3,4d37~3,4d3e~3,4d45~3,4d4c~3,4d53~3,4d5a
~3,4d61~3,4d68~3,4d6f~3,4d76~3,4d7d~3,4d84~3,4d8b~3,4d92~3,4d99~3,4da0~3,4da7~3,4dae~3,4db5~3,4dbc~3,4dc3~3,4dca~
3,4dd1~3,4dd8~3,4ddf~3,4de6~3,4ded~3,4df4~3,4dfb~3,4e02~3,4e09~3,4e10~3,4e17~3,4e1e~3,4e25~3,4e2c~3,4e33~3,4e3a~3
,4e41~3,4e48~3,4e4f~3,4e56~3,4e5d~3,4e64~3,4e6b~c5,4f31~1,4f33~1,4f36~cc,5003~1,5005~1,5008~f4,5100~3,5107~3,510e
~f,5121~3,5128~16,5142~2f,5175~23,5199~3,519d~7,51a6~7,51ae~1,51b9~1760,691d~424,6d43~513,7258~1,725b~3,7263~1,72
69~2,726d~1,7270~2,7274~1,7277~2,727b~1,727e~2,7282~1,7285~2,7289~1,728c~2,7290~1,7293~2,7297~1,729a~2,729e~1,72a
1~2,72a5~1,72a8~2,72ac~1,72af~2,72b3~1,72b6~6,72be~3,72c5~1,72d0~1,72de~1,72e1~2,72e8~1,72ef~1,72f2~91]",        
      "history": { "epoch_created": 2992,
          "last_epoch_started": 116069,
          "last_epoch_clean": 116069,
          "last_epoch_split": 0,
          "same_up_since": 116182,
          "same_interval_since": 116182,
          "same_primary_since": 116068,
          "last_scrub": "86058'151386",
          "last_scrub_stamp": "2014-09-22 17:42:25.962548",
          "last_deep_scrub": "86058'151386",
          "last_deep_scrub_stamp": "2014-09-22 17:42:25.962548",
          "last_clean_scrub_stamp": "2014-09-22 17:42:25.962548"},
      "stats": { "version": "86058'151386",
          "reported_seq": "623325",
          "reported_epoch": "116183",
          "state": "incomplete",
          "last_fresh": "2014-09-23 15:01:11.330263",
          "last_change": "2014-09-23 15:01:10.521895",
          "last_active": "2014-09-23 14:55:12.428301",
          "last_clean": "2014-09-23 14:55:12.428301",
          "last_became_active": "0.000000",
          "last_unstale": "2014-09-23 15:01:11.330263",
          "mapping_epoch": 116068,
          "log_start": "75975'145637",
          "ondisk_log_start": "75975'145637",
          "created": 2992,
          "last_epoch_clean": 116069,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "86058'151386",
          "last_scrub_stamp": "2014-09-22 17:42:25.962548",
          "last_deep_scrub": "86058'151386",
          "last_deep_scrub_stamp": "2014-09-22 17:42:25.962548",
          "last_clean_scrub_stamp": "2014-09-22 17:42:25.962548",
          "log_size": 5749,
          "ondisk_log_size": 5749,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 1557106688,
              "num_objects": 601,
              "num_object_clones": 332,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 414452,
              "num_read_kb": 11250984,
              "num_write": 106491,
              "num_write_kb": 3138432,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 1,
              "num_bytes_recovered": 3252224,
              "num_keys_recovered": 0},
          "stat_cat_sum": {},
          "up": [
                38],
          "acting": [
                38]},
      "empty": 0,
      "dne": 0,
      "incomplete": 0,
      "last_epoch_started": 116069},
  "recovery_state": [
        { "name": "Started\/Primary\/Peering",
          "enter_time": "2014-09-23 15:01:10.521734",
          "past_intervals": [
                { "first": 116068,
                  "last": 116181,
                  "maybe_went_rw": 1,
                  "up": [
                        38,
                        18],
                  "acting": [
                        38,
                        18]}],
          "probing_osds": [
                38],
          "down_osds_we_would_probe": [
                18],
          "peering_blocked_by": []},
        { "name": "Started",
          "enter_time": "2014-09-23 15:01:10.521677"}]}

In the instance I saw this, the cluster was left untouched for 5 minutes and the after the mon osd down out interval expired, the cluster started recovering/backfilling normally, but I'm still concerned about the PGs being marked incomplete, albeit temprorarily.

Looks like something is fishy with the fix, still.

#13 Updated by Samuel Just over 9 years ago

https://github.com/ceph/ceph/pull/2525

The num_trimmed does not seem to be reset. I think you are not trimming at all after the initial trims.

#14 Updated by Samuel Just over 9 years ago

  • Status changed from Fix Under Review to 12

#15 Updated by Florian Haas over 9 years ago

  • Subject changed from Dumpling: running many RBDs with many snapshots makes OSDs go berserk to Dumpling: removing many snapshots in a short time makes OSDs go berserk

I'm not 100% sure, so I'd thought I'd ask: what's the exact reason for the PG being marked incomplete here? Is it the failure of a recovering/backfilled PG to reconstruct its snap_trim history, or something else?

#16 Updated by Sage Weil over 9 years ago

  • Status changed from 12 to 7
  • Assignee changed from Sage Weil to Samuel Just

fix is in wip-9487 and wip-sam-testing

#17 Updated by Florian Haas over 9 years ago

Please disregard #15. I just fell victim to inaccurate documentation about the incomplete PG state.

Sam's hunch was correct in that Dan's original patch just ceased trimming after the initial batch. With the updated wip-9487-dumpling packages deployed, OSDs started getting busy snap trimming again, and several OSDs were at 100% CPU for several minutes, but never stopped responding to heartbeats in the process and slept according to osd snap trim sleep. After a few minutes, with snap trimming completed, OSDs were just humming along nicely.

So it looks like this is fixed now. Sam, please feel free to add my Tested-By to the commit.

Scratch that. I was still using Dan's old patch. wip-9487-dumpling seems to have disappeared from the repos in http://gitbuilder.ceph.com/ceph-deb-wheezy-x86_64-notcmalloc/ref/, although there is now a disconcertingly named wip-9487-dumpling-forcehack. Could you fill me in on what's up with that?

#18 Updated by Samuel Just over 9 years ago

  • Status changed from 7 to Fix Under Review

#19 Updated by Dan van der Ster over 9 years ago

Hi Sam,
I think this is fixed in master/giant.. correct? Just a gentle reminder that we'd appreciate a backport in dumpling.
Cheers, Dan

#20 Updated by Florian Haas over 9 years ago

I believe the fact that the commit message for 255b430a87201c7d0cf8f10a3c1e62cbe8dd2d93 said Backfill where it should say Backport may have caught the odd QA script off guard. :)

#22 Updated by Samuel Just over 9 years ago

  • Status changed from Fix Under Review to Pending Backport

#23 Updated by Christopher Thorjussen over 9 years ago

Any news on the backport?

#24 Updated by Samuel Just over 9 years ago

  • Status changed from Pending Backport to Resolved

#25 Updated by Florian Haas over 9 years ago

Sam, is it correct to assume that this was fixed for dumpling in 1be9476afb9f715502a14749dd44e08371535b54, and that it's pending release in 0.67.12?

Also available in: Atom PDF