Project

General

Profile

Actions

Bug #7674

closed

Cache pool configuration stalls

Added by Mark Nelson about 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In testing the following cache pool configuration:

ceph osd pool create cache 4096
ceph osd pool create base 4096
ceph osd pool set cache size 1
ceph osd pool set base size 3
ceph osd tier add base cache
ceph osd tier cache-mode cache writeback
ceph osd tier set-overlay base cache
ceph osd pool set cache hit_set_type bloom
ceph osd pool set cache hit_set_count 8
ceph osd pool set cache hit_set_period 60
ceph osd pool set cache target_max_objects 5000

using the command:

/usr/bin/rados -c /tmp/cbt/ceph/ceph.conf -p base -b 4194304 bench 300 write --concurrent-ios 32 --no-cleanup

rados bench was able to write out 3650 objects and then promptly stalled. Further attempts to write out to the pool resulted in a small number of object writes (between 50-100) before promptly stalling again. After discussing with Greg, we suspect that 3650 may be enough for at least 1 PG to have hit it's max target threshold, and subsequent writes are able to write out a small number of objects before stalling on the same PG.

For the last hour there have been 225 blocked requests according to ceph -w, and according to the admin socket every OSD has ops in flight, all of which look similar to:

        { "description": "osd_op(client.4421.0:2917 benchmark_data_burnupiY_16185_object2916 [write 0~4194304] 19.68af5c2c ack+ondisk+write e158)",
          "received_at": "2014-03-10 15:04:26.258728",
          "age": "3679.915517",
          "duration": "0.016717",
          "type_data": [
                "reached pg",
                { "client": "client.4421",
                  "tid": 2917},
                [
                    { "time": "2014-03-10 15:04:26.275362",
                      "event": "waiting_for_osdmap"},
                    { "time": "2014-03-10 15:04:26.275445",
                      "event": "reached_pg"}]]},

looking at the objecter_requests stats from the admin socket on each OSD, I see for every OSD:

{ "ops": [],
  "linger_ops": [],
  "pool_ops": [],
  "pool_stat_ops": [],
  "statfs_ops": [],
  "command_ops": []}

During these tests I also had optracker logging enabled and saw many of the following, but am not sure how to interpret them:

osd.0.log:2014-03-10 15:04:32.403352 7fc9a1c27700  5 -- op tracker -- , seq: 184939, time: 2014-03-10 15:04:32.403302, event: dispatched, request: osd_op(osd.0.4:434 benchmark_data_burnupiY_16185_object2332 [assert-version v2,copy-get max 8388608 ver 0] 19.ffcf2d8c ack+read+ignore_cache+ignore_overlay+flush e158) v4
osd.0.log:2014-03-10 15:04:32.403364 7fc9a1c27700  5 -- op tracker -- , seq: 184939, time: 2014-03-10 15:04:32.403363, event: waiting_for_osdmap, request: osd_op(osd.0.4:434 benchmark_data_burnupiY_16185_object2332 [assert-version v2,copy-get max 8388608 ver 0] 19.ffcf2d8c ack+read+ignore_cache+ignore_overlay+flush e158) v4
osd.0.log:2014-03-10 15:04:32.403465 7fc998c15700  5 -- op tracker -- , seq: 184939, time: 2014-03-10 15:04:32.403464, event: reached_pg, request: osd_op(osd.0.4:434 benchmark_data_burnupiY_16185_object2332 [assert-version v2,copy-get max 8388608 ver 0] 19.ffcf2d8c ack+read+ignore_cache+ignore_overlay+flush e158) v4
osd.0.log:2014-03-10 15:04:32.403526 7fc998c15700  5 -- op tracker -- , seq: 184939, time: 2014-03-10 15:04:32.403526, event: started, request: osd_op(osd.0.4:434 benchmark_data_burnupiY_16185_object2332 [assert-version v2,copy-get max 8388608 ver 0] 19.ffcf2d8c ack+read+ignore_cache+ignore_overlay+flush e158) v4
osd.0.log:2014-03-10 15:04:32.414554 7fc998c15700  5 -- op tracker -- , seq: 184939, time: 2014-03-10 15:04:32.414553, event: done, request: osd_op(osd.0.4:434 benchmark_data_burnupiY_16185_object2332 [assert-version v2,copy-get max 8388608 ver 0] 19.ffcf2d8c ack+read+ignore_cache+ignore_overlay+flush e158) v4
osd.0.log:2014-03-10 15:04:32.478231 7fc9a1c27700  5 -- op tracker -- , seq: 184944, time: 2014-03-10 15:04:32.478130, event: header_read, request: osd_op(osd.9.23:399 benchmark_data_burnupiY_16185_object2174 [assert-version v2,copy-get max 8388608 ver 0] 19.e3737f9f ack+read+ignore_cache+ignore_overlay+flush e158) v4
osd.0.log:2014-03-10 15:04:32.478272 7fc9a1c27700  5 -- op tracker -- , seq: 184944, time: 2014-03-10 15:04:32.478135, event: throttled, request: osd_op(osd.9.23:399 benchmark_data_burnupiY_16185_object2174 [assert-version v2,copy-get max 8388608 ver 0] 19.e3737f9f ack+read+ignore_cache+ignore_overlay+flush e158) v4

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #7686: osd spinning in agent_workDuplicate03/11/2014

Actions
Actions #1

Updated by Mark Nelson about 10 years ago

This was on 0.77-735-gd171418-1saucy which is a couple of days old. I can retest on firefly from today if there have been any significant changes since then.

Actions #2

Updated by Ian Colle about 10 years ago

  • Assignee set to Sage Weil
Actions #3

Updated by Sage Weil about 10 years ago

  • Status changed from New to Fix Under Review
Actions #4

Updated by Sage Weil about 10 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF