Project

General

Profile

Actions

Bug #37747

open

slow requests are being show on Luminous version while using bluestore , and cluster capactity goes above 40%

Added by kobi ginon over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
luminous bluestore
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi
we are seeing a regression in luminous blue store compared to filestore jewels version
while the capacity of the cluster goes above 40% .
An example of the error's showing is:
HEALTH_WARN 7 slow requests are blocked > 32 sec
REQUEST_SLOW 7 slow requests are blocked > 32 sec
7 ops are blocked > 32.768 sec
osd.87 has blocked requests > 32.768 sec

and it fluctuates between different OSD's

Problem first analysis summary:
- seems that some OSD's are getting much more data then other's during the test to be written to Disk
- The num operations waiting in the queue ramps up for those OSD's reproting slow requests
for example: --- overcloud-pl11sriovcompute-41 osd.87 "num_ops": 1329
- in jewes version with file store - we could indeed see with the same test ramp up in the queue
of those OSD's, but they are being clear much faster from the queue and the numbers are around 25% or maybe less

While we compared the same setup with previous version of jewles we do not see that behavior at that capacity

Details of the Blue store setup ###############################
1. HP ProLiant BL460c Gen9
2. ceph version: ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
3. bluestore
4. number of controllers : 3 , number of computes: 63
5. ceph configuration : Hyper-converged
6. number of OSD's: 120
7. number of pools: 7
8. replicatin factor: 2
9. ceph df - well the issue started at 42% capacity , and we continuted testing so it is now at higher value
[root@overcloud-controller-0 ~]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
109T 45086G 67467G 59.94
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
images 1 8040M 0.07 11104G 1033
metrics 2 111M 0 11104G 27208
vms 4 204G 1.81 11104G 54494
volumes 5 32911G 74.77 11104G 26470118
backups 8 0 0 11104G 0
manila_data 10 0 0 11104G 0
manila_metadata 11 2246 0 11104G 21

10. volumes pool is the Most occupied, so as part of the trial to improove things we enlarged the number of placement groups for this pool
below is a dump of number of Pg's for the pools
pool 1 'images' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 850 pgp_num 850 last_change 5271 lfor 0/1428 flags hashpspool stripe_
width 0 application rbd
pool 2 'metrics' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 850 pgp_num 850 last_change 5271 lfor 0/1497 flags hashpspool stripe
_width 0 application rbd
pool 4 'vms' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 850 pgp_num 850 last_change 5271 lfor 0/1639 flags hashpspool stripe_wid
th 0 application rbd
pool 5 'volumes' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 2600 pgp_num 2600 last_change 5568 lfor 0/5564 flags hashpspool stri
pe_width 0 application rbd
pool 8 'backups' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 5271 flags hashpspool stripe_width 0
pool 10 'manila_data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 5271 flags hashpspool stripe_width 0
application cephfs
pool 11 'manila_metadata' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 5271 flags hashpspool stripe_widt
h 0 application cephfs

The test that has been used by the customer is some IBm benchmark tool , but we could reproduce it with simple rados benchmark
with write of 8k Objects from all the computes in parallel

See Attached file showing the queues and slow requests report during the issue: results_bench_18_10min
test being run n paralle on all computes for 10 Minutes is: rados -p volumes bench 600 write -b 8192 -t 32

The Comparison setup is: ##########################

1. HPE ProLiant DL360 Gen10
2. ceph version: ceph version 10.2.10-17.el7cp (9865b1b203321435cc7128257833dca28bd779aa)
3. filestore
4. number of controllers : 3 , number of computes: 64
5. ceph configuration : Hyper-converged
6. number of OSD's: 128
7. number of pools: 5
8. replicatin factor: 2
9. ceph df - 43% capacity
[root@overcloud-controller-0 ~]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
185T 105T 81713G 43.06
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
metrics 1 1495M 0 35892G 126014
images 2 1451G 3.89 35892G 419835
backups 3 0 0 35892G 0
volumes 4 25900G 41.91 35892G 7109960
vms 5 13576G 27.44 35892G 3507618

10.
below is a dump of number of Pg's for the pools
pool 1 'metrics' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 2688 pgp_num 2688 last_change 543 flags hashpspool stripe_width 0
pool 2 'images' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 2688 pgp_num 2688 last_change 7110 flags hashpspool stripe_width 0
pool 3 'backups' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 2688 pgp_num 2688 last_change 547 flags hashpspool stripe_width 0
pool 4 'volumes' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 2688 pgp_num 2688 last_change 994 flags hashpspool stripe_width 0
pool 5 'vms' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 2688 pgp_num 2688 last_change 6915 flags hashpspool stripe_width 0

The test that has been used by the customer is some IBm benchmark tool , but we could reproduce it with simple rados benchmark
with write of 8k Objects from all the computes in parallel


Files

results_bench_18_10min (725 KB) results_bench_18_10min this is from jewel - queues summary udrng benchmark test kobi ginon, 12/24/2018 10:34 AM
results_bench_19_10min (551 KB) results_bench_19_10min this is from luminous - queues summary benchmark test kobi ginon, 12/24/2018 10:35 AM
ceph_osd-sdb.zip (697 KB) ceph_osd-sdb.zip Log from osd.50 in debug level when the slow requests are happening kobi ginon, 12/25/2018 12:32 PM
Actions #4

Updated by kobi ginon over 5 years ago

Well we do not see any traffic related to this bug , so just updating to reflect current trials
1. we did tried to enlarge Pg's of the Most busy pool (volumes) to 4096
Result: Still slow requests
2. we changed the osd configuration to (wpq was the default)
osd_op_queue = prioritized
Result: Still slow requests
3. We have tried to change the block.db size to 4% (e.g
Note: The disks are HDD's and we have let the ceph to put a default value, which is: 1GB
We could not do it for all OSD's as this setup has 120 OSD's
So we changed those that shown slow requests, and after recreation and stabilization , the new Ones
Still showed slow requests.
BTW: since it is HDD i assumed that the block.db if exists on collocated disk it should not be used, but being kept as an option
for future use if someone puts in SSD's that can be utilized for that.

Actions #5

Updated by Greg Farnum over 5 years ago

  • Project changed from Ceph to RADOS

I've moved this into the RADOS tracker for now, but you will probably get more useful help on the ceph-users mailing list for tuning.

Actions

Also available in: Atom PDF