Support #20356: cluster health ok,but rados hang - Ceph - Ceph

Actions

Copy link

Support #20356

closed

cluster health ok,but rados hang

Added by Ivan Wong almost 7 years ago. Updated almost 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

v10.2.7

Pull request ID:

Description

One day I found the VM ioutils is very high.But the ceph load did not.

The cluster health is OK,all pgs is active+clean.
When I run :rados -p block ls (my rbd block pool).
Command to midway got hang.The default log not record specific information.
Restart all osd can slove this problem.But to run for a period of time will return.

I try to upgrade ceph from 10.2.5 to 10.2.7.The problem is still.

It seems that after I modify configuration appears to be something.
ceph.conf modify:
ms_type = async
ms_async_op_threads = 5
ms_dispatch_throttle_bytes = 104857600000

Finally,I roll-back ceph config. Two days without problems.

ceph.conf:
[global]
fsid = e608c0d4-9e92-4e96-a4f1-a2c319dd8435
mon_initial_members = ip-10-26-67-10, ip-10-26-67-11, ip-10-26-67-12
mon_host = 10.26.67.10,10.26.67.11,10.26.67.12
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd pool default size = 3
osd pool min size = 1
osd pool default pg num = 2048
osd pool default pgp num = 2048

public network =10.26.0.0/16
cluster network =192.168.1.0/24
max open files = 131072
#message
#ms_type = async
#ms_async_op_threads = 5
#ms_dispatch_throttle_bytes = 104857600000
#osd crush update on start = false
mon_pg_warn_max_per_osd = 600

[osd]
osd data = /var/lib/ceph/osd/ceph-$id
osd journal size = 20000
osd mkfs type = xfs
osd mkfs options xfs = -f

filestore min sync interval = 10
filestore max sync interval = 15
filestore queue max ops = 5000
filestore queue max bytes = 10485760
filestore queue committing max ops = 5000
filestore queue committing max bytes = 10485760000

journal max write bytes = 1073714824
journal max write entries = 1000
journal queue max ops = 3000
journal queue max bytes = 10485760000

osd max write size = 512
osd client message size cap = 2048
osd deep scrub stride = 131072
osd op threads = 8
osd disk threads = 4
osd map cache size = 1024
osd map cache bl size = 128
osd mount options xfs = "rw,nodev,noexec,noatime,nodiratime,inode64"
osd recovery op priority = 4
osd recovery max active = 10
osd recovery max backfills = 4

[client]
rbd cache = true
rbd cache size = 67108864
rbd cache max dirty = 50331648
rbd cache target dirty= 33554432
rbd cache max dirty age = 1
rbd cache writethrough until flush= false

Thanks.

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Status changed from New to Closed

You've set it to allow nearly 100 GB of in-flight IO data. I don't know exactly why that would trigger an issue with listing but there's no way you can expect that to function correctly on any normal server. If it works without that config change, I think you've found the culprit!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Support #20356

cluster health ok,but rados hang

Updated by Greg Farnum almost 7 years ago