Support #20391: cluster health is ok, but client IO hang with 100% utils of ssd disk - Ceph - Ceph

Actions

Copy link

Support #20391

open

cluster health is ok, but client IO hang with 100% utils of ssd disk

Added by XuHua Lin almost 7 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

v12.0.0

Pull request ID:

Description

cluster info :
3 host,that each has 1 mon and 4 osd(ssd intel s3500).

host info :


[root@node2 ceph]# uname -a ; cat /etc/redhat-release ;ceph version
Linux node2 3.10.0-514.6.1.el7.x86_64 #1 SMP Wed Jan 18 13:06:36 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
CentOS Linux release 7.3.1611 (Core) 
ceph version 12.0.3 (f2337d1b42fa49dbb0a93e4048a42762e3dffbbf)

Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root   50G  1.9G   49G   4% /
devtmpfs                  63G     0   63G   0% /dev
tmpfs                     63G     0   63G   0% /dev/shm
tmpfs                     63G   26M   63G   1% /run
tmpfs                     63G     0   63G   0% /sys/fs/cgroup
/dev/sde1                497M  169M  328M  35% /boot
/dev/mapper/centos-home  225G   33M  225G   1% /home
tmpfs                     13G     0   13G   0% /run/user/0
/dev/sda1                260G   26G  234G  10% /var/lib/ceph/osd/ceph-0
/dev/sdb1                260G   16G  244G   6% /var/lib/ceph/osd/ceph-3
/dev/sdc1                260G   17G  243G   7% /var/lib/ceph/osd/ceph-7
/dev/sdd1                260G   28G  232G  11% /var/lib/ceph/osd/ceph-9

ceph conf


[global]
fsid = ed5467b9-a4a7-4a05-847f-907f8c256b44
max open files = 131072
mon initial members = node1,node2,node3
mon host = 172.17.141.219,172.17.141.220,172.17.141.221
public network = 172.17.141.0/24
cluster network = 10.1.1.0/24

[client.libvirt]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok # must be writable by QEMU and allowed by SELinux or AppArmor
log file = /var/log/ceph/qemu-guest-$pid.log # must be writable by QEMU and allowed by SELinux or AppArmor

[osd]
osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
osd mount options xfs = noatime,largeio,inode64,swalloc
osd journal size = 20480

[client.rgw.node1]
host = node1
keyring = /var/lib/ceph/radosgw/ceph-rgw.node1/keyring
rgw socket path = /tmp/radosgw-node1.sock
log file = /var/log/ceph/ceph-rgw-node1.log
rgw data = /var/lib/ceph/radosgw/ceph-rgw.node1
rgw frontends = civetweb port=172.17.141.220:8080 num_threads=50
rgw resolve cname = False
[client.rgw.node2]
host = node2
keyring = /var/lib/ceph/radosgw/ceph-rgw.node2/keyring
rgw socket path = /tmp/radosgw-node2.sock
log file = /var/log/ceph/ceph-rgw-node2.log
rgw data = /var/lib/ceph/radosgw/ceph-rgw.node2
rgw frontends = civetweb port=172.17.141.220:8080 num_threads=50
rgw resolve cname = False
[client.rgw.node3]
host = node3
keyring = /var/lib/ceph/radosgw/ceph-rgw.node3/keyring
rgw socket path = /tmp/radosgw-node3.sock
log file = /var/log/ceph/ceph-rgw-node3.log
rgw data = /var/lib/ceph/radosgw/ceph-rgw.node3
rgw frontends = civetweb port=172.17.141.220:8080 num_threads=50
rgw resolve cname = False

test client
fio

fio --runtime=600 --time_based --direct=1 --thread --ioengine=libaio --numjobs=1 --group_reporting --bs=4k --rw=randwrite --size 10G --name=/dev/rbd0 --iodepth=128

problem
the fio testing will be hanged about 30-40 sec.
in iostat.log . at 06/23/2017 06/23/2017 05:57:15
disk sdd has 100% util, and then the iops of fio testing is 0.
how to fix it??
i check the osd log,it's

2017-06-23 17:57:00.150802 7fc58fb84700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1498211820150795, "job": 32, "event": "flush_started", "num_memtables": 1, "num_entries": 320290, "num_deletes": 106800, "memory_usage": 65013968}
2017-06-23 17:57:00.150806 7fc58fb84700  4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.3/rpm/el7/BUILD/ceph-12.0.3/src/rocksdb/db/flush_job.cc:293] [default] [JOB 32] Level-0 flush table #148: started
2017-06-23 17:57:00.353798 7fc58fb84700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1498211820353779, "cf_name": "default", "job": 32, "event": "table_file_creation", "file_number": 148, "file_size": 23161781, "table_properties": {"data_size": 22693540, "index_size": 467418, "filter_size": 0, "raw_key_size": 14533440, "raw_average_key_size": 67, "raw_value_size": 18831838, "raw_average_value_size": 88, "num_data_blocks": 5806, "num_entries": 213828, "filter_policy_name": "", "kDeletedKeys": "106800", "kMergeOperands": "0"}}
2017-06-23 17:57:00.353816 7fc58fb84700  4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.3/rpm/el7/BUILD/ceph-12.0.3/src/rocksdb/db/flush_job.cc:319] [default] [JOB 32] Level-0 flush table #148: 23161781 bytes OK
2017-06-23 17:57:00.354871 7fc58fb84700  4 rocksdb: (Original Log Time 2017/06/23-17:57:00.353826) [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.3/rpm/el7/BUILD/ceph-12.0.3/src/rocksdb/db/memtable_list.cc:360] [default] Level-0 commit table #148 started
2017-06-23 17:57:00.354876 7fc58fb84700  4 rocksdb: (Original Log Time 2017/06/23-17:57:00.354812) [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.3/rpm/el7/BUILD/ceph-12.0.3/src/rocksdb/db/memtable_list.cc:383] [default] Level-0 commit table #148: memtable #1 done
2017-06-23 17:57:00.354878 7fc58fb84700  4 rocksdb: (Original Log Time 2017/06/23-17:57:00.354825) EVENT_LOG_v1 {"time_micros": 1498211820354819, "job": 32, "event": "flush_finished", "lsm_state": [2, 2, 0, 0, 0, 0, 0], "immutable_memtables": 0}
2017-06-23 17:57:00.354881 7fc58fb84700  4 rocksdb: (Original Log Time 2017/06/23-17:57:00.354844) [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.3/rpm/el7/BUILD/ceph-12.0.3/src/rocksdb/db/db_impl_compaction_flush.cc:132] [default] Level summary: base level 1 max bytes base 268435456 files[2 2 0 0 0 0 0] max score 0.50

2017-06-23 17:57:00.354888 7fc58fb84700  4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.3/rpm/el7/BUILD/ceph-12.0.3/src/rocksdb/db/db_impl_files.cc:388] [JOB 32] Try to delete WAL files size 59234319, prev total WAL file size 59635063, number of live WAL files 2.

2017-06-23 17:57:30.068047 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:30.068417 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:30.486288 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:30.486300 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:30.995800 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:30.995850 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:31.181170 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:31.181207 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:31.659606 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:31.659623 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:31.878119 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:31.878174 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:32.410072 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:32.410105 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:32.507813 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:32.507990 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:32.973129 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:32.973412 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:33.049784 7fc5adbf8700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:33.322992 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:33.323021 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:34.989710 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:34.989759 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:35.281774 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:35.281825 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:35.623550 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:35.623672 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:35.696286 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:35.696340 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:35.760361 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:35.760477 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:35.782442 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:35.782485 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:35.968737 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:35.968754 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:36.008497 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:36.008682 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:36.196929 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:36.197071 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:36.387002 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:36.387027 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:36.510525 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:36.510536 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:36.724498 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:36.724526 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:37.178850 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:37.178878 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:37.460982 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:37.461081 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:37.673892 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:37.674053 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:37.709496 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:37.709539 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:38.049870 7fc5adbf8700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:38.879534 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:38.879573 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:39.090049 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:39.090071 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:39.697794 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:39.697867 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:40.069545 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:40.069553 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:40.483138 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:40.483380 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:40.611326 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:40.611375 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:40.798258 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:40.798415 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:41.087748 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:41.087765 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:41.561647 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:41.561668 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:41.770211 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:41.770225 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:42.024948 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:42.025187 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:42.270684 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:42.270714 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:42.374624 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:42.374652 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:42.525523 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:42.525581 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:42.980005 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:42.980123 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:43.049977 7fc5adbf8700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:43.190650 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:43.190695 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:43.371659 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:43.371671 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:43.609878 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:43.609909 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:44.111739 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:44.111769 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:44.298793 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:44.298927 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:44.583229 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : 135 slow requests, 5 included below; oldest blocked for > 30.217186 secs
2017-06-23 17:57:44.583241 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 30.216137 seconds old, received at 2017-06-23 17:57:14.366963: osd_repop(client.4385.1:19397234 0.103 e161/158 0:c09c748d:::rb.0.1126.41b71efb.000000001a7f:head v 161'32838) currently write_thread_in_journal_buffer
2017-06-23 17:57:44.583246 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 30.214274 seconds old, received at 2017-06-23 17:57:14.368826: osd_repop(client.4385.1:19397319 0.199 e161/159 0:99b2dfc1:::rb.0.1126.41b71efb.00000000003d:head v 161'39101) currently commit_queued_for_journal_write
2017-06-23 17:57:44.583250 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 30.213115 seconds old, received at 2017-06-23 17:57:14.369985: osd_repop(client.4385.1:19397350 0.4 e161/158 0:20775ff7:::rb.0.1126.41b71efb.000000000760:head v 161'43410) currently commit_queued_for_journal_write
2017-06-23 17:57:44.583256 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 30.211471 seconds old, received at 2017-06-23 17:57:14.371629: osd_op(client.4385.1:19397420 0.ee1393b8 0:1dc9c877:::rb.0.1126.41b71efb.0000000003be:head [set-alloc-hint object_size 4194304 write_size 4194304,write 2105344~4096] snapc 0=[] ondisk+write+known_if_redirected e161) currently waiting for subops from 1,2
2017-06-23 17:57:44.583265 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 30.207416 seconds old, received at 2017-06-23 17:57:14.375684: osd_op(client.4385.1:19397510 0.dc4daea6 0:6575b23b:::rb.0.1126.41b71efb.000000004c52:head [set-alloc-hint object_size 4194304 write_size 4194304,write 3137536~4096] snapc 0=[] ondisk+write+known_if_redirected e161) currently queued_for_pg
2017-06-23 17:57:45.491360 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:45.491440 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:45.583586 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : 135 slow requests, 5 included below; oldest blocked for > 31.217534 secs
2017-06-23 17:57:45.583595 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 31.216554 seconds old, received at 2017-06-23 17:57:14.366894: osd_op(client.4385.1:19397278 0.cb22385b 0:da1c44d3:::rb.0.1126.41b71efb.000000004478:head [set-alloc-hint object_size 4194304 write_size 4194304,write 2555904~4096] snapc 0=[] ondisk+write+known_if_redirected e161) currently sub_op_commit_rec from 8
2017-06-23 17:57:45.583601 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 31.213418 seconds old, received at 2017-06-23 17:57:14.370030: osd_repop(client.4385.1:19397358 0.161 e161/160 0:86d6b60b:::rb.0.1126.41b71efb.000000003841:head v 161'40567) currently commit_queued_for_journal_write
2017-06-23 17:57:45.583605 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 31.211710 seconds old, received at 2017-06-23 17:57:14.371738: osd_repop(client.4385.1:19397405 0.65 e161/160) currently queued_for_pg
2017-06-23 17:57:45.583611 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 31.207684 seconds old, received at 2017-06-23 17:57:14.375764: osd_op(client.4385.1:19397511 0.f5e62e27 0:e47467af:::rb.0.1126.41b71efb.0000000042b9:head [set-alloc-hint object_size 4194304 write_size 4194304,write 2093056~4096] snapc 0=[] ondisk+write+known_if_redirected e161) currently queued_for_pg
2017-06-23 17:57:45.583618 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 31.216443 seconds old, received at 2017-06-23 17:57:14.367005: osd_repop(client.4385.1:19397248 0.12 e161/160 0:487dd605:::rb.0.1126.41b71efb.0000000040ee:head v 161'33450) currently write_thread_in_journal_buffer
2017-06-23 17:57:45.662422 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:45.662438 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:45.788199 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:45.788215 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:46.383742 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:46.384301 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:46.510339 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:46.510389 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:46.583808 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : 136 slow requests, 5 included below; oldest blocked for > 32.217807 secs
2017-06-23 17:57:46.583818 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 32.211946 seconds old, received at 2017-06-23 17:57:14.371775: osd_repop_reply(client.4385.1:19397374 0.157 e161/158) currently queued_for_pg
2017-06-23 17:57:46.583822 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 32.207855 seconds old, received at 2017-06-23 17:57:14.375866: osd_repop(client.4385.1:19397508 0.98 e161/159) currently queued_for_pg
2017-06-23 17:57:46.583835 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 32.216679 seconds old, received at 2017-06-23 17:57:14.367042: osd_repop(client.4385.1:19397258 0.148 e161/158 0:129db012:::rb.0.1126.41b71efb.000000005916:head v 161'36087) currently write_thread_in_journal_buffer
2017-06-23 17:57:46.583841 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 32.214817 seconds old, received at 2017-06-23 17:57:14.368904: osd_repop(client.4385.1:19397322 0.139 e161/160 0:9ce1644e:::rb.0.1126.41b71efb.000000004093:head v 161'37011) currently commit_queued_for_journal_write
2017-06-23 17:57:46.583845 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 32.211900 seconds old, received at 2017-06-23 17:57:14.371821: osd_repop(client.4385.1:19397413 0.1 e161/158) currently queued_for_pg
2017-06-23 17:57:46.762842 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:46.762890 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:46.888847 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:46.888910 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:47.075332 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:47.075515 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:47.584043 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : 136 slow requests, 5 included below; oldest blocked for > 33.218037 secs
2017-06-23 17:57:47.584061 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 33.208030 seconds old, received at 2017-06-23 17:57:14.375921: osd_repop(client.4385.1:19397506 0.c9 e161/158) currently queued_for_pg
2017-06-23 17:57:47.584067 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 33.215020 seconds old, received at 2017-06-23 17:57:14.368931: osd_repop(client.4385.1:19397300 0.1a6 e161/160 0:65e98bed:::rb.0.1126.41b71efb.000000001df8:head v 161'30571) currently commit_queued_for_journal_write
2017-06-23 17:57:47.584071 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 33.213694 seconds old, received at 2017-06-23 17:57:14.370257: osd_repop(client.4385.1:19397360 0.dc e161/159 0:3b49b009:::rb.0.1126.41b71efb.000000002001:head v 161'38194) currently commit_queued_for_journal_write
2017-06-23 17:57:47.584076 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 33.211887 seconds old, received at 2017-06-23 17:57:14.372064: osd_repop_reply(client.4385.1:19397404 0.1c0 e161/160) currently queued_for_pg
2017-06-23 17:57:47.584086 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 33.207462 seconds old, received at 2017-06-23 17:57:14.376488: osd_repop(client.4385.1:19397521 0.139 e161/160) currently queued_for_pg
2017-06-23 17:57:47.826345 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:47.826394 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:48.050087 7fc5adbf8700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:48.280660 7fc5afb6d700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:48.280788 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:48.399393 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:48.399449 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:48.584316 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : 136 slow requests, 5 included below; oldest blocked for > 34.218277 secs
2017-06-23 17:57:48.584326 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 34.213798 seconds old, received at 2017-06-23 17:57:14.370393: osd_repop(client.4385.1:19397335 0.15d e161/160 0:ba916a03:::rb.0.1126.41b71efb.0000000023ab:head v 161'43861) currently started
2017-06-23 17:57:48.584331 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 34.212021 seconds old, received at 2017-06-23 17:57:14.372170: osd_op(client.4385.1:19397427 0.cde20ee5 0:a77047b3:::rb.0.1126.41b71efb.000000002f1a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 1126400~4096] snapc 0=[] ondisk+write+known_if_redirected e161) currently queued_for_pg
2017-06-23 17:57:48.584336 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 34.207696 seconds old, received at 2017-06-23 17:57:14.376495: osd_repop(client.4385.1:19397523 0.98 e161/159) currently queued_for_pg
2017-06-23 17:57:48.584340 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 34.215209 seconds old, received at 2017-06-23 17:57:14.368982: osd_repop(client.4385.1:19397308 0.103 e161/158 0:c0d5ce5b:::rb.0.1126.41b71efb.000000004877:head v 161'32839) currently commit_queued_for_journal_write
2017-06-23 17:57:48.584357 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 34.213758 seconds old, received at 2017-06-23 17:57:14.370434: osd_repop(client.4385.1:19397338 0.11 e161/159 0:883057a0:::rb.0.1126.41b71efb.000000000f8c:head v 161'32688) currently started
2017-06-23 17:57:48.775927 7fc5af36c700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:48.775941 7fc5aeb6b700  1 heartbeat_map is_healthy 'tp_osd_tp thread tp_osd_tp' had timed out after 15
2017-06-23 17:57:48.781356 7fc5af36c700  0 -- 10.1.1.12:6806/1089579 >> 10.1.1.12:6808/1090333 conn(0x7fc5d704a800 :-1 s=STATE_OPEN pgs=55 cs=1 l=0).fault with nothing to send, going to standby
2017-06-23 17:57:48.781867 7fc5aeb6b700  0 -- 10.1.1.12:6806/1089579 >> 10.1.1.11:6802/3673486 conn(0x7fc5d7bd4000 :-1 s=STATE_OPEN pgs=3 cs=1 l=0).fault with nothing to send, going to standby
2017-06-23 17:57:48.782263 7fc5af36c700  0 -- 10.1.1.12:6806/1089579 >> 10.1.1.12:6800/90330 conn(0x7fc5e05ff800 :-1 s=STATE_OPEN pgs=14 cs=1 l=0).fault with nothing to send, going to standby
2017-06-23 17:57:48.782936 7fc5afb6d700  0 -- 10.1.1.12:6806/1089579 >> 10.1.1.11:6800/3673479 conn(0x7fc5de22f000 :-1 s=STATE_OPEN pgs=2 cs=1 l=0).fault with nothing to send, going to standby
2017-06-23 17:57:48.783035 7fc5afb6d700  0 -- 10.1.1.12:6806/1089579 >> 10.1.1.11:6806/3673484 conn(0x7fc5c0730800 :-1 s=STATE_OPEN pgs=5 cs=1 l=0).fault with nothing to send, going to standby
2017-06-23 17:57:48.783087 7fc5af36c700  0 -- 10.1.1.12:6806/1089579 >> 10.1.1.13:6805/3371268 conn(0x7fc5de22d800 :-1 s=STATE_OPEN pgs=3 cs=1 l=0).fault with nothing to send, going to standby
2017-06-23 17:57:48.783461 7fc5aeb6b700  0 -- 10.1.1.12:6806/1089579 >> 10.1.1.11:6804/3673476 conn(0x7fc5d7224000 :-1 s=STATE_OPEN pgs=3 cs=1 l=0).fault with nothing to send, going to standby
2017-06-23 17:57:48.785013 7fc5af36c700  0 -- 10.1.1.12:6806/1089579 >> 10.1.1.13:6803/3371274 conn(0x7fc5d7605000 :-1 s=STATE_OPEN pgs=1 cs=1 l=0).fault with nothing to send, going to standby
2017-06-23 17:57:48.786302 7fc5af36c700  0 -- 10.1.1.12:6806/1089579 >> 10.1.1.13:6806/3371276 conn(0x7fc5d71ed000 :-1 s=STATE_OPEN pgs=3 cs=1 l=0).fault with nothing to send, going to standby
2017-06-23 17:57:48.787066 7fc5aeb6b700  0 -- 10.1.1.12:6806/1089579 >> 10.1.1.13:6801/3371278 conn(0x7fc5de230800 :-1 s=STATE_OPEN pgs=3 cs=1 l=0).fault with nothing to send, going to standby
2017-06-23 17:57:49.275352 7fc5aeb6b700  0 -- 10.1.1.12:6806/1089579 >> 10.1.1.12:6804/90332 conn(0x7fc5e16f0000 :-1 s=STATE_OPEN pgs=9 cs=1 l=0).fault with nothing to send, going to standby
2017-06-23 17:57:49.584611 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : 136 slow requests, 5 included below; oldest blocked for > 35.218603 secs
2017-06-23 17:57:49.584623 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 35.212281 seconds old, received at 2017-06-23 17:57:14.372237: osd_repop(client.4385.1:19397423 0.90 e161/158) currently queued_for_pg
2017-06-23 17:57:49.584632 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 35.207878 seconds old, received at 2017-06-23 17:57:14.376640: osd_repop(client.4385.1:19397524 0.161 e161/160) currently queued_for_pg
2017-06-23 17:57:49.584657 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 35.212243 seconds old, received at 2017-06-23 17:57:14.372275: osd_repop_reply(client.4385.1:19397404 0.1c0 e161/160) currently queued_for_pg
2017-06-23 17:57:49.584668 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 35.207420 seconds old, received at 2017-06-23 17:57:14.377098: osd_op(client.4385.1:19397542 0.8c49cce7 0:e7339231:::rb.0.1126.41b71efb.0000000000f1:head [set-alloc-hint object_size 4194304 write_size 4194304,write 3522560~4096] snapc 0=[] ondisk+write+known_if_redirected e161) currently queued_for_pg
2017-06-23 17:57:49.584674 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 35.217294 seconds old, received at 2017-06-23 17:57:14.367223: osd_repop(client.4385.1:19397243 0.85 e161/160 0:a1746f02:::rb.0.1126.41b71efb.000000000b5f:head v 161'36763) currently write_thread_in_journal_buffer
2017-06-23 17:57:50.584932 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : 136 slow requests, 5 included below; oldest blocked for > 36.218888 secs
2017-06-23 17:57:50.584941 7fc5ab3bb700  0 log_channel(cluster) log [WRN] : slow request 36.215737 seconds old, received at 2017-06-23 17:57:14.369065: osd_op(client.4385.1:19397340 0.603d872e 0:74e1bc06:::rb.0.1126.41b71efb.00000000033b:head [set-alloc-hint object_size 4194304 write_size 4194304,write 1282048~4096] snapc 0=[] ondisk+write+known_if_redirected e161) currently sub_op_commit_rec from 5

Files

Download all files

iostat.log (201 KB) iostat.log		XuHua Lin, 06/23/2017 10:38 AM
ceph-osd.9.log (807 KB) ceph-osd.9.log		XuHua Lin, 06/23/2017 10:40 AM

No data to display

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Support #20391

cluster health is ok, but client IO hang with 100% utils of ssd disk