Project

General

Profile

Bug #49116

written io continuous high occupancy

Added by liu yongqing 23 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Development
Tags:
14.2.2
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Ceph's status is healthy.No other processes are running on the server, only ceph is running.But ceph’s io occupies a lot, and written io has reached 60M/s.Io occupies too high, it will drop when mds is turned off.

mds's log
```
debug 2021-02-03 01:50:06.802 7f26f125d700 1 mds.0.273579 Evicting (and blacklisting) client session 11211836 (172.76.43.172:0/2716102586)
debug 2021-02-03 01:50:06.802 7f26f125d700 0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 11211836 (172.76.43.172:0/2716102586)
debug 2021-02-03 02:01:16.815 7f26f125d700 0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 34.258067 secs
debug 2021-02-03 02:01:16.815 7f26f125d700 0 log_channel(cluster) log [WRN] : slow request 34.258066 seconds old, received at 2021-02-03 02:00:42.557504: client_request(client.11054100:41355457 getattr Xs #0x1000157d8ac 2021-02-03 02:00:42.557059 caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting
debug 2021-02-03 02:01:16.815 7f26f125d700 0 log_channel(cluster) log [WRN] : slow request 34.201902 seconds old, received at 2021-02-03 02:00:42.613668: client_request(client.3477490:60044653 getattr pAsLsXsFs #0x1000157d8ac 2021-02-03 02:00:42.612749 caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting
debug 2021-02-03 02:01:19.331 7f26f3a62700 1 mds.myfs-a Updating MDS map to version 273587 from mon.1
```

mon's log
```
cluster 2021-02-03 02:28:13.127823 mgr.a (mgr.11169429) 166757 : cluster [DBG] pgmap v167807: 1000 pgs: 1000 active+clean; 776 GiB data, 2.3 TiB used, 3.1 TiB / 5.4 TiB avail; 293 KiB/s rd, 66 MiB/s wr, 177 op/s
cluster 2021-02-03 02:28:15.129717 mgr.a (mgr.11169429) 166758 : cluster [DBG] pgmap v167808: 1000 pgs: 1000 active+clean; 776 GiB data, 2.3 TiB used, 3.1 TiB / 5.4 TiB avail; 252 KiB/s rd, 60 MiB/s wr, 163 op/s
cluster 2021-02-03 02:28:17.132390 mgr.a (mgr.11169429) 166759 : cluster [DBG] pgmap v167809: 1000 pgs: 1000 active+clean; 776 GiB data, 2.3 TiB used, 3.1 TiB / 5.4 TiB avail; 249 KiB/s rd, 79 MiB/s wr, 166 op/s
```
mgr's log
```
debug 2021-02-03 02:31:37.501 7f4817bb3700 0 mgr[prometheus] failed listing pool cinder: [errno 2] error opening pool 'cinder'
debug 2021-02-03 02:31:37.501 7f4817bb3700 0 mgr[prometheus] failed listing pool glance: [errno 2] error opening pool 'glance'
debug 2021-02-03 02:31:37.502 7f4817bb3700 0 mgr[prometheus] failed listing pool nova: [errno 2] error opening pool 'nova'
::ffff:172.20.51.19 - - [03/Feb/2021:02:31:37] "GET /metrics HTTP/1.1" 200 94093 "" "Prometheus/2.9.2"
::ffff:172.20.1.1 - - [03/Feb/2021:02:31:38] "GET /api/summary HTTP/1.1" 200 249 "https://172.20.51.17:8443/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:84.0) Gecko/20100101 Firefox/84.0"
debug 2021-02-03 02:31:39.348 7f4821e9a700 0 log_channel(cluster) log [DBG] : pgmap v167910: 1000 pgs: 1000 active+clean; 775 GiB data, 2.3 TiB used, 3.1 TiB / 5.4 TiB avail; 417 KiB/s rd, 83 MiB/s wr, 419 op/s
```

ceph's status
```
  1. ceph -s
    cluster:
    id: 01e525c3-75ac-4e60-8080-b46f3579cab3
    health: HEALTH_WARN
    noscrub,nodeep-scrub flag(s) set
    527 pgs not deep-scrubbed in time
    386 pgs not scrubbed in time
services:
mon: 3 daemons, quorum a,c,d (age 6d)
mgr: a(active, since 3d)
mds: myfs:1 {0=myfs-a=up:active} 1 up:standby-replay
osd: 3 osds: 3 up (since 3d), 3 in (since 6d)
flags noscrub,nodeep-scrub
rgw: 1 daemon active (prometheus.store.a)
data:
pools: 10 pools, 1000 pgs
objects: 2.98M objects, 776 GiB
usage: 2.3 TiB used, 3.1 TiB / 5.4 TiB avail
pgs: 1000 active+clean
io:
client: 286 KiB/s rd, 69 MiB/s wr, 3 op/s rd, 182 op/s wr
```

Environment:

OS (e.g. from /etc/os-release): CentOS Linux release 7.6.1810 (Core)
Kernel (e.g. uname -a): 3.10.0-1127.el7.x86_64
Cloud provider or hardware configuration: 40c 256G
Rook version (use rook version inside of a Rook Pod): v1.0.6
Storage backend version (e.g. for ceph do ceph -v): 14.2.1
Kubernetes version (use kubectl version): v1.11.10
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

Through iotop, the read and write high of the two processes of osd can be seen
```
23605 be/4 root 0.00 B/s 69.86 M/s 0.00 % 0.00 % ceph-osd --foreground --id 1 --osd-uuid 9a3bfe74-1169-4a~cluster ceph --default-log-to-file false [journal_write]
23608 be/4 root 0.00 B/s 34.80 M/s 0.00 % 0.00 % ceph-osd --foreground --id 1 --osd-uuid 9a3bfe74-1169-4a~-cluster ceph --default-log-to-file false [tp_fstore_op]
23609 be/4 root 0.00 B/s 34.73 M/s 0.00 % 0.00 % ceph-osd --foreground --id 1 --osd-uuid 9a3bfe74-1169-4a~-cluster ceph --default-log-to-file false [tp_fstore_op]
```

what caused the io of ceph to be so high?

Also available in: Atom PDF