Project

General

Profile

Bug #48028

ceph-mon always suffer lots of slow ops from v14.2.9

Added by Yao Ning 3 months ago. Updated 3 months ago.

Status:
New
Priority:
High
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

root@worker-2:~# docker exec ceph-mon-worker-2 ceph -s
cluster:
id: 299a04ba-dd3e-43a7-af17-628190cf742f
health: HEALTH_WARN
41 slow ops, oldest one blocked for 38274 sec, mon.worker-2 has slow ops

services:
mon: 3 daemons, quorum worker-2,worker-12,worker-22 (age 17m)
mgr: worker-22(active, since 11d), standbys: worker-12, worker-2
mds: cephfs:1 {0=worker-22=up:active} 2 up:standby
osd: 300 osds: 300 up (since 56m), 300 in (since 11d)
task status:
scrub status:
mds.worker-22: idle
data:
pools: 2 pools, 16384 pgs
objects: 78.68M objects, 300 TiB
usage: 645 TiB used, 1.5 PiB / 2.2 PiB avail
pgs: 16384 active+clean
io:
client: 5.1 KiB/s rd, 315 MiB/s wr, 1 op/s rd, 188 op/s wr

History

#1 Updated by Yao Ning 3 months ago

Yao Ning wrote:

root@worker-2:~# docker exec ceph-mon-worker-2 ceph -s
cluster:
id: 299a04ba-dd3e-43a7-af17-628190cf742f
health: HEALTH_WARN
41 slow ops, oldest one blocked for 38274 sec, mon.worker-2 has slow ops

services:
mon: 3 daemons, quorum worker-2,worker-12,worker-22 (age 17m)
mgr: worker-22(active, since 11d), standbys: worker-12, worker-2
mds: cephfs:1 {0=worker-22=up:active} 2 up:standby
osd: 300 osds: 300 up (since 56m), 300 in (since 11d)

task status:
scrub status:
mds.worker-22: idle

data:
pools: 2 pools, 16384 pgs
objects: 78.68M objects, 300 TiB
usage: 645 TiB used, 1.5 PiB / 2.2 PiB avail
pgs: 16384 active+clean

io:
client: 5.1 KiB/s rd, 315 MiB/s wr, 1 op/s rd, 188 op/s wr

{
"description": "osd_failure(failed timeout osd.79 [v2:172.168.0.12:6878/46988,v1:172.168.0.12:6879/46988] for 145sec e89966 v89966)",
"initiated_at": "2020-10-28 06:00:45.449880",
"age": 33713.186917503001,
"duration": 30.000211802999999,
"type_data": {
"events": [ {
"time": "2020-10-28 06:00:45.449880",
"event": "initiated"
}, {
"time": "2020-10-28 06:00:45.449880",
"event": "header_read"
}, {
"time": "0.000000",
"event": "throttled"
}, {
"time": "0.000000",
"event": "all_read"
}, {
"time": "0.000000",
"event": "dispatched"
}, {
"time": "2020-10-28 06:00:45.449973",
"event": "mon:_ms_dispatch"
}, {
"time": "2020-10-28 06:00:45.449974",
"event": "mon:dispatch_op"
}, {
"time": "2020-10-28 06:00:45.449974",
"event": "psvc:dispatch"
}, {
"time": "2020-10-28 06:00:45.449991",
"event": "osdmap:preprocess_query"
}, {
"time": "2020-10-28 06:00:45.449993",
"event": "osdmap:preprocess_failure"
}, {
"time": "2020-10-28 06:00:45.450004",
"event": "osdmap:prepare_update"
}, {
"time": "2020-10-28 06:00:45.450005",
"event": "osdmap:prepare_failure"
}, {
"time": "2020-10-28 06:00:45.450018",
"event": "no_reply: send routed request"
}, {
"time": "2020-10-28 06:01:15.449819",
"event": "no_reply: send routed request"
}, {
"time": "2020-10-28 06:01:15.450092",
"event": "done"
}
],
"info": {
"seq": 16028221,
"src_is_mon": false,
"source": "osd.107 v2:172.168.0.5:6881/12793",
"forwarded_to_leader": false

we find that, it is frequently occurs during osd_failures. and the slow ops cannot be auto recovered until we restart the ceph monitor daemon

#2 Updated by Yao Ning 3 months ago

Yao Ning wrote:

root@worker-2:~# docker exec ceph-mon-worker-2 ceph -s
cluster:
id: 299a04ba-dd3e-43a7-af17-628190cf742f
health: HEALTH_WARN
41 slow ops, oldest one blocked for 38274 sec, mon.worker-2 has slow ops

services:
mon: 3 daemons, quorum worker-2,worker-12,worker-22 (age 17m)
mgr: worker-22(active, since 11d), standbys: worker-12, worker-2
mds: cephfs:1 {0=worker-22=up:active} 2 up:standby
osd: 300 osds: 300 up (since 56m), 300 in (since 11d)

task status:
scrub status:
mds.worker-22: idle

data:
pools: 2 pools, 16384 pgs
objects: 78.68M objects, 300 TiB
usage: 645 TiB used, 1.5 PiB / 2.2 PiB avail
pgs: 16384 active+clean

io:
client: 5.1 KiB/s rd, 315 MiB/s wr, 1 op/s rd, 188 op/s wr

in leader mon log:

2020-10-29 03:07:14.980 7f2255491700 0 log_channel(cluster) log [WRN] : Health check update: 41 slow ops, oldest one blocked for 422147 sec, mon.overcloud-controller-0 has slow ops (SLOW_OPS)
2020-10-29 03:07:14.988 7f2255491700 -1 mon.overcloud-controller-0@0(leader) e8 get_health_metrics reporting 41 slow ops, oldest is osd_failure(failed timeout osd.24 [v2:10.0.213.181:6804/4392,v1:10.0.213.181:6806/4392] for 11sec e343763 v343763)

Also available in: Atom PDF