Bug #48028
ceph-mon always suffer lots of slow ops from v14.2.9
0%
Description
root@worker-2:~# docker exec ceph-mon-worker-2 ceph -s
cluster:
id: 299a04ba-dd3e-43a7-af17-628190cf742f
health: HEALTH_WARN
41 slow ops, oldest one blocked for 38274 sec, mon.worker-2 has slow ops
services:
mon: 3 daemons, quorum worker-2,worker-12,worker-22 (age 17m)
mgr: worker-22(active, since 11d), standbys: worker-12, worker-2
mds: cephfs:1 {0=worker-22=up:active} 2 up:standby
osd: 300 osds: 300 up (since 56m), 300 in (since 11d)
task status:
scrub status:
mds.worker-22: idle
data:
pools: 2 pools, 16384 pgs
objects: 78.68M objects, 300 TiB
usage: 645 TiB used, 1.5 PiB / 2.2 PiB avail
pgs: 16384 active+clean
io:
client: 5.1 KiB/s rd, 315 MiB/s wr, 1 op/s rd, 188 op/s wr
History
#1 Updated by Yao Ning 3 months ago
Yao Ning wrote:
root@worker-2:~# docker exec ceph-mon-worker-2 ceph -s
cluster:
id: 299a04ba-dd3e-43a7-af17-628190cf742f
health: HEALTH_WARN
41 slow ops, oldest one blocked for 38274 sec, mon.worker-2 has slow opsservices:
mon: 3 daemons, quorum worker-2,worker-12,worker-22 (age 17m)
mgr: worker-22(active, since 11d), standbys: worker-12, worker-2
mds: cephfs:1 {0=worker-22=up:active} 2 up:standby
osd: 300 osds: 300 up (since 56m), 300 in (since 11d)task status:
scrub status:
mds.worker-22: idledata:
pools: 2 pools, 16384 pgs
objects: 78.68M objects, 300 TiB
usage: 645 TiB used, 1.5 PiB / 2.2 PiB avail
pgs: 16384 active+cleanio:
client: 5.1 KiB/s rd, 315 MiB/s wr, 1 op/s rd, 188 op/s wr
{
"description": "osd_failure(failed timeout osd.79 [v2:172.168.0.12:6878/46988,v1:172.168.0.12:6879/46988] for 145sec e89966 v89966)",
"initiated_at": "2020-10-28 06:00:45.449880",
"age": 33713.186917503001,
"duration": 30.000211802999999,
"type_data": {
"events": [
{
"time": "2020-10-28 06:00:45.449880",
"event": "initiated"
},
{
"time": "2020-10-28 06:00:45.449880",
"event": "header_read"
},
{
"time": "0.000000",
"event": "throttled"
},
{
"time": "0.000000",
"event": "all_read"
},
{
"time": "0.000000",
"event": "dispatched"
},
{
"time": "2020-10-28 06:00:45.449973",
"event": "mon:_ms_dispatch"
},
{
"time": "2020-10-28 06:00:45.449974",
"event": "mon:dispatch_op"
},
{
"time": "2020-10-28 06:00:45.449974",
"event": "psvc:dispatch"
},
{
"time": "2020-10-28 06:00:45.449991",
"event": "osdmap:preprocess_query"
},
{
"time": "2020-10-28 06:00:45.449993",
"event": "osdmap:preprocess_failure"
},
{
"time": "2020-10-28 06:00:45.450004",
"event": "osdmap:prepare_update"
},
{
"time": "2020-10-28 06:00:45.450005",
"event": "osdmap:prepare_failure"
},
{
"time": "2020-10-28 06:00:45.450018",
"event": "no_reply: send routed request"
},
{
"time": "2020-10-28 06:01:15.449819",
"event": "no_reply: send routed request"
},
{
"time": "2020-10-28 06:01:15.450092",
"event": "done"
}
],
"info": {
"seq": 16028221,
"src_is_mon": false,
"source": "osd.107 v2:172.168.0.5:6881/12793",
"forwarded_to_leader": false
we find that, it is frequently occurs during osd_failures. and the slow ops cannot be auto recovered until we restart the ceph monitor daemon
#2 Updated by Yao Ning 3 months ago
Yao Ning wrote:
root@worker-2:~# docker exec ceph-mon-worker-2 ceph -s
cluster:
id: 299a04ba-dd3e-43a7-af17-628190cf742f
health: HEALTH_WARN
41 slow ops, oldest one blocked for 38274 sec, mon.worker-2 has slow opsservices:
mon: 3 daemons, quorum worker-2,worker-12,worker-22 (age 17m)
mgr: worker-22(active, since 11d), standbys: worker-12, worker-2
mds: cephfs:1 {0=worker-22=up:active} 2 up:standby
osd: 300 osds: 300 up (since 56m), 300 in (since 11d)task status:
scrub status:
mds.worker-22: idledata:
pools: 2 pools, 16384 pgs
objects: 78.68M objects, 300 TiB
usage: 645 TiB used, 1.5 PiB / 2.2 PiB avail
pgs: 16384 active+cleanio:
client: 5.1 KiB/s rd, 315 MiB/s wr, 1 op/s rd, 188 op/s wr
in leader mon log:
2020-10-29 03:07:14.980 7f2255491700 0 log_channel(cluster) log [WRN] : Health check update: 41 slow ops, oldest one blocked for 422147 sec, mon.overcloud-controller-0 has slow ops (SLOW_OPS)
2020-10-29 03:07:14.988 7f2255491700 -1 mon.overcloud-controller-0@0(leader) e8 get_health_metrics reporting 41 slow ops, oldest is osd_failure(failed timeout osd.24 [v2:10.0.213.181:6804/4392,v1:10.0.213.181:6806/4392] for 11sec e343763 v343763)