Project

General

Profile

Actions

Bug #59104

closed

All access to rgw took over 100 seconds for a long time

Added by Yuma Ogami about 1 year ago. Updated 8 months ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

All access to rgw took over 100 seconds for more than 10 minutes. I encountered this problem several times.
At that time, ceph status looks as following:

```
cybozu@stage0-boot-0:~$ kubectl exec n ceph-object-store deploy/rook-ceph-tools - ceph -s
cluster:
id: b52d5f3d-ba14-442e-a089-0bca47b83758
health: HEALTH_WARN
80 pgs not deep-scrubbed in time
79 pgs not scrubbed in time
17 slow ops, oldest one blocked for 856 sec, daemons [osd.0,osd.1,osd.17,osd.2] have slow ops.

services:
mon: 3 daemons, quorum b,f,g (age 70m)
mgr: a(active, since 11h), standbys: b
osd: 24 osds: 24 up (since 12m), 24 in (since 14m); 15 remapped pgs
rgw: 3 daemons active (3 hosts, 1 zones)
data:
pools: 9 pools, 336 pgs
objects: 104.09M objects, 7.9 TiB
usage: 25 TiB used, 114 TiB / 139 TiB avail
pgs: 10043255/312255237 objects misplaced (3.216%)
291 active+clean
26 active+recovering
15 active+remapped+backfilling
3 active+recovering+laggy
1 active+clean+laggy
io:
client: 70 KiB/s rd, 19 KiB/s wr, 66 op/s rd, 16 op/s wr
recovery: 2.5 MiB/s, 29.44k keys/s, 27 objects/s
```

Here is the latency of healthcheck (i.e. GET "/swift/healthcheck" request).

```
...
2023-03-15 12:49:22
debug 2023-03-15T03:49:22.364+0000 7fed946e0700 1 beast: 0x7fee4416a710: 10.69.3.9 - - [15/Mar/2023:03:49:22.363 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=0.000000000sShow context
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 1 beast: 0x7ff83c2ed710: 10.69.1.134 - - [15/Mar/2023:03:49:23.400 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=0.000000000s
2023-03-15 12:49:31
debug 2023-03-15T03:49:31.934+0000 7f5cdfe4f700 1 beast: 0x7f5dd85f3710: 10.69.2.83 - - [15/Mar/2023:03:49:31.934 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=0.000000000s
2023-03-15 12:52:20
debug 2023-03-15T03:52:20.396+0000 7ff7b8f61700 1 beast: 0x7ff8305f3710: 10.69.1.134 - - [15/Mar/2023:03:49:33.402 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=166.993789673s
2023-03-15 12:52:20
debug 2023-03-15T03:52:20.396+0000 7ff757e9f700 1 beast: 0x7ff818597710: 10.69.1.134 - - [15/Mar/2023:03:49:54.402 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=145.994049072s
2023-03-15 12:52:20
debug 2023-03-15T03:52:20.396+0000 7ff751692700 1 beast: 0x7ff8240e9710: 10.69.1.134 - - [15/Mar/2023:03:49:53.401 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=146.994079590s
2023-03-15 12:52:20
debug 2023-03-15T03:52:20.396+0000 7ff7cc788700 1 beast: 0x7ff818414710: 10.69.1.134 - - [15/Mar/2023:03:50:13.400 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=126.995399475s
...
```

It looks fine before 03:52:20. But after that, the latency got too bad. Other requests like PUT took long time too.

When healthcheck took long time, the log looked different from the normal case.

  • normal case

```
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 1 ====== starting new request req=0x7ff83c2ed710 =====
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s initializing for trans_id = tx00000973bdb8c58a8c2c8-0064114043-663ba9a-ceph-object-store-0
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s getting op 0
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check verifying requester
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check normalizing buckets and tenants
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check init permissions
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check recalculating target
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check reading permissions
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check init op
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check verifying op mask
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check verifying op permissions
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check verifying op params
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check pre-executing
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check check rate limiting
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check executing
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check completing
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check op status=0
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check http status=200
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 1 ====== req done req=0x7ff83c2ed710 op status=0 http_status=200 latency=0.000000000s ======
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 1 beast: 0x7ff83c2ed710: 10.69.1.134 - - [15/Mar/2023:03:49:23.400 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=0.000000000s
```

  • problematic case

```
2023-03-15 12:50:33
debug 2023-03-15T03:50:33.402+0000 7ff70be07700 1 ====== starting new request req=0x7ff8d80d1710 =====
2023-03-15 12:50:33
debug 2023-03-15T03:50:33.402+0000 7ff70be07700 2 req 10795295110652223112 0.000000000s initializing for trans_id = tx0000095d097b50befde88-0064114089-663ba9a-ceph-object-store-0
2023-03-15 12:50:33
debug 2023-03-15T03:50:33.402+0000 7ff70be07700 2 req 10795295110652223112 0.000000000s getting op 0
2023-03-15 12:50:37
debug 2023-03-15T03:50:37.801+0000 7ff8dbfe7700 2 rgw data changes log: RGWDataChangesLog::ChangesRenewThread: start
2023-03-15 12:50:37
debug 2023-03-15T03:50:37.801+0000 7ff8dbfe7700 2 rgw data changes log: RGWDataChangesLog::ChangesRenewThread: pruning old generations
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check verifying requester
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check normalizing buckets and tenants
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check init permissions
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff7c0770700 2 req 1034626586565520860 63.837211609s s3:get_obj verifying requester
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check recalculating target
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check reading permissions
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check init op
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check verifying op mask
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check verifying op permissions
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check verifying op params
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check pre-executing
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check check rate limiting
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check executing
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check completing
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff788700700 2 req 14745705746987820772 63.338191986s s3:get_obj verifying requester
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff753e97700 2 req 6474228846771742968 62.464164734s s3:get_obj verifying requester
...
```

  • environment information:

- ceph: 17.2.5
- rook: 1.10.7

  • attached files

- rgw-all.log.xz: the combined log of all rgw daemons (12:40:00 - 11:53:00)
- rgw-{0,1,2}.log.xz: the log of each rgw daemon


Files

rgw-all.log.xz (501 KB) rgw-all.log.xz Yuma Ogami, 03/20/2023 06:47 AM
rgw-0.log.xz (137 KB) rgw-0.log.xz Yuma Ogami, 03/20/2023 06:51 AM
rgw-1.log.xz (171 KB) rgw-1.log.xz Yuma Ogami, 03/20/2023 06:51 AM
rgw-2.log.xz (137 KB) rgw-2.log.xz Yuma Ogami, 03/20/2023 06:51 AM
Actions

Also available in: Atom PDF