Bug #59104: All access to rgw took over 100 seconds for a long time - rgw - Ceph

Actions

Copy link

Bug #59104

closed

All access to rgw took over 100 seconds for a long time

Added by Yuma Ogami about 1 year ago. Updated 8 months ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

All access to rgw took over 100 seconds for more than 10 minutes. I encountered this problem several times.
At that time, ceph status looks as following:

```
cybozu@stage0-boot-0:~$ kubectl exec ~~n ceph-object-store deploy/rook-ceph-tools -~~ ceph -s
cluster:
id: b52d5f3d-ba14-442e-a089-0bca47b83758
health: HEALTH_WARN
80 pgs not deep-scrubbed in time
79 pgs not scrubbed in time
17 slow ops, oldest one blocked for 856 sec, daemons [osd.0,osd.1,osd.17,osd.2] have slow ops.

services:
    mon: 3 daemons, quorum b,f,g (age 70m)
    mgr: a(active, since 11h), standbys: b
    osd: 24 osds: 24 up (since 12m), 24 in (since 14m); 15 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

data:
    pools:   9 pools, 336 pgs
    objects: 104.09M objects, 7.9 TiB
    usage:   25 TiB used, 114 TiB / 139 TiB avail
    pgs:     10043255/312255237 objects misplaced (3.216%)
             291 active+clean
             26  active+recovering
             15  active+remapped+backfilling
             3   active+recovering+laggy
             1   active+clean+laggy

io:
    client:   70 KiB/s rd, 19 KiB/s wr, 66 op/s rd, 16 op/s wr
    recovery: 2.5 MiB/s, 29.44k keys/s, 27 objects/s
```

Here is the latency of healthcheck (i.e. GET "/swift/healthcheck" request).

```
...
2023-03-15 12:49:22
debug 2023-03-15T03:49:22.364+0000 7fed946e0700 1 beast: 0x7fee4416a710: 10.69.3.9 - - [15/Mar/2023:03:49:22.363 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=0.000000000sShow context
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 1 beast: 0x7ff83c2ed710: 10.69.1.134 - - [15/Mar/2023:03:49:23.400 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=0.000000000s
2023-03-15 12:49:31
debug 2023-03-15T03:49:31.934+0000 7f5cdfe4f700 1 beast: 0x7f5dd85f3710: 10.69.2.83 - - [15/Mar/2023:03:49:31.934 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=0.000000000s
2023-03-15 12:52:20
debug 2023-03-15T03:52:20.396+0000 7ff7b8f61700 1 beast: 0x7ff8305f3710: 10.69.1.134 - - [15/Mar/2023:03:49:33.402 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=166.993789673s
2023-03-15 12:52:20
debug 2023-03-15T03:52:20.396+0000 7ff757e9f700 1 beast: 0x7ff818597710: 10.69.1.134 - - [15/Mar/2023:03:49:54.402 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=145.994049072s
2023-03-15 12:52:20
debug 2023-03-15T03:52:20.396+0000 7ff751692700 1 beast: 0x7ff8240e9710: 10.69.1.134 - - [15/Mar/2023:03:49:53.401 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=146.994079590s
2023-03-15 12:52:20
debug 2023-03-15T03:52:20.396+0000 7ff7cc788700 1 beast: 0x7ff818414710: 10.69.1.134 - - [15/Mar/2023:03:50:13.400 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=126.995399475s
...
```

It looks fine before 03:52:20. But after that, the latency got too bad. Other requests like PUT took long time too.

When healthcheck took long time, the log looked different from the normal case.

normal case

```
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 1 ====== starting new request req=0x7ff83c2ed710 =====
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s initializing for trans_id = tx00000973bdb8c58a8c2c8-0064114043-663ba9a-ceph-object-store-0
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s getting op 0
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check verifying requester
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check normalizing buckets and tenants
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check init permissions
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check recalculating target
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check reading permissions
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check init op
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check verifying op mask
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check verifying op permissions
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check verifying op params
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check pre-executing
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check check rate limiting
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check executing
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check completing
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check op status=0
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 2 req 10897545119182406344 0.000000000s swift:get_health_check http status=200
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 1 ====== req done req=0x7ff83c2ed710 op status=0 http_status=200 latency=0.000000000s ======
2023-03-15 12:49:23
debug 2023-03-15T03:49:23.401+0000 7ff73a664700 1 beast: 0x7ff83c2ed710: 10.69.1.134 - - [15/Mar/2023:03:49:23.400 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=0.000000000s
```

problematic case

```
2023-03-15 12:50:33
debug 2023-03-15T03:50:33.402+0000 7ff70be07700 1 ====== starting new request req=0x7ff8d80d1710 =====
2023-03-15 12:50:33
debug 2023-03-15T03:50:33.402+0000 7ff70be07700 2 req 10795295110652223112 0.000000000s initializing for trans_id = tx0000095d097b50befde88-0064114089-663ba9a-ceph-object-store-0
2023-03-15 12:50:33
debug 2023-03-15T03:50:33.402+0000 7ff70be07700 2 req 10795295110652223112 0.000000000s getting op 0
2023-03-15 12:50:37
debug 2023-03-15T03:50:37.801+0000 7ff8dbfe7700 2 rgw data changes log: RGWDataChangesLog::ChangesRenewThread: start
2023-03-15 12:50:37
debug 2023-03-15T03:50:37.801+0000 7ff8dbfe7700 2 rgw data changes log: RGWDataChangesLog::ChangesRenewThread: pruning old generations
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check verifying requester
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check normalizing buckets and tenants
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check init permissions
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff7c0770700 2 req 1034626586565520860 63.837211609s s3:get_obj verifying requester
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check recalculating target
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check reading permissions
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check init op
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check verifying op mask
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check verifying op permissions
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check verifying op params
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check pre-executing
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check check rate limiting
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check executing
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff737e5f700 2 req 14124238834200714628 64.880249023s swift:get_health_check completing
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff788700700 2 req 14745705746987820772 63.338191986s s3:get_obj verifying requester
2023-03-15 12:50:38
debug 2023-03-15T03:50:38.282+0000 7ff753e97700 2 req 6474228846771742968 62.464164734s s3:get_obj verifying requester
...
```

environment information:

- ceph: 17.2.5
- rook: 1.10.7

attached files

- rgw-all.log.xz: the combined log of all rgw daemons (12:40:00 - 11:53:00)
- rgw-{0,1,2}.log.xz: the log of each rgw daemon

Files

Download all files

rgw-all.log.xz (501 KB) rgw-all.log.xz		Yuma Ogami, 03/20/2023 06:47 AM
rgw-0.log.xz (137 KB) rgw-0.log.xz		Yuma Ogami, 03/20/2023 06:51 AM
rgw-1.log.xz (171 KB) rgw-1.log.xz		Yuma Ogami, 03/20/2023 06:51 AM
rgw-2.log.xz (137 KB) rgw-2.log.xz		Yuma Ogami, 03/20/2023 06:51 AM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #59104

All access to rgw took over 100 seconds for a long time

Updated by Yuma Ogami about 1 year ago

Updated by Satoru Takeuchi about 1 year ago

problem¶

Our considerations¶

how to reproduce¶

cluster configuration¶

software environment¶

reproduce operation¶

log samples¶

Updated by Satoru Takeuchi 12 months ago

Additional information¶

reproducer¶

Updated by Satoru Takeuchi 11 months ago

Updated by Konstantin Shalygin 8 months ago