Project

General

Profile

Feature #10198

PG removal occupy the disk thread several hours

Added by Zhi Zhang about 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

We found an issue after we enable scrubbing/deep-scrubbing when doing recovering. The phenomenon is that all the radosgw threads get hung gradually in few hours, so all the requests get 500 error eventually.

After we turn on scrubing/deep-scrubibing, there are some .rgw.bucket.index pool's PGs starting to scrub/deep-scrub. But during the
whole time of scrubbing those objects, they can't be written. And for our case, those objects here are bucket index objects, so all the writes to those buckets will be blocked at updating bucket index info.

chunky scrub will exit temporarily to wait for replicas to build scrub map. When replicas finishes the job, it will send sub_op_scrub_map OP back to primary OSD. This OP will be processed by op_tp firstly and then queued into scrub_wq, so disk_tp can continue the left work.

Our problem is that after sub_op_scrub_map OP is processed and queued into scrub_wq, disk_tp doesn't continue to process for about 1.5 hrs. So all the writes to those buckets become slow requests, radosgw threads start to hang, eventually spread to all the radosgw threads.

Once chunk scrub exits, disk_tp turns to process other events. For our case, it is processing pg removing event of EC pool. Unfortunately it
takes very long time to process such event, because in pg removing, it seems to loop all the objects under each directory, but 32 to 64 one time.

One simple solution is to use different thread pool to process those different events which might impact each other?

History

#1 Updated by Zhi Zhang about 6 years ago

scrub exits to wait for replicas to build scrub map:
------------------------------------------------
2014-11-27 04:52:11.336699 7f288658e700 10 osd.469 pg_epoch: 43797 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep] starting a
new chunky scrub
2014-11-27 04:52:11.336735 7f288658e700 10 osd.469 pg_epoch: 43797 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep] scrub start
2014-11-27 04:52:11.336758 7f288658e700 10
filestore(/home/y/var/lib/ceph/osd/ceph-469) collection_list_partial:
4.65b_head
2014-11-27 04:52:11.336778 7f288658e700 20 _collection_list_partial
0//0//-1/0/0 5-25 ls.size 0
...
2014-11-27 04:52:11.337033 7f288658e700 20  prefixes
B5608CEF,B5624762,B563D6FD,B56830C7,B5689025,B56AA0EE,B56B1F95,B56B2A1D,B5E35DE5,B5E5B3E6,B5E86E85,B5E8A3D5,B5EC883E,B5ECADEE,B5ED8453,B5EEFABE
2014-11-27 04:52:11.337101 7f288658e700 20
filestore(/home/y/var/lib/ceph/osd/ceph-469) objects:
[fec8065b/.dir.default.506102.4121/head//4,2674265b/.dir.default.1177618.958/head//4,df6d365b/.dir.default.1177618.284/head//4,7c03865b/.dir.default.1150130.745/head//4,5209865b/.dir.default.1177609.149/head//4,ee0aa65b/.dir.default.579855.7389/head//4,59f1b65b/.dir.default.1177618.1303/head//4,d1a2b65b/.dir.default.1150124.690/head//4,5ed53e5b/.dir.default.1150127.1797/head//4,6e3b5e5b/.dir.default.341342.2359/head//4,58e68e5b/.dir.default.1177618.297/head//4,5d3a8e5b/.dir.default.506102.318/head//4,e388ce5b/.dir.default.1177612.690/head//4,eedace5b/.dir.default.1177612.21/head//4,3548de5b/.dir.default.7291.5020/head//4,ebafee5b/.dir.default.29329.1440/head//4]
2014-11-27 04:52:11.339286 7f288658e700 10 osd.469 pg_epoch: 43797 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep] scrub 
requesting scrubmap from osd.99
2014-11-27 04:52:11.339339 7f288658e700 10 osd.469 pg_epoch: 43797 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep] scrub 
requesting scrubmap from osd.459
2014-11-27 04:52:11.339367 7f288658e700 10 osd.469 pg_epoch: 43797 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep]
build_scrub_map
2014-11-27 04:52:11.339379 7f288658e700 10
filestore(/home/y/var/lib/ceph/osd/ceph-469) collection_list_partial:
4.65b_head
2014-11-27 04:52:11.339384 7f288658e700 20 _collection_list_partial 0//0//-1
32-64 ls.size 0
2014-11-27 04:52:11.339588 7f288658e700 20  prefixes
B5608CEF,B5624762,B563D6FD,B56830C7,B5689025,B56AA0EE,B56B1F95,B56B2A1D,B5E35DE5,B5E5B3E6,B5E86E85,B5E8A3D5,B5EC883E,B5ECADEE,B5ED8453,B5EEFABE
2014-11-27 04:52:11.339654 7f288658e700 20
filestore(/home/y/var/lib/ceph/osd/ceph-469) objects:
[fec8065b/.dir.default.506102.4121/head//4,2674265b/.dir.default.1177618.958/head//4,df6d365b/.dir.default.1177618.284/head//4,7c03865b/.dir.default.1150130.745/head//4,5209865b/.dir.default.1177609.149/head//4,ee0aa65b/.dir.default.579855.7389/head//4,59f1b65b/.dir.default.1177618.1303/head//4,d1a2b65b/.dir.default.1150124.690/head//4,5ed53e5b/.dir.default.1150127.1797/head//4,6e3b5e5b/.dir.default.341342.2359/head//4,58e68e5b/.dir.default.1177618.297/head//4,5d3a8e5b/.dir.default.506102.318/head//4,e388ce5b/.dir.default.1177612.690/head//4,eedace5b/.dir.default.1177612.21/head//4,3548de5b/.dir.default.7291.5020/head//4,ebafee5b/.dir.default.29329.1440/head//4]
2014-11-27 04:52:11.339710 7f288658e700 10 osd.469 pg_epoch: 43797 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep] _scan_list
scanning 16 objects deeply
...
2014-11-27 04:52:21.859629 7f288658e700 10 osd.469 pg_epoch: 43797 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep]
build_scrub_map_chunk done.
2014-11-27 04:52:21.859720 7f288658e700 10 osd.469 pg_epoch: 43797 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep] wait for
replicas to build scrub map
------------------------------------------------

osd_sub_scrub_map OP comes back quickly:
------------------------------------------------
2014-11-27 04:52:21.893323 7f2888592700 10 osd.469 43799 dequeue_op 0x1730f4a0
prio 127 cost 6420 latency 7.496690 osd_sub_op(unknown.0.0:0 4.65b 0//0//-1
[scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v11 pg pg[4.65b( v 43775'1003452
(43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c 43311/43311
43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450 lcod
43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep]
...
2014-11-27 04:52:21.893511 7f2888592700 10 osd.469 pg_epoch: 43797 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep] map version
is 43775'1003452
...
2014-11-27 04:52:21.893534 7f2888592700 10 osd.469 43799 dequeue_op 0x1730f4a0
finish
2014-11-27 04:52:21.903364 7f2889594700 10 osd.469 43799 dequeue_op 0x117d6d20
prio 127 cost 6420 latency 6.399532 osd_sub_op(unknown.0.0:0 4.65b 0//0//-1
[scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v11 pg pg[4.65b( v 43775'1003452
(43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c 43311/43311
43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450 lcod
43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep]
...
2014-11-27 04:52:21.903423 7f2889594700 10 osd.469 pg_epoch: 43799 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep]  got 459
scrub map
...
014-11-27 04:52:21.903483 7f2889594700 10 osd.469 pg_epoch: 43799 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep] map version
is 43775'1003452
2014-11-27 04:52:21.903495 7f2889594700 10 osd.469 43799 dequeue_op 0x117d6d20
finish
...
------------------------------------------------

disk_tp to process pg removing event:
------------------------------------------------
2014-11-27 04:52:11.336758 7f288658e700 10
filestore(/home/y/var/lib/ceph/osd/ceph-469) collection_list_partial:
4.65b_head
2014-11-27 04:52:11.336778 7f288658e700 20 _collection_list_partial
0//0//-1/0/0 5-25 ls.size 0
2014-11-27 04:52:11.337033 7f288658e700 20  prefixes
B5608CEF,B5624762,B563D6FD,B56830C7,B5689025,B56AA0EE,B56B1F95,B56B2A1D,B5E35DE5,B5E5B3E6,B5E86E85,B5E8A3D5,B5EC883E,B5ECADEE,B5ED8453,B5EEFABE
...
2014-11-27 06:30:04.060618 7f288658e700 10
filestore(/home/y/var/lib/ceph/osd/ceph-469) collection_list_partial:
3.264s0_head
2014-11-27 06:30:04.060629 7f288658e700 20 _collection_list_partial
3f78e264/default.1177618.14_mon03c008.cos.gq1.yahoo.com_fe4a94032b486ec449788c1cc6265024/head//3/18446744073709551615/0
32-64 ls.size 0
2014-11-27 06:30:04.060688 7f288658e700 20  prefixes 4
2014-11-27 06:30:04.060709 7f288658e700 20  prefixes 46
2014-11-27 06:30:04.060726 7f288658e700 20  prefixes 462
2014-11-27 06:30:04.060758 7f288658e700 20  prefixes 462E
2014-11-27 06:30:04.060810 7f288658e700 20  prefixes
462E8,462E9,462EA,462EB,462EC,462ED,462EE,462EF
...
------------------------------------------------

disk_up continues to process scrub:
------------------------------------------------
2014-11-27 06:34:46.196360 7f288658e700 10 osd.469 pg_epoch: 43935 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep]
scrub_compare_maps has maps, analyzing
2014-11-27 06:34:46.196475 7f288658e700 10 osd.469 pg_epoch: 43935 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep] scrub 
comparing replica scrub maps
2014-11-27 06:34:46.196488 7f288658e700  2 osd.469 pg_epoch: 43935 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep] scrub  
osd.469 has 16 items
2014-11-27 06:34:46.196500 7f288658e700  2 osd.469 pg_epoch: 43935 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep] scrub
replica 99 has 16 items
...
2014-11-27 06:34:46.196655 7f288658e700 10 osd.469 pg_epoch: 43935 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep]
be_select_auth_object: selecting osd 459 for obj
fec8065b/.dir.default.506102.4121/head//4
2014-11-27 06:34:46.196669 7f288658e700 10 osd.469 pg_epoch: 43935 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep]
be_select_auth_object: selecting osd 469 for obj
fec8065b/.dir.default.506102.4121/head//4
...
2014-11-27 06:34:46.197471 7f288658e700 10 osd.469 pg_epoch: 43935 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep]
be_select_auth_object: selecting osd 469 for obj
ebafee5b/.dir.default.29329.1440/head//4
2014-11-27 06:34:46.197494 7f288658e700  2 osd.469 pg_epoch: 43935 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep]
2014-11-27 06:34:46.197506 7f288658e700 10 osd.469 pg_epoch: 43935 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep] _scrub
2014-11-27 06:34:46.197580 7f288658e700 10 osd.469 pg_epoch: 43935 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep] _scrub
(deep-scrub) finish
2014-11-27 06:34:46.198070 7f288658e700 10 osd.469 pg_epoch: 43935 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep] deep-scrub
got 16/16 objects, 0/0 clones, 16/16 dirty, 16/16 omap, 0/0 hit_set_archive,
0/0 bytes.
2014-11-27 06:34:46.198089 7f288658e700 10 osd.469 pg_epoch: 43935 pg[4.65b( v
43775'1003452 (43547'1000451,43775'1003452] local-les=43311 n=16 ec=457 les/c
43311/43311 43260/43310/43310) [469,459,99] r=0 lpr=43310 crt=43769'1003450
lcod 43775'1003451 mlcod 43775'1003451 active+clean+scrubbing+deep]
process_inconsistent() checking authoritative
2014-11-27 06:34:46.198102 7f288658e700  0 log [INF] : 4.65b deep-scrub ok
------------------------------------------------

#2 Updated by Samuel Just about 6 years ago

  • Project changed from Ceph to rgw

Radosgw creates unbounded size index objects, will change eventually.

#3 Updated by Guang Yang about 6 years ago

Samuel Just wrote:

Radosgw creates unbounded size index objects, will change eventually.

Hi Sam,
It is more like a general problem at OSD side rather than a rgw problem. In short, the problem is that a single object could be non-writable for a couple of hours because the thread taking responsibility to mark it writable was stuck doing pg remove (which is time consuming), not because it is really not writable.

I am thinking about two possibilities to fix:
1. Pull the time critical section into op thread (make the object as writable)
2. Use a dedicated thread pool to handle pg removing (and other time consuming work load)

#4 Updated by Samuel Just about 6 years ago

In that case, two things:
1) move scrubbing into the OpWQ (I'm working on that one)
2) restructure pg removal to only remove N objects a time and requeue (at the front of the removal queue)

#5 Updated by Guang Yang about 6 years ago

  • Subject changed from Radosgw threads get hung when hitting a paticular pattern at peering, remapping and deep-scrubbing to PG removal occupy the disk thread several hours

#6 Updated by Sage Weil almost 6 years ago

  • Project changed from rgw to Ceph

#7 Updated by Guang Yang almost 6 years ago

Samuel Just wrote:

In that case, two things:
2) restructure pg removal to only remove N objects a time and requeue (at the front of the removal queue)

https://github.com/ceph/ceph/pull/3564

#8 Updated by Sage Weil almost 6 years ago

  • Priority changed from Normal to High

#9 Updated by Samuel Just almost 6 years ago

  • Tracker changed from Bug to Feature
  • Target version set to v0.94

#10 Updated by Samuel Just almost 6 years ago

  • Target version changed from v0.94 to v0.95

#11 Updated by Sage Weil almost 6 years ago

  • Target version changed from v0.95 to v9.0.2

#12 Updated by Samuel Just over 5 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF