Project

General

Profile

Bug #49823

rgw gc object leak when gc omap set entry failed with a large omap value

Added by dovefi Z almost 2 years ago. Updated 6 months ago.


Description

HOW TO REPRODUCE

1. upload a large file about 1.3TB,the object name length is about 100 char.

s3cmd put database/173727306/20210115/mysql_newsnapshot_g0_200037_20210115023303.tar.gz s3://large-bucket

2. delete object

s3cmd del s3://large-bucket/database/173727306/20210115/mysql_newsnapshot_g0_200037_20210115023303.tar.gz 

3. the gc list is empty

radosgw-admin gc list --include-all
[]

4. rados df

$ rados df
POOL_NAME                  USED    OBJECTS CLONES COPIES     MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS    RD      WR_OPS    WR
default.rgw.buckets.data   1.3TiB   340787      0 1022361                  0       0        0  80887124 58.3GiB 207278635  183TiB


so the object have been leak, and finlly we have found the reaseon,bellow is the osd log
do_op msg data len 128255528 > osd_max_write_size 94371840 on osd_op(client.2308860.0:7839785 60.7b 60:deff95d7:::gc.27:head [call rgw.gc_set_entry] snapc 0=[] ondisk+write+known_if_redirected e15159) v8

the omap is too large for the osd_max_write_size limit, The problem can be solved by increasing the parameter value, But what if the file gets bigger and bigger?may be save object manifest in gc omap value is better。


Related issues

Related to rgw - Bug #52711: Deleting a bucket with large MPU (1.4tb or more) object does not cleanup rgw.data pool Duplicate
Related to rgw - Bug #53585: RGW Garbage collector leads to slow ops and osd down when removing large object New
Copied to rgw - Backport #56405: octopus: rgw gc object leak when gc omap set entry failed with a large omap value Rejected
Copied to rgw - Backport #56406: quincy: rgw gc object leak when gc omap set entry failed with a large omap value In Progress
Copied to rgw - Backport #56407: pacific: rgw gc object leak when gc omap set entry failed with a large omap value Resolved

History

#1 Updated by Matt Benjamin almost 2 years ago

  • Priority changed from Normal to High

#2 Updated by Matt Benjamin almost 2 years ago

I think we want to avoid writing to omap, but some change needed.

Matt

#3 Updated by dovefi Z almost 2 years ago

I have tested deleting file of size 5TB,file name length is about 1024, and this operation make the osd crash, and osd can't up again

#4 Updated by dovefi Z almost 2 years ago

dovefi Z wrote:

I have tested deleting file of size 5TB,file name length is about 1024, and this operation make the osd crash, and osd can't up again

the osd log

     -5> 2021-03-17 11:10:47.796270 7fc5e2b19700  1 -- 10.191.24.41:6847/2387749 <== osd.64 10.191.24.46:0/4136504 6 ==== osd_ping(ping e15789 stamp 2021-03-17 11:10:47.796207) v4 ==== 2004+0+0 (2049075826 0 0) 0x5563b05dc400 con 0x5563b047c000
    -4> 2021-03-17 11:10:47.796278 7fc5e2b19700  1 -- 10.191.24.41:6847/2387749 --> 10.191.24.46:0/4136504 -- osd_ping(ping_reply e15789 stamp 2021-03-17 11:10:47.796207) v4 -- 0x556464db8e00 con 0
    -3> 2021-03-17 11:10:47.796288 7fc5e2b19700 20 osd.2 15789 share_map_peer 0x5563b0578000 already has epoch 15789
    -2> 2021-03-17 11:10:47.805190 7fc5bf2a6700 20 osd.2 op_wq(4) _process empty q, waiting
    -1> 2021-03-17 11:10:47.830107 7fc5c02a8700 20 osd.2 op_wq(2) _process empty q, waiting
     0> 2021-03-17 11:10:47.893074 7fc5e1b17700 -1 *** Caught signal (Aborted) **
 in thread 7fc5e1b17700 thread_name:msgr-worker-2

 ceph version 12.2.12.1 (731179e60fe566d6183973cba26786a88b30f9e2) luminous (stable)
 1: (()+0xa59c94) [0x556370bafc94]
 2: (()+0x110e0) [0x7fc5e5ec30e0]
 3: (gsignal()+0xcf) [0x7fc5e4e8afff]
 4: (abort()+0x16a) [0x7fc5e4e8c42a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fc5e57a30ad]
 6: (()+0x8f066) [0x7fc5e57a1066]
 7: (()+0x8f0b1) [0x7fc5e57a10b1]
 8: (()+0xb9e9e) [0x7fc5e57cbe9e]
 9: (()+0x74a4) [0x7fc5e5eb94a4]
 10: (clone()+0x3f) [0x7fc5e4f40d0f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
  20/20 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /home/ceph/log/ceph-osd.2.log
--- end dump of recent events ---
2021-03-17 11:10:47.907064 7fc5e2318700 20 osd.2 15789 share_map_peer 0x5563b0369000 already has epoch 15789
2021-03-17 11:10:47.907116 7fc5e2318700 20 osd.2 15789 share_map_peer 0x5563b0369000 already has epoch 15789
2021-03-17 11:10:47.910783 7fc5e2b19700 20 osd.2 15789 share_map_peer 0x556465ef6000 already has epoch 15789
2021-03-17 11:10:47.910812 7fc5e2318700 20 osd.2 15789 share_map_peer 0x556465ef6000 already has epoch 15789
2021-03-17 11:10:47.912436 7fc5e2318700 20 osd.2 15789 share_map_peer 0x5563b012b800 already has epoch 15789

#5 Updated by Casey Bodley almost 2 years ago

  • Status changed from New to Triaged

for octopus (see https://github.com/ceph/ceph/pull/28421) we changed this gc list to use cls_gc_queue which is bounded in size, so calls to RGWGC::send_chain() may fail with ENOSPC. we added error handling for this to call delete_objs_inline() and delete the tail objects immediately

this will change the behavior for these very large objects, but i don't think delete_objs_inline() is a good solution there; inline deletion will take a long time, and the client will likely time out and retry the DELETE several times, and the bucket index and head object will still show that the object exists in the meantime

it sounds like this OSD_WRITETOOBIG error will happen regardless of the backing (omap, cls_gc_queue, or cls_fifo) just because the osd message itself is too big. i'm not sure exactly how a gc chain is represented here, but it sounds like we'll either need a compressed representation, or the ability to send long chains in multiple osd ops

#6 Updated by Matt Benjamin almost 2 years ago

I think this is correct. I thought briefly about compression, but intuitively, wouldn't that still be at risk at imposing some, presumably larger size limit? That makes me think that the multi-submit idea is a safer bet...

it sounds like this OSD_WRITETOOBIG error will happen regardless of the backing (omap, cls_gc_queue, or cls_fifo) just because
the osd message itself is too big. i'm not sure exactly how a gc chain is represented here, but it sounds like we'll either
need a compressed representation, or the ability to send long chains in multiple osd ops

Matt

#7 Updated by Daniel Gryniewicz over 1 year ago

  • Related to Bug #52711: Deleting a bucket with large MPU (1.4tb or more) object does not cleanup rgw.data pool added

#8 Updated by Casey Bodley over 1 year ago

  • Assignee set to Pritha Srivastava

#9 Updated by Jeegn Chen 9 months ago

Casey Bodley wrote:

for octopus (see https://github.com/ceph/ceph/pull/28421) we changed this gc list to use cls_gc_queue which is bounded in size, so calls to RGWGC::send_chain() may fail with ENOSPC. we added error handling for this to call delete_objs_inline() and delete the tail objects immediately

this will change the behavior for these very large objects, but i don't think delete_objs_inline() is a good solution there; inline deletion will take a long time, and the client will likely time out and retry the DELETE several times, and the bucket index and head object will still show that the object exists in the meantime

it sounds like this OSD_WRITETOOBIG error will happen regardless of the backing (omap, cls_gc_queue, or cls_fifo) just because the osd message itself is too big. i'm not sure exactly how a gc chain is represented here, but it sounds like we'll either need a compressed representation, or the ability to send long chains in multiple osd ops

https://github.com/ceph/ceph/pull/28421 seems not able to address the issue well. The queue implemented in https://github.com/ceph/ceph/pull/28421 is in fact a Rados object, which will be restricted by osd_max_object_size (128MB by default). If the S3 object is as large as 50TB (5GB per part, 10000 parts), the chain will be several GB large (According to my experiment, when rgw_max_chunk_size = 1048576 and rgw_obj_stripe_size = 2097152, a 800GB s3 object will result in an about 200MB chain).
But Manifest is usually very small.
Why do we put a long list in cls_rgw_obj_chain instead of a small RGWObjManifest?
Is it because of some backward compatibility concern?

#10 Updated by Pritha Srivastava 9 months ago

Jeegn Chen wrote:

Casey Bodley wrote:

for octopus (see https://github.com/ceph/ceph/pull/28421) we changed this gc list to use cls_gc_queue which is bounded in size, so calls to RGWGC::send_chain() may fail with ENOSPC. we added error handling for this to call delete_objs_inline() and delete the tail objects immediately

this will change the behavior for these very large objects, but i don't think delete_objs_inline() is a good solution there; inline deletion will take a long time, and the client will likely time out and retry the DELETE several times, and the bucket index and head object will still show that the object exists in the meantime

it sounds like this OSD_WRITETOOBIG error will happen regardless of the backing (omap, cls_gc_queue, or cls_fifo) just because the osd message itself is too big. i'm not sure exactly how a gc chain is represented here, but it sounds like we'll either need a compressed representation, or the ability to send long chains in multiple osd ops

https://github.com/ceph/ceph/pull/28421 seems not able to address the issue well. The queue implemented in https://github.com/ceph/ceph/pull/28421 is in fact a Rados object, which will be restricted by osd_max_object_size (128MB by default). If the S3 object is as large as 50TB (5GB per part, 10000 parts), the chain will be several GB large (According to my experiment, when rgw_max_chunk_size = 1048576 and rgw_obj_stripe_size = 2097152, a 800GB s3 object will result in an about 200MB chain).
But Manifest is usually very small.
Why do we put a long list in cls_rgw_obj_chain instead of a small RGWObjManifest?
Is it because of some backward compatibility concern?

Hi Jeegn,

https://github.com/ceph/ceph/pull/28421, is not meant to address this issue.

#11 Updated by Casey Bodley 9 months ago

Jeegn Chen wrote:

But Manifest is usually very small.
Why do we put a long list in cls_rgw_obj_chain instead of a small RGWObjManifest?
Is it because of some backward compatibility concern?

backward compat is a challenge, but i do think it's worth exploring the use of RGWObjManifest here. it's the 'compressed representation' that can generate the whole gc chain (this is what RGWRados::update_gc_chain() does, see https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L4915-L4927)

i guess that would be a new field in cls_rgw_gc_obj_info. so we'd pass that into cls_rgw_gc_queue_enqueue(), the gc queue would store it, and GC would read it back with cls_rgw_gc_queue_list_entries()

if the new field in cls_rgw_gc_obj_info is encoded as a bufferlist, cls_rgw_gc wouldn't be sensitive to any encoding changes to RGWObjManifest. however, all OSDs and RGWs would need to be upgraded to support this new field before any RGWs could safely write manifests via cls_rgw_gc_queue_enqueue()

#12 Updated by Casey Bodley 9 months ago

  • Status changed from Triaged to Fix Under Review
  • Backport set to octopus pacific quincy
  • Pull request ID set to 46020

#13 Updated by Casey Bodley 7 months ago

  • Status changed from Fix Under Review to Pending Backport

#14 Updated by Backport Bot 7 months ago

  • Copied to Backport #56405: octopus: rgw gc object leak when gc omap set entry failed with a large omap value added

#15 Updated by Backport Bot 7 months ago

  • Copied to Backport #56406: quincy: rgw gc object leak when gc omap set entry failed with a large omap value added

#16 Updated by Backport Bot 7 months ago

  • Copied to Backport #56407: pacific: rgw gc object leak when gc omap set entry failed with a large omap value added

#17 Updated by Backport Bot 6 months ago

  • Tags changed from gc to gc backport_processed

#18 Updated by Casey Bodley 5 months ago

  • Related to Bug #53585: RGW Garbage collector leads to slow ops and osd down when removing large object added

Also available in: Atom PDF