Bug #49823: rgw gc object leak when gc omap set entry failed with a large omap value - rgw - Ceph

Actions

Copy link

Bug #49823

closed

rgw gc object leak when gc omap set entry failed with a large omap value

Added by dovefi Z about 3 years ago. Updated about 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

Pritha Srivastava

Target version:

Ceph - v18.0.0

% Done:

100%

Source:

Tags:

gc backport_processed

Backport:

octopus pacific quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v10.0.0, Ceph - v10.1.1, Ceph - v10.2.0, Ceph - v10.2.1, Ceph - v10.2.10, Ceph - v10.2.11, Ceph - v10.2.12, Ceph - v10.2.2, Ceph - v10.2.3, Ceph - v10.2.4, Ceph - v10.2.5, Ceph - v10.2.6, Ceph - v10.2.7, Ceph - v10.2.8, Ceph - v10.2.9, Ceph - v11.1.0, Ceph - v11.2.0, Ceph - v11.2.1, Ceph - v11.2.2, Ceph - v12.0.0, Ceph - v12.1.0, Ceph - v12.2.0, Ceph - v12.2.1, Ceph - v12.2.10, Ceph - v12.2.11, Ceph - v12.2.12, Ceph - v12.2.13, Ceph - v12.2.14, Ceph - v12.2.2, Ceph - v12.2.3, Ceph - v12.2.4, Ceph - v12.2.5, Ceph - v12.2.6, Ceph - v12.2.7, Ceph - v12.2.8, Ceph - v12.2.9, Ceph - v13.0.0, Ceph - v13.2.0, Ceph - v13.2.1, Ceph - v13.2.10, Ceph - v13.2.11, Ceph - v13.2.2, Ceph - v13.2.3, Ceph - v13.2.4, Ceph - v13.2.5, Ceph - v13.2.6, Ceph - v13.2.7, Ceph - v13.2.8, Ceph - v13.2.9, Ceph - v14.0.0, Ceph - v14.2.0, Ceph - v14.2.1, Ceph - v14.2.10, Ceph - v14.2.11, Ceph - v14.2.12, Ceph - v14.2.13, Ceph - v14.2.14, Ceph - v14.2.15, Ceph - v14.2.16, Ceph - v14.2.17, Ceph - v14.2.18

ceph-qa-suite:

Pull request ID:

46020

Crash signature (v1):

Crash signature (v2):

Description

HOW TO REPRODUCE¶

1. upload a large file about 1.3TB,the object name length is about 100 char.

s3cmd put database/173727306/20210115/mysql_newsnapshot_g0_200037_20210115023303.tar.gz s3://large-bucket

2. delete object

s3cmd del s3://large-bucket/database/173727306/20210115/mysql_newsnapshot_g0_200037_20210115023303.tar.gz

3. the gc list is empty

radosgw-admin gc list --include-all
[]

4. rados df

$ rados df
POOL_NAME                  USED    OBJECTS CLONES COPIES     MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS    RD      WR_OPS    WR
default.rgw.buckets.data   1.3TiB   340787      0 1022361                  0       0        0  80887124 58.3GiB 207278635  183TiB

so the object have been leak, and finlly we have found the reaseon，bellow is the osd log

do_op msg data len 128255528 > osd_max_write_size 94371840 on osd_op(client.2308860.0:7839785 60.7b 60:deff95d7:::gc.27:head [call rgw.gc_set_entry] snapc 0=[] ondisk+write+known_if_redirected e15159) v8

the omap is too large for the osd_max_write_size limit, The problem can be solved by increasing the parameter value, But what if the file gets bigger and bigger？may be save object manifest in gc omap value is better。

Related issues 5 (1 open — 4 closed)

Actions

Copy link

Updated by Matt Benjamin about 3 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Matt Benjamin about 3 years ago

I think we want to avoid writing to omap, but some change needed.

Matt

Actions

Copy link

Updated by dovefi Z about 3 years ago

I have tested deleting file of size 5TB，file name length is about 1024, and this operation make the osd crash, and osd can't up again

Actions

Copy link

Updated by dovefi Z about 3 years ago

dovefi Z wrote:

I have tested deleting file of size 5TB，file name length is about 1024, and this operation make the osd crash, and osd can't up again

the osd log

     -5> 2021-03-17 11:10:47.796270 7fc5e2b19700  1 -- 10.191.24.41:6847/2387749 <== osd.64 10.191.24.46:0/4136504 6 ==== osd_ping(ping e15789 stamp 2021-03-17 11:10:47.796207) v4 ==== 2004+0+0 (2049075826 0 0) 0x5563b05dc400 con 0x5563b047c000
    -4> 2021-03-17 11:10:47.796278 7fc5e2b19700  1 -- 10.191.24.41:6847/2387749 --> 10.191.24.46:0/4136504 -- osd_ping(ping_reply e15789 stamp 2021-03-17 11:10:47.796207) v4 -- 0x556464db8e00 con 0
    -3> 2021-03-17 11:10:47.796288 7fc5e2b19700 20 osd.2 15789 share_map_peer 0x5563b0578000 already has epoch 15789
    -2> 2021-03-17 11:10:47.805190 7fc5bf2a6700 20 osd.2 op_wq(4) _process empty q, waiting
    -1> 2021-03-17 11:10:47.830107 7fc5c02a8700 20 osd.2 op_wq(2) _process empty q, waiting
     0> 2021-03-17 11:10:47.893074 7fc5e1b17700 -1 *** Caught signal (Aborted) **
 in thread 7fc5e1b17700 thread_name:msgr-worker-2

 ceph version 12.2.12.1 (731179e60fe566d6183973cba26786a88b30f9e2) luminous (stable)
 1: (()+0xa59c94) [0x556370bafc94]
 2: (()+0x110e0) [0x7fc5e5ec30e0]
 3: (gsignal()+0xcf) [0x7fc5e4e8afff]
 4: (abort()+0x16a) [0x7fc5e4e8c42a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fc5e57a30ad]
 6: (()+0x8f066) [0x7fc5e57a1066]
 7: (()+0x8f0b1) [0x7fc5e57a10b1]
 8: (()+0xb9e9e) [0x7fc5e57cbe9e]
 9: (()+0x74a4) [0x7fc5e5eb94a4]
 10: (clone()+0x3f) [0x7fc5e4f40d0f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
  20/20 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /home/ceph/log/ceph-osd.2.log
--- end dump of recent events ---
2021-03-17 11:10:47.907064 7fc5e2318700 20 osd.2 15789 share_map_peer 0x5563b0369000 already has epoch 15789
2021-03-17 11:10:47.907116 7fc5e2318700 20 osd.2 15789 share_map_peer 0x5563b0369000 already has epoch 15789
2021-03-17 11:10:47.910783 7fc5e2b19700 20 osd.2 15789 share_map_peer 0x556465ef6000 already has epoch 15789
2021-03-17 11:10:47.910812 7fc5e2318700 20 osd.2 15789 share_map_peer 0x556465ef6000 already has epoch 15789
2021-03-17 11:10:47.912436 7fc5e2318700 20 osd.2 15789 share_map_peer 0x5563b012b800 already has epoch 15789

Actions

Copy link

Updated by Casey Bodley about 3 years ago

Status changed from New to Triaged

for octopus (see https://github.com/ceph/ceph/pull/28421) we changed this gc list to use cls_gc_queue which is bounded in size, so calls to RGWGC::send_chain() may fail with ENOSPC. we added error handling for this to call delete_objs_inline() and delete the tail objects immediately

this will change the behavior for these very large objects, but i don't think delete_objs_inline() is a good solution there; inline deletion will take a long time, and the client will likely time out and retry the DELETE several times, and the bucket index and head object will still show that the object exists in the meantime

it sounds like this OSD_WRITETOOBIG error will happen regardless of the backing (omap, cls_gc_queue, or cls_fifo) just because the osd message itself is too big. i'm not sure exactly how a gc chain is represented here, but it sounds like we'll either need a compressed representation, or the ability to send long chains in multiple osd ops

Actions

Copy link

Updated by Matt Benjamin about 3 years ago

I think this is correct. I thought briefly about compression, but intuitively, wouldn't that still be at risk at imposing some, presumably larger size limit? That makes me think that the multi-submit idea is a safer bet...

it sounds like this OSD_WRITETOOBIG error will happen regardless of the backing (omap, cls_gc_queue, or cls_fifo) just because
the osd message itself is too big. i'm not sure exactly how a gc chain is represented here, but it sounds like we'll either
need a compressed representation, or the ability to send long chains in multiple osd ops

Matt

Actions

Copy link

Updated by Daniel Gryniewicz over 2 years ago

Related to Bug #52711: Deleting a bucket with large MPU (1.4tb or more) object does not cleanup rgw.data pool added

Actions

Copy link

Updated by Casey Bodley over 2 years ago

Assignee set to Pritha Srivastava

Actions

Copy link

Updated by Jeegn Chen about 2 years ago

Casey Bodley wrote:

for octopus (see https://github.com/ceph/ceph/pull/28421) we changed this gc list to use cls_gc_queue which is bounded in size, so calls to RGWGC::send_chain() may fail with ENOSPC. we added error handling for this to call delete_objs_inline() and delete the tail objects immediately

this will change the behavior for these very large objects, but i don't think delete_objs_inline() is a good solution there; inline deletion will take a long time, and the client will likely time out and retry the DELETE several times, and the bucket index and head object will still show that the object exists in the meantime

it sounds like this OSD_WRITETOOBIG error will happen regardless of the backing (omap, cls_gc_queue, or cls_fifo) just because the osd message itself is too big. i'm not sure exactly how a gc chain is represented here, but it sounds like we'll either need a compressed representation, or the ability to send long chains in multiple osd ops

https://github.com/ceph/ceph/pull/28421 seems not able to address the issue well. The queue implemented in https://github.com/ceph/ceph/pull/28421 is in fact a Rados object, which will be restricted by osd_max_object_size (128MB by default). If the S3 object is as large as 50TB (5GB per part, 10000 parts), the chain will be several GB large (According to my experiment, when rgw_max_chunk_size = 1048576 and rgw_obj_stripe_size = 2097152, a 800GB s3 object will result in an about 200MB chain).
But Manifest is usually very small.
Why do we put a long list in cls_rgw_obj_chain instead of a small RGWObjManifest?
Is it because of some backward compatibility concern?

Actions

Copy link

#10

Updated by Pritha Srivastava about 2 years ago

Jeegn Chen wrote:

Casey Bodley wrote:

for octopus (see https://github.com/ceph/ceph/pull/28421) we changed this gc list to use cls_gc_queue which is bounded in size, so calls to RGWGC::send_chain() may fail with ENOSPC. we added error handling for this to call delete_objs_inline() and delete the tail objects immediately

this will change the behavior for these very large objects, but i don't think delete_objs_inline() is a good solution there; inline deletion will take a long time, and the client will likely time out and retry the DELETE several times, and the bucket index and head object will still show that the object exists in the meantime

it sounds like this OSD_WRITETOOBIG error will happen regardless of the backing (omap, cls_gc_queue, or cls_fifo) just because the osd message itself is too big. i'm not sure exactly how a gc chain is represented here, but it sounds like we'll either need a compressed representation, or the ability to send long chains in multiple osd ops

https://github.com/ceph/ceph/pull/28421 seems not able to address the issue well. The queue implemented in https://github.com/ceph/ceph/pull/28421 is in fact a Rados object, which will be restricted by osd_max_object_size (128MB by default). If the S3 object is as large as 50TB (5GB per part, 10000 parts), the chain will be several GB large (According to my experiment, when rgw_max_chunk_size = 1048576 and rgw_obj_stripe_size = 2097152, a 800GB s3 object will result in an about 200MB chain).
But Manifest is usually very small.
Why do we put a long list in cls_rgw_obj_chain instead of a small RGWObjManifest?
Is it because of some backward compatibility concern?

Hi Jeegn,

https://github.com/ceph/ceph/pull/28421, is not meant to address this issue.

Actions

Copy link

#11

Updated by Casey Bodley almost 2 years ago

Jeegn Chen wrote:

But Manifest is usually very small.
Why do we put a long list in cls_rgw_obj_chain instead of a small RGWObjManifest?
Is it because of some backward compatibility concern?

backward compat is a challenge, but i do think it's worth exploring the use of RGWObjManifest here. it's the 'compressed representation' that can generate the whole gc chain (this is what RGWRados::update_gc_chain() does, see https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L4915-L4927)

i guess that would be a new field in cls_rgw_gc_obj_info. so we'd pass that into cls_rgw_gc_queue_enqueue(), the gc queue would store it, and GC would read it back with cls_rgw_gc_queue_list_entries()

if the new field in cls_rgw_gc_obj_info is encoded as a bufferlist, cls_rgw_gc wouldn't be sensitive to any encoding changes to RGWObjManifest. however, all OSDs and RGWs would need to be upgraded to support this new field before any RGWs could safely write manifests via cls_rgw_gc_queue_enqueue()

Actions

Copy link

#12