Project

General

Profile

Actions

Bug #45412

closed

radosgw gc issue - failed to list objs: (22) Invalid argument

Added by James McEwan almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
High
Target version:
-
% Done:

0%

Source:
Tags:
gc
Backport:
octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a Ceph cluster running Octopus version 15.2.1 (upgraded from Octopus 15.2.0). We’re using RGW with an erasure coded backed pool.

I started to get a suspicion that deleted objects were not getting cleaned up properly, and I wanted to verify this by checking the garbage collector.

That’s when I discovered that when I run “radosgw-admin gc list --include-all”, I get the following error:
"ERROR: failed to list objs: (22) Invalid argument”

I'm running the radosgw-admin command from one of our monitors, as the radosgw itself does not have admin access to the cluster.

I've attached the output of the radosgw-admin command with debug_rgw=20, and also the logs for one of the OSDs in the cluster with debug_osd and debug_objclass=20.


Files

ceph-rgw.rgw0.log (8.42 KB) ceph-rgw.rgw0.log James McEwan, 05/06/2020 06:22 PM
ceph-osd.15.log (523 KB) ceph-osd.15.log James McEwan, 05/06/2020 06:31 PM
ceph-rgw.rgw0.log (886 KB) ceph-rgw.rgw0.log James McEwan, 05/08/2020 06:33 AM

Related issues 1 (0 open1 closed)

Copied to rgw - Backport #45844: octopus: radosgw gc issue - failed to list objs: (22) Invalid argumentResolvedNathan CutlerActions
Actions #1

Updated by Matt Benjamin almost 4 years ago

  • Assignee set to Pritha Srivastava
Actions #2

Updated by Casey Bodley almost 4 years ago

  • Status changed from New to Triaged
  • Tags set to gc
  • Backport set to octopus
Actions #3

Updated by Pritha Srivastava almost 4 years ago

Checked the logs and can see that while listing gc entries, there is a failure in reading the START of data, which should be a magic number (0xBEEF), but is zero as can be seen from the logs. There is nothing else that indicates what could have gone wrong. Are there any other logs prior to upgrade available? Also the rgw log attached is the one that is obtained by running radosgw-admin gc list. There must be other rgw logs like osd logs, can that be shared?

2020-05-06T20:06:39.923+0200 7f78f91ab700 10 <cls> /build/ceph-15.2.1/src/cls/queue/cls_queue_src.cc:263: INFO: queue_list_entries(): front is: 0/4096, tail is 0/732076

2020-05-06T20:06:39.923+0200 7f78f91ab700 10 <cls> /build/ceph-15.2.1/src/cls/queue/cls_queue_src.cc:271: INFO: queue_list_entries(): start_offset is 4096

2020-05-06T20:06:39.923+0200 7f78f91ab700 10 <cls> /build/ceph-15.2.1/src/cls/queue/cls_queue_src.cc:280: INFO: queue_list_entries(): size_to_read is 1024

2020-05-06T20:06:39.923+0200 7f78f91ab700 10 osd.15 pg_epoch: 5421 pg[5.9( v 5421'489853 (5421'486400,5421'489853] local-lis/les=5414/5415 n=59 ec=67/67 lis/c=5414/5414 les/c/f=5415/5415/0 sis=5414) [15,21,26] r=0 lpr=5414 crt=5421'489851 lcod 5421'489852 mlcod 5421'489852 active+clean] do_osd_op 5:9a6dd8c0:gc::gc.0:head [sync_read 4096~1024]
2020-05-06T20:06:39.923+0200 7f78f91ab700 10 osd.15 pg_epoch: 5421 pg[5.9( v 5421'489853 (5421'486400,5421'489853] local-lis/les=5414/5415 n=59 ec=67/67 lis/c=5414/5414 les/c/f=5415/5415/0 sis=5414) [15,21,26] r=0 lpr=5414 crt=5421'489851 lcod 5421'489852 mlcod 5421'489852 active+clean] do_osd_op sync_read 4096~1024
2020-05-06T20:06:39.923+0200 7f78f91ab700 20 osd.15 pg_epoch: 5421 pg[5.9( v 5421'489853 (5421'486400,5421'489853] local-lis/les=5414/5415 n=59 ec=67/67 lis/c=5414/5414 les/c/f=5415/5415/0 sis=5414) [15,21,26] r=0 lpr=5414 crt=5421'489851 lcod 5421'489852 mlcod 5421'489852 active+clean] do_read
2020-05-06T20:06:39.923+0200 7f78f91ab700 10 osd.15 pg_epoch: 5421 pg[5.9( v 5421'489853 (5421'486400,5421'489853] local-lis/les=5414/5415 n=59 ec=67/67 lis/c=5414/5414 les/c/f=5415/5415/0 sis=5414) [15,21,26] r=0 lpr=5414 crt=5421'489851 lcod 5421'489852 mlcod 5421'489852 active+clean] read got 1024 / 1024 bytes from obj 5:9a6dd8c0:gc::gc.0:head
2020-05-06T20:06:39.923+0200 7f78f91ab700 20 <cls> /build/ceph-15.2.1/src/cls/queue/cls_queue_src.cc:295: INFO: queue_list_entries(): Entry start offset accounting for leftover data is 4096

2020-05-06T20:06:39.923+0200 7f78f91ab700 20 <cls> /build/ceph-15.2.1/src/cls/queue/cls_queue_src.cc:299: INFO: queue_list_entries(): size of chunk 1024

2020-05-06T20:06:39.923+0200 7f78f91ab700 10 <cls> /build/ceph-15.2.1/src/cls/queue/cls_queue_src.cc:306: INFO: queue_list_entries(): index: 0, size_to_process: 1024

2020-05-06T20:06:39.923+0200 7f78f91ab700 5 <cls> /build/ceph-15.2.1/src/cls/queue/cls_queue_src.cc:312: INFO: queue_list_entries(): offset: 0/4096

2020-05-06T20:06:39.923+0200 7f78f91ab700 5 <cls> /build/ceph-15.2.1/src/cls/queue/cls_queue_src.cc:326: ERROR: queue_list_entries: invalid entry start 0

2020-05-06T20:06:39.923+0200 7f78f91ab700 5 <cls> /build/ceph-15.2.1/src/cls/rgw_gc/cls_rgw_gc.cc:135: ERROR: queue_list_entries(): returned error -22

Actions #4

Updated by Pritha Srivastava almost 4 years ago

@James Did you notice any error in 15.2.0? Did you run gc list for that version?

Actions #5

Updated by James McEwan almost 4 years ago

No, I never ran the command in 15.2.0, so I can't be sure if the problem existed prior to the upgrade or not.

Actions #6

Updated by James McEwan almost 4 years ago

Here are the rgw logs that you requested from the rgw server, with debug_rgw set to 20. The reason I didn't include those in the original post was because I couldn't find any entries related to the garbage collection, so I assumed you meant the logs obtained by running the command.

Note that I've tried to censor any sensitive information in the logs.

The command was run at 2020-05-08T08:18:55.562+0000

Thanks,
James.

Actions #7

Updated by Pritha Srivastava almost 4 years ago

Hi James,

Yes, this rgw log does not have anything related to GC. Isn't there a log file right after upgrade, there are some initializations related to GC that I want to check, also log entries related to objects being sent to GC, and GC trying to process them - something like "RGWGC::send_chain", "RGWGC::process entered with GC index_shard" and corresponding osd logs?

Thanks,
Pritha

Actions #8

Updated by James McEwan almost 4 years ago

Seems like the rgw logs have been rotated out since we performed the upgrade, so those entries you're interested in regarding the initialisation are gone I'm afraid.

The only entries I can find containing the string "RGWGC" are the following:
2020-05-06T18:56:22.134+0200 7f2736ffd700 5 garbage collection: RGWGC::process removing default.rgw.buckets.data:5a5c812a-3d31-4d79-87e6-1a17206228ac.325552.3__multipart_urn:oid:1638.2~St4_RiJhGa9NJfS1CPOvSQc9nw86NAH.20
2020-05-06T18:56:22.134+0200 7f2736ffd700 5 garbage collection: RGWGC::process removing default.rgw.buckets.data:5a5c812a-3d31-4d79-87e6-1a17206228ac.325552.3__shadow_urn:oid:1638.2~St4_RiJhGa9NJfS1CPOvSQc9nw86NAH.20_1
2020-05-06T18:56:22.134+0200 7f2736ffd700 5 garbage collection: RGWGC::process removing default.rgw.buckets.data:5a5c812a-3d31-4d79-87e6-1a17206228ac.325552.3__multipart_urn:oid:1638.2~St4_RiJhGa9NJfS1CPOvSQc9nw86NAH.21
2020-05-06T18:56:22.162+0200 7f2736ffd700 5 garbage collection: RGWGC::process removing default.rgw.buckets.data:5a5c812a-3d31-4d79-87e6-1a17206228ac.325552.3__shadow_urn:oid:1638.2~St4_RiJhGa9NJfS1CPOvSQc9nw86NAH.21_1
2020-05-06T18:56:22.162+0200 7f2736ffd700 5 garbage collection: RGWGC::process removing default.rgw.buckets.data:5a5c812a-3d31-4d79-87e6-1a17206228ac.325552.3__multipart_urn:oid:1638.2~St4_RiJhGa9NJfS1CPOvSQc9nw86NAH.22
2020-05-06T18:56:22.162+0200 7f2736ffd700 5 garbage collection: RGWGC::process removing default.rgw.buckets.data:5a5c812a-3d31-4d79-87e6-1a17206228ac.325552.3__shadow_urn:oid:1638.2~St4_RiJhGa9NJfS1CPOvSQc9nw86NAH.22_1
...

I believe this coincided with when I executed the "radosgw-admin gc process" command manually, but I'm not 100% sure.

If I run "radosgw-admin gc process" now with debug_rgw=20, I believe I'm seeing the same -22 error code:
2020-05-08T10:03:12.835+0000 7f7fcb600080 20 garbage collection: RGWGC::process entered with GC index_shard=4, max_secs=3600, expired_only=1
2020-05-08T10:03:12.835+0000 7f7fcb600080 20 garbage collection: RGWGC::process cls_rgw_gc_list returned with returned:0, entries.size=0, truncated=0, next_marker=''
2020-05-08T10:03:12.835+0000 7f7fcb600080 20 garbage collection: RGWGC::process cls_rgw_gc_list returned ENOENT for non expired entries, so setting cache entry to TRUE
2020-05-08T10:03:12.835+0000 7f7fcb600080 20 garbage collection: RGWGC::process cls_rgw_gc_queue_list_entries returned with return value:-22, entries.size=0, truncated=0, next_marker=''
2020-05-08T10:03:12.835+0000 7f7fcb600080 20 garbage collection: RGWGC::process entered with GC index_shard=5, max_secs=3600, expired_only=1
2020-05-08T10:03:12.835+0000 7f7fcb600080 20 garbage collection: RGWGC::process cls_rgw_gc_list returned with returned:0, entries.size=0, truncated=0, next_marker=''
2020-05-08T10:03:12.835+0000 7f7fcb600080 20 garbage collection: RGWGC::process cls_rgw_gc_list returned ENOENT for non expired entries, so setting cache entry to TRUE
2020-05-08T10:03:12.835+0000 7f7fcb600080 20 garbage collection: RGWGC::process cls_rgw_gc_queue_list_entries returned with return value:-22, entries.size=0, truncated=0, next_marker=''
...

Is there any more information I can provide you? Would restarting the RGW also re-initialise the GC and provide you with the logs you are interested in, or is this only done after upgrade?

Are you able to determine yet if this problem is originating from the RGW side or the OSD side?

Actions #9

Updated by Pritha Srivastava almost 4 years ago

Do you see anything with "queue_enqueue():" in the osd logs?

The problem that we see with listing gc entries is happening because of some error that happened while writing the entry to the disk, and I can see the error in the first entry that was written. I am trying to determine, why that happened.

Actions #10

Updated by James McEwan almost 4 years ago

No, I can't see anything with "queue_enqueue" in the osd logs, but we have not been running with debug set to 20. I can set them to 20 again, but the question is what will trigger "queue_enqueue"?

When you say an error "writing the entry to the disk", is the error with the rados object that was written to disk, or with some entry specific to the garbage collector?

Actions #11

Updated by Pritha Srivastava almost 4 years ago

When an object is deleted, RGWGC::send_chain() is invoked which in turn will invoke queue_enqueue(). So I just wanted to check whether an error has happened will writing the gc object to the disk.

When I say error writing to the disk, I mean writing entries specific to gc to the disk.

Actions #12

Updated by James McEwan almost 4 years ago

OK, I'm not really sure if there's any more info I can provide you at this stage to help progress this issue? Deleting another object won't provide you the info you need I assume?

Would reprovisioning the OSDs resolve the issue? Or is there a less invasive fix that you think would get the garbage collector up and running again?

Actions #13

Updated by James McEwan almost 4 years ago

Hi Pritha,

Any thoughts on my previous comment? I wonder if simply creating a new pool for GC and updating the rgw zone would be sufficient to get the garbage collector running again?
"gc_pool": "default.rgw.log:gc",

Actions #14

Updated by Pritha Srivastava almost 4 years ago

Hi James,

You can try this, for one gc object:
1. rados rm gc.0
2. invoke any radosgw-admin command - this will initialize the gc queue, and you will see a new gc.0 object.

If the above steps go fine, then you can perform the same steps for all gc objects.

Although you have mentioned that this a new cluster with 15.2.0 upgraded to 15.2.1, but just wanted to make sure that the cluster wasn't upgraded from nautilus to 15.2.0?

Thanks,
Pritha

Actions #15

Updated by James McEwan almost 4 years ago

Hi,

Your workaround was sufficient for us to get GC working again. I've now verified that I'm able to list and process items in the GC.

This was a brand new installation at 15.2.0, which we upgraded to 15.2.1. As I previously mentioned, I'm not sure whether the problem existed prior to the upgrade or not. I have a lab cluster which was also newly installed at 15.2.0 and upgraded to 15.2.1, and I did not encounter the issue there.

In our production cluster, we ran multiple COSBench workloads to perform some benchmarking directly after setting up the cluster. I'm not sure if it's possible that the objects that were deleted through COSBench somehow corrupted the gc object.

Regards,
James.

Actions #16

Updated by Pritha Srivastava almost 4 years ago

Thanks for your reply, James.

If you see a similar issue again, please upload logs as soon as you see it.

Thanks,
Pritha

Actions #17

Updated by Casey Bodley almost 4 years ago

  • Priority changed from Normal to High
Actions #19

Updated by Casey Bodley almost 4 years ago

  • Status changed from Triaged to Pending Backport
  • Pull request ID set to 35264
Actions #20

Updated by Nathan Cutler almost 4 years ago

  • Copied to Backport #45844: octopus: radosgw gc issue - failed to list objs: (22) Invalid argument added
Actions #21

Updated by Nathan Cutler almost 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF