Project

General

Profile

Actions

Bug #61716

open

Production random data not accessible. Object not found on healthy cluster.

Added by Jonas Nemeikšis 11 months ago. Updated about 2 months ago.

Status:
Need More Info
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seems like everything duplicated:

https://tracker.ceph.com/issues/47866

Clusters version:

Pacific 16.2.9

I can show objects here:

radosgw-admin bucket radoslist --bucket=<bucket>

I cannot show here:

radosgw-admin object stat --object=<object>
ERROR: failed to stat object, returned error: (2) No such file or directory

Prmary osd logs:

v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 367+0+0 (crc 0 0 0) 0x55d7fcee5680 con 0x55d7fd76e800

Actions #1

Updated by Casey Bodley 11 months ago

  • Status changed from New to Need More Info
  • Assignee set to J. Eric Ivancich

https://tracker.ceph.com/issues/47866 tracked a bug that caused data loss. is it possible that the objects in question were written during the octopus release before that fix was in place?

Actions #2

Updated by Jonas Nemeikšis 11 months ago

Nope, objects were written in the Pacific. I've debugged that unfound objects are parts of multiparted objects. One more thing, that many unfound objects were written when the cluster was a little bit unhealthy due to a network device flapped. Maybe this bugfix is related also? https://tracker.ceph.com/issues/57519 

Actions #3

Updated by Jonas Nemeikšis 10 months ago

Could I provide any information?

Actions #4

Updated by Jonas Nemeikšis 10 months ago

more findings

I tried list parts with

radosgw-admin bucket radoslist --bucket=bucket-name

and then check with

rados stat -p pool object_id

And I saw a few missing parts _shadow of failed download file.

Actions #5

Updated by Jonas Nemeikšis 10 months ago

rados output about missing parts:
`No such file or directory`

Actions #6

Updated by J. Eric Ivancich 10 months ago

Do you happen to have logs from when the objects were written? Without those or a reproducer it will be very difficult to track down the cause, and whether it's at the RGW layer, the rados layer, or the interface between the two.

Actions #7

Updated by Jonas Nemeikšis 10 months ago

nope, I don't have any logs and I cannot reproduce it, but I am still trying to do that. One more fact that I found is that this leak happens when files larger than 5GB. Jetbackup S3 client is very strange doing multipart, it trying to do max part how it can. For example if a file ~6GB so one part always will be 5368709120 bytes and so on. Another fact is that all missing parts are ._shadow_. objects.

Actions #8

Updated by Jonas Nemeikšis 9 months ago

Updates, seems like now I can reproduce this issue. Our rgws nodes are accessible through haproxy nodes, we've `timeout server 50s` in haproxy config. I see that uploads timeouting and successfully after 3-4 times.

1. first timeout:

2023-07-26T14:52:17.085+0000 7fcea8606700  0 req 4116432839796972413 50.035957336s ERROR: RESTFUL_IO(s)->complete_header() returned err=Broken pipe
2023-07-26T14:52:17.085+0000 7fcea8606700  1 ====== req done req=0x7fce8acc8620 op status=-2 http_status=404 latency=50.035957336s ======
2023-07-26T14:52:17.085+0000 7fcea8606700  1 beast: 0x7fce8acc8620: 2a02:4780:a:c0de::a5 - jet [26/Jul/2023:14:51:27.049 +0000] "PUT /jetbackup2/jetbackup_2_2_64bf9ebb6a02201b387db982/files/64c1321c000c047400000c19?partNumber=2&uploadId=2~7gbb10H_sh7D0bC6LpnJTwDK6smiB0h HTTP/1.1" 404 8388653 - - - latency=50.035957336s

2. second timeout:

2023-07-26T14:53:07.573+0000 7fce99de9700  0 req 1297700206770153440 50.013954163s ERROR: RESTFUL_IO(s)->complete_header() returned err=Broken pipe
2023-07-26T14:53:07.573+0000 7fcea7604700  1 ====== req done req=0x7fcefc2d2620 op status=-2 http_status=404 latency=50.013954163s ======
2023-07-26T14:53:07.573+0000 7fcea7604700  1 beast: 0x7fcefc2d2620: 2a02:4780:a:c0de::a5 - jet [26/Jul/2023:14:52:17.559 +0000] "PUT /jetbackup2/jetbackup_2_2_64bf9ebb6a02201b387db982/files/64c1321c000c047400000c19?partNumber=2&uploadId=2~7gbb10H_sh7D0bC6LpnJTwDK6smiB0h HTTP/1.1" 404 8388653 - - - latency=50.013954163s

3. third timeout:

2023-07-26T14:54:11.450+0000 7fce98de7700  0 req 2381992534535352282 63.494941711s ERROR: RESTFUL_IO(s)->complete_header() returned err=Broken pipe
2023-07-26T14:54:11.451+0000 7fcea5e01700  1 ====== req done req=0x7fcefc2d2620 op status=-2 http_status=404 latency=63.495941162s ======
2023-07-26T14:54:11.451+0000 7fcea5e01700  1 beast: 0x7fcefc2d2620: 2a02:4780:a:c0de::a3 - jet [26/Jul/2023:14:53:07.955 +0000] "PUT /jetbackup2/jetbackup_2_2_64bf9ebb6a02201b387db982/files/64c1321c000c047400000c19?partNumber=2&uploadId=2~7gbb10H_sh7D0bC6LpnJTwDK6smiB0h HTTP/1.1" 404 2453667885 - - - latency=63.495941162s

4. fourth retry successfully:

2023-07-26T14:55:16.246+0000 7fcebc62e700  1 beast: 0x7fce8ad49620: 2a02:4780:a:c0de::a4 - jet [26/Jul/2023:14:53:58.998 +0000] "PUT /jetbackup2/jetbackup_2_2_64bf9ebb6a02201b387db982/files/64c1321c000c047400000c19?partNumber=2&uploadId=2~7gbb10H_sh7D0bC6LpnJTwDK6smiB0h HTTP/1.1" 200 3221237760 - - - latency=77.247932434s

Check missing object and not found:

[mon1 ~]# radosgw-admin bucket radoslist --bucket=jetbackup2 |grep "6869e9c3-1117-44db-b1cd-cf5621e11e71.830302.1__shadow_jetbackup_2_2_64bf9ebb6a02201b387db982/files/64c1321c000c047400000c19.2~7gbb10H_sh7D0bC6LpnJTwDK6smiB0h.2_99" 
6869e9c3-1117-44db-b1cd-cf5621e11e71.830302.1__shadow_jetbackup_2_2_64bf9ebb6a02201b387db982/files/64c1321c000c047400000c19.2~7gbb10H_sh7D0bC6LpnJTwDK6smiB0h.2_99

[mon1 ~]# radosgw-admin object stat --object 6869e9c3-1117-44db-b1cd-cf5621e11e71.830302.1__shadow_jetbackup_2_2_64bf9ebb6a02201b387db982/files/64c1321c000c047400000c19.2~7gbb10H_sh7D0bC6LpnJTwDK6smiB0h.2_99
ERROR: failed to stat object, returned error: (2) No such file or directory

More rgw logs about missing object, interesting why deleting:


2023-07-26T14:53:17.478+0000 7f5af8d34700  1 -- [v2:[2a02:4780:a:c0de::a7]:6824/3209,v1:[2a02:4780:a:c0de::a7]:6825/3209] --> [2a02:4780:a:c0de::a3]:0/2830877866 -- osd_op_reply(85769 6869e9c3-1117-44db-b1cd-cf5621e11e71.830302.1__shadow_jetbackup_2_2_64bf9ebb6a02201b387db982/files/64c1321c000c047400000c19.2~7gbb10H_sh7D0bC6LpnJTwDK6smiB0h.2_99 [writefull 0~4190208] v4050'3402 uv3402 ondisk = 0) v8 -- 0x55afdf9e6000 con 0x55afcb900c00

2023-07-26T14:54:11.449+0000 7f5af8d34700  1 -- [v2:[2a02:4780:a:c0de::a7]:6824/3209,v1:[2a02:4780:a:c0de::a7]:6825/3209] --> [2a02:4780:a:c0de::a3]:0/2830877866 -- osd_op_reply(87085 6869e9c3-1117-44db-b1cd-cf5621e11e71.830302.1__shadow_jetbackup_2_2_64bf9ebb6a02201b387db982/files/64c1321c000c047400000c19.2~7gbb10H_sh7D0bC6LpnJTwDK6smiB0h.2_99 [writefull 0~4190208] v4050'3405 uv3405 ondisk = 0) v8 -- 0x55afdf2e3680 con 0x55afcb900c00
2023-07-26T14:54:11.449+0000 7f5af8d34700  1 -- [v2:[2a02:4780:a:c0de::a7]:6824/3209,v1:[2a02:4780:a:c0de::a7]:6825/3209] --> [2a02:4780:a:c0de::a3]:0/2830877866 -- osd_op_reply(87088 6869e9c3-1117-44db-b1cd-cf5621e11e71.830302.1__shadow_jetbackup_2_2_64bf9ebb6a02201b387db982/files/64c1321c000c047400000c19.2~7gbb10H_sh7D0bC6LpnJTwDK6smiB0h.2_99 [delete] v4050'3406 uv3406 ondisk = 0) v8 -- 0x55afd9cafd40 con 0x55afcb900c00

Actions #9

Updated by Casey Bodley 9 months ago

Jonas Nemeikšis wrote:

[mon1 ~]# radosgw-admin object stat --object 6869e9c3-1117-44db-b1cd-cf5621e11e71.830302.1__shadow_jetbackup_2_2_64bf9ebb6a02201b387db982/files/64c1321c000c047400000c19.2~7gbb10H_sh7D0bC6LpnJTwDK6smiB0h.2_99
ERROR: failed to stat object, returned error: (2) No such file or directory

radosgw-admin object stat operates on s3/swift object names, not rados object names. you'd need to use the rados tool to inspect the rados objects: rados -p <data pool name> stat <rados object name>

Actions #10

Updated by Jonas Nemeikšis 9 months ago

Sorry, I've wrong pasted here cmd. I use rados tool to inspect the missing rados objects.

Actions #11

Updated by Jonas Nemeikšis 9 months ago

I've found that 100% does not depend on the client. I also can reproduce with the aws-cli client..

Conditions and steps how I can reproduce:

1. File size ~7GB
2. Configured Multipart chunksize 5GB
3. Total number part 2
4. Haproxy `server timeout 2m`
5. When uploading I can see a few timeouts until the client finishes uploading, so it seems like the client re-uploading chunks

When server timeout up to 30m or so then I cannot find missing objects. I think the problem is the large size part and Haproxy server timeout. Very interesting how and why clients successfully upload but missing some rados objects.

Actions #12

Updated by J. Eric Ivancich about 2 months ago

This sounds like it could be a duplicate of this bug that was resolved:

https://tracker.ceph.com/issues/63642

Jonas, would you be able to re-test your reproducer on a version that has this fix?

Actions

Also available in: Atom PDF