Project

General

Profile

Bug #47866

Object not found on healthy cluster

Added by Julien Demais 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Target version:
% Done:

0%

Source:
Tags:
Backport:
octopus
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
rgw
Pull request ID:
Crash signature:

Description

While using the aws cli or trying to get a file using curl, I get an error from the radosgw:

aws --endpoint=http://172.17.5.11:7480 s3 cp s3://mpd/0002/0002-dashevc-main-2s-v1.mp4 .
download failed: s3://mpd/0002/0002-dashevc-main-2s-v1.mp4 to ./0002-dashevc-main-2s-v1.mp4 An error occurred (NoSuchKey) when calling the GetObject operation: Unknown

Overall, the cluster looks healthy:

root@node1-1:~# ceph -s
  cluster:
    id:     066f558c-6789-4a93-aaf1-5af1ba01a3ad
    health: HEALTH_WARN
            noout flag(s) set

  services:
    mon: 5 daemons, quorum node1-1,node2-1,node3-1,node4-1,node5-1 (age 3d)
    mgr: node2-1(active, since 3d), standbys: node3-1, node1-1, node4-1, node5-1
    osd: 6 osds: 6 up (since 3d), 6 in (since 13d)
         flags noout
    rgw: 6 daemons active (radosgw.node1-1, radosgw.node2-1, radosgw.node3-1, radosgw.node4-1, radosgw.node5-1, radosgw.node6-1)

  task status:

  data:
    pools:   9 pools, 481 pgs
    objects: 338.74k objects, 1.2 TiB
    usage:   3.7 TiB used, 17 TiB / 21 TiB avail
    pgs:     481 active+clean

It also seems like the file is available in the bucket index:

root@node1-1:~# radosgw-admin bi list --bucket=mpd | grep 0002-dashevc-main-2s-v1.mp4
        "idx": "0002/0002-dashevc-main-2s-v1.mp4",
            "name": "0002/0002-dashevc-main-2s-v1.mp4",

root@node1-1:~# radosgw-admin object stat --bucket=mpd --object=0002/0002-dashevc-main-2s-v1.mp4
{
    "name": "0002/0002-dashevc-main-2s-v1.mp4",
    "size": 31958035,
    "policy": {
        "acl": {
            "acl_user_map": [
                {
                    "user": "perfo",
                    "acl": 15
                }
            ],
            "acl_group_map": [
                {
                    "group": 1,
                    "acl": 1
                }
            ],
            "grant_map": [
                {
                    "id": "",
                    "grant": {
                        "type": {
                            "type": 2
                        },
                        "id": "",
                        "email": "",
                        "permission": {
                            "flags": 1
                        },
                        "name": "",
                        "group": 1,
                        "url_spec": "" 
                    }
                },
                {
                    "id": "perfo",
                    "grant": {
                        "type": {
                            "type": 0
                        },
                        "id": "perfo",
                        "email": "",
                        "permission": {
                            "flags": 15
                        },
                        "name": "perfo user",
                        "group": 0,
                        "url_spec": "" 
                    }
                }
            ]
        },
        "owner": {
            "id": "perfo",
            "display_name": "perfo user" 
        }
    },
    "etag": "3bec83e90cbc05823f71d3ddd7e40f33",
    "tag": "64992646-777f-4214-9a7e-6de270ba3d54.350436.82",
    "manifest": {
        "objs": [],
        "obj_size": 31958035,
        "explicit_objs": "false",
        "head_size": 4194304,
        "max_head_size": 4194304,
        "prefix": ".kPQbrpASUP-I4PXgozEZsXh5gQgo44u_",
        "rules": [
            {
                "key": 0,
                "val": {
                    "start_part_num": 0,
                    "start_ofs": 4194304,
                    "part_size": 0,
                    "stripe_max_size": 4194304,
                    "override_prefix": "" 
                }
            }
        ],
        "tail_instance": "",
        "tail_placement": {
            "bucket": {
                "name": "mpd",
                "marker": "64992646-777f-4214-9a7e-6de270ba3d54.350469.3",
                "bucket_id": "64992646-777f-4214-9a7e-6de270ba3d54.350469.3",
                "tenant": "",
                "explicit_placement": {
                    "data_pool": "",
                    "data_extra_pool": "",
                    "index_pool": "" 
                }
            },
            "placement_rule": "default-placement" 
        },
        "begin_iter": {
            "part_ofs": 0,
            "stripe_ofs": 0,
            "ofs": 0,
            "stripe_size": 4194304,
            "cur_part_id": 0,
            "cur_stripe": 0,
            "cur_override_prefix": "",
            "location": {
                "placement_rule": "default-placement",
                "obj": {
                    "bucket": {
                        "name": "mpd",
                        "marker": "64992646-777f-4214-9a7e-6de270ba3d54.350469.3",
                        "bucket_id": "64992646-777f-4214-9a7e-6de270ba3d54.350469.3",
                        "tenant": "",
                        "explicit_placement": {
                            "data_pool": "",
                            "data_extra_pool": "",
                            "index_pool": "" 
                        }
                    },
                    "key": {
                        "name": "0002/0002-dashevc-main-2s-v1.mp4",
                        "instance": "",
                        "ns": "" 
                    }
                },
                "raw_obj": {
                    "pool": "",
                    "oid": "",
                    "loc": "" 
                },
                "is_raw": false
            }
        },
        "end_iter": {
            "part_ofs": 4194304,
            "stripe_ofs": 29360128,
            "ofs": 31958035,
            "stripe_size": 2597907,
            "cur_part_id": 0,
            "cur_stripe": 7,
            "cur_override_prefix": "",
            "location": {
                "placement_rule": "default-placement",
                "obj": {
                    "bucket": {
                        "name": "mpd",
                        "marker": "64992646-777f-4214-9a7e-6de270ba3d54.350469.3",
                        "bucket_id": "64992646-777f-4214-9a7e-6de270ba3d54.350469.3",
                        "tenant": "",
                        "explicit_placement": {
                            "data_pool": "",
                            "data_extra_pool": "",
                            "index_pool": "" 
                        }
                    },
                    "key": {
                        "name": ".kPQbrpASUP-I4PXgozEZsXh5gQgo44u_7",
                        "instance": "",
                        "ns": "shadow" 
                    }
                },
                "raw_obj": {
                    "pool": "",
                    "oid": "",
                    "loc": "" 
                },
                "is_raw": false
            }
        }
    },
    "attrs": {
        "user.rgw.content_type": "video/mp4",
        "user.rgw.pg_ver": "\u0018",
        "user.rgw.source_zone": "",
        "user.rgw.tail_tag": "64992646-777f-4214-9a7e-6de270ba3d54.350436.82",
        "user.rgw.x-amz-acl": "public-read",
        "user.rgw.x-amz-content-sha256": "2fb5ab1a1fdb43c8f07ce17e2c14237f3827f7ff952cb7818eee7994f7f98072",
        "user.rgw.x-amz-date": "20200929T151731Z" 
    }
}

Here is the osd mapping of this object's chunks:

root@node1-1:~# for i in $(seq 0 7); do ceph osd map default.rgw.buckets.data 64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_${i}; done
osdmap e2269 pool 'default.rgw.buckets.data' (11) object '64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_0' -> pg 11.a6213653 (11.53) -> up ([5,2,1], p5) acting ([5,2,1], p5)
osdmap e2269 pool 'default.rgw.buckets.data' (11) object '64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_1' -> pg 11.7615522a (11.2a) -> up ([4,3,1], p4) acting ([4,3,1], p4)
osdmap e2269 pool 'default.rgw.buckets.data' (11) object '64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_2' -> pg 11.687dc40 (11.40) -> up ([0,5,1], p0) acting ([0,5,1], p0)
osdmap e2269 pool 'default.rgw.buckets.data' (11) object '64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_3' -> pg 11.ceadb28d (11.8d) -> up ([3,2,5], p3) acting ([3,2,5], p3)
osdmap e2269 pool 'default.rgw.buckets.data' (11) object '64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_4' -> pg 11.81caf2f4 (11.f4) -> up ([5,1,0], p5) acting ([5,1,0], p5)
osdmap e2269 pool 'default.rgw.buckets.data' (11) object '64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_5' -> pg 11.b5b8aebb (11.bb) -> up ([5,2,1], p5) acting ([5,2,1], p5)
osdmap e2269 pool 'default.rgw.buckets.data' (11) object '64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_6' -> pg 11.1a9d8fbc (11.bc) -> up ([3,0,2], p3) acting ([3,0,2], p3)
osdmap e2269 pool 'default.rgw.buckets.data' (11) object '64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_7' -> pg 11.aa597da7 (11.a7) -> up ([5,0,1], p5) acting ([5,0,1], p5)

root@node1-1:~# ceph osd map default.rgw.buckets.data 64992646-777f-4214-9a7e-6de270ba3d54.350469.3_0002-dashevc-main-2s-v1.mp4
osdmap e2269 pool 'default.rgw.buckets.data' (11) object '64992646-777f-4214-9a7e-6de270ba3d54.350469.3_0002-dashevc-main-2s-v1.mp4' -> pg 11.774f69c0 (11.c0) -> up ([2,4,5], p2) acting ([2,4,5], p2)

root@node1-1:~# ceph osd map default.rgw.buckets.data 64992646-777f-4214-9a7e-6de270ba3d54.350469.3_0002/0002-dashevc-main-2s-v1.mp4
osdmap e2269 pool 'default.rgw.buckets.data' (11) object '64992646-777f-4214-9a7e-6de270ba3d54.350469.3_0002/0002-dashevc-main-2s-v1.mp4' -> pg 11.4015493 (11.93) -> up ([0,1,3], p0) acting ([0,1,3], p0)

However, from the radosgw logs, it seems like some objects are seen as missing on the primary osds:

2020-10-12T10:47:01.585+0000 7f9bd7f57700 20 rados->get_obj_iterate_cb oid=64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_3 obj-ofs=12582912 read_ofs=0 len=4194304
2020-10-12T10:47:01.585+0000 7f9bd7f57700  1 -- 172.17.5.11:0/1875369112 --> [v2:172.17.5.31:6800/216163,v1:172.17.5.31:6801/216163] -- osd_op(unknown.0.0:449201 11.8d 11:b14db573:::64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_3:head [read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e2263) v8 -- 0x7f9c68024720 con 0x5603cb2e4290
2020-10-12T10:47:01.585+0000 7f9bd7f57700 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2020-10-12T10:47:01.585+0000 7f9bd7f57700 20 RGWObjManifest::operator++(): result: ofs=16777216 stripe_ofs=16777216 part_ofs=4194304 rule->part_size=0
2020-10-12T10:47:01.585+0000 7f9cc614b700  1 -- 172.17.5.11:0/1875369112 <== osd.3 v2:172.17.5.31:6800/216163 69444 ==== osd_op_reply(449194 64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_6 [read 0~4194304] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 232+0+0 (crc 0 0 0) 0x7f9cbc0e6070 con 0x5603cb2e4290
2020-10-12T10:47:01.586+0000 7f9cc614b700  1 -- 172.17.5.11:0/1875369112 <== osd.3 v2:172.17.5.31:6800/216163 69445 ==== osd_op_reply(449201 64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_3 [read 0~4194304] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 232+0+0 (crc 0 0 0) 0x7f9cbc0e6070 con 0x5603cb2e4290
2020-10-12T10:47:01.587+0000 7f9cc694c700  1 -- 172.17.5.11:0/1875369112 <== osd.4 v2:172.17.5.51:6800/219435 59157 ==== osd_op_reply(449199 64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_1 [read 0~4194304] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 232+0+0 (crc 0 0 0) 0x7f9cb80d4040 con 0x5603cb2d9710
2020-10-12T10:47:01.587+0000 7f9cc694c700  1 -- 172.17.5.11:0/1875369112 <== osd.5 v2:172.17.5.21:6800/236481 103456 ==== osd_op_reply(449195 64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_4 [read 0~4194304] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 232+0+0 (crc 0 0 0) 0x7f9cb80d4040 con 0x7f9ca4010b10
2020-10-12T10:47:01.587+0000 7f9cc694c700  1 -- 172.17.5.11:0/1875369112 <== osd.5 v2:172.17.5.21:6800/236481 103457 ==== osd_op_reply(449196 64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_5 [read 0~4194304] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 232+0+0 (crc 0 0 0) 0x7f9cb80d4040 con 0x7f9ca4010b10
2020-10-12T10:47:01.587+0000 7f9cc694c700  1 -- 172.17.5.11:0/1875369112 <== osd.5 v2:172.17.5.21:6800/236481 103458 ==== osd_op_reply(449198 64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_7 [read 0~2597907] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 232+0+0 (crc 0 0 0) 0x7f9cb80d4040 con 0x7f9ca4010b10
2020-10-12T10:47:01.588+0000 7f9bd2f4d700  2 req 48 0.010999928s s3:get_obj completing
2020-10-12T10:47:01.588+0000 7f9bd2f4d700  2 req 48 0.010999928s s3:get_obj op status=-2
2020-10-12T10:47:01.588+0000 7f9bd2f4d700  2 req 48 0.010999928s s3:get_obj http status=404
2020-10-12T10:47:01.588+0000 7f9bd2f4d700  1 ====== req done req=0x7f9cb46f57d0 op status=-2 http_status=404 latency=0.010999928s ======
2020-10-12T10:47:01.588+0000 7f9bd2f4d700  1 beast: 0x7f9cb46f57d0: 172.17.2.128 - - [2020-10-12T10:47:01.588700+0000] "GET /mpd/0002/0002-dashevc-main-2s-v1.mp4 HTTP/1.1" 404 214 - "aws-cli/1.18.156 Python/2.7.17 Linux/4.15.0-118-generic botocore/1.18.15" bytes=16777216-25165823
2020-10-12T10:47:01.588+0000 7f9bc8f39700  2 req 49 0.009999935s s3:get_obj completing
2020-10-12T10:47:01.588+0000 7f9bc8f39700  2 req 49 0.009999935s s3:get_obj op status=-2
2020-10-12T10:47:01.588+0000 7f9bc8f39700  2 req 49 0.009999935s s3:get_obj http status=404
2020-10-12T10:47:01.588+0000 7f9bc8f39700  1 ====== req done req=0x7f9cb46747d0 op status=-2 http_status=404 latency=0.009999935s ======
2020-10-12T10:47:01.588+0000 7f9bc8f39700  1 beast: 0x7f9cb46747d0: 172.17.2.128 - - [2020-10-12T10:47:01.588700+0000] "GET /mpd/0002/0002-dashevc-main-2s-v1.mp4 HTTP/1.1" 404 214 - "aws-cli/1.18.156 Python/2.7.17 Linux/4.15.0-118-generic botocore/1.18.15" bytes=25165824-
2020-10-12T10:47:01.597+0000 7f9cc694c700  1 -- 172.17.5.11:0/1875369112 <== osd.0 v2:172.17.5.41:6800/217855 3380 ==== osd_op_reply(449197 64992646-777f-4214-9a7e-6de270ba3d54.350469.3_0002/0002-dashevc-main-2s-v1.mp4 [cmpxattr (47) op 1 mode 1,read 0~4194304 out=4194304b] v0'0 uv25 ondisk = 0) v8 ==== 264+0+4194304 (crc 0 0 0) 0x7f9cb80d4040 con 0x7f9cb801d430
2020-10-12T10:47:01.597+0000 7f9cc694c700  1 -- 172.17.5.11:0/1875369112 <== osd.0 v2:172.17.5.41:6800/217855 3381 ==== osd_op_reply(449200 64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_2 [read 0~4194304] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 232+0+0 (crc 0 0 0) 0x7f9cb80d4040 con 0x7f9cb801d430
2020-10-12T10:47:01.598+0000 7f9bcd742700  2 req 47 0.020999862s s3:get_obj completing
2020-10-12T10:47:01.598+0000 7f9bcd742700  2 req 47 0.020999862s s3:get_obj op status=-2
2020-10-12T10:47:01.598+0000 7f9bcd742700  2 req 47 0.020999862s s3:get_obj http status=404
2020-10-12T10:47:01.598+0000 7f9bcd742700  1 ====== req done req=0x7f9cb47767d0 op status=-2 http_status=404 latency=0.020999862s ======
2020-10-12T10:47:01.598+0000 7f9bcd742700  1 beast: 0x7f9cb47767d0: 172.17.2.128 - - [2020-10-12T10:47:01.598700+0000] "GET /mpd/0002/0002-dashevc-main-2s-v1.mp4 HTTP/1.1" 404 214 - "aws-cli/1.18.156 Python/2.7.17 Linux/4.15.0-118-generic botocore/1.18.15" bytes=0-8388607
2020-10-12T10:47:01.598+0000 7f9bcef45700  2 req 50 0.019999869s s3:get_obj completing
2020-10-12T10:47:01.598+0000 7f9bcef45700  2 req 50 0.019999869s s3:get_obj op status=-2
2020-10-12T10:47:01.598+0000 7f9bcef45700  2 req 50 0.019999869s s3:get_obj http status=404
2020-10-12T10:47:01.598+0000 7f9bcef45700  1 ====== req done req=0x7f9cb45f37d0 op status=-2 http_status=404 latency=0.019999869s ======
2020-10-12T10:47:01.598+0000 7f9bcef45700  1 beast: 0x7f9cb45f37d0: 172.17.2.128 - - [2020-10-12T10:47:01.598700+0000] "GET /mpd/0002/0002-dashevc-main-2s-v1.mp4 HTTP/1.1" 404 214 - "aws-cli/1.18.156 Python/2.7.17 Linux/4.15.0-118-generic botocore/1.18.15" bytes=8388608-16777215

Here are logs of osd.5:

2020-10-13T06:50:53.767+0000 7f6bbe584700 20 osd.5 pg_epoch: 2269 pg[11.f4( v 2240'13384 (2132'12090,2240'13384] local-lis/les=2243/2244 n=1273 ec=223/171 lis/c=2243/2243 les/c/f=2244/2244/0 sis=2243) [5,1,0] r=0 lpr=2243 crt=2240'13384 lcod 0'0 mlcod 0'0 active+clean] do_op: op osd_op(client.3854431.0:580190 11.f4 11:2f4f5381:::64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_4:head [read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e2269) v8
2020-10-13T06:50:53.767+0000 7f6bbe584700 20 osd.5 pg_epoch: 2269 pg[11.f4( v 2240'13384 (2132'12090,2240'13384] local-lis/les=2243/2244 n=1273 ec=223/171 lis/c=2243/2243 les/c/f=2244/2244/0 sis=2243) [5,1,0] r=0 lpr=2243 crt=2240'13384 lcod 0'0 mlcod 0'0 active+clean] op_has_sufficient_caps session=0x557de4a3a780 pool=11 (default.rgw.buckets.data ) pool_app_metadata={rgw={}} need_read_cap=1 need_write_cap=0 classes=[] -> yes
2020-10-13T06:50:53.767+0000 7f6bbe584700 10 osd.5 pg_epoch: 2269 pg[11.f4( v 2240'13384 (2132'12090,2240'13384] local-lis/les=2243/2244 n=1273 ec=223/171 lis/c=2243/2243 les/c/f=2244/2244/0 sis=2243) [5,1,0] r=0 lpr=2243 crt=2240'13384 lcod 0'0 mlcod 0'0 active+clean] do_op osd_op(client.3854431.0:580190 11.f4 11:2f4f5381:::64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_4:head [read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e2269) v8 may_read -> read-ordered flags ondisk+read+known_if_redirected
2020-10-13T06:50:53.767+0000 7f6bbe584700 10 osd.5 pg_epoch: 2269 pg[11.f4( v 2240'13384 (2132'12090,2240'13384] local-lis/les=2243/2244 n=1273 ec=223/171 lis/c=2243/2243 les/c/f=2244/2244/0 sis=2243) [5,1,0] r=0 lpr=2243 crt=2240'13384 lcod 0'0 mlcod 0'0 active+clean] get_object_context: obc NOT found in cache: 11:2f4f5381:::64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_4:head
2020-10-13T06:50:53.768+0000 7f6bbe584700 10 osd.5 pg_epoch: 2269 pg[11.f4( v 2240'13384 (2132'12090,2240'13384] local-lis/les=2243/2244 n=1273 ec=223/171 lis/c=2243/2243 les/c/f=2244/2244/0 sis=2243) [5,1,0] r=0 lpr=2243 crt=2240'13384 lcod 0'0 mlcod 0'0 active+clean] get_object_context: no obc for soid 11:2f4f5381:::64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_4:head and !can_create
2020-10-13T06:50:53.768+0000 7f6bbe584700 20 osd.5 pg_epoch: 2269 pg[11.f4( v 2240'13384 (2132'12090,2240'13384] local-lis/les=2243/2244 n=1273 ec=223/171 lis/c=2243/2243 les/c/f=2244/2244/0 sis=2243) [5,1,0] r=0 lpr=2243 crt=2240'13384 lcod 0'0 mlcod 0'0 active+clean] do_op: find_object_context got error -2
2020-10-13T06:50:53.768+0000 7f6bbe584700 10 osd.5 2269 dequeue_op 0x557e5442ef20 finish
2020-10-13T06:50:53.768+0000 7f6bbe584700 20 osd.5 op_wq(4) _process empty q, waiting
2020-10-13T06:50:53.768+0000 7f6bda058700 15 osd.5 2269 enqueue_op 0x557e535794a0 prio 63 cost 0 latency 0.000014 epoch 2269 osd_op(client.3854431.0:580191 11.bb 11.b5b8aebb (undecoded) ondisk+read+known_if_redirected e2269) v8
2020-10-13T06:50:53.768+0000 7f6bda058700 20 osd.5 op_wq(3) _enqueue OpSchedulerItem(11.bb PGOpItem(op=osd_op(client.3854431.0:580191 11.bb 11.b5b8aebb (undecoded) ondisk+read+known_if_redirected e2269) v8) prio 63 cost 0 e2269)
2020-10-13T06:50:53.768+0000 7f6bbad7d700 20 osd.5 op_wq(3) _process 11.bb to_process <> waiting <> waiting_peering {}
2020-10-13T06:50:53.768+0000 7f6bbad7d700 20 osd.5 op_wq(3) _process OpSchedulerItem(11.bb PGOpItem(op=osd_op(client.3854431.0:580191 11.bb 11.b5b8aebb (undecoded) ondisk+read+known_if_redirected e2269) v8) prio 63 cost 0 e2269) queued
2020-10-13T06:50:53.768+0000 7f6bbad7d700 20 osd.5 op_wq(3) _process 11.bb to_process <OpSchedulerItem(11.bb PGOpItem(op=osd_op(client.3854431.0:580191 11.bb 11.b5b8aebb (undecoded) ondisk+read+known_if_redirected e2269) v8) prio 63 cost 0 e2269)> waiting <> waiting_peering {}
2020-10-13T06:50:53.768+0000 7f6bbad7d700 20 osd.5 op_wq(3) _process OpSchedulerItem(11.bb PGOpItem(op=osd_op(client.3854431.0:580191 11.bb 11.b5b8aebb (undecoded) ondisk+read+known_if_redirected e2269) v8) prio 63 cost 0 e2269) pg 0x557dde47c800
2020-10-13T06:50:53.768+0000 7f6bbed85700 20 osd.5 op_wq(3) _process empty q, waiting
2020-10-13T06:50:53.768+0000 7f6bbad7d700 10 osd.5 2269 dequeue_op 0x557e535794a0 prio 63 cost 0 latency 0.000194 osd_op(client.3854431.0:580191 11.bb 11.b5b8aebb (undecoded) ondisk+read+known_if_redirected e2269) v8 pg pg[11.bb( v 2153'13237 (2021'11944,2153'13237] local-lis/les=2243/2244 n=1304 ec=223/171 lis/c=2243/2243 les/c/f=2244/2244/0 sis=2243) [5,2,1] r=0 lpr=2243 crt=2153'13237 lcod 0'0 mlcod 0'0 active+clean]
2020-10-13T06:50:53.768+0000 7f6bbad7d700 10 osd.5 pg_epoch: 2269 pg[11.bb( v 2153'13237 (2021'11944,2153'13237] local-lis/les=2243/2244 n=1304 ec=223/171 lis/c=2243/2243 les/c/f=2244/2244/0 sis=2243) [5,2,1] r=0 lpr=2243 crt=2153'13237 lcod 0'0 mlcod 0'0 active+clean] _handle_message: 0x557e535794a0
2020-10-13T06:50:53.768+0000 7f6bbad7d700 20 osd.5 pg_epoch: 2269 pg[11.bb( v 2153'13237 (2021'11944,2153'13237] local-lis/les=2243/2244 n=1304 ec=223/171 lis/c=2243/2243 les/c/f=2244/2244/0 sis=2243) [5,2,1] r=0 lpr=2243 crt=2153'13237 lcod 0'0 mlcod 0'0 active+clean] do_op: op osd_op(client.3854431.0:580191 11.bb 11:dd751dad:::64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_5:head [read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e2269) v8
2020-10-13T06:50:53.768+0000 7f6bbad7d700 20 osd.5 pg_epoch: 2269 pg[11.bb( v 2153'13237 (2021'11944,2153'13237] local-lis/les=2243/2244 n=1304 ec=223/171 lis/c=2243/2243 les/c/f=2244/2244/0 sis=2243) [5,2,1] r=0 lpr=2243 crt=2153'13237 lcod 0'0 mlcod 0'0 active+clean] op_has_sufficient_caps session=0x557de4a3a780 pool=11 (default.rgw.buckets.data ) pool_app_metadata={rgw={}} need_read_cap=1 need_write_cap=0 classes=[] -> yes
2020-10-13T06:50:53.768+0000 7f6bbad7d700 10 osd.5 pg_epoch: 2269 pg[11.bb( v 2153'13237 (2021'11944,2153'13237] local-lis/les=2243/2244 n=1304 ec=223/171 lis/c=2243/2243 les/c/f=2244/2244/0 sis=2243) [5,2,1] r=0 lpr=2243 crt=2153'13237 lcod 0'0 mlcod 0'0 active+clean] do_op osd_op(client.3854431.0:580191 11.bb 11:dd751dad:::64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_5:head [read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e2269) v8 may_read -> read-ordered flags ondisk+read+known_if_redirected
2020-10-13T06:50:53.768+0000 7f6bbad7d700 10 osd.5 pg_epoch: 2269 pg[11.bb( v 2153'13237 (2021'11944,2153'13237] local-lis/les=2243/2244 n=1304 ec=223/171 lis/c=2243/2243 les/c/f=2244/2244/0 sis=2243) [5,2,1] r=0 lpr=2243 crt=2153'13237 lcod 0'0 mlcod 0'0 active+clean] get_object_context: obc NOT found in cache: 11:dd751dad:::64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_5:head
2020-10-13T06:50:53.769+0000 7f6bbad7d700 10 osd.5 pg_epoch: 2269 pg[11.bb( v 2153'13237 (2021'11944,2153'13237] local-lis/les=2243/2244 n=1304 ec=223/171 lis/c=2243/2243 les/c/f=2244/2244/0 sis=2243) [5,2,1] r=0 lpr=2243 crt=2153'13237 lcod 0'0 mlcod 0'0 active+clean] get_object_context: no obc for soid 11:dd751dad:::64992646-777f-4214-9a7e-6de270ba3d54.350469.3__shadow_.kPQbrpASUP-I4PXgozEZsXh5gQgo44u_5:head and !can_create
2020-10-13T06:50:53.769+0000 7f6bbad7d700 20 osd.5 pg_epoch: 2269 pg[11.bb( v 2153'13237 (2021'11944,2153'13237] local-lis/les=2243/2244 n=1304 ec=223/171 lis/c=2243/2243 les/c/f=2244/2244/0 sis=2243) [5,2,1] r=0 lpr=2243 crt=2153'13237 lcod 0'0 mlcod 0'0 active+clean] do_op: find_object_context got error -2

What surprises me though is that the file is indeed present on the cluster:

root@node1-1:~# rados -p default.rgw.buckets.data get 64992646-777f-4214-9a7e-6de270ba3d54.350469.3_0002/0002-dashevc-main-2s-v1.mp4 out

The out file is well retrieved and fine so it looks like the file is still there on the cluster.
The issue seems to be with the "shadow" files.

radosgw-admin-stat.json.gz (95.2 KB) Janek Bevendorff, 11/12/2020 05:04 PM

ceph.conf View (14 KB) Janek Bevendorff, 11/12/2020 05:04 PM

radosgw-admin-stat.json.gz (18.6 KB) Janek Bevendorff, 11/13/2020 08:17 AM


Related issues

Precedes rgw - Bug #48330: rgw: fix GC defer process and clean up from immediate fix New
Copied to rgw - Backport #48331: octopus: Object not found on healthy cluster Resolved

History

#1 Updated by Janek Bevendorff 2 months ago

Can you check that the RADOS object is larger than 0 bytes? This is a creeping data loss that has been reported on Octopus in several places now and we have experienced it as well. All of a sudden, objects disappear, but their metadata remain.

For us, this issue has manifested itself as follows: Of the 78 buckets we have (having 10s and 100s of thousands of objects each), two are suddenly missing objects. S3 and `radosgw-admin stat` can still find the objects' metadata, but the payload object is gone. Via the S3 interface, this shows as a 404 error (NoSuchKey).

All affected objects are multipart objects, but that may be a coincidence. If I run `rados stat` on the head object, it shows a size of 0 bytes. All subsequent shadow objects are gone completely. I pulled a dump of all objects in my RGW buckets pool (61GB of object names) and ack'ed it for the shadow suffixes shown in `radosgw-admin stat`. I cannot find any of them.

So far, we have lost 16 objects this way. This number hasn't increased in the last two days, but I can only expect this data loss to continue silently in the background. This is a VERY serious issue and I don't know how to debug it.

#2 Updated by Casey Bodley 2 months ago

  • Project changed from Ceph to rgw
  • Priority changed from Normal to Urgent

#3 Updated by Casey Bodley 2 months ago

Do you have any other storage classes configured? Is there a lifecycle policy on the bucket?

#4 Updated by Casey Bodley 2 months ago

  • Status changed from New to Triaged

#5 Updated by Mark Kogan 2 months ago

May we also please ask to attach the ceph.conf
and `radosgw-admin object stat` of a similarly sized OK object from the same bucket.

#6 Updated by Janek Bevendorff 2 months ago

I am not running Ceph in Kubernetes.

I attached a config dump and a stat dump of an OK object. All sensitive information has been XXX'ed out.

#7 Updated by Janek Bevendorff 2 months ago

I have not set any lifecycle policy on the bucket and I double checked that there is indeed none. s3cmd getlifecycle s3://bucket gives NoSuchLifecycleConfiguration.

#8 Updated by J. Eric Ivancich 2 months ago

I did some experiments, and I don't think the head object having a size of 0 is necessarily an indicator of a problem. That happens whenever a multipart upload is invoked.

If multipart upload is not invoked, and the object is either small or requires shadow objects, the head object will contain all the data or the first stripe of the data.

#9 Updated by Janek Bevendorff 2 months ago

Indeed. I just noticed the other object isn't a multipart object. I attached the radosgw-admin stat dump of a working multipart object. When I rados stat the head object, it also reports 0 bytes. But I can find all the shadow objects in my rados ls dump, which is not the case for the broken objects.

#10 Updated by J. Eric Ivancich 2 months ago

I'm copying in answers Janek gave to questions I posted on ceph-users.

1. It seems like those reporting this issue are seeing it strictly after upgrading to Octopus. From what version did each of these sites upgrade to Octopus? From Nautilus? Mimic? Luminous?

I upgraded from the latest Luminous release.

2. Does anyone have any lifecycle rules on a bucket experiencing this issue? If so, please describe.

Nope.

3. Is anyone making copies of the affected objects (to same or to a different bucket) prior to seeing the issue? And if they are making copies, does the destination bucket have lifecycle rules? And if they are making copies, are those copies ever being removed?

We are not making copies, but we have bucket ACLs in place, which allow different users to access the objects. I doubt this is the problem though, otherwise we probably would have lost terabytes upon terabytes and not 16 objects so far.

4. Is anyone experiencing this issue willing to run their RGWs with 'debug_ms=1'? That would allow us to see a request from an RGW to either remove a tail object or decrement its reference counter (and when its counter reaches 0 it will be deleted).

I haven't had any new data loss in the last few days (at least I think so, I read 1byte from all objects, but didn't compare checksums, so I cannot say if all objects are complete, but at least all are there).

#11 Updated by J. Eric Ivancich 2 months ago

@Julien Demais and ANYONE ELSE experiencing this issue,

I posted these questions on ceph-users but wanted to see if you would post answers here in the tracker.

1. It seems like those reporting this issue are seeing it strictly after upgrading to Octopus. From what version did each of these sites upgrade to Octopus? From Nautilus? Mimic? Luminous?

2. Does anyone have any lifecycle rules on a bucket experiencing this issue? If so, please describe.

3. Is anyone making copies of the affected objects (to same or to a different bucket) prior to seeing the issue? And if they are making copies, does the destination bucket have lifecycle rules? And if they are making copies, are those copies ever being removed?

4. Is anyone experiencing this issue willing to run their RGWs with 'debug_ms=1'? That would allow us to see a request from an RGW to either remove a tail object or decrement its reference counter (and when its counter reaches 0 it will be deleted).

Thanks,

Eric

#12 Updated by Julien Demais 2 months ago

Please find my answers below:

1. It seems like those reporting this issue are seeing it strictly after upgrading to Octopus. From what version did each of these sites upgrade to Octopus? From Nautilus? Mimic? Luminous?

I am seeing this issue on a cluster directly created with octopus (1.15.3). No upgrade in my case.

2. Does anyone have any lifecycle rules on a bucket experiencing this issue? If so, please describe.

I did not specify any chile creating the bucket. I had only one bucket created using the aws cli (using default placement).

3. Is anyone making copies of the affected objects (to same or to a different bucket) prior to seeing the issue? And if they are making copies, does the destination bucket have lifecycle rules? And if they are making copies, are those copies ever being removed?

No one is using my ceph cluster so far. I am in the process of evaluating ceph so I am the only one using it. I am evaluating the read/write performance of ceph as a streaming platform storage solution so my use case was the following:
- Create a bucket
- Add a bunch of file there still using the aws s3 cli with public read ACL
- Create 1500 copies of those files in the same bucket to generate a fictional big content set
- I then have a jmeter script to simulate lots of multiple parallel https get request to get the files and measuring when the clients would suffer from empty buffer.
- During the test, I randomly reboot nodes in order to test ceph resiliency.

My setup is a 6 nodes cluster. Each node has a 4TB SSD drive (so 1 OSD per node) and a radosgw running. Five of those nodes also are MONs an MGRs. I know this is not a recommended deployment but this is all I have.

I had lots of test rounds working successfully for days/weeks and all of a sudden, the http requests started to return 404 errors on some files.

4. Is anyone experiencing this issue willing to run their RGWs with 'debug_ms=1'? That would allow us to see a request from an RGW to either remove a tail object or decrement its reference counter (and when its counter reaches 0 it will be deleted).

Sadly, I had to reset my cluster since I created this bug ticket. I will try to recreate this issue though if it can help to find the root cause somehow.

#13 Updated by Mark Kogan 2 months ago

Julien Demais wrote:

No one is using my ceph cluster so far. I am in the process of evaluating ceph so I am the only one using it. I am evaluating the read/write performance of ceph as a streaming platform storage solution so my use case was the following:
- Create a bucket
- Add a bunch of file there still using the aws s3 cli with public read ACL
- Create 1500 copies of those files in the same bucket to generate a fictional big content set
- I then have a jmeter script to simulate lots of multiple parallel https get request to get the files and measuring when the clients would suffer from empty buffer.
- During the test, I randomly reboot nodes in order to test ceph resiliency.

@Julien and @Janek
Trying to generate a similar workload to reproduce.

A clarification question please to narrow the affected objects.

Does the problem occure only on Shadow objects (NON Multipart) like the originaley reported object
s3://mpd/0002/0002-dashevc-main-2s-v1.mp4

"ns": "shadow"

or only on Multipart objects
"ns": "multipart"

or both object types?

#14 Updated by Mark Kogan 2 months ago

Janek Bevendorff wrote:

Indeed. I just noticed the other object isn't a multipart object. I attached the radosgw-admin stat dump of a working multipart object. When I rados stat the head object, it also reports 0 bytes. But I can find all the shadow objects in my rados ls dump, which is not the case for the broken objects.

@Janek, thanks
Noticed that lz4 compression was enabled in the attached json,

     "compression": {                
         "compression_type": "lz4",  

few questions please:
- does the issue reproduce on objects that were not compressed (or not compressed with lz4)?
- is it possible that there are multiple RGWs and not all were restarted after the compression was enabled?

#15 Updated by Janek Bevendorff 2 months ago

So far, the issue has only affected multipart objects. I have already suspected that this might be part of the issue, but I cannot be sure. The data loss is steady, but slow, so it's hard to say anything for sure. After four days of no data loss, we lost another 6 objects over the weekend, totalling 19 lost S3 objects now. The pool has a total of 162 million RADOS objects, so the percentage is quite low, but expanding.

We will try to reupload all objects as single-part objects now as a direct counter measure, which with a good portion of luck will stop this from happening.

The LZ4 compression is another thing I already suspected (especially considering the previous and now fixed bug where LZ4 caused OSDs to crash due to unexpected debug output). I have disabled it for now, but cannot make any statements about what would happen to uncompressed objects at this time. We have had LZ4 compression enabled from the start, so all objects are compressed. We do have multiple RadosGWs running (7 to be precise), but they are all in sync, since compression has been on since way before the upgrade from Nautilus. I don't believe we ever had an inconsistent setup here.

What I do remember is error messages about compression during OSD upgrades (i.e., compression on RADOS level) from Nautilus. The errors silently disappeared, though, and the OSDs started normally and everything was fine for a while. I just rechecked the logs and I can see some messages like this:

    2020-11-13T08:45:40.227+0100 7f4cd3a63700  3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: ex
pected 0, got 2333323974  in db/084544.sst offset 18446744073709551615 size 18446744073709551615
    2020-11-13T08:45:40.319+0100 7f4cd3a63700  3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: ex
pected 0, got 177483464  in db/084545.sst offset 18446744073709551615 size 18446744073709551615
    2020-11-13T08:45:40.367+0100 7f4cd3a63700  3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: ex
pected 0, got 177483472  in db/084546.sst offset 18446744073709551615 size 18446744073709551615
    2020-11-13T08:45:40.415+0100 7f4cd3a63700  3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: ex
pected 0, got 184561352  in db/084547.sst offset 18446744073709551615 size 18446744073709551615
    2020-11-13T08:45:40.467+0100 7f4cd3a63700  3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: ex
pected 0, got 181186248  in db/084548.sst offset 18446744073709551615 size 18446744073709551615
2020-11-13T08:45:40.507+0100 7f4cd3a63700  3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: ex
pected 0, got 177483464  in db/084549.sst offset 18446744073709551615 size 18446744073709551615
    2020-11-13T08:45:40.579+0100 7f4cd3a63700  4 rocksdb: [db/compaction_job.cc:1395] [default] [JOB 1363] Compacted 1@2 + 6@3 files to L3 => 274404171 bytes
    2020-11-13T08:45:40.583+0100 7f4cd3a63700  4 rocksdb: (Original Log Time 2020/11/13-08:45:40.585011) [db/compaction_job.cc:768] [default] compacted to: files[0 0 27 228 232 0 0] max score 1.08, MB/sec: 7.4
 rd, 7.3 wr, level 3, files in(1, 6) out(7) MB in(65.1, 200.3) out(261.7), read-write-amplify(8.1) write-amplify(4.0) OK, records in: 7401648, records dropped: 98 output_compression: NoCompression

    2020-11-13T08:45:40.583+0100 7f4cd3a63700  4 rocksdb: (Original Log Time 2020/11/13-08:45:40.585069) EVENT_LOG_v1 {"time_micros": 1605253540585037, "job": 1363, "event": "compaction_finished", "compaction_
time_micros": 37423934, "compaction_time_cpu_micros": 33899376, "output_level": 3, "num_output_files": 7, "total_output_size": 274404171, "num_input_records": 7394851, "num_output_records": 7394753, "num_subco
mpactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 27, 228, 232, 0, 0]}
    2020-11-13T08:45:40.583+0100 7f4cd3a63700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1605253540586300, "job": 1363, "event": "table_file_deletion", "file_number": 84516}
    2020-11-13T08:45:40.587+0100 7f4cd3a63700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1605253540588263, "job": 1363, "event": "table_file_deletion", "file_number": 83119}
    2020-11-13T08:45:40.587+0100 7f4cd3a63700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1605253540589516, "job": 1363, "event": "table_file_deletion", "file_number": 83118}
    2020-11-13T08:45:40.587+0100 7f4cd3a63700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1605253540590572, "job": 1363, "event": "table_file_deletion", "file_number": 83117}
    2020-11-13T08:45:40.587+0100 7f4cd3a63700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1605253540591869, "job": 1363, "event": "table_file_deletion", "file_number": 83116}
    2020-11-13T08:45:40.591+0100 7f4cd3a63700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1605253540592967, "job": 1363, "event": "table_file_deletion", "file_number": 83115}
    2020-11-13T08:45:40.591+0100 7f4cd3a63700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1605253540594942, "job": 1363, "event": "table_file_deletion", "file_number": 83114}
    2020-11-13T08:45:40.591+0100 7f4cd3262700  4 rocksdb: [db/compaction_job.cc:1645] [default] [JOB 1364] Compacting 1@2 + 22@3 files to L3, score 1.08
    2020-11-13T08:45:40.591+0100 7f4cd3262700  4 rocksdb: [db/compaction_job.cc:1649] [default] Compaction start summary: Base version 1363 Base level 2, inputs: [84517(50MB)], [83641(64MB) 83642(64MB) 83638(6
5MB) 83640(58MB) 83123(64MB) 83124(64MB) 83125(64MB) 83126(64MB) 83127(64MB) 83128(64MB) 83129(64MB) 83130(64MB) 83131(64MB) 83132(64MB) 83133(64MB) 83134(64MB) 83135(64MB) 83136(64MB) 83137(64MB) 83138(64MB) 
83139(64MB) 83140(64MB)]

This is from only a few days ago, so definitely after the upgrade.

#16 Updated by Dan van der Ster 2 months ago

Has #42670 been excluded?

#17 Updated by Janek Bevendorff 2 months ago

Interesting issue, but it doesn't really look like it's the same thing for two reasons:

1) We never saw this kind of data loss in Nautilus (the linked issue is a Nautilus issue and apparently has a fix already).
2) Objects disappear entirely without user intervention. Once uploaded, our objects are basically immutable and we are only trying to read them, not append anything.

#18 Updated by J. Eric Ivancich 2 months ago

Janek Bevendorff wrote:

What I do remember is error messages about compression during OSD upgrades (i.e., compression on RADOS level) from Nautilus. The errors silently disappeared, though, and the OSDs started normally and everything was fine for a while. I just rechecked the logs and I can see some messages like this:

2020-11-13T08:45:40.227+0100 7f4cd3a63700 3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: expected 0, got 2333323974 in db/084544.sst offset 18446744073709551615 size 18446744073709551615

I tried to trace this to see if it could be related.

There are references to this error message in: https://www.mail-archive.com/ceph-users@ceph.io/msg04003.html

And Igor Fedotov says this is the tracker that root causes and resolves the issue: https://tracker.ceph.com/issues/45613On

So you may want to look there and make sure your Octopus is up to date.

#19 Updated by J. Eric Ivancich 2 months ago

Dan van der Ster wrote:

Has #42670 been excluded?

So the first version of Octopus that that fix appears in is 15.2.5, which was released on/about September 16, about 2 months ago.

Can everyone list which is the latest version of Octopus they've experienced this issue in?

Thanks!

#20 Updated by Janek Bevendorff 2 months ago

I waited for the first few minor releases and upgraded from Nautilus directly to 15.2.5, so yes.

#21 Updated by J. Eric Ivancich 2 months ago

In Comment #11 I asked: Is anyone experiencing this issue willing to run their RGWs with 'debug_ms=1'?

Let me follow on with what someone could look for in their logs if they were willing to change this configuration to help us track down the cause of this issue.

RGW can initiate a delete by sending an op to the OSD, and those ops are "refcount.put" and "rgw.obj_remove".

So you'd want to look for log lines that contain either of those two strings PLUS the bucket_id for the bucket. (You can get the bucket id for a bucket with `radosgw-admin metadata get bucket:<bucket-name>`.)

Assuming the bucket_id is 04e65185-554b-4867-8f44-786bd8987ab8.24130.1 and that --debug_ms=1, then I think an egrep along the lines of:

egrep '04e65185-554b-4867-8f44-786bd8987ab8.24130.1.*(refcount.put|rgw.obj_remove)' ...

should work. (It might be even more efficient if all the "."s except the one next to "*" were changed to "\."; then it would look for a literal "." rather than match any 1 character.)

If you find such lines for a bucket you're not expecting to delete from, I'd like to see a chunk of the logs from at least 1 minute before to at least 1 minute after. And if you find many examples, perhaps one example of rgw.obj_remove and one example of refcount.put to start off with.

I realize there may be some private information in the logs, and if so, please look into ways you might "cleanse" the logs. I'm happy to discuss that topic further.

#22 Updated by Janek Bevendorff 2 months ago

I have enabled the setting via

ceph config set client.rgw debug_ms 1

I will check the logs when the next object vanishes. This doesn't happen too often, though. I only hope the logs won't overflow until then.

#23 Updated by Janek Bevendorff 2 months ago

Short question here: isn't 1 the default log level anyway? I couldn't find any such messages in previous logs at least.

#24 Updated by Dan van der Ster 2 months ago

Janek Bevendorff wrote:

Short question here: isn't 1 the default log level anyway? I couldn't find any such messages in previous logs at least.

The default debug_ms is 0/1, which means it prints at level 0 but logs in memory at level 1 (in case of an assertion, the last 10000 memory log entries are dumped to the log file).

If you have debug_ms = 1 (i.e. 1/1) then you'll definitely notice, because the rgw log file will now be spammed with a line for every single message passed with the cluster.

#25 Updated by Dan van der Ster 2 months ago

Janek Bevendorff wrote:

I have enabled the setting via

[...]

I will check the logs when the next object vanishes. This doesn't happen too often, though. I only hope the logs won't overflow until then.

If that doesn't appear to be working, you probably need to do `ceph config set rgw debug_ms 1`

#26 Updated by Janek Bevendorff 2 months ago

The config says the default for rgw is 1/5, that's why I was wondering. https://docs.ceph.com/en/latest/rados/troubleshooting/log-and-debug/

If that doesn't appear to be working, you probably need to do `ceph config set rgw debug_ms 1`

rgw isn't a valid target. Only client-rgw works. I haven't seen any messages so far, though.

#27 Updated by Dan van der Ster 2 months ago

Janek Bevendorff wrote:

The config says the default for rgw is 1/5, that's why I was wondering. https://docs.ceph.com/en/latest/rados/troubleshooting/log-and-debug/

[...]

rgw isn't a valid target. Only client-rgw works. I haven't seen any messages so far, though.

hmm. indeed ceph config is inconsistent sometimes.

You can try the old school way to change options at runtime:

`ls /var/run/ceph/*.asok` -- get the asok of the running rgw, then

ceph --admin-daemon /path/to/asok config set debug_ms 1

#28 Updated by Janek Bevendorff 2 months ago

I believe client-rgw is doing the job. The default here seems to be 0/0 and when I turn it back off, I have nothing in my RadosGW logs. With it set to 1, I have a constant stream of osd_op_* messages.

#29 Updated by Denis Krienbühl 2 months ago

We're hitting this problem in our 15.2.5 cluster as well. So far we know of one S3 that went missing, in a typical Docker registry bucket (no lifecycle, no versioning, multi-part uploads).

We tried setting `debug_ms 1`, but the amount of log that creates is too much for us to handle. We are regularly running `gc list --include-all` to hopefully catch it deleting the next object when that happens.

Apart from that we are trying to go through all our multipart objects expected on the buckets, comparing them to the objects on rados. This will take a long time to complete though.

This is a very serious issue for us, if there's anything we can do to help speed this along, please let us know. This includes financial contributions.

#30 Updated by Denis Krienbühl 2 months ago

An update on our efforts: We enabled debug_ms 1 and we are trying to find an object that vanishes while these logs are active.

We are also checking our __multipart_ Rados objects. We found that for S3 objects whose HEAD works, but whose GET produce NoSuchKey, all related __multipart_ objects are gone. They show up in radosgw-admin bucket radoslist, but are not found by rados stat.

So far we only have one bucket where this occurs, but that's probably too early to call a data point.

#31 Updated by Denis Krienbühl 2 months ago

Another update from our side: We have two sites, a larger and a smaller one. The smaller site was successfully scanned yesterday, without any missing objects.

The second site however brought forth a second bucket where this occurred. There are a two correlations that are noteworthy, if again interpreted with caution:

1. Both buckets are from the same customer, but they are driven by different software. One is a Docker registry, one is a basic bucket to serve downloads of large-ish files. Neither have versioning nor lifecycles active. From what we can tell there's nothing special about them.

2. The upload dates of the files that went missing cluster suspiciously around a time when the customer had network issues with intermittent packet loss:

Last mod:  Mon, 31 Aug 2020 21:16:33 GMT (registry bucket)
Last mod:  Wed, 21 Oct 2020 14:52:36 GMT (regular bucket)
Last mod:  Wed, 21 Oct 2020 22:52:16 GMT (registry bucket)
Last mod:  Wed, 21 Oct 2020 22:53:25 GMT (registry bucket)
Last mod:  Wed, 21 Oct 2020 22:53:34 GMT (registry bucket)
Last mod:  Thu, 22 Oct 2020 09:12:31 GMT (regular bucket)
Last mod:  Thu, 22 Oct 2020 21:41:18 GMT (registry bucket)
Last mod:  Thu, 22 Oct 2020 21:41:38 GMT (registry bucket)
Last mod:  Thu, 29 Oct 2020 22:41:54 GMT (registry bucket)
Last mod:  Fri, 30 Oct 2020 18:07:38 GMT (registry bucket)
Last mod:  Fri, 30 Oct 2020 20:11:46 GMT (registry bucket)
Last mod:  Mon, 02 Nov 2020 18:27:02 GMT (registry bucket)
Last mod:  Mon, 02 Nov 2020 18:27:03 GMT (registry bucket)
Last mod:  Mon, 02 Nov 2020 18:40:41 GMT (registry bucket)
Last mod:  Mon, 02 Nov 2020 18:41:02 GMT (registry bucket)
Last mod:  Mon, 02 Nov 2020 18:42:49 GMT (registry bucket)
Last mod:  Mon, 02 Nov 2020 18:43:22 GMT (registry bucket)
Last mod:  Mon, 02 Nov 2020 20:55:20 GMT (registry bucket)
Last mod:  Mon, 02 Nov 2020 20:55:24 GMT (registry bucket)
Last mod:  Mon, 02 Nov 2020 20:55:48 GMT (registry bucket)
Last mod:  Mon, 02 Nov 2020 20:58:17 GMT (registry bucket)
Last mod:  Mon, 02 Nov 2020 23:06:45 GMT (registry bucket)
Last mod:  Tue, 03 Nov 2020 00:08:08 GMT (registry bucket)
Last mod:  Tue, 03 Nov 2020 00:09:03 GMT (registry bucket)
Last mod:  Tue, 03 Nov 2020 00:09:40 GMT (registry bucket)
Last mod:  Sun, 08 Nov 2020 14:11:02 GMT (regular bucket)

The dates on November 2nd and 3rd coincide with those network issues.


I don't think that's enough to be sure that packet-loss could be at the heart of this issue, but it does certainly stand out from our data, that we currently only know of a single customer, that tends to access our object storage across the globe where packet-loss might play a role.

This isn't a lot to go on, but maybe it sparks an idea about what could be the cause of this. Note that we are still scanning our bigger site and we might find other examples that contradict this.

We also want to start checking all the actions taken by the garbage collector, as seen in the debug log via rgw.obj_remove, to hopefully find an instance where this occurs as it happens, so we can provide a detailed log.

#32 Updated by Janek Bevendorff 2 months ago

I don't think the issue is packet loss per se. The objects in our affected buckets were all filled by a Hadoop job running on a nearby compute cluster. There may have been interruptions of some sort, but surely they weren't due to packet loss (there may have been timeouts, reschedules, preemptions).
What is weird is that most of the objects are single-part, but a (very) small percentage are in fact multi-part and I can't figure out why. It must be to do with they way Hadoop's s3a driver works. We never actively chose to upload those as multi-part objects.

#33 Updated by Denis Krienbühl 2 months ago

Janek Bevendorff wrote:

I don't think the issue is packet loss per se. The objects in our affected buckets were all filled by a Hadoop job running on a nearby compute cluster. There may have been interruptions of some sort, but surely they weren't due to packet loss (there may have been timeouts, reschedules, preemptions).
What is weird is that most of the objects are single-part, but a (very) small percentage are in fact multi-part and I can't figure out why. It must be to do with they way Hadoop's s3a driver works. We never actively chose to upload those as multi-part objects.

Oh, so the S3 objects that are missing on your side are mainly single-part objects? We've been focusing on mulit-part as the initial files we found missing were multi-part. If you see single-part objects as well we might want to start and search that too.

#34 Updated by Denis Krienbühl 2 months ago

Janek Bevendorff wrote:

I don't think the issue is packet loss per se. The objects in our affected buckets were all filled by a Hadoop job running on a nearby compute cluster. There may have been interruptions of some sort, but surely they weren't due to packet loss (there may have been timeouts, reschedules, preemptions).
What is weird is that most of the objects are single-part, but a (very) small percentage are in fact multi-part and I can't figure out why. It must be to do with they way Hadoop's s3a driver works. We never actively chose to upload those as multi-part objects.

Oh, so the S3 objects that are missing on your side are mainly single-part objects? We've been focusing on mulit-part as the initial files we found missing were multi-part. If you see single-part objects as well we might want to start and search that too.

#35 Updated by Janek Bevendorff 2 months ago

Sorry if that was unclear. The majority of all objects we have in total are single-part objects. The affected objects, however, are all multi-part (19 in total so far).

#36 Updated by Janek Bevendorff 2 months ago

I wanted to reupload our multipart objects as single parts and noticed something very strange: Not all multipart objects identify themselves as such via their ETags. In fact, when I upload a new multipart object via s3cmd, the ETag will ALWAYS be the MD5 sum of the entire object. Many existing multipart ETags, however, look like this: 788a4bc54dd3eece8065d9f06045623d-18, which is, I presume, an iterative hash of the last part combined with the hashes of the previous parts (I cannot use this to check the integrity of the file). I don't know under which circumstances an object receives such an ETag, but I can only assume it does so when a multipart upload interrupted and later on resumed.

That said, ALL our affected objects have an ETag like this. Despite this updated knowledge, what I said earlier still remains true: The buckets contain a mixture of multipart and singlepart objects (yes, also large ones like 1.5GB for some reason, I verified this via radosgw-admin). So far only multipart objects with partial hashes are affected, but now I also have to assume that the actual number of multipart objects may be far larger, since I cannot reliably identify them via S3.

I cannot be too confident here, since my sample size is only 19, but from these observations it appears likely that the issue affects only (or primarily) multipart objects with partial hashes.

#37 Updated by Denis Krienbühl 2 months ago

Janek Bevendorff wrote:

I wanted to reupload our multipart objects as single parts and noticed something very strange: Not all multipart objects identify themselves as such via their ETags. In fact, when I upload a new multipart object via s3cmd, the ETag will ALWAYS be the MD5 sum of the entire object. Many existing multipart ETags, however, look like this: 788a4bc54dd3eece8065d9f06045623d-18 [...]

Are you sure that s3cmd actually creates multiple parts in this case? That is, do you see s3cmd doing multiple uploads as follows (here, 5MiB chunks are used).

$ s3 s3cmd -c personal.cfg put 10m s3://denis/test --multipart-chunk-size-mb 5
upload: '10m' -> 's3://denis/test'  [part 1 of 2, 5MB] [1 of 1]
 5242880 of 5242880   100% in    1s     4.07 MB/s  done
upload: '10m' -> 's3://denis/test'  [part 2 of 2, 5MB] [1 of 1]
 5242880 of 5242880   100% in    1s     3.13 MB/s  done

I ask because it would surprise me quite a bit if there are multipart uploads whose ETAG do not contain a dash. From the limited insight I gained while debugging this, it seems that this always holds true. All multipart uploads that I've generated have a dash in their ETag and so do all the objects we lost.

I also found mention of this in different places, I would say that's a common expectation that this property holds true across multipart uploads.

#38 Updated by Janek Bevendorff 2 months ago

Yes, it is uploading multiple parts and the resulting RADOS objects live in the multipart namespace, not the shadow namespace (verified via radosgw-admin object stat).

I did in fact assume that ALL multipart objects contain dashes, but now I know for sure that's not really the case. It still surprises me that some or most of our objects are singlepart to begin with, but their proportion is probably a lot smaller than anticipated.

#39 Updated by Denis Krienbühl 2 months ago

Janek Bevendorff wrote:

Yes, it is uploading multiple parts and the resulting RADOS objects live in the multipart namespace, not the shadow namespace (verified via radosgw-admin object stat).

Interesting, what output do you see when you look at the `radosgw-admin bucket radoslist --bucket=<bucket>` output? Assuming you can reproduce this in an empty bucket or one that isn't too large.

We've been looking at the output of this command and there we see something like this:

3410862b-8e1e-48be-bf3c-e5e3bef3d757.17968943950.15__multipart_test.2~zfDbcllX30Yms5RQ0-L01G4Xg8UlzM-.1
3410862b-8e1e-48be-bf3c-e5e3bef3d757.17968943950.15__multipart_test.2~zfDbcllX30Yms5RQ0-L01G4Xg8UlzM-.2
3410862b-8e1e-48be-bf3c-e5e3bef3d757.17968943950.15__shadow_test.2~zfDbcllX30Yms5RQ0-L01G4Xg8UlzM-.1_1
3410862b-8e1e-48be-bf3c-e5e3bef3d757.17968943950.15__shadow_test.2~zfDbcllX30Yms5RQ0-L01G4Xg8UlzM-.2_1
3410862b-8e1e-48be-bf3c-e5e3bef3d757.17968943950.15_test

For the missing S3 objects the __multipart files cannot be found using rados -p default.rgw.buckets.data stat <object_id>. If you can show me how your radoslist output for the multiparts with the unexpected ETag looks like, I might see something that invalidates my assumptions.


As an aside, we've been looking at catching the GC removing such a multipart object by looking for (refcount.put|rgw.obj_remove), but the only __multipart objects we can see being deleted are ones with a .meta suffix, which happens during upload. So I assume that is just a throwaway. The actual __multipart objects seem to also remain long after the S3 object has been deleted, my test multipart objects have not been GCd it seems.

That being said, if the theory holds true that this is a GC issue, could we somehow catch the GC throwing away a __multipart object when it should not? I thought I might be able to do this by looking at rgw.obj_remove logs, but if I don't see those objects in there that seems to be unfeasible.

We are gathering lots of logs at the moment, but we need a way to catch this error as it happens. If there are any other ideas on how to do this, let me know :)

#40 Updated by Denis Krienbühl 2 months ago

Update: It seems like refcount.put is the call that logs the __multipart file about to be deleted.

#41 Updated by Janek Bevendorff 2 months ago

I crated a bucket and uploaded two 1GB multipart objects. Here's what the radoslist looks like:

6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2_WIDE-20110903143858-01167.warc.gz                                                
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__multipart_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.1 
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__multipart_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.10
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__multipart_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.11
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__multipart_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.12
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__multipart_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.13
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__multipart_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.14
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__multipart_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.15
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__multipart_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.16
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__multipart_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.17
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__multipart_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.18
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__multipart_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.19

...

6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__shadow_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.10_1
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__shadow_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.10_2
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__shadow_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.10_3
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__shadow_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.11_1
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__shadow_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.11_2
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__shadow_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.11_3
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__shadow_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.12_1
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__shadow_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.12_2
6112a2c2-2570-4c6a-8e57-709633df12eb.343478859.2__shadow_WIDE-20110903143858-01167.warc.gz.2~t5YVPyPcWGhoE0RlkoHxz-lnTk4VBiP.12_3

...

Same as yours. s3cmd info is this:

s3://reupload-test/WIDE-20110924065228-02544.warc.gz (object):
   File size: 1000039590
   Last mod:  Fri, 20 Nov 2020 11:24:06 GMT
   MIME type: application/gzip
   Storage:   STANDARD
   MD5 sum:   4a2244642bf656ae5c29d7859a81d4a4
   SSE:       none
   Policy:    none
   CORS:      none
   ACL:       webis: FULL_CONTROL

(second file the same, only different hash).

#42 Updated by Denis Krienbühl 2 months ago

Thank you, I'm glad to see the __multipart files in there, as that's what I'm counting on.

And just in case someone finds this useful or interesting, this is the script with which we're scanning for missing files. It is run on a mon which has access to radosgw-admin.

https://gist.github.com/href/89c717c7bbaf9f87fd64e4346a6cc889

#43 Updated by Janek Bevendorff 2 months ago

Update: forget what I said. When I retrieve the ETag via boto3, I get 57ba90377073200f91b5dbee7aaece6-64 instead. I am pretty sure, s3cmd's "MD5 sum" field used to be the ETag. I am confused.

#44 Updated by J. Eric Ivancich 2 months ago

If the multipart components are themselves larger than the striping size, they two can have shadow tail objects, as shown in Comment #39.

And when you upload via s3cmd and add --disable-multipart, if the object is large enough, it will have shadow tail objects as well. Is that what people above are referring to as "single part"?

I believe we're aware of a case of missing shadow/tail objects when multipart is not involved.\

Eric

#45 Updated by J. Eric Ivancich 2 months ago

We realize not everyone is able to do this, but for those who can, here is one approach we're using to try to nail this bug down.

If you have a bucket where things get uploaded and maybe read but rarely deleted, and if you're able to run all RGWs with --debug_ms=1, then we're looking for log entry lines that contain the bucket id and all ops associated with a) removing objects directly, b) decrementing refcount, which possibly results in deletion of a tail object, c) adding an object to GC, d) deferring an object from removal in GC if it's being read.

egrep '<bucket_id>.*(refcount\.put|rgw\.obj_remove|gc_set_entry|gc_defer_entry)' ...

The challenge is various aspects of the story may come through different RGWs. So for example, something may be added for gc via one RGW and the refcount decrements through another.

Thanks to everyone who's lending a hand.

Eric

#46 Updated by J. Eric Ivancich 2 months ago

Just to clarify on why this is helpful: "If you have a bucket where things get uploaded and maybe read but rarely deleted"....

We're trying to get away from user-initiated deletes.

On the other hand, for those that are experiencing losses that they can readily identify, then having logs with --debug_ms=1 surrounding the loss would help us independently.

#47 Updated by Denis Krienbühl 2 months ago

Eric Ivancich wrote:

If you have a bucket where things get uploaded and maybe read but rarely deleted, and if you're able to run all RGWs with --debug_ms=1, then we're looking for log entry lines that contain the bucket id and all ops associated with a) removing objects directly, b) decrementing refcount, which possibly results in deletion of a tail object, c) adding an object to GC, d) deferring an object from removal in GC if it's being read.

egrep '<bucket_id>.*(refcount\.put|rgw\.obj_remove|gc_set_entry|gc_defer_entry)' ...

We are currently running all our RGWs of the affected zone with `debug_ms`, producing north of 4k lines per second. Now we are trying to find an object that is lost after the enabling debug logs - with no luck yet.

Since we haven't discovered any such objects, I'm not sure how useful these logs would be for you at the moment, but if you simply want the debug logs of a run-of-the-mill bucket I could create one for you, use it in any pattern you'd like and send you the associated logs afterwards.

#48 Updated by J. Eric Ivancich 2 months ago

Denis Krienbühl wrote:

Eric Ivancich wrote:

If you have a bucket where things get uploaded and maybe read but rarely deleted, and if you're able to run all RGWs with --debug_ms=1, then we're looking for log entry lines that contain the bucket id and all ops associated with a) removing objects directly, b) decrementing refcount, which possibly results in deletion of a tail object, c) adding an object to GC, d) deferring an object from removal in GC if it's being read.

egrep '<bucket_id>.*(refcount\.put|rgw\.obj_remove|gc_set_entry|gc_defer_entry)' ...

We are currently running all our RGWs of the affected zone with `debug_ms`, producing north of 4k lines per second. Now we are trying to find an object that is lost after the enabling debug logs - with no luck yet.

Since we haven't discovered any such objects, I'm not sure how useful these logs would be for you at the moment, but if you simply want the debug logs of a run-of-the-mill bucket I could create one for you, use it in any pattern you'd like and send you the associated logs afterwards.

Yeah, doubtful it's helpful to examine logs for a run-of-the-mill bucket. I think the two conditions we're looking for are:

a. if you have a bucket that doesn't get deletes then using the egrep command above will help us find the interesting events quickly.

b. if you know of a case of an object that's damaged and we have logs surrounding that event.

We'll try to see if we can find more angles to hit this from.

Thank you for the help!

#49 Updated by Denis Krienbühl 2 months ago

Eric Ivancich wrote:

a. if you have a bucket that doesn't get deletes then using the egrep command above will help us find the interesting events quickly.

b. if you know of a case of an object that's damaged and we have logs surrounding that event.

Just so I don't get this wrong, you want both of those conditions to be true, right?

That is probably not too likely to happen, but that being said the second bucket where this did occur is one such bucket, and we are keeping those buckets under close observation, so maybe we get lucky and we can capture this in action.

Let me know if we can help you with anything else. This has our full attention.

#50 Updated by J. Eric Ivancich 2 months ago

Sorry to update again, but I want to cover both newer and older gc implementations. The egrep discussed above is now:

egrep '<bucket_id>.*(refcount\.put|rgw\.obj_remove|gc_set_entry|gc_defer_entry|rgw_gc_queue_enqueue|rgw_gc_queue_update_entry)' ...

Thanks,

Eric

#51 Updated by J. Eric Ivancich 2 months ago

Denis Krienbühl wrote:

Eric Ivancich wrote:

a. if you have a bucket that doesn't get deletes then using the egrep command above will help us find the interesting events quickly.

b. if you know of a case of an object that's damaged and we have logs surrounding that event.

Just so I don't get this wrong, you want both of those conditions to be true, right?

That is probably not too likely to happen, but that being said the second bucket where this did occur is one such bucket, and we are keeping those buckets under close observation, so maybe we get lucky and we can capture this in action.

Let me know if we can help you with anything else. This has our full attention.

The a) case is dependent on a bucket in which removes don't happen (or are incredibly rare), otherwise there will be too many false positives. But if you have such a bucket, it might find these cases early.

The b) case is the more general one that doesn't need a no-delete bucket. But if you find the case of a damaged object then we might have days worth of logs to see the events that led to it.

So a) and b) are independent but a) is more highly constrained and not everyone will have such a bucket handy.

Eric

#52 Updated by Denis Krienbühl 2 months ago

I see, I'll try to run a search for case "a" next week. We have plenty of access logs going back a few months detailing all HTTP requests against all of our buckets. I might be able to find a bucket that does not get any or few DELETE requests.

#53 Updated by J. Eric Ivancich 2 months ago

Based on a review of code paths, I have a theory as to why this happens. There's an immediate configuration change that, if this theory is correct, would mitigate the data loss.

RGW has a garbage collection (GC) deferral process. It is meant to handle cases of:

1. Read of large object begins
2. Delete issued for large object
3. Read of large object continues

When reads continue beyond 1/2 of the time set in the configuration option rgw_gc_obj_min_wait, a GC defer operation is sent to prevent that object from being GC'd while reads continue.

The GC code was substantially updated for Octopus to use a more efficient back end. The theory is that it introduced a bug such that when GC is deferred, it can add a GC entry for an object that was never deleted. And that deletion will then occur at a later time as the GC queue is processed.

So the quick work-around is to increase rgw_gc_obj_min_wait from a default of 2 hours to a higher value? 5 hours? 10 hours? The choice is yours.

If you set it to 5 hours, then a read would have to exceed 2.5 hours (half the configured time) to trigger deferment and the theorized bug.

Although this is currently theorized and no reproducer has yet been created, this seems to fit the various descriptions of the theorized bug.

1. Tail objects are deleted but head objects are not. That's the behavior of GC.
2. No user-initiated removal of the object occurs; we have cases where a bucket in which deletes are exceedingly rare experience the data loss.
3. The bug occurs rarely; again it's triggered by particularly slow reads. With default config options, the read would have to exceed 1 hour.
4. Someone above thought there was a correlation with network issues; network issues could have slowed downed reads, thus triggering the bug.
5. The bug seems to have started when users upgraded to Octopus, which corresponds to this code change.

We're currently working on a reproducer to verify this theorized bug along with bug-fixes.

If there is any additional evidence, corroborating or invalidating, please post.

Thanks,

Eric

#54 Updated by Denis Krienbühl 2 months ago

Your theory sounds plausible to me. Though if I understand correctly we might be able to verify this by looking at our requests log, right? I would expect to find a GET request that lasted longer than 1 hour for the objects in question.

If you think that would corroborate your theory let me know, I should be able to get that information.

In any case, I'm still discussing this with my team, but I assume we'll try your suggested workaround.

#55 Updated by J. Eric Ivancich 2 months ago

Denis Krienbühl wrote:

Your theory sounds plausible to me. Though if I understand correctly we might be able to verify this by looking at our requests log, right? I would expect to find a GET request that lasted longer than 1 hour for the objects in question.

If you think that would corroborate your theory let me know, I should be able to get that information.

Yes, that could confirm.

Thank you,

Eric

#56 Updated by J. Eric Ivancich 2 months ago

IMPORTANT

rgw_gc_obj_min_wait is set in seconds.
So 2 hours, the default, is 7200 seconds.
5 hours is 18000 seconds.
And 10 hours is 36000 seconds.

#57 Updated by J. Eric Ivancich 2 months ago

  • Status changed from Triaged to Fix Under Review
  • Assignee set to J. Eric Ivancich
  • Target version set to v16.0.0
  • Backport set to octopus
  • Severity changed from 2 - major to 1 - critical
  • Pull request ID set to 38228

#58 Updated by J. Eric Ivancich 2 months ago

Please note that there is now a PR attached to this Tracker. I still have not been able to reproduce, as using my command-line tools I've not yet been able to slow down a read object operation to create the conditions that would seem to produce this data loss.

This PR is not a full fix. If the theory is correct it should prevent the data loss, which is the most pressing issue. But it might create orphans in similar situations, tail objects that should be removed ultimately but that are not. More specifically, objects that the user has requested be removed and that are on the GC queue and that are then read slowly, the GC entry might be lost, so the objects are not ultimately deleted. And

We do have the `rgw-orphan-list` tool that you'll be able to use to identify them.

And discussions are underway at providing a more complete solution.

I continue to try to reproduce and will report what I find.

Eric

#59 Updated by Denis Krienbühl 2 months ago

Your theory aligns perfectly with our request logs. Here are the longest measured GET requests for all of the S3 objects that went missing:

1:00:53.887000 
1:01:12.880000 
1:04:42.373000 
1:05:27.778000 
1:11:46.220000 
1:24:54.102000 
2:08:10.981000 
2:10:46.777000 
2:14:52.989000 
2:23:11.656000 
2:29:08.968000 
2:29:35.797000 
2:36:49.871000 
2:37:54.386000 
2:52:56.228000 
2:59:15.081000 
3:25:31.692000 
3:25:32.937000 
3:35:50.415000 
4:17:14.685000 
4:21:26.926000 
4:27:04.474000 
4:47:05.590000 
5:19:33.749000 
6:51:13.892000 
7:31:06.271000 

As you can see, the shortest request is just a bit longer than an hour, and the longest took a whopping 7½ hours.

As a result we will probably increase rgw_gc_obj_min_wait to a even higher number, as we recently upgraded our storage and have plenty to spare.

#60 Updated by J. Eric Ivancich 2 months ago

Denis Krienbühl wrote:

Your theory aligns perfectly with our request logs. Here are the longest measured GET requests for all of the S3 objects that went missing:

[...]

As you can see, the shortest request is just a bit longer than an hour, and the longest took a whopping 7½ hours.

As a result we will probably increase rgw_gc_obj_min_wait to a even higher number, as we recently upgraded our storage and have plenty to spare.

Wow! That's very helpful. Thank you.

Do you have the means to rebuild your RGWs with the PR attached to this tracker? If so that would be an option. Depending on the number of deletes your system sees the larger rgw_gc_obj_min_wait value might cause the GC log to hit the end of its allotted space in which case, GC won't be used and objects will be deleted straight away at a performance cost (noticeable but not significant, I'm told).

Eric

#61 Updated by Denis Krienbühl 2 months ago

Eric Ivancich wrote:

Do you have the means to rebuild your RGWs with the PR attached to this tracker? If so that would be an option. Depending on the number of deletes your system sees the larger rgw_gc_obj_min_wait value might cause the GC log to hit the end of its allotted space in which case, GC won't be used and objects will be deleted straight away at a performance cost (noticeable but not significant, I'm told).

If the GC won't be used, would there be any other problems rather than just performance? As in, is our cluster still safe when that happens? Because we can live with a performance hit for the weekend and look into compiling our own build next week.

In any case, thank you for your continued help. We are grateful to have a likely root cause and workaround at hand!

#62 Updated by Gaudenz Steinlin 2 months ago

I made some tests in our lab and can confirm the theory. I was able to reproduce this bug. I used the following settings to speed things up:

debug rgw = 20
debug ms = 1
rgw gc obj min wait = 300
rgw gc processor period = 150

I uploaded a multipart object and another object which was large enough to have tail "shadow" object without multiparts and slowly downloaded them using curl and restricting the download rate to 10KB/s with "trickle" (https://github.com/mariusae/trickle). After a while the shadow and multipart objects showed up in the GC list and were purged by the next GC run after they "expired". With the settings above it took around 10 minutes for them to get deleted.

I also verified that objects without tail objects are not affected by this bug.

If you are still interested in logs, I can probably provide them or look through them for any further hints on what exactly happens. But I would have to clean them a bit before I can attach them to the bug report. Please just tell me if you still need them.

P.S.: I'm a co-worker of Denis.

#63 Updated by J. Eric Ivancich 2 months ago

Gaudenz Steinlin wrote:

I made some tests in our lab and can confirm the theory. I was able to reproduce this bug. I used the following settings to speed things up:
[...]

I uploaded a multipart object and another object which was large enough to have tail "shadow" object without multiparts and slowly downloaded them using curl and restricting the download rate to 10KB/s with "trickle" (https://github.com/mariusae/trickle). After a while the shadow and multipart objects showed up in the GC list and were purged by the next GC run after they "expired". With the settings above it took around 10 minutes for them to get deleted.

I also verified that objects without tail objects are not affected by this bug.

If you are still interested in logs, I can probably provide them or look through them for any further hints on what exactly happens. But I would have to clean them a bit before I can attach them to the bug report. Please just tell me if you still need them.

P.S.: I'm a co-worker of Denis.

Thank you, Gaudenz. That's further confirmation we have the root cause.

I don't think additional logs would be helpful at this point. But I do appreciate the insights you and Denis have brought to the table and your incredible willingness to lend a hand.

Eric

#64 Updated by J. Eric Ivancich 2 months ago

Denis Krienbühl wrote:

Eric Ivancich wrote:

Do you have the means to rebuild your RGWs with the PR attached to this tracker? If so that would be an option. Depending on the number of deletes your system sees the larger rgw_gc_obj_min_wait value might cause the GC log to hit the end of its allotted space in which case, GC won't be used and objects will be deleted straight away at a performance cost (noticeable but not significant, I'm told).

If the GC won't be used, would there be any other problems rather than just performance? As in, is our cluster still safe when that happens? Because we can live with a performance hit for the weekend and look into compiling our own build next week.

I am not aware of any reason your cluster would be unsafe. Immediate deletion of tail objects would also make the corner case addressed by GC deferral not possible (1. start read, 2. remove object, 3. continue read). But generally speaking, most applications and users do not expect continued access to an object that's been deleted.

In any case, thank you for your continued help. We are grateful to have a likely root cause and workaround at hand!

You're very welcome and I appreciate all the help!

Eric

#65 Updated by Julien Demais 2 months ago

This actually makes a lot of sense.
Here are what we have set in our config:
  • rgw_gc_obj_min_wait: 300
  • rgw_gc_processor_max_time: 600
  • rgw_gc_processor_period: 600

I found those settings there: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021864.html
This was because I tested content load performances and noticed that when deleting the bucket (without bypassing the gc with radosgw-admin), the objects were still there for a very very long time and I ended up filling up my cluster.

I our case, some of the nodes have network issue on boot up so this is very likely that some reads took more than 150 seconds...
Nice work narrowing this down!

This leads my next question: is there a way to get information on how to fine tune the RGW garbage collector? Those settings are pretty "blury" to me.

#66 Updated by J. Eric Ivancich 2 months ago

  • Status changed from Fix Under Review to Pending Backport

#67 Updated by Pritha Srivastava about 2 months ago

If the affected users have increased the value of rgw_gc_obj_min_wait, then before restoring the original expiration time once the fixed build is available, it is recommended to clear off the gc queue manually by running - radosgw-admin gc process --include-all. This may be needed since the older entries may have expiration time > newer entries (once the original expiration time is restored), and may delay removal of expired entries leading to a full queue. In order to avoid this, manual gc process can be run to flush out the old entries. Please note that this will remove the gc entries before their expiration time.

#68 Updated by Denis Krienbühl about 2 months ago

Pritha Srivastava wrote:

Please note that this will remove the gc entries before their expiration time.

What does this mean for us? We did increase rgw_gc_obj_min_wait and we will have to get back to the default once we update: Would that trigger the bug and therefore cause more data-loss, if we are not careful? That is, do we have to look out for requests that take longer than 1 hour, making sure that we don't have any of those before reverting?

#69 Updated by Pritha Srivastava about 2 months ago

Denis Krienbühl wrote:

Pritha Srivastava wrote:

Please note that this will remove the gc entries before their expiration time.

What does this mean for us? We did increase rgw_gc_obj_min_wait and we will have to get back to the default once we update: Would that trigger the bug and therefore cause more data-loss, if we are not careful? That is, do we have to look out for requests that take longer than 1 hour, making sure that we don't have any of those before reverting?

This will not cause any more data loss. What it means is that if objects in gc are to expire and hence get deleted after x seconds from now, if a radosgw-admin gc process --include-all is run, all gc entries will be removed from the system now (which might be before their expiration time) With the fix in place, requests that take longer than one hour won't be added to gc at all.

#70 Updated by Denis Krienbühl about 2 months ago

Pritha Srivastava wrote:

Denis Krienbühl wrote:

Pritha Srivastava wrote:

Please note that this will remove the gc entries before their expiration time.

What does this mean for us? We did increase rgw_gc_obj_min_wait and we will have to get back to the default once we update: Would that trigger the bug and therefore cause more data-loss, if we are not careful? That is, do we have to look out for requests that take longer than 1 hour, making sure that we don't have any of those before reverting?

This will not cause any more data loss. What it means is that if objects in gc are to expire and hence get deleted after x seconds from now, if a radosgw-admin gc process --include-all is run, all gc entries will be removed from the system now (which might be before their expiration time) With the fix in place, requests that take longer than one hour won't be added to gc at all.

Ah so the best way of doing the update would be to roll out the new version, then change the configuration, right?

#71 Updated by Pritha Srivastava about 2 months ago

Denis Krienbühl wrote:

Pritha Srivastava wrote:

Denis Krienbühl wrote:

Pritha Srivastava wrote:

Please note that this will remove the gc entries before their expiration time.

What does this mean for us? We did increase rgw_gc_obj_min_wait and we will have to get back to the default once we update: Would that trigger the bug and therefore cause more data-loss, if we are not careful? That is, do we have to look out for requests that take longer than 1 hour, making sure that we don't have any of those before reverting?

This will not cause any more data loss. What it means is that if objects in gc are to expire and hence get deleted after x seconds from now, if a radosgw-admin gc process --include-all is run, all gc entries will be removed from the system now (which might be before their expiration time) With the fix in place, requests that take longer than one hour won't be added to gc at all.

Ah so the best way of doing the update would be to roll out the new version, then change the configuration, right?

Yes.

#72 Updated by J. Eric Ivancich about 2 months ago

  • Precedes Bug #48330: rgw: fix GC defer process and clean up from immediate fix added

#73 Updated by J. Eric Ivancich about 2 months ago

I exchanged email with Pritha, who is the true expert on modern GC in RGW, to clarify my own questions. I'm pasting that Q&A here for everyone's benefit.

==== Paste Below ====

Hi Pritha,

Just so I'm clear, assume the user:

1. makes rgw_gc_obj_min_wait big (e.g., 10 hours)
2. runs for a while
3. applies the fix
4. manually runs `radosgw-admin gc process --include-all ...`
5. returns rgw_gc_obj_min_wait to the original value (2 hours)
6. objects are deleted as part of the normal running

Now...

If we skip step 4 in the above:
a. the objects deleted in step 6 in the first 10 hours after step 5 might
take 10 hours to be gc'd rather than the expected 2 hours.

Yes

b. if the GC queue is full, objects would be deleted in-line rather than
GC'd, which might come with a performance hit.
i. and that may have already happened during step 2 anyway

Yes

c. but otherwise, there would be no problems.

Yes, there would be no problems otherwise.

Is that understanding correct?

By doing the manual GC in step 4 we simply allow the objects deleted in
step 6 to only remain on the GC queue for 2 hours. And it would be
necessary to use the "--include-all" flag when running manual GC, right?

Yes. It is basically a means to start afresh after restoring the original
expiration time because the elements of the queue should have increasing
expiration times. Hence I have mentioned this step as a recommendation.

Eric

Thanks,
Pritha

#74 Updated by J. Eric Ivancich about 2 months ago

  • Copied to Backport #48331: octopus: Object not found on healthy cluster added

#75 Updated by Denis Krienbühl about 2 months ago

Thank you for sharing that insight with us Pritha and Eric, it clarifies a lot.

#76 Updated by Janek Bevendorff about 2 months ago

Is there a roadmap when the fix will be available? Luckily, we haven't lost any more objects so far (not sure if that is only due to the config workaround), but I'd like to have a safe solution.

#77 Updated by Nathan Cutler about 2 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

#78 Updated by J. Eric Ivancich about 2 months ago

Janek Bevendorff wrote:

Is there a roadmap when the fix will be available? Luckily, we haven't lost any more objects so far (not sure if that is only due to the config workaround), but I'd like to have a safe solution.

According to Yuri Weinstein, it looks like we're on schedule to release Octopus 15.2.7 early next week after it's undergone some final testing.

Eric

Also available in: Atom PDF