Project

General

Profile

Actions

Bug #10885

closed

rgw timeout mechanism might delete original object if timeout happens

Added by Zhi Zhang about 9 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When rados_osd_op_timeout is set and happened even after retrying, this request's is_complete flag is not true. So when rgw thread deconstructs object processor, it seems to try to delete those objects in objs list.

For updating/overwriting an object, if above situation happens, the old version of this object might be deleted. And although rgw will return 408 to the user, if user doesn't retry later, user may not get this object's original version.

This is not easy to simulate, but from below example, we may get some clues.

1. When updating an object, timeout happens at rgw.bucket_prepare_op:

2015-02-14 05:29:43.976099 7fe9f08c0700 20 get_obj_state: setting s->obj_tag to default.385873.38
2015-02-14 05:29:43.976117 7fe9f08c0700 10 setting object write_tag=default.385873.39
2015-02-14 05:29:43.976208 7fe9f08c0700 10 librados: call oid=.dir.default.340307.3 nspace=
2015-02-14 05:29:43.976213 7fe9f08c0700 10 client.385873.objecter _op_submit op 0x7feae82e93d0
2015-02-14 05:29:43.976224 7fe9f08c0700 10 client.385873.objecter _calc_target pgid 5.7e46dce1 acting [32,133,65]
2015-02-14 05:29:43.976228 7fe9f08c0700 20 client.385873.objecter _get_session s=0x7feb000434f0 osd=32 3
2015-02-14 05:29:43.976229 7fe9f08c0700 20 client.385873.objecter note: not requesting ack
2015-02-14 05:29:43.976230 7fe9f08c0700 10 client.385873.objecter _op_submit oid .dir.default.340307.3 @5 @5 [call rgw.bucket_prepare_op] tid 0 osd.32
2015-02-14 05:29:43.976234 7fe9f08c0700 20 client.385873.objecter get_session s=0x7feb000434f0 osd=32 3
2015-02-14 05:29:43.976236 7fe9f08c0700 15 client.385873.objecter _session_op_assign 32 3214
2015-02-14 05:29:43.976237 7fe9f08c0700 15 client.385873.objecter _send_op 3214 to osd.32
2015-02-14 05:29:43.976238 7fe9f08c0700 1 – 10.193.244.202:0/1018651 --> 10.193.105.3:6817/93722 – osd_op(client.385873.0:3214 .dir.default.340307.3 [call rgw.bucket_prepare_op] 5.7e46dce1 ondisk+write+known_if_redirected e33731) v4 – ?+0 0x7feae830c510 con 0x7feb00039600
2015-02-14 05:29:43.976249 7fe9f08c0700 20 client.385873.objecter put_session s=0x7feb000434f0 osd=32 4
2015-02-14 05:29:43.976250 7fe9f08c0700 5 client.385873.objecter 2 unacked, 3 uncommitted
... (5 sec timeout)
2015-02-14 05:29:48.976442 7fe9f08c0700 10 librados: Objecter returned from call r=-110

2. Even retrying still doesn't work at this moment:

015-02-14 05:29:48.976491 7fe9f08c0700 20 get_obj_state: rctx=0x7fe9f08bf680 obj=zhibucket5:testf1 state=0x7feae82f1918 s->prefetch_data=0
2015-02-14 05:29:48.976540 7fe9f08c0700 10 setting object write_tag=default.385873.39_3kcYhc5ES-hB7Z901n2utL9-4u_gkU1
2015-02-14 05:29:48.976567 7fe9f08c0700 10 librados: call oid=.dir.default.340307.3 nspace=
2015-02-14 05:29:48.976576 7fe9f08c0700 10 client.385873.objecter _op_submit op 0x7feae82e94a0
2015-02-14 05:29:48.976601 7fe9f08c0700 10 client.385873.objecter _calc_target pgid 5.7e46dce1 acting [32,133,65]
2015-02-14 05:29:48.976607 7fe9f08c0700 20 client.385873.objecter _get_session s=0x7feb000434f0 osd=32 2
2015-02-14 05:29:48.976610 7fe9f08c0700 20 client.385873.objecter note: not requesting ack
2015-02-14 05:29:48.976612 7fe9f08c0700 10 client.385873.objecter _op_submit oid .dir.default.340307.3 @5 @5 [call rgw.bucket_prepare_op] tid 0 osd.32
2015-02-14 05:29:48.976622 7fe9f08c0700 20 client.385873.objecter get_session s=0x7feb000434f0 osd=32 2
2015-02-14 05:29:48.976626 7fe9f08c0700 15 client.385873.objecter _session_op_assign 32 3215
2015-02-14 05:29:48.976628 7fe9f08c0700 15 client.385873.objecter _send_op 3215 to osd.32
2015-02-14 05:29:48.976631 7fe9f08c0700 1 – 10.193.244.202:0/1018651 --> 10.193.105.3:6817/93722 – osd_op(client.385873.0:3215 .dir.default.340307.3 [call rgw.bucket_prepare_op] 5.7e46dce1 ondisk+write+known_if_redirected e33731) v4 – ?+0 0x7feae82e9b60 con 0x7febc0139880
2015-02-14 05:29:48.976656 7fe9f08c0700 20 client.385873.objecter put_session s=0x7feb000434f0 osd=32 3
2015-02-14 05:29:48.976659 7fe9f08c0700 5 client.385873.objecter 2 unacked, 4 uncommitted
... (5 sec timeout)
2015-02-14 05:29:53.976848 7fe9f08c0700 10 librados: Objecter returned from call r=-110

3. So rgw thread can't complete it with return code -110 and tries to deconstruct ~RGWPutObjProcessor.

From my below log, it fails to delete this object because I kill the OSD daemon and osdmap hasn't been updated yet. But if this OSD daemon was something wrong at #1 and #2, it comes back to normal at #3 now, will this object be deleted?

2015-02-14 05:29:53.976869 7fe9f08c0700 0 WARNING: failed to remove obj (zhibucket5:testf1), leaked
2015-02-14 05:29:53.976931 7fe9f08c0700 2 req 39:10.040499:s3:PUT /zhibucket5/testf1:put_obj:http status=408
2015-02-14 05:29:53.976945 7fe9f08c0700 1 ====== req done req=0x7feaf80401e0 http_status=408 ======
2015-02-14 05:29:53.976964 7fe9f08c0700 20 process_request() returned -110

Of course, if user will retry after getting 408 before other operations, this may not be a big issue.

Actions #1

Updated by Yehuda Sadeh about 9 years ago

You shouldn't set objecter timeouts when using rgw. In any case the object that is removed is the tail object, not the head object. However, with timeout set you might end up with object that has head but not tail, or that doesn't match the corresponding bucket index entry.

Actions #2

Updated by Zhi Zhang about 9 years ago

Hi Yehuda,

Thanks for your reply.

Even if we don't set objecter timeouts when using rgw, we still can't just simply add a timeout for the conditional wait in librados IoCtxImpl operate, because the same issue as you mentioned above would also happen. Is my understanding correct?

Actions #3

Updated by Loïc Dachary almost 9 years ago

  • Project changed from Ceph to rgw
  • Category deleted (22)
  • Regression set to No
Actions #4

Updated by Sage Weil almost 9 years ago

  • Status changed from New to Need More Info

can you clarify?

Actions #5

Updated by Casey Bodley about 4 years ago

  • Status changed from Need More Info to Closed
Actions

Also available in: Atom PDF