Project

General

Profile

Actions

Bug #65251

open

rgw: crash in lc while transitioning to cloud

Added by Soumya Koduri about 1 month ago. Updated about 1 month ago.

Status:
Pending Backport
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
lifecycle cloud-transition backport_processed
Backport:
squid
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This bug is track one of the issues reported with LC process which is crash in cloud-transition code path (in https://tracker.ceph.com/issues/64571#note-9) -

#6 RGWHTTPClient::set_send_length (len=4036, this=0x0)
at /root/ceph/src/rgw/rgw_http_client.h:128
#7 RGWLCCloudStreamPut::send_ready (
this=this@entry=0x55555837b260, dpp=0x5555583e0460,
rest_obj=...)
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:660
#8 0x0000555556befef5 in cloud_tier_transfer_object (
dpp=<optimized out>, readf=0x555558437b80,
writef=0x55555837b260)
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:728
#9 0x0000555556bf04e7 in cloud_tier_plain_transfer (
tier_ctx=...)
at /usr/include/c++/12/bits/shared_ptr_base.h:1665
#10 0x0000555556bf5a9b in rgw_cloud_tier_transfer_object
(tier_ctx=...,
cloud_targets=std::set with 1 element = {...})
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:1298
#11 0x00005555569eed13 in rgw::sal::RadosObject::transition_to_cloud (this=0x55555fa84000,
bucket=0x55555d5db100, tier=0x55555d7ca340, o=...,
cloud_targets=std::set with 1 element = {...},
cct=0x5555583d6000, update_object=true,
dpp=<optimized out>, y=...)
at /root/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:1766
#12 0x0000555556b1f61c in LCOpAction_Transition::transition_obj_to_cloud (this=this@entry=0x55555fa0ffc0, oc=...)
at /root/ceph/src/rgw/rgw_lc.cc:1396
#13 0x0000555556b1fed1 in LCOpAction_Transition::process
(this=0x55555fa0ffc0, oc=...)

2024-04-01T23:06:01.345+0530 7f7d0562a6c0 20 endpoint url=http://localhost:8001 last endpoint status update time=1.71199e+09 diff=0.00030184
2024-04-01T23:06:01.345+0530 7f7d0562a6c0 5 ERROR: no valid endpoint
2024-04-01T23:06:01.345+0530 7f7d04e296c0 20 endpoint url=http://localhost:8001 last endpoint status update time=1.71199e+09 diff=0.000209596
2024-04-01T23:06:01.345+0530 7f7d04e296c0 5 ERROR: no valid endpoint

Even though there is valid cloud endpoint url, put_obj_send_init->get_url()->get_url(endpoint) is returning nullptr. This seem to have caused by the below code recently added in https://github.com/ceph/ceph/commit/e200499bb3c5703862b92a4d7fb534d98601f1bf

static constexpr uint32_t CONN_STATUS_EXPIRE_SECS = 2;
if (diff >= CONN_STATUS_EXPIRE_SECS) {
endpoints_status[endpoint].store(ceph::real_clock::zero());
ldout(cct, 10) << "endpoint " << endpoint << " unconnectable status expired. mark it connectable" << dendl;
break;
}
num++;
};
if (num == endpoints.size()) {
ldout(cct, 5) << "ERROR: no valid endpoint" << dendl;
return -EINVAL;
}

This seems to be regression caused by https://github.com/ceph/ceph/commit/e200499bb3c5703862b92a4d7fb534d98601f1bf


Related issues 1 (0 open1 closed)

Copied to rgw - Backport #65351: squid: rgw: crash in lc while transitioning to cloudResolvedSoumya KoduriActions
Actions #1

Updated by Soumya Koduri about 1 month ago

  • Status changed from New to In Progress
Actions #2

Updated by Soumya Koduri about 1 month ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 56657
Actions #3

Updated by Casey Bodley about 1 month ago

  • Status changed from Fix Under Review to Pending Backport
  • Tags set to lifecycle cloud-transition
  • Backport set to squid
Actions #4

Updated by Backport Bot about 1 month ago

  • Copied to Backport #65351: squid: rgw: crash in lc while transitioning to cloud added
Actions #5

Updated by Backport Bot about 1 month ago

  • Tags changed from lifecycle cloud-transition to lifecycle cloud-transition backport_processed
Actions

Also available in: Atom PDF