Project

General

Profile

Actions

Bug #48983

closed

radosgw not working - upgraded from mimic to octopus

Added by YOUZHONG YANG about 3 years ago. Updated almost 3 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I upgraded our ceph cluster (6 bare metal nodes, 3 rgw VMs) from v13.2.4 to v15.2.8. The mon, mgr, mds and osd daemons were all upgraded successfully, everything looked good.

After the radosgw was upgraded, they refused to work, the log messages are at the end of this message.

Here are the things I have tried:

1. I moved aside the pools for the rgw service, started from scratch (creating realm, zonegroup, zone, users), but when I tried to run 'radosgw-admin user create ...', it appeared to be stuck and never returned, other command like 'radosgw-admin period update --commit' also got stuck.

2. I rolled back radosgw to the old version v13.2.4, then everything works great again.

What am I missing here? Is there anything extra that needs to be done for rgw after upgrading from mimic to octopus? is this a bug or some sort?

2021-01-24T09:24:10.192-0500 7f638f79f9c0  0 deferred set uid:gid to 64045:64045 (ceph:ceph)
2021-01-24T09:24:10.192-0500 7f638f79f9c0  0 ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable), process radosgw, pid 898
2021-01-24T09:24:10.192-0500 7f638f79f9c0  0 framework: civetweb
2021-01-24T09:24:10.192-0500 7f638f79f9c0  0 framework conf key: port, val: 80
2021-01-24T09:24:10.192-0500 7f638f79f9c0  0 framework conf key: num_threads, val: 1024
2021-01-24T09:24:10.192-0500 7f638f79f9c0  0 framework conf key: request_timeout_ms, val: 50000
2021-01-24T09:24:10.192-0500 7f638f79f9c0  1 radosgw_Main not setting numa affinity
2021-01-24T09:29:10.195-0500 7f638cbcd700 -1 Initialization timeout, failed to initialize
2021-01-24T09:29:10.367-0500 7f4c213ba9c0  0 deferred set uid:gid to 64045:64045 (ceph:ceph)
2021-01-24T09:29:10.367-0500 7f4c213ba9c0  0 ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable), process radosgw, pid 1541
2021-01-24T09:29:10.367-0500 7f4c213ba9c0  0 framework: civetweb
2021-01-24T09:29:10.367-0500 7f4c213ba9c0  0 framework conf key: port, val: 80
2021-01-24T09:29:10.367-0500 7f4c213ba9c0  0 framework conf key: num_threads, val: 1024
2021-01-24T09:29:10.367-0500 7f4c213ba9c0  0 framework conf key: request_timeout_ms, val: 50000
2021-01-24T09:29:10.367-0500 7f4c213ba9c0  1 radosgw_Main not setting numa affinity
2021-01-24T09:29:25.883-0500 7f4c213ba9c0  1 robust_notify: If at first you don't succeed: (110) Connection timed out
2021-01-24T09:29:25.883-0500 7f4c213ba9c0  0 ERROR: failed to distribute cache for coredumps.rgw.log:meta.history
2021-01-24T09:32:27.754-0500 7fcdac2bf9c0  0 deferred set uid:gid to 64045:64045 (ceph:ceph)
2021-01-24T09:32:27.754-0500 7fcdac2bf9c0  0 ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable), process radosgw, pid 978
2021-01-24T09:32:27.758-0500 7fcdac2bf9c0  0 framework: civetweb
2021-01-24T09:32:27.758-0500 7fcdac2bf9c0  0 framework conf key: port, val: 80
2021-01-24T09:32:27.758-0500 7fcdac2bf9c0  0 framework conf key: num_threads, val: 1024
2021-01-24T09:32:27.758-0500 7fcdac2bf9c0  0 framework conf key: request_timeout_ms, val: 50000
2021-01-24T09:32:27.758-0500 7fcdac2bf9c0  1 radosgw_Main not setting numa affinity
2021-01-24T09:32:44.719-0500 7fcdac2bf9c0  1 robust_notify: If at first you don't succeed: (110) Connection timed out
2021-01-24T09:32:44.719-0500 7fcdac2bf9c0  0 ERROR: failed to distribute cache for coredumps.rgw.log:meta.history

Related issues 1 (0 open1 closed)

Is duplicate of rgw - Bug #50520: slow radosgw-admin startup when large value of rgw_gc_max_objs configuredResolvedMark Kogan

Actions
Actions #1

Updated by YOUZHONG YANG about 3 years ago

Ok, I tried v15.2.1, it didn't work; I tried v14.2.16, yes, it works! so definitely some change in octopus has caused this regression.

Please help, I can provide whatever information needed.

Actions #2

Updated by YOUZHONG YANG about 3 years ago

Something is terribly wrong. I ran the same 'radosgw-admin bucket list' command against the same backend ceph storage system (upgraded to v15.2.8 from v13.2.4):

v15.2.8 - time radosgw-admin bucket list

real    10m50.067s
user    0m2.127s
sys     0m1.653s
<pre>

v13.2.4 - time radosgw-admin bucket list
<pre>
real    0m3.843s
user    0m0.362s
sys     0m0.195s
</pre>

Actions #3

Updated by YOUZHONG YANG about 3 years ago

A simple 'radosgw-admin user list' under v15.2.8 took 11 minutes 7 seconds:

# time radosgw-admin user list
[
    "zone.user",
    "bse" 
]

real    11m7.266s
user    0m2.081s
sys     0m1.690s

but v13.2.4 only took 2 seconds:

root@ceph-prod-rgw1:~# radosgw-admin -v
ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)

root@ceph-prod-rgw1:~# time radosgw-admin user list
[
    "zone.user",
    "bse" 
]

real    0m2.044s
user    0m0.245s
sys     0m0.245s
Actions #4

Updated by YOUZHONG YANG about 3 years ago

OK, figured out where the regression is by running radosgw-admin in debugger.

We have rgw_gc_max_objs = 36000 in ceph.conf.

Look at the difference between v14.2.16 and v15.2.8 of RGWGC::initialize():

src/rgw/rgw_gc.cc - v14.2.16

void RGWGC::initialize(CephContext *_cct, RGWRados *_store) {
  cct = _cct;
  store = _store;

  max_objs = min(static_cast<int>(cct->_conf->rgw_gc_max_objs), rgw_shards_max());

  obj_names = new string[max_objs];

  for (int i = 0; i < max_objs; i++) {
    obj_names[i] = gc_oid_prefix;
    char buf[32];
    snprintf(buf, 32, ".%d", i);
    obj_names[i].append(buf);
  }
}

src/rgw/rgw_gc.cc - v15.2.8

void RGWGC::initialize(CephContext *_cct, RGWRados *_store) {
  cct = _cct;
  store = _store;

  max_objs = min(static_cast<int>(cct->_conf->rgw_gc_max_objs), rgw_shards_max());

  obj_names = new string[max_objs];

  for (int i = 0; i < max_objs; i++) {
    obj_names[i] = gc_oid_prefix;
    char buf[32];
    snprintf(buf, 32, ".%d", i);
    obj_names[i].append(buf);

    auto it = transitioned_objects_cache.begin() + i;
    transitioned_objects_cache.insert(it, false);

    //version = 0 -> not ready for transition
    //version = 1 -> marked ready for transition
    librados::ObjectWriteOperation op;
    op.create(false);
    const uint64_t queue_size = cct->_conf->rgw_gc_max_queue_size, num_deferred_entries = cct->_conf->rgw_gc_max_deferred;
    gc_log_init2(op, queue_size, num_deferred_entries);
    store->gc_operate(obj_names[i], &op);
  }
}

The extra stuff in the loop seems to be very inefficient when rgw_gc_max_objs is big.

Once we set rgw_gc_max_objs to a small value, everything goes back to normal.

Actions #5

Updated by Sage Weil almost 3 years ago

  • Project changed from Ceph to rgw
Actions #6

Updated by Casey Bodley almost 3 years ago

  • Is duplicate of Bug #50520: slow radosgw-admin startup when large value of rgw_gc_max_objs configured added
Actions #7

Updated by Casey Bodley almost 3 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF