Bug #48983
closedradosgw not working - upgraded from mimic to octopus
0%
Description
I upgraded our ceph cluster (6 bare metal nodes, 3 rgw VMs) from v13.2.4 to v15.2.8. The mon, mgr, mds and osd daemons were all upgraded successfully, everything looked good.
After the radosgw was upgraded, they refused to work, the log messages are at the end of this message.
Here are the things I have tried:
1. I moved aside the pools for the rgw service, started from scratch (creating realm, zonegroup, zone, users), but when I tried to run 'radosgw-admin user create ...', it appeared to be stuck and never returned, other command like 'radosgw-admin period update --commit' also got stuck.
2. I rolled back radosgw to the old version v13.2.4, then everything works great again.
What am I missing here? Is there anything extra that needs to be done for rgw after upgrading from mimic to octopus? is this a bug or some sort?
2021-01-24T09:24:10.192-0500 7f638f79f9c0 0 deferred set uid:gid to 64045:64045 (ceph:ceph) 2021-01-24T09:24:10.192-0500 7f638f79f9c0 0 ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable), process radosgw, pid 898 2021-01-24T09:24:10.192-0500 7f638f79f9c0 0 framework: civetweb 2021-01-24T09:24:10.192-0500 7f638f79f9c0 0 framework conf key: port, val: 80 2021-01-24T09:24:10.192-0500 7f638f79f9c0 0 framework conf key: num_threads, val: 1024 2021-01-24T09:24:10.192-0500 7f638f79f9c0 0 framework conf key: request_timeout_ms, val: 50000 2021-01-24T09:24:10.192-0500 7f638f79f9c0 1 radosgw_Main not setting numa affinity 2021-01-24T09:29:10.195-0500 7f638cbcd700 -1 Initialization timeout, failed to initialize 2021-01-24T09:29:10.367-0500 7f4c213ba9c0 0 deferred set uid:gid to 64045:64045 (ceph:ceph) 2021-01-24T09:29:10.367-0500 7f4c213ba9c0 0 ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable), process radosgw, pid 1541 2021-01-24T09:29:10.367-0500 7f4c213ba9c0 0 framework: civetweb 2021-01-24T09:29:10.367-0500 7f4c213ba9c0 0 framework conf key: port, val: 80 2021-01-24T09:29:10.367-0500 7f4c213ba9c0 0 framework conf key: num_threads, val: 1024 2021-01-24T09:29:10.367-0500 7f4c213ba9c0 0 framework conf key: request_timeout_ms, val: 50000 2021-01-24T09:29:10.367-0500 7f4c213ba9c0 1 radosgw_Main not setting numa affinity 2021-01-24T09:29:25.883-0500 7f4c213ba9c0 1 robust_notify: If at first you don't succeed: (110) Connection timed out 2021-01-24T09:29:25.883-0500 7f4c213ba9c0 0 ERROR: failed to distribute cache for coredumps.rgw.log:meta.history 2021-01-24T09:32:27.754-0500 7fcdac2bf9c0 0 deferred set uid:gid to 64045:64045 (ceph:ceph) 2021-01-24T09:32:27.754-0500 7fcdac2bf9c0 0 ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable), process radosgw, pid 978 2021-01-24T09:32:27.758-0500 7fcdac2bf9c0 0 framework: civetweb 2021-01-24T09:32:27.758-0500 7fcdac2bf9c0 0 framework conf key: port, val: 80 2021-01-24T09:32:27.758-0500 7fcdac2bf9c0 0 framework conf key: num_threads, val: 1024 2021-01-24T09:32:27.758-0500 7fcdac2bf9c0 0 framework conf key: request_timeout_ms, val: 50000 2021-01-24T09:32:27.758-0500 7fcdac2bf9c0 1 radosgw_Main not setting numa affinity 2021-01-24T09:32:44.719-0500 7fcdac2bf9c0 1 robust_notify: If at first you don't succeed: (110) Connection timed out 2021-01-24T09:32:44.719-0500 7fcdac2bf9c0 0 ERROR: failed to distribute cache for coredumps.rgw.log:meta.history
Updated by YOUZHONG YANG over 3 years ago
Ok, I tried v15.2.1, it didn't work; I tried v14.2.16, yes, it works! so definitely some change in octopus has caused this regression.
Please help, I can provide whatever information needed.
Updated by YOUZHONG YANG over 3 years ago
Something is terribly wrong. I ran the same 'radosgw-admin bucket list' command against the same backend ceph storage system (upgraded to v15.2.8 from v13.2.4):
v15.2.8 - time radosgw-admin bucket list
real 10m50.067s user 0m2.127s sys 0m1.653s <pre> v13.2.4 - time radosgw-admin bucket list <pre> real 0m3.843s user 0m0.362s sys 0m0.195s </pre>
Updated by YOUZHONG YANG over 3 years ago
A simple 'radosgw-admin user list' under v15.2.8 took 11 minutes 7 seconds:
# time radosgw-admin user list [ "zone.user", "bse" ] real 11m7.266s user 0m2.081s sys 0m1.690s
but v13.2.4 only took 2 seconds:
root@ceph-prod-rgw1:~# radosgw-admin -v ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable) root@ceph-prod-rgw1:~# time radosgw-admin user list [ "zone.user", "bse" ] real 0m2.044s user 0m0.245s sys 0m0.245s
Updated by YOUZHONG YANG over 3 years ago
OK, figured out where the regression is by running radosgw-admin in debugger.
We have rgw_gc_max_objs = 36000 in ceph.conf.
Look at the difference between v14.2.16 and v15.2.8 of RGWGC::initialize():
src/rgw/rgw_gc.cc - v14.2.16 void RGWGC::initialize(CephContext *_cct, RGWRados *_store) { cct = _cct; store = _store; max_objs = min(static_cast<int>(cct->_conf->rgw_gc_max_objs), rgw_shards_max()); obj_names = new string[max_objs]; for (int i = 0; i < max_objs; i++) { obj_names[i] = gc_oid_prefix; char buf[32]; snprintf(buf, 32, ".%d", i); obj_names[i].append(buf); } } src/rgw/rgw_gc.cc - v15.2.8 void RGWGC::initialize(CephContext *_cct, RGWRados *_store) { cct = _cct; store = _store; max_objs = min(static_cast<int>(cct->_conf->rgw_gc_max_objs), rgw_shards_max()); obj_names = new string[max_objs]; for (int i = 0; i < max_objs; i++) { obj_names[i] = gc_oid_prefix; char buf[32]; snprintf(buf, 32, ".%d", i); obj_names[i].append(buf); auto it = transitioned_objects_cache.begin() + i; transitioned_objects_cache.insert(it, false); //version = 0 -> not ready for transition //version = 1 -> marked ready for transition librados::ObjectWriteOperation op; op.create(false); const uint64_t queue_size = cct->_conf->rgw_gc_max_queue_size, num_deferred_entries = cct->_conf->rgw_gc_max_deferred; gc_log_init2(op, queue_size, num_deferred_entries); store->gc_operate(obj_names[i], &op); } }
The extra stuff in the loop seems to be very inefficient when rgw_gc_max_objs is big.
Once we set rgw_gc_max_objs to a small value, everything goes back to normal.
Updated by Casey Bodley about 3 years ago
- Is duplicate of Bug #50520: slow radosgw-admin startup when large value of rgw_gc_max_objs configured added