Project

General

Profile

Bug #50249

rgw doesn't respect rgw_frontends stored in cluster configuration

Added by Arnaud Lefebvre 8 months ago. Updated 27 days ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello everyone,

During an upgrade of our Nautilus cluster from 14.2.16 to 14.2.19, we hit an issue where our RadosGW wouldn't listen on the port we configured in the cluster configuration. Instead, it would use and listen on the default framework / port which is `beast port=7480`.

ceph configuration:

~ # ceph config dump | grep rgw_frontends
  client      basic    rgw_frontends             beast endpoint=0.0.0.0:8080

When RadosGW starts, I see the following logs:

❯ radosgw -d --cluster ceph --name client.n1
2021-04-08 19:52:03.925 7fece9c1a200  0 ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) nautilus (stable), process radosgw, pid 1043029
2021-04-08 19:52:03.971 7fece9c1a200 10  cannot find current period zonegroup using local zonegroup
2021-04-08 19:52:03.973 7fece9c1a200 10 Cannot find current period zone using local zone
2021-04-08 19:52:04.008 7fece9c1a200  2 all 8 watchers are set, enabling cache
2021-04-08 19:52:04.015 7fecd81fb640  2 RGWDataChangesLog::ChangesRenewThread: start
2021-04-08 19:52:04.016 7fecd73ff640  2 garbage collection: garbage collection: start
2021-04-08 19:52:04.016 7fecd65ff640  2 object expiration: start
2021-04-08 19:52:04.020 7fecbe3ff640  5 lifecycle: schedule life cycle next start time: Thu Apr  8 22:00:00 2021
2021-04-08 19:52:04.020 7fecbcbff640 10 ERROR: can't get key: ret=-2
2021-04-08 19:52:04.020 7fecbcbff640  5 ERROR: sync_all_users() returned ret=-2
2021-04-08 19:52:04.023 7fece9c1a200  0 starting handler: beast
2021-04-08 19:52:04.023 7fece9c1a200  4 frontend listening on 0.0.0.0:7480
2021-04-08 19:52:04.023 7fece9c1a200  4 frontend listening on [::]:7480
2021-04-08 19:52:04.023 7fece9c1a200  4 frontend spawning 512 threads

This doesn't happen with the 14.2.16 release, downgrading the radosgw process to this version works fine.

2021-04-08 18:07:46.584 7f9e922df140  0 framework: beast
2021-04-08 18:07:46.584 7f9e922df140  0 framework conf key: endpoint, val: 0.0.0.0:8080
2021-04-08 18:07:46.585 7f9e922df140  0 deferred set uid:gid to 985:116 (ceph:ceph)
2021-04-08 18:07:46.585 7f9e922df140  0 ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable), process radosgw, pid 433139
2021-04-08 18:07:46.629 7f9e922df140 10  cannot find current period zonegroup using local zonegroup
2021-04-08 18:07:46.631 7f9e922df140 10 Cannot find current period zone using local zone
2021-04-08 18:07:46.972 7f9e922df140  2 all 8 watchers are set, enabling cache
2021-04-08 18:07:46.981 7f9e803fb640  2 RGWDataChangesLog::ChangesRenewThread: start
2021-04-08 18:07:46.983 7f9e7efff640  2 garbage collection: garbage collection: start
2021-04-08 18:07:46.983 7f9e7e7fe640  2 object expiration: start
2021-04-08 18:07:46.985 7f9e66bff640 10 ERROR: can't get key: ret=-2
2021-04-08 18:07:46.986 7f9e66bff640  5 ERROR: sync_all_users() returned ret=-2
2021-04-08 18:07:46.986 7f9e683fd640  5 lifecycle: schedule life cycle next start time: Fri Apr  9 00:00:00 2021
2021-04-08 18:07:46.986 7f9e922df140  0 starting handler: beast
2021-04-08 18:07:46.986 7f9e922df140  4 frontend listening on 0.0.0.0:8080
2021-04-08 18:07:46.988 7f9e922df140  0 set uid:gid to 985:116 (ceph:ceph)
2021-04-08 18:07:46.988 7f9e922df140  4 frontend spawning 512 threads

I also tried to compile the nautilus branch from Github, the issue is still there.

So I went and git bisected the issue, here's the log:

$ git bisect log
git bisect start
# good: [762032d6f509d5e7ee7dc008d80fe9c87086603c] 14.2.16
git bisect good 762032d6f509d5e7ee7dc008d80fe9c87086603c
# bad: [bb796b9b5bab9463106022eef406373182465d11] 14.2.19
git bisect bad bb796b9b5bab9463106022eef406373182465d11
# bad: [dca1d19eaa1d4fa75c3c75054a886d0d9a990e16] test/librbd/fsx: respect rbd_default_map_options in krbd_open()
git bisect bad dca1d19eaa1d4fa75c3c75054a886d0d9a990e16
# good: [74f48adff35db6f86e9231614da019ef946277a3] Merge pull request #38614 from neha-ojha/wip-48614-nautilus
git bisect good 74f48adff35db6f86e9231614da019ef946277a3
# good: [95879c2433f37c64de9baf8fbd8a77d0d1f3035b] Merge pull request #38475 from ifed01/wip-ifed-fix-avl-nau
git bisect good 95879c2433f37c64de9baf8fbd8a77d0d1f3035b
# good: [95879c2433f37c64de9baf8fbd8a77d0d1f3035b] Merge pull request #38475 from ifed01/wip-ifed-fix-avl-nau
git bisect good 95879c2433f37c64de9baf8fbd8a77d0d1f3035b
# bad: [15fbac3d8abcb59b976a48509c5d53f5019fb58f] Merge pull request #38760 from tchaikov/nautilus-38263
git bisect bad 15fbac3d8abcb59b976a48509c5d53f5019fb58f
# bad: [d5b7eb8b3882adbcda130904c070e67938929f33] Merge pull request #38822 from smithfarm/wip-48803-nautilus
git bisect bad d5b7eb8b3882adbcda130904c070e67938929f33
# bad: [f671c8b2bcd8d9583b406f790c0bce7178f352b5] Merge pull request #38558 from badone/wip-nautilus-fix-logfile-create-perms
git bisect bad f671c8b2bcd8d9583b406f790c0bce7178f352b5
# good: [d10f380204f10b83a4efddfe792b54f1115d791a] tools/ceph_conf: do not "exit(1)" in usage()
git bisect good d10f380204f10b83a4efddfe792b54f1115d791a
# good: [869ae1df551f5e9b56f4c696bfa6a04313d7e175] tools/ceph_conf: send help to cout in case of '--help'
git bisect good 869ae1df551f5e9b56f4c696bfa6a04313d7e175
# bad: [5f7aaf074e3d0372f8d0fb5a1b721f440a9c028f] global/global_init: do first transport connection after setuid()
git bisect bad 5f7aaf074e3d0372f8d0fb5a1b721f440a9c028f
# first bad commit: [5f7aaf074e3d0372f8d0fb5a1b721f440a9c028f] global/global_init: do first transport connection after setuid()

This commit comes from this pull request: https://github.com/ceph/ceph/pull/28012

From what I understand of the issue:
- radosgw starts (rgw_main.cc) and calls global_pre_init() in global_init.cc
- before the patch, it would get some configuration from the mon I guess using the `mc_bootstrap.get_monmap_and_config()` call
- now, the `global_pre_init` does not populate gconf()->rgw_frontends, making radosgw think that there are no frontends configured and uses the default one defined in src/common/options.cc
- then, the global_init() function is called which calls `mc_bootstrap.get_monmap_and_config()`

I was wondering if I could patch this, unfortunately, here's my issue:
- The pull request purposely moved the `mc_bootstrap.get_monmap_and_config()` call after the privileges are dropped
- radosgw seems to be the only program that needs to not drop privileges if one of beast or civetweb is configured, by passing a flag to global_init() that indicates to not drop privileges
- calling global_init() right after global_pre_init() could work. Unfortunately, privileges would be already dropped and beast / civetweb might not successfully bind to a port less than 1024 (there may be other issues, not sure here)

So I'm opening this issue to have your input on how it would be best to fix this and if this is easy enough, I would love to be able to provide a patch given the amount of time I've put into this debug.

Thanks!

History

#1 Updated by Konstantin Shalygin 8 months ago

  • Priority changed from Normal to High
  • Source set to Community (user)
  • Affected Versions v14.2.17, v14.2.18 added

#2 Updated by Konstantin Shalygin 7 months ago

  • Affected Versions v14.2.20, v14.2.21 added

#3 Updated by Casey Bodley about 2 months ago

  • Assignee set to Casey Bodley

#4 Updated by Casey Bodley about 1 month ago

  • Status changed from New to Triaged

#5 Updated by Konstantin Shalygin about 1 month ago

  • Affected Versions v14.2.22 added
  • Affected Versions deleted (v14.2.17, v14.2.18, v14.2.19, v14.2.20, v14.2.21)

#6 Updated by Casey Bodley about 1 month ago

i raised an email about this issue on the list, titled "global init, mon config and setuid" - feel free to chime in there

#7 Updated by Casey Bodley 29 days ago

  • Status changed from Triaged to Need More Info

mon config for rgw_frontends seems to work fine in octopus and later. there was some related work in https://github.com/ceph/ceph/pull/34613 that might be relevant there. have you been able to reproduce this since?

#8 Updated by Casey Bodley 29 days ago

  • Status changed from Need More Info to Resolved

ok, i think i finally understand the timeline here, and it was all captured well in the original description

https://github.com/ceph/ceph/pull/28012 introduced a regression late in the nautilus release

- calling global_init() right after global_pre_init() could work. Unfortunately, privileges would be already dropped and beast / civetweb might not successfully bind to a port less than 1024 (there may be other issues, not sure here)

during octopus, https://github.com/ceph/ceph/pull/33287 moved the global_init() call as suggested, then https://github.com/ceph/ceph/pull/34613 fixed up the ability to bind privileged ports

so i believe this issue is resolved on all supported release branches

#9 Updated by Arnaud Lefebvre 27 days ago

Casey Bodley wrote:

https://github.com/ceph/ceph/pull/28012 introduced a regression late in the nautilus release

Indeed, I opened this bug hoping that it could be fixed before the latest release of the Nautilus branch.

Thanks for your time and confirming that it's fixed in other major versions!

Also available in: Atom PDF