Project

General

Profile

Bug #57096

osd not restarting after upgrading to quincy due to podman args --cgroups=split

Added by Ween Jiann Lee over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy
Regression:
No
Severity:
2 - major
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm trying to upgrade from octopus latest to 17.2.3.

Looking at the /var/lib/ceph/xxxx/osd.6/unit.run, there is an extra argument `--cgroups=split` that wasn't there in octopus.
With this argument, the osd daemon does not start. Removing this argument, osd daemon starts normally.

Below are the podman info and the jorunalctl startup logs

podman info

host:
  arch: amd64
  buildahVersion: 1.26.2
  cgroupControllers:
  - cpuset
  - cpu
  - cpuacct
  - blkio
  - memory
  - devices
  - freezer
  - net_cls
  - perf_event
  - net_prio
  - hugetlb
  - pids
  - rdma
  cgroupManager: systemd
  cgroupVersion: v1
  conmon:
    package: conmon-2.1.2-2.module+el8.6.0+997+05c9d812.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.2, commit: 98e028a5804809ccb49bc099c0d53adc43ef8cc4'
  cpuUtilization:
    idlePercent: 97.58
    systemPercent: 0.55
    userPercent: 1.87
  cpus: 48
  distribution:
    distribution: '"rocky"'
    version: "8.6" 
  eventLogger: file
  hostname: sys1.nodes.preferred.ai
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 4.18.0-372.16.1.el8_6.x86_64
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 236812603392
  memTotal: 269990506496
  networkBackend: cni
  ociRuntime:
    name: runc
    package: runc-1.1.3-2.module+el8.6.0+997+05c9d812.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.1.3
      spec: 1.0.2-dev
      go: go1.17.12
      libseccomp: 2.5.2
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-2.module+el8.6.0+997+05c9d812.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 4291006464
  swapTotal: 4294963200
  uptime: 16h 15m 10.45s (Approximately 0.67 days)
plugins:
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 6
    paused: 0
    running: 3
    stopped: 3
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 233919479808
  graphRootUsed: 93134196736
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false" 
    Supports d_type: "true" 
    Using metacopy: "true" 
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 4
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.1.1
  Built: 1659426794
  BuiltTime: Tue Aug  2 15:53:14 2022
  GitCommit: "" 
  GoVersion: go1.17.12
  Os: linux
  OsArch: linux/amd64
  Version: 4.1.1

osd.6 start logs

Aug 11 11:14:20 x.com systemd[1]: Starting Ceph osd.6 for xxxx...
-- Subject: Unit ceph-xxxx@osd.6.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-xxxx@osd.6.service has begun starting up.
Aug 11 11:14:21 x.com systemd[1]: Started libcontainer container 436ff1dc9e04f07a58006178a1c1978d704e951d467e61299b9cdb755e83358d.
-- Subject: Unit libpod-436ff1dc9e04f07a58006178a1c1978d704e951d467e61299b9cdb755e83358d.scope has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit libpod-436ff1dc9e04f07a58006178a1c1978d704e951d467e61299b9cdb755e83358d.scope has finished starting up.
--
-- The start-up result is done.
Aug 11 11:14:22 x.com bash[3742940]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-6
Aug 11 11:14:22 x.com bash[3742940]: Running command: /usr/bin/ceph-bluestore-tool prime-osd-dir --path /var/lib/ceph/osd/ceph-6 --no-mon-config --dev /dev/mapper/ceph--e140738d--f7fc--428f--9ecd--192c0db697dc-osd--block--77a96052--c916--492a--aa3f--bf6>
Aug 11 11:14:22 x.com bash[3742940]: Running command: /usr/bin/chown -h ceph:ceph /dev/mapper/ceph--e140738d--f7fc--428f--9ecd--192c0db697dc-osd--block--77a96052--c916--492a--aa3f--bf6cccb0a802
Aug 11 11:14:22 x.com bash[3742940]: Running command: /usr/bin/chown -R ceph:ceph /dev/dm-2
Aug 11 11:14:22 x.com bash[3742940]: Running command: /usr/bin/ln -s /dev/mapper/ceph--e140738d--f7fc--428f--9ecd--192c0db697dc-osd--block--77a96052--c916--492a--aa3f--bf6cccb0a802 /var/lib/ceph/osd/ceph-6/block
Aug 11 11:14:22 x.com bash[3742940]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-6
Aug 11 11:14:22 x.com bash[3742940]: --> ceph-volume raw activate successful for osd ID: 6
Aug 11 11:14:22 x.com systemd[1]: libpod-436ff1dc9e04f07a58006178a1c1978d704e951d467e61299b9cdb755e83358d.scope: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- The unit libpod-436ff1dc9e04f07a58006178a1c1978d704e951d467e61299b9cdb755e83358d.scope has successfully entered the 'dead' state.
Aug 11 11:14:22 x.com systemd[1]: libpod-436ff1dc9e04f07a58006178a1c1978d704e951d467e61299b9cdb755e83358d.scope: Consumed 1.445s CPU time
-- Subject: Resources consumed by unit runtime
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- The unit libpod-436ff1dc9e04f07a58006178a1c1978d704e951d467e61299b9cdb755e83358d.scope completed and consumed the indicated resources.
Aug 11 11:14:22 x.com systemd[1]: var-lib-containers-storage-overlay-afe8eefb9c7e960f583589858644d28823a097316b5987e607f36cb401adaca4-merged.mount: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- The unit var-lib-containers-storage-overlay-afe8eefb9c7e960f583589858644d28823a097316b5987e607f36cb401adaca4-merged.mount has successfully entered the 'dead' state.
Aug 11 11:14:22 x.com bash[3743336]: b205a5ef766e965ca9e72a189edc8974fa4d82d8d43819c7a6ae8b10a55a0fac
Aug 11 11:14:23 x.com podman[3743417]:
Aug 11 11:14:23 x.com bash[3743417]: Error: could not find cgroup mount in "/proc/self/cgroup" 
Aug 11 11:14:23 x.com systemd[1]: ceph-xxxx@osd.6.service: Control process exited, code=exited status=126


Related issues

Copied to Orchestrator - Backport #57526: quincy: osd not restarting after upgrading to quincy due to podman args --cgroups=split Resolved

History

#1 Updated by Adam King over 1 year ago

and this is only on OSDs? Afaik we're setting --cgroups=split for every container we deploy. It seems odd it would only break for OSDs

#2 Updated by Ween Jiann Lee over 1 year ago

Adam King wrote:

and this is only on OSDs? Afaik we're setting --cgroups=split for every container we deploy. It seems odd it would only break for OSDs

Unfortunately, only the OSD and rgw are placed on servers with podman. The OSDs are upgraded before the rgw so I can't tell whether the problem exists for the rest of the daemon types.

I can try to upgrade the rgw manually if you would like to know.

#3 Updated by Ween Jiann Lee over 1 year ago

Adam King wrote:

and this is only on OSDs? Afaik we're setting --cgroups=split for every container we deploy. It seems odd it would only break for OSDs

Unfortunately, only the OSD and rgw are placed on servers with podman. The OSDs are upgraded before the rgw so I can't tell whether the problem exists for rgw as well and or the rest of the daemon types.

I can try to upgrade the rgw manually if you would like to find out.

#4 Updated by Adam King over 1 year ago

Was discussing this with somebody today and they had a potential workaround idea. We have a way of specifying miscellaneous container args in newer version of cephadm https://docs.ceph.com/en/quincy/cephadm/services/#extra-container-arguments. We might be able to use this to pass `--cgroups=enabled` (which is the default behavior) to override the `--cgroups=split`. It would basically involve pausing the upgrade, modifying the service spec for the osds and rgws on this host to add

extra_container_args:
  - '--cgroups=enabled'

to the end, re-applying the spec, telling cephadm to redeploy the service "ceph orch redeploy <service-name>" for each service that has daemons on this host, and then starting the upgrade again. You could try that and see if it works. If not, I have https://github.com/ceph/ceph/pull/47640 open and could try to push to get it in the next quincy release. Personally I don't really like the patch so I'd like to first make sure this workaround doesn't help before moving forward with it.

#5 Updated by Ween Jiann Lee over 1 year ago

Adam King wrote:

It would basically involve pausing the upgrade, modifying the service spec for the osds and rgws on this host to add

Thanks for getting a PR ready. The OSDs have been deployed to the default unmanaged service spec (i.e. "osd" on `ceph orch ls`), which cannot be modified.

I can add another service spec with the args above but is there any way to migrate the OSDs to another service spec without destroying the data?

#6 Updated by Adam King over 1 year ago

Ween Jiann Lee wrote:

Adam King wrote:

It would basically involve pausing the upgrade, modifying the service spec for the osds and rgws on this host to add

Thanks for getting a PR ready. The OSDs have been deployed to the default unmanaged service spec (i.e. "osd" on `ceph orch ls`), which cannot be modified.

I can add another service spec with the args above but is there any way to migrate the OSDs to another service spec without destroying the data?

If the osd is attached to a service, you should be able to edit the existent service spec to include the args and re-apply it and cpehadm will redeploy the OSDs with the extra args. I tested that a set of OSDs deployed using

[ceph: root@vm-00 /]# cat osd.yml 
service_type: osd
service_id: foo
placement:
  hosts:
  - vm-00
  - vm-01
spec:
  data_devices:
    paths:
    - /dev/vdb
    - /dev/vdc

were deployed without --cgroups=enabled but then after re-applying this spec with a modification (see https://docs.ceph.com/en/latest/cephadm/services/#updating-service-specifications if you're not sure what I mean) they were redeployed with --cgroups=enabled

[ceph: root@vm-00 /]# cat osd.yml 
service_type: osd
service_id: foo
service_name: osd.foo
placement:
  hosts:
  - vm-00
  - vm-01
extra_container_args:
  - '--cgroups=enabled'
spec:
  data_devices:
    paths:
    - /dev/vdb
    - /dev/vdc
  filter_logic: AND
  objectstore: bluestore

Note that the OSDs affected by that were known to be part of osd.foo by cephadm

[ceph: root@vm-00 /]# ceph orch ps --service-name osd.foo
NAME   HOST   PORTS  STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION                 IMAGE ID      CONTAINER ID  
osd.0  vm-01         running (10m)     9m ago  13m    48.8M    11.1G  17.0.0-14573-gf5c21acf  f05a7f2e4b06  45ccb83b661a  
osd.1  vm-00         running (11m)     5m ago  13m    54.8M    7588M  17.0.0-14573-gf5c21acf  f05a7f2e4b06  3ae50021a8c1  
osd.2  vm-01         running (10m)     9m ago  13m    45.4M    11.1G  17.0.0-14573-gf5c21acf  f05a7f2e4b06  761d2fd4e5bf  
osd.3  vm-00         running (11m)     5m ago  13m    50.3M    7588M  17.0.0-14573-gf5c21acf  f05a7f2e4b06  d26a1fbcce32  

I don't think there's any way currently to do so with OSDs not tied to any service (that show up in `ceph orch ls --service-name osd`). I think there SHOULD be and I will try to add something in the future. In the meantime, I guess maybe I'll try to go forward with that other PR to address the issue.

#7 Updated by Adam King over 1 year ago

  • Status changed from New to Pending Backport
  • Backport set to quincy, pacific
  • Pull request ID set to 47640

#8 Updated by Adam King over 1 year ago

  • Backport changed from quincy, pacific to quincy

#9 Updated by Backport Bot over 1 year ago

  • Copied to Backport #57526: quincy: osd not restarting after upgrading to quincy due to podman args --cgroups=split added

#10 Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed

#11 Updated by Adam King over 1 year ago

  • Status changed from Pending Backport to Resolved

#12 Updated by Adam King over 1 year ago

@Ween I found another thing that may be helpful. For OSDs that are not attached to a spec you can make them attached to a spec (and therefore use extra_container_args) on them by editing the unit.meta file for the daemon.

For example, her eosd.0 was not attached to any spec

[ceph: root@vm-00 /]# ceph orch daemon add osd vm-00:/dev/vdc
Created osd(s) 0 on host 'vm-00'
[ceph: root@vm-00 /]# ceph orch ps
NAME              HOST   PORTS   STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION          IMAGE ID      CONTAINER ID  
crash.vm-00       vm-00          running (4m)     2m ago   4m    6971k        -  16.2.8-80.el8cp  4826c8c29ba2  0ef00e352938  
crash.vm-01       vm-01          running (3m)     2m ago   3m    7168k        -  16.2.8-80.el8cp  4826c8c29ba2  3417453e1d7d  
crash.vm-02       vm-02          running (3m)     2m ago   3m    7180k        -  16.2.8-80.el8cp  4826c8c29ba2  e5f65e7122ac  
mgr.vm-00.ezzehn  vm-00  *:9283  running (5m)     2m ago   5m     421M        -  16.2.8-80.el8cp  4826c8c29ba2  f4cfac7ba198  
mgr.vm-01.lgesel  vm-01  *:8443  running (3m)     2m ago   3m     389M        -  16.2.8-80.el8cp  4826c8c29ba2  0347fea26d7c  
mon.vm-00         vm-00          running (5m)     2m ago   5m    48.6M    2048M  16.2.8-80.el8cp  4826c8c29ba2  2bae16704025  
mon.vm-01         vm-01          running (3m)     2m ago   3m    39.6M    2048M  16.2.8-80.el8cp  4826c8c29ba2  4303ce39f55c  
mon.vm-02         vm-02          running (3m)     2m ago   3m    37.3M    2048M  16.2.8-80.el8cp  4826c8c29ba2  a060d6e3546a  
osd.0             vm-00          running (2m)     2m ago   2m    13.1M    22.2G  16.2.8-80.el8cp  4826c8c29ba2  bdbdd6db176f
[ceph: root@vm-00 /]# ceph orch ls
NAME   PORTS  RUNNING  REFRESHED  AGE  PLACEMENT    
crash             3/3  58s ago    3m   *            
mgr               2/2  58s ago    3m   count:2      
mon               3/5  58s ago    3m   count:5      
osd                 1  10s ago    -    <unmanaged> 

and you can see it's unit.meta reported as such (service of just "osd")

[root@vm-00 ~]# cat /var/lib/ceph/3098b6e0-3904-11ed-8133-525400affcb5/osd.0/unit.meta 
{
    "service_name": "osd",
    "ports": [],
    "ip": null,
    "deployed_by": [
        "quay.io/adk3798/ceph@sha256:4b100389b72cf985b4819fc145c7581d79783d1671f2b43bba5fb560ff83fac3" 
    ],
    "rank": null,
    "rank_generation": null,
    "extra_container_args": null,
    "memory_request": null,
    "memory_limit": null
}

but if I added a new osd spec and then changed the service_name for that osd's unit.meta file, it now falls under that spec

[ceph: root@vm-00 /]# cat osd.yml 
service_type: osd
service_id: foo
service_name: osd.foo
placement:
  hosts:
  - vm-00
spec:
  data_devices:
    paths:
    - /dev/vdc
  filter_logic: AND
  objectstore: bluestore
[ceph: root@vm-00 /]# ceph orch apply -i osd.yml 
Scheduled osd.foo update...
[ceph: root@vm-00 /]# ceph orch ls
NAME     PORTS  RUNNING  REFRESHED  AGE  PLACEMENT    
crash               3/3  2m ago     4m   *            
mgr                 2/2  2m ago     4m   count:2      
mon                 3/5  2m ago     4m   count:5      
osd                   1  113s ago   -    <unmanaged>  
osd.foo               0  -          62s  vm-00        

[root@vm-00 ~]# vi /var/lib/ceph/3098b6e0-3904-11ed-8133-525400affcb5/osd.0/unit.meta 
[root@vm-00 ~]# cat /var/lib/ceph/3098b6e0-3904-11ed-8133-525400affcb5/osd.0/unit.meta 
{
    "service_name": "osd.foo",
    "ports": [],
    "ip": null,
    "deployed_by": [
        "quay.io/adk3798/ceph@sha256:4b100389b72cf985b4819fc145c7581d79783d1671f2b43bba5fb560ff83fac3" 
    ],
    "rank": null,
    "rank_generation": null,
    "extra_container_args": null,
    "memory_request": null,
    "memory_limit": null
}

Run a "ceph orch ps --refresh" here to get cephadm to update its daemon metadata. Then . . .

[ceph: root@vm-00 /]# ceph orch ls
NAME     PORTS  RUNNING  REFRESHED  AGE  PLACEMENT  
crash               3/3  56s ago    10m  *          
mgr                 2/2  56s ago    10m  count:2    
mon                 3/5  56s ago    10m  count:5         
osd.foo               1  33s ago    6m   vm-00  

and now I can add extra_container_args for osd.0 using the spec for osd.foo service. This should allow the workaround to be applied to osds not currently attached to any specs.

#13 Updated by Ween Jiann Lee over 1 year ago

The unit.meta file is not yet present in Octopus. I'll try to figure something out or wait for the PR release.

Thanks for the help, Adam!

#14 Updated by Ween Jiann Lee over 1 year ago

I manually created the unit.meta, and it seems to work. thanks again.

Also available in: Atom PDF