Project

General

Profile

Bug #46036

cephadm: killmode=none: systemd units failed, but containers still running

Added by Sebastian Wagner 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
cephadm (binary)
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

# ceph orch ps
NAME                    HOST         STATUS         REFRESHED  AGE  VERSION     IMAGE NAME   IMAGE ID      CONTAINER ID  
osd.0                   hostXXXXX-4  error          6m ago     92m  15.2.3.252  ceph/ceph    33194941836f  c5dd2b0cc77d  
osd.1                   hostXXXXX-4  error          6m ago     90m  15.2.3.252  ceph/ceph    33194941836f  b65dc56c76a2  

turns out, the systemd unit failed:

● ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.2.service - Ceph osd.2 for 92d2d4c0-af05-11ea-9578-0cc47aaa2edc
   Loaded: loaded (/etc/systemd/system/ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2020-06-16 12:05:49 UTC; 1h 32min ago
  Process: 3861 ExecStopPost=/bin/bash /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.2/unit.poststop (code=exited, status=0/SUCCESS)
  Process: 3693 ExecStart=/bin/bash /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.2/unit.run (code=exited, status=125)
  Process: 3676 ExecStartPre=/usr/bin/podman rm ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc-osd.2 (code=exited, status=2)
 Main PID: 3693 (code=exited, status=125)
    Tasks: 34
   CGroup: /system.slice/system-ceph\x2d92d2d4c0\x2daf05\x2d11ea\x2d9578\x2d0cc47aaa2edc.slice/ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.2.service
           ├─28935 /bin/bash /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.2/unit.run
           ├─29335 /usr/bin/podman run --rm --net=host --ipc=host --privileged --group-add=disk --name ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc-osd.2 -e CONTAINER_IMAGE=ceph/ceph -e NODE_NAME=hostXXXXX-4 -v /var/run/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc>
           └─29396 /usr/bin/conmon --api-version 1 -s -c 2f88b58cb64519fc90842f6a473703da44c5612d2686b5beae86b0ff2a7d50bb -u 2f88b58cb64519fc90842f6a473703da44c5612d2686b5beae86b0ff2a7d50bb -r /usr/sbin/runc -b /var/lib/containers/storage/btrfs-containers/2f88b58cb64519fc90842f6a473703da44c5612d2686b5beae86b0ff2a7d5>

Jun 16 13:32:08 hostXXXXX-4 bash[28935]: Uptime(secs): 5400.0 total, 0.0 interval
Jun 16 13:32:08 hostXXXXX-4 bash[28935]: Flush(GB): cumulative 0.000, interval 0.000
Jun 16 13:32:08 hostXXXXX-4 bash[28935]: AddFile(GB): cumulative 0.000, interval 0.000
Jun 16 13:32:08 hostXXXXX-4 bash[28935]: AddFile(Total Files): cumulative 0, interval 0
Jun 16 13:32:08 hostXXXXX-4 bash[28935]: AddFile(L0 Files): cumulative 0, interval 0
Jun 16 13:32:08 hostXXXXX-4 bash[28935]: AddFile(Keys): cumulative 0, interval 0
Jun 16 13:32:08 hostXXXXX-4 bash[28935]: Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Jun 16 13:32:08 hostXXXXX-4 bash[28935]: Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Jun 16 13:32:08 hostXXXXX-4 bash[28935]: Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Jun 16 13:32:08 hostXXXXX-4 bash[28935]: ** File Read Latency Histogram By Level [default] **

where the log shows something like

-- Unit ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service has begun starting up.                                                                        
Jun 16 12:03:06 hostXXXXX-4 podman[31032]: Error: cannot remove container b65dc56c76a247e9178fa81005a93cf44e502a06fb46bcc79b9bf484128b2907 as it is running ->
Jun 16 12:03:06 hostXXXXX-4 systemd[1]: Started Ceph osd.1 for 92d2d4c0-af05-11ea-9578-0cc47aaa2edc.                                                          
-- Subject: Unit ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service has finished start-up                                                                
-- Defined-By: systemd                                                                                                                                        
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel                                                                                      
--                                                                                                                                                            
-- Unit ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service has finished starting up.                                                                     
--                                                                                                                                                            
-- The start-up result is done.                                                                                                                               
Jun 16 12:03:06 hostXXXXX-4 bash[31047]: WARNING: The same type, major and minor should not be used for multiple devices.                                     
Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1                                                
Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-block-78635646-a4a6-474e->
Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/ln -snf /dev/ceph-block-78635646-a4a6-474e-8832-c3a3e668cf9d/osd-block-f645cf27-857b-48ae->
Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block                                          
Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -R ceph:ceph /dev/mapper/ceph--block--78635646--a4a6--474e--8832--c3a3e668cf9d-osd-->
Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1                                                
Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/ln -snf /dev/ceph-block-dbs-83a1ab17-f232-4f60-887f-111b89f3f655/osd-block-db-3c9563a7-9ab>
Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -h ceph:ceph /dev/ceph-block-dbs-83a1ab17-f232-4f60-887f-111b89f3f655/osd-block-db-3>
Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -R ceph:ceph /dev/mapper/ceph--block--dbs--83a1ab17--f232--4f60--887f--111b89f3f655->
Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block.db                                       
Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -R ceph:ceph /dev/mapper/ceph--block--dbs--83a1ab17--f232--4f60--887f--111b89f3f655->
Jun 16 12:03:07 hostXXXXX-4 bash[31047]: --> ceph-volume lvm activate successful for osd ID: 1                                                                
Jun 16 12:03:08 hostXXXXX-4 bash[31047]: WARNING: The same type, major and minor should not be used for multiple devices.                                     
Jun 16 12:03:08 hostXXXXX-4 bash[31047]: Error: error creating container storage: the container name "ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc-osd.1" is alr>
Jun 16 12:03:08 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Main process exited, code=exited, status=125/n/a             
Jun 16 12:03:08 hostXXXXX-4 bash[31227]: WARNING: The same type, major and minor should not be used for multiple devices.                                     
Jun 16 12:03:09 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Unit entered failed state.                                   
Jun 16 12:03:09 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Failed with result 'exit-code'.                              
Jun 16 12:03:19 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Service RestartSec=10s expired, scheduling restart.          
Jun 16 12:03:19 hostXXXXX-4 systemd[1]: Stopped Ceph osd.1 for 92d2d4c0-af05-11ea-9578-0cc47aaa2edc.                                                          
-- Subject: Unit ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service has finished shutting down                                                           
-- Defined-By: systemd                                                                                                                                        
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel                                                                                      
--                                                                                                                                                            
-- Unit ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service has finished shutting down.                                                                   
Jun 16 12:03:19 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Start request repeated too quickly.                          
Jun 16 12:03:19 hostXXXXX-4 systemd[1]: Failed to start Ceph osd.1 for 92d2d4c0-af05-11ea-9578-0cc47aaa2edc.                                                  
-- Subject: Unit ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service has failed                                                                           
-- Defined-By: systemd                                                                                                                                        
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel                                                                                      
--                                                                                                                                                            
-- Unit ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service has failed.                                                                                   
--                                                                                                                                                            
-- The result is failed.                                                                                                                                      
Jun 16 12:03:19 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Unit entered failed state.                                   
Jun 16 12:03:19 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Failed with result 'exit-code'. 

Adding a set -e changes the output to /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.1/unit.run:

hostXXXXX-4:~ # systemctl status ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1                                                                              
● ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service - Ceph osd.1 for 92d2d4c0-af05-11ea-9578-0cc47aaa2edc                                               
   Loaded: loaded (/etc/systemd/system/ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@.service; enabled; vendor preset: disabled)                                  
   Active: activating (auto-restart) (Result: exit-code) since Tue 2020-06-16 14:07:27 UTC; 4s ago                                                            
  Process: 10391 ExecStopPost=/bin/bash /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.1/unit.poststop (code=exited, status=0/SUCCESS)                
  Process: 10216 ExecStart=/bin/bash /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.1/unit.run (code=exited, status=125)                              
  Process: 10201 ExecStartPre=/usr/bin/podman rm ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc-osd.1 (code=exited, status=2)                                      
 Main PID: 10216 (code=exited, status=125)                                                                                                                    
    Tasks: 29                                                                                                                                                 
   CGroup: /system.slice/system-ceph\x2d92d2d4c0\x2daf05\x2d11ea\x2d9578\x2d0cc47aaa2edc.slice/ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service        
           ├─25971 /bin/bash /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.1/unit.run                                                                
           ├─26395 /usr/bin/podman run --rm --net=host --ipc=host --privileged --group-add=disk --name ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc-osd.1 -e CON>
           └─26452 /usr/bin/conmon --api-version 1 -s -c b65dc56c76a247e9178fa81005a93cf44e502a06fb46bcc79b9bf484128b2907 -u b65dc56c76a247e9178fa81005a93cf4>

Jun 16 14:07:27 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Failed with result 'exit-code'.    

now, let's kill the container:

hostXXXXX-4:~ # podman stop ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc-osd.1                                                                                   
b65dc56c76a247e9178fa81005a93cf44e502a06fb46bcc79b9bf484128b2907                                                    

Adding this line into /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.1/unit.run

! /usr/bin/podman rm --storage ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc-osd.1

now, the service is up again.


Related issues

Related to Orchestrator - Bug #44990: cephadm: exec: "/usr/bin/ceph-mon": stat /usr/bin/ceph-mon: no such file or directory New
Related to Orchestrator - Bug #46654: Unsupported podman container configuration via systemd Resolved

History

#1 Updated by Sebastian Wagner 4 months ago

https://github.com/ceph/ceph/pull/35524 is part of the solution. the other part is adding a set -e

#2 Updated by Sebastian Wagner 4 months ago

  • Related to Bug #44990: cephadm: exec: "/usr/bin/ceph-mon": stat /usr/bin/ceph-mon: no such file or directory added

#3 Updated by Sebastian Wagner 4 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 35651

#4 Updated by Sebastian Wagner 3 months ago

  • Status changed from Fix Under Review to Pending Backport

#5 Updated by Sebastian Wagner 3 months ago

  • Status changed from Pending Backport to Resolved
  • Target version set to v15.2.5

#6 Updated by Sebastian Wagner 3 months ago

  • Related to Bug #46654: Unsupported podman container configuration via systemd added

Also available in: Atom PDF