Bug #13238
duplicate start of ceph-osd daemon
0%
Description
OS:
CentOS 7 X86_64
ceph version:
[root@hust17 /var/log/ceph]# ceph -v ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e)
the data devices have been mounted, but daemons have not been started, then use init script to start the osds:
/etc/init.d/ceph start osd
at some rare chances, we found those error messages in the osd log?
325 2015-09-25 15:38:41.505114 7ff72788b760 0 ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e), process ceph-osd, pid 30794 326 2015-09-25 15:38:42.317053 7fe06b600760 -1 asok(0x47478c0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ce# 327 2015-09-25 15:38:42.317117 7fe06b600760 0 filestore(/var/lib/ceph/osd/ceph-1) lock_fsid failed to lock /var/lib/ceph/osd/ceph-1/fsid, is another ceph-osd still running? (11) Resource# 328 2015-09-25 15:38:42.317132 7fe06b600760 -1 filestore(/var/lib/ceph/osd/ceph-1) FileStore::mount: lock_fsid failed 329 2015-09-25 15:38:42.317142 7fe06b600760 -1 ^[[0;31m ** ERROR: error converting store /var/lib/ceph/osd/ceph-1: (16) Device or resource busy^[[0m 330 2015-09-25 15:38:43.562710 7ff72788b760 -1 asok(0x351f8c0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ce# 331 2015-09-25 15:38:43.562833 7ff72788b760 0 filestore(/var/lib/ceph/osd/ceph-1) lock_fsid failed to lock /var/lib/ceph/osd/ceph-1/fsid, is another ceph-osd still running? (11) Resource# 332 2015-09-25 15:38:43.562869 7ff72788b760 -1 filestore(/var/lib/ceph/osd/ceph-1) FileStore::mount: lock_fsid failed 333 2015-09-25 15:38:43.562897 7ff72788b760 -1 ^[[0;31m ** ERROR: error converting store /var/lib/ceph/osd/ceph-1: (16) Device or resource busy^[[0m
and some pid files of the daemons also got removed unexpectedly, which makes those osds can not be stopped or restarted
possible clues:
1. after osd daemon is forked, pid file will be created 2. /etc/init.d/ceph use this pid file to prevent a second start of the daemon, as well as stopping the daemon 3. if the data device has been mounted at the correct location, the script will try to start the daemon first, and this is the first chance that the daemon is started 4. then at the final processing part, the script always try to activate all data devices, even if the daemons have been started at the previous step, therefore the daemon get a second chance to start 5. if the pid file is created fast enough before the second chance of starting, everything is ok, but if for some reason, this creation is delayed a little, the race condition arises
this problem can be reproduced easily if we add some sleep, like a second or so, before pidfile_write in global_init_postfork_start.
Related issues
History
#1 Updated by Loïc Dachary over 8 years ago
- Status changed from New to Won't Fix
- Release set to giant
I believe this misbehavior has been fixed in Hammer and would be resolved by upgrading the OSDs. Marking this problem as Won't fix because the recommended way to resolve a bug in Giant is to upgrade.
#2 Updated by runsisi hust over 8 years ago
i don't think this bug has been fixed in Hammer.
to reproduce in CentOS 6, we can add some sleep, like a second or so, before pidfile_write in global_init_postfork_start.
to reproduce in CentOS 7, we can add some sleep, like a second or so, before pidfile_write in global_init_prefork.
so, add some sleep in pidfile_write should be the same.
[root@ceph0 /var/lib/ceph/mon/ceph-ceph0]# ceph --version ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
[root@ceph0 /var/lib/ceph/mon/ceph-ceph0]# ls /var/run/ceph/ ceph-mon.ceph0.asok ceph-osd.3.asok ceph-osd.6.asok mon.ceph0.pid osd.3.pid osd.6.pid [root@ceph0 /var/lib/ceph/mon/ceph-ceph0]# ps aux | grep ceph-osd root 18285 0.0 0.0 115212 1488 ? Ss 09:13 0:00 /bin/bash -c ulimit -n 32768; /usr/bin/ceph-osd -i 8 --pid-file /var/run/ceph/osd.8.pid -c /etc/ceph/ceph.conf --cluster ceph -f root 18288 0.4 0.6 764784 24908 ? Sl 09:13 0:01 /usr/bin/ceph-osd -i 8 --pid-file /var/run/ceph/osd.8.pid -c /etc/ceph/ceph.conf --cluster ceph -f root 19588 0.0 0.0 115212 1484 ? Ss 09:14 0:00 /bin/bash -c ulimit -n 32768; /usr/bin/ceph-osd -i 3 --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf --cluster ceph -f root 19591 0.5 0.5 767912 20948 ? Sl 09:14 0:01 /usr/bin/ceph-osd -i 3 --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf --cluster ceph -f root 19730 0.0 0.0 115212 1484 ? Ss 09:14 0:00 /bin/bash -c ulimit -n 32768; /usr/bin/ceph-osd -i 6 --pid-file /var/run/ceph/osd.6.pid -c /etc/ceph/ceph.conf --cluster ceph -f root 19732 0.4 0.8 766856 32536 ? Sl 09:14 0:01 /usr/bin/ceph-osd -i 6 --pid-file /var/run/ceph/osd.6.pid -c /etc/ceph/ceph.conf --cluster ceph -f root 20752 0.0 0.0 112640 980 pts/1 R+ 09:20 0:00 grep --color=auto ceph-osd
2015-10-09 09:11:33.918608 7f7ccd3c4700 20 -- 192.168.134.10:6804/16907 >> 192.168.134.10:6800/16172 pipe(0x5aba000 sd=100 :54147 s=3 pgs=9 cs=1 l=0 c=0x563d280).writer sleeping 2015-10-09 09:13:35.532936 7f234cc5f900 0 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b), process ceph-osd, pid 18288 2015-10-09 09:13:36.562408 7f234cc5f900 0 filestore(/var/lib/ceph/osd/ceph-8) backend xfs (magic 0x58465342) 2015-10-09 09:13:36.570533 7f234cc5f900 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-8) detect_features: FIEMAP ioctl is supported and appears to work 2015-10-09 09:13:36.570548 7f234cc5f900 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-8) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-10-09 09:13:36.571888 7f234cc5f900 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-8) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2015-10-09 09:13:36.572054 7f234cc5f900 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-8) detect_feature: extsize is supported and kernel 3.10.0-229.7.2.el7.x86_64 >= 3.5 2015-10-09 09:13:36.592346 7f234cc5f900 0 filestore(/var/lib/ceph/osd/ceph-8) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2015-10-09 09:13:36.599757 7f234cc5f900 1 journal _open /var/lib/ceph/osd/ceph-8/journal fd 19: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-10-09 09:13:36.658018 7f234cc5f900 1 journal _open /var/lib/ceph/osd/ceph-8/journal fd 19: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-10-09 09:13:36.658755 7f234cc5f900 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello 2015-10-09 09:13:36.667059 7f234cc5f900 0 osd.8 78 crush map has features 1107558400, adjusting msgr requires for clients 2015-10-09 09:13:36.667089 7f234cc5f900 0 osd.8 78 crush map has features 1107558400 was 8705, adjusting msgr requires for mons 2015-10-09 09:13:36.667115 7f234cc5f900 0 osd.8 78 crush map has features 1107558400, adjusting msgr requires for osds 2015-10-09 09:13:36.667152 7f234cc5f900 0 osd.8 78 load_pgs 2015-10-09 09:13:36.682865 7f234cc5f900 0 osd.8 78 load_pgs opened 19 pgs 2015-10-09 09:13:36.683773 7f234cc5f900 -1 osd.8 78 log_to_monitors {default=true} 2015-10-09 09:13:36.704752 7f688beaa900 0 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b), process ceph-osd, pid 18727 2015-10-09 09:13:36.735878 7f233a869700 0 osd.8 78 ignoring osdmap until we have initialized 2015-10-09 09:13:36.735944 7f233a869700 0 osd.8 78 ignoring osdmap until we have initialized 2015-10-09 09:13:36.738201 7f234cc5f900 0 osd.8 78 done with init, starting boot process 2015-10-09 09:13:37.691963 7f2324922700 0 -- 192.168.134.10:6804/18288 >> 192.168.134.12:6804/16500 pipe(0x4cc0000 sd=83 :6804 s=0 pgs=0 cs=0 l=0 c=0x48a3c80).accept connect_seq 0 vs exis ting 0 state connecting 2015-10-09 09:13:37.725762 7f2324821700 0 -- 192.168.134.10:6804/18288 >> 192.168.134.12:6802/15672 pipe(0x4cc9000 sd=84 :6804 s=0 pgs=0 cs=0 l=0 c=0x48a3f40).accept connect_seq 0 vs exis ting 0 state wait 2015-10-09 09:13:37.725950 7f688beaa900 -1 asok(0x3a70000) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-os d.8.asok': (17) File exists 2015-10-09 09:13:37.748822 7f688beaa900 0 filestore(/var/lib/ceph/osd/ceph-8) lock_fsid failed to lock /var/lib/ceph/osd/ceph-8/fsid, is another ceph-osd still running? (11) Resource temp orarily unavailable 2015-10-09 09:13:37.748853 7f688beaa900 -1 osd.8 0 OSD::pre_init: object store '/var/lib/ceph/osd/ceph-8' is currently in use. (Is ceph-osd already running?) 2015-10-09 09:13:37.748858 7f688beaa900 -1 ^[[0;31m ** ERROR: osd pre_init failed: (16) Device or resource busy^[[0m 2015-10-09 09:13:40.806878 7f23220fa700 0 -- 192.168.134.10:6804/18288 >> 192.168.134.10:6800/17935 pipe(0x4cc4800 sd=97 :6804 s=0 pgs=0 cs=0 l=0 c=0x48a3de0).accept connect_seq 0 vs exis ting 0 state connecting 2015-10-09 09:13:40.808133 7f2321ff9700 0 -- 192.168.134.10:6804/18288 >> 192.168.134.10:6802/18075 pipe(0x4e61000 sd=99 :6804 s=0 pgs=0 cs=0 l=0 c=0x48a56a0).accept connect_seq 0 vs exis ting 0 state connecting 2015-10-09 09:14:39.889322 7f23220fa700 0 -- 192.168.134.10:6804/18288 >> 192.168.134.10:6800/17935 pipe(0x4cc4800 sd=97 :6804 s=2 pgs=8 cs=1 l=0 c=0x48a5280).fault with nothing to send, going to standby 2015-10-09 09:14:39.899316 7f2325029700 0 -- 192.168.133.10:0/18288 >> 192.168.133.10:6801/17935 pipe(0x4bf5800 sd=59 :0 s=1 pgs=0 cs=0 l=1 c=0x48a4fc0).fault 2015-10-09 09:14:39.899508 7f2324d26700 0 -- 192.168.133.10:0/18288 >> 192.168.134.10:6801/17935 pipe(0x4d65800 sd=58 :0 s=1 pgs=0 cs=0 l=1 c=0x48a5120).fault 2015-10-09 09:14:45.520507 7f8643720900 0 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b), process ceph-osd, pid 19950 2015-10-09 09:14:46.555291 7f8643720900 0 filestore(/var/lib/ceph/osd/ceph-8) lock_fsid failed to lock /var/lib/ceph/osd/ceph-8/fsid, is another ceph-osd still running? (11) Resource temp orarily unavailable 2015-10-09 09:14:46.555329 7f8643720900 -1 osd.8 0 OSD::pre_init: object store '/var/lib/ceph/osd/ceph-8' is currently in use. (Is ceph-osd already running?) 2015-10-09 09:14:46.555335 7f8643720900 -1 ^[[0;31m ** ERROR: osd pre_init failed: (16) Device or resource busy^[[0m 2015-10-09 09:14:46.637084 7f5f963a7900 0 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b), process ceph-osd, pid 20381 2015-10-09 09:14:47.657560 7f5f963a7900 0 filestore(/var/lib/ceph/osd/ceph-8) lock_fsid failed to lock /var/lib/ceph/osd/ceph-8/fsid, is another ceph-osd still running? (11) Resource temp orarily unavailable 2015-10-09 09:14:47.657592 7f5f963a7900 -1 osd.8 0 OSD::pre_init: object store '/var/lib/ceph/osd/ceph-8' is currently in use. (Is ceph-osd already running?) 2015-10-09 09:14:47.657597 7f5f963a7900 -1 ^[[0;31m ** ERROR: osd pre_init failed: (16) Device or resource busy^[[0m 2015-10-09 09:14:49.533754 7f232512a700 0 -- 192.168.134.10:6804/18288 >> :/0 pipe(0x4e5c800 sd=80 :6804 s=0 pgs=0 cs=0 l=0 c=0x48a5540).accept failed to getpeername (107) Transport endpo int is not connected 2015-10-09 09:14:50.034601 7f23222fc700 0 -- 192.168.134.10:6804/18288 >> 192.168.134.10:6803/19732 pipe(0x4e58000 sd=97 :6804 s=0 pgs=0 cs=0 l=0 c=0x48a53e0).accept connect_seq 0 vs exis ting 0 state wait
#3 Updated by Ken Dreyer over 8 years ago
- Status changed from Won't Fix to New
- Release deleted (
giant) - Release set to hammer
Re-opening based on runsisi hust's comment above.
#4 Updated by Warren Wang over 8 years ago
2nd confirmation that this still occurs with Hammer and RHEL 7. pid file missing, resulting in orphaned procs from systemd.
#5 Updated by Loïc Dachary over 8 years ago
- Status changed from New to 12
- Priority changed from Normal to High
#6 Updated by Loïc Dachary over 8 years ago
- Duplicated by Bug #13773: pid files on some OSD's are not created correctly (empty) added
#7 Updated by Kefu Chai about 8 years ago
- Assignee set to Kefu Chai
#8 Updated by Kefu Chai about 8 years ago
note to myself: need to see if https://github.com/ceph/ceph/pull/7075 addresses this issue.
#10 Updated by Kefu Chai about 8 years ago
- Duplicates Bug #13422: pid file that was deleted after ceph restart leads to osd process lose control added