Project

General

Profile

Bug #13422

pid file that was deleted after ceph restart leads to osd process lose control

Added by shun song over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
10/09/2015
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
hammer,infernalis
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

when restarting ceph, the same osd will start twice. the first time is regular start;the other is activated by ceph-disk activated all. if the second startup starts fast enough before the pid file created. Unluckly, only one process can get fsid_lock and live through. then the lose process will delete the pid file when exiting. As a result, this osd may stop unregularlly when using 'service ceph stop osd.{id}'.

resovled method:
locking pid file using file lock


Related issues

Duplicated by Ceph - Bug #13238: duplicate start of ceph-osd daemon Duplicate 09/25/2015
Duplicated by Ceph - Bug #14575: daemons leave pid files behind in /home/ubuntu/cephtest Duplicate 01/30/2016
Copied to Ceph - Backport #14582: infernalis: pid file that was deleted after ceph restart leads to osd process lose control Rejected
Copied to Ceph - Backport #14583: hammer: pid file that was deleted after ceph restart leads to osd process lose control Resolved

Associated revisions

Revision f2c0ef40 (diff)
Added by shun song over 3 years ago

global/pidfile: do not start two daemons with a single pid-file

add functions named pidfile_open and pidfile_verify to avoid starting two daemons by a single pid-file

Fixes: #13422
Signed-off-by: shun song <>

Revision 12649ef4 (diff)
Added by root over 3 years ago

test: add unitest test_pidfile.sh

Fixes: #13422
Signed-off-by: shun song <>

Revision 9828d49d (diff)
Added by Loic Dachary over 3 years ago

global: do not start two daemons with a single pid-file (part 2)

Fixes the following bugs:

  • the fd is open(O_WRONLY) and cannot be read from, safe_read
    always fails and never removes the pid file.
  • pidfile_open(g_conf) is close(STDOUT_FILENO) and there is a risk that
    pidfile_open gets STDOUT_FILENO only to have it closed and redirected
    to /dev/null.
  • Before writing the file, ftruncate it so that overriding a file
    containing the pid 1234 with the pid 89 does not end up being
    a file with 8934.
  • Before reading the file, lseek back to offset 0 otherwise it
    will read nothing.
  • tests_pidfile was missing an argument when failing
    TEST_without_pidfile and killed all process with ceph in their name,
    leading to chaos and no useful error message.
  • lstat(fd) cannot possibly return a result different from the one
    obtained right after the file was open, stat(path) must be used
    instead.

In addition to fixing the bugs above, refactor the pidfile.cc
implementation to:

  • be systematic about error reporting (using cerr for when removing
    the pidfile because derr is not available at this point and derr
    when creating the pidfile).
  • replace pidfile_open / pidfile_write with just pidfile_write since
    there never is a case when they are not used together.

More test cases are added to test_pidfile to verify the bugs above are
fixed.

http://tracker.ceph.com/issues/13422 Fixes: #13422

Signed-off-by: Loic Dachary <>

Revision cf433bac (diff)
Added by shun song over 3 years ago

global/pidfile: do not start two daemons with a single pid-file

add functions named pidfile_open and pidfile_verify to avoid starting two daemons by a single pid-file

Fixes: #13422
Signed-off-by: shun song <>
(cherry picked from commit f2c0ef40fd674fecc6e3e97cd6155b976e6759b4)

Revision c63baebb (diff)
Added by Loic Dachary over 3 years ago

global: do not start two daemons with a single pid-file (part 2)

Fixes the following bugs:

  • the fd is open(O_WRONLY) and cannot be read from, safe_read
    always fails and never removes the pid file.
  • pidfile_open(g_conf) is close(STDOUT_FILENO) and there is a risk that
    pidfile_open gets STDOUT_FILENO only to have it closed and redirected
    to /dev/null.
  • Before writing the file, ftruncate it so that overriding a file
    containing the pid 1234 with the pid 89 does not end up being
    a file with 8934.
  • Before reading the file, lseek back to offset 0 otherwise it
    will read nothing.
  • tests_pidfile was missing an argument when failing
    TEST_without_pidfile and killed all process with ceph in their name,
    leading to chaos and no useful error message.
  • lstat(fd) cannot possibly return a result different from the one
    obtained right after the file was open, stat(path) must be used
    instead.

In addition to fixing the bugs above, refactor the pidfile.cc
implementation to:

  • be systematic about error reporting (using cerr for when removing
    the pidfile because derr is not available at this point and derr
    when creating the pidfile).
  • replace pidfile_open / pidfile_write with just pidfile_write since
    there never is a case when they are not used together.

More test cases are added to test_pidfile to verify the bugs above are
fixed.

http://tracker.ceph.com/issues/13422 Fixes: #13422

Signed-off-by: Loic Dachary <>
(cherry picked from commit 9828d49d6f3ccfc78d496153d263ea39b1722d4b)

Conflicts:
src/global/global_init.cc
- the `flag` argument of `global_init_prefork()` is not used, so
it was removed in master. but the cleanup commit was not
cherry-picked to hammer, thus the conflict. we can just keep it
around in hammer to minimize the code churn, although it may
stand in the way of future backports.)
- s/nullptr/NULL/ as hammer does not support c++11.

History

#1 Updated by huang jun over 3 years ago

#2 Updated by shun song over 3 years ago

how about not to remove pid-file,when process exits, becuase the pid-file is important for juding whether osd/mon is running. at the same time, to avoid compelling to write the pid-file when multiple times start up one osd or monitor, lock is recommanded.

#4 Updated by Nathan Cutler over 3 years ago

  • Status changed from New to Need Review

#5 Updated by Kefu Chai over 3 years ago

  • Duplicated by Bug #13238: duplicate start of ceph-osd daemon added

#7 Updated by Loic Dachary over 3 years ago

  • Backport set to hammer,infernalis

#8 Updated by Shinobu Kinjo over 3 years ago

Will patch be (able to be) backported to 0.80.8?

#9 Updated by Kefu Chai over 3 years ago

Will patch be (able to be) backported to 0.80.8?

@Shinobu 0.80.8 is a firefly release, and since firefly is retired, see http://docs.ceph.com/docs/master/releases/ , we are not going to backport to this release.

and yes, it is able to be backported to 0.80.8, but this will not happen in the upstream.

#10 Updated by Kefu Chai over 3 years ago

  • Status changed from Need Review to Pending Backport

#11 Updated by Kefu Chai over 3 years ago

  • Assignee set to Kefu Chai

#12 Updated by Loic Dachary over 3 years ago

  • Copied to Backport #14582: infernalis: pid file that was deleted after ceph restart leads to osd process lose control added

#13 Updated by Loic Dachary over 3 years ago

  • Copied to Backport #14583: hammer: pid file that was deleted after ceph restart leads to osd process lose control added

#14 Updated by Loic Dachary over 3 years ago

  • Status changed from Pending Backport to In Progress
  • Assignee changed from Kefu Chai to Loic Dachary

#15 Updated by Loic Dachary over 3 years ago

  • Duplicated by Bug #14575: daemons leave pid files behind in /home/ubuntu/cephtest added

#17 Updated by Kefu Chai over 3 years ago

  • Status changed from In Progress to Need Review

#18 Updated by Sage Weil over 3 years ago

  • Status changed from Need Review to Resolved

woot!

#19 Updated by Loic Dachary over 3 years ago

  • Status changed from Resolved to Pending Backport

#21 Updated by Loic Dachary almost 3 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF