Project

General

Profile

Actions

Bug #13748

closed

ceph-mons crashing constantly after 0.94.3->0.94.5 upgrade

Added by Logan V over 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
hammer
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This was posted about on ceph-users but I now have a debug log and didn't want to send a 6mb file to the list.

I am seeing constant mon crashes since upgrading from 0.94.3 to 0.94.5 this morning. I am still upgrading the OSDs so it is yet to be seen whether these crashes stop occurring after the OSDs are all running 0.94.5.

Debug log from one of the mons is attached.


Files

2015-11-10-moncrash-4.bz2 (323 KB) 2015-11-10-moncrash-4.bz2 debug from mon crash Logan V, 11/10/2015 05:11 PM

Related issues 3 (0 open3 closed)

Related to Ceph - Bug #13783: monitors crashing constantly with 0.94.5DuplicateJoao Eduardo Luis11/12/2015

Actions
Related to Ceph - Bug #13990: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?)ResolvedKefu Chai12/05/2015

Actions
Copied to Ceph - Backport #14765: hammer: ceph-mons crashing constantly after 0.94.3->0.94.5 upgradeResolvedKefu ChaiActions
Actions #1

Updated by Logan V over 8 years ago

Apparently the file filed to attach

Failed to load resource: the server responded with a status of 413 (Request Entity Too Large)
6.2M Nov 10 11:03 2015-11-10-moncrash-4

Now it is 323k after bzip2... seems to upload fine.

Actions #2

Updated by Logan V over 8 years ago

Just got done upgrading all of the osds to 0.94.5. The mons were crashing every 2-3 minutes the whole time. After upgrading the last of the OSDs the mons now seem stable. It has been roughly 15 minutes since the last crash. I will continue to watch and update again if they start crashing.

Actions #3

Updated by Nathan Cutler over 8 years ago

  • Tracker changed from Tasks to Bug
  • Project changed from Stable releases to Ceph
Actions #4

Updated by Joao Eduardo Luis over 8 years ago

  • Assignee set to Joao Eduardo Luis
Actions #5

Updated by Joao Eduardo Luis over 8 years ago

Can you upload the mon's store.db somewhere or send it via email to ?

This is failing applying an update from the store, so I'm guessing it's either an empty version, so having the store should make it easily reproducible locally.

Actions #6

Updated by Tom Verdaat over 8 years ago

I've found similar behavior and opened bug #13783. Hoping this is the same bug but not sure.

Actions #7

Updated by Nathan Cutler over 8 years ago

  • Related to Bug #13783: monitors crashing constantly with 0.94.5 added
Actions #8

Updated by Logan V over 8 years ago

I can send it to you but the mons are no longer running 0.94.5 anymore. They are now on infernalis. The mons kept crashing even after my last update, just less frequently, and we were only on 0.94.5 as the release notes said it was necessary to get to infernalis from 0.94.3. So we were only running 0.94.5 as long as it took to upgrade all of the OSDs.

Would the store.db be useful to you now if the mons are all upgraded off of 0.94.5? If so I will send it over.

Actions #9

Updated by Sage Weil over 8 years ago

  • Priority changed from Normal to High
Actions #10

Updated by Joao Eduardo Luis over 8 years ago

  • Category set to Monitor
  • Status changed from New to Need More Info
  • Priority changed from High to Urgent
  • Source set to Community (user)

Logan and Tom, I've been trying to reproduce this without success.

The store Tom provided appears to work just fine once I start a ceph-mon with 0.94.3 or 0.94.5 on it, without crashing.

I've tried several combinations of daemon versions in a brand new cluster, under several loads, hoping this would surface; again, no luck.

Do you have any additional info you can provide on your setup that could help me going forward with this? For instance, do you have any custom/non-default settings on your ceph.conf -- especially related to clog or syslog? How did you perform the upgrade? mons first, then osds? mixed mon versions at a time, perhaps? Any details you can provide would be most appreciated.

Actions #11

Updated by Logan V over 8 years ago

Hi Joao,

the ceph.conf being used:

[global]
fsid = <uuid>
mon_initial_members = mon1, mon2, mon3
mon_host = 10.0.0.1,10.0.0.2,10.0.0.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
mds_cache_size = 500000
mds_standby_replay = true
mon_pg_warn_max_per_osd = 1000

#performance related mount options
osd mount options xfs = rw, noatime, inode64, logbufs=8, nodiratime, nobarrier

#turn down concurrent backfills and restores by default so they dont overload io
osd max backfills = 5
osd recovery max active = 5
osd recovery delay start = 15

#log to syslog only
log file = /dev/null
log to syslog = true
err to syslog = true

[mon]
mon cluster log to syslog = true
mon cluster log file = /dev/null

Upgrade process was:

mons first, then mds, then ~200 OSDs. Mixed mon versions for a short period while doing a rolling upgrade on the 3 mons.

Actions #12

Updated by Tom Verdaat over 8 years ago

My experience was with a fresh installation, not an upgrade. Would like to repeat that the problem did not occur on infernalis with the same settings!

All potentially relevant config settings below:

[global]
  auth cluster required = cephx
  auth service required = cephx
  auth client required = cephx
  mon osd full ratio = .90
  mon osd nearfull ratio = .80
  mon pg warn min per osd = 4
  mon pg warn max per osd = 0
  mon pg warn max object skew = 0
  log max new = 1000
  log max recent = 1000000
  log to stderr = true
  err to stderr = true
  log to syslog = false
  err to syslog = false
  log flush on exit = true
  clog to monitors = true
  clog to syslog = false
  mon cluster log to syslog = false
  debug client = 0/5
  debug default = 0/5
  debug lockdep = 0/5
  debug context = 0/5
  debug crush = 0/5
  debug buffer = 0/0
  debug timer = 0/5
  debug filer = 0/5
  debug objecter = 0/0
  debug rados = 0/5
  debug rbd = 0/5
  debug journaler = 0/5
  debug objectcacher = 0/5
  debug optracker = 0/5
  debug objclass = 0/5
  debug filestore = 0/5
  debug ms = 0/5
  debug tp = 0/5
  debug finisher = 0/5
  debug heartbeatmap = 0/5
  debug perfcounter = 0/5
  debug rgw = 1/5
  debug javaclient = 1/5
  debug asok = 0/5
  debug throttle = 0/5

[mon]
  debug mon = 0/5
  debug paxos = 0/5
  debug auth = 0/5
Actions #13

Updated by Edward Huyer about 8 years ago

I think I'm seeing this problem as well, and would prefer not to upgrade to Infernalis. Currently I'm partially working around it by changing the restart parameters in /etc/init/ceph-mon.conf

This is on Ubuntu 14.04 running Hammer 0.94.5.

When one crashes, it spits the following segfault info into the error log. At around the same time (within a minute or so) the other monitors will crash as well.

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: /usr/bin/ceph-mon() [0x9adefa]
 2: (()+0x10340) [0x7f7df8d92340]
 3: (std::_Rb_tree<std::string, std::pair<std::string const, std::string>, std::_Select1st<std::pair<std::string const, std::string> >, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > >::find(std::string const&) const+0x25) [0x6518e5]
 4: (get_str_map_key(std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&, std::string const&, std::string const*)+0x1e) [0x8a002e]
 5: (LogMonitor::update_from_paxos(bool*)+0x87a) [0x6b0a5a]
 6: (PaxosService::refresh(bool*)+0x19a) [0x60432a]
 7: (Monitor::refresh_from_paxos(bool*)+0x1db) [0x5b03db]
 8: (Monitor::init_paxos()+0x85) [0x5b0745]
 9: (Monitor::sync_finish(unsigned long)+0x26a) [0x5c826a]
 10: (Monitor::handle_sync_chunk(MMonSync*)+0xc93) [0x5c9513]
 11: (Monitor::handle_sync(MMonSync*)+0x1b3) [0x5c9b13]
 12: (Monitor::dispatch(MonSession*, Message*, bool)+0x781) [0x5cf841]
 13: (Monitor::_ms_dispatch(Message*)+0x1a6) [0x5cfe36]
 14: (Monitor::ms_dispatch(Message*)+0x23) [0x5edb43]
 15: (DispatchQueue::entry()+0x649) [0x929679]
 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c99cd]
 17: (()+0x8182) [0x7f7df8d8a182]
 18: (clone()+0x6d) [0x7f7df72f547d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Digging back through the logs, it seems like once all three monitors are up and stable, they will mostly remain that way for a while, but if your see one monitor segfault you'll get a burst of crashes from all of them.

My ceph.conf is fairly unremarkable.

[global]
fsid = [redacted]
mon_initial_members = hydra0
mon_host = [redacted]
public network = [redacted]
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
log file = none
log to syslog = true
err to syslog = true
osd pool default pg num = 512
osd pool default pgp num = 512

[mon]
mon cluster log to syslog = true
mon cluster log file = none

Actions #14

Updated by Kefu Chai about 8 years ago

  • Related to Bug #13990: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) added
Actions #15

Updated by Kefu Chai about 8 years ago

the mon crash is also observed in e1b92081c9e4b21eb30cc873c239083a08fce12f

that's that we see the mon segfault nearly every time we create a snapshot. Tom mentioned above about a month ago that we were seeing this issue, and it still persists. One problem we're looking at is that I don't see how we can upgrade if we do get a fix for this osd map cache issue without the mon segfault issue resolved.

The stack trace on the mon segfault looks the same as the one referenced at http://tracker.ceph.com/issues/13748#note-13,

Actions #16

Updated by Kefu Chai about 8 years ago

  • Status changed from Need More Info to In Progress

this issue was addressed by https://github.com/ceph/ceph/pull/5148 . should have backported it to hammer...

Actions #17

Updated by Kefu Chai about 8 years ago

  • Status changed from In Progress to Fix Under Review
Actions #18

Updated by Loïc Dachary about 8 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #19

Updated by Loïc Dachary about 8 years ago

  • Backport set to hammer
Actions #20

Updated by Loïc Dachary about 8 years ago

  • Copied to Backport #14765: hammer: ceph-mons crashing constantly after 0.94.3->0.94.5 upgrade added
Actions #21

Updated by Kefu Chai about 8 years ago

  • Status changed from Pending Backport to Resolved
  • Assignee changed from Joao Eduardo Luis to Kefu Chai
Actions

Also available in: Atom PDF