Bug #13748: ceph-mons crashing constantly after 0.94.3->0.94.5 upgrade - Ceph - Ceph

Actions

Copy link

Bug #13748

closed

ceph-mons crashing constantly after 0.94.3->0.94.5 upgrade

Added by Logan V over 8 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Kefu Chai

Category:

Monitor

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

hammer

Regression:

Severity:

Reviewed:

Affected Versions:

v0.94.5

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This was posted about on ceph-users but I now have a debug log and didn't want to send a 6mb file to the list.

I am seeing constant mon crashes since upgrading from 0.94.3 to 0.94.5 this morning. I am still upgrading the OSDs so it is yet to be seen whether these crashes stop occurring after the OSDs are all running 0.94.5.

Debug log from one of the mons is attached.

Files

2015-11-10-moncrash-4.bz2 (323 KB) 2015-11-10-moncrash-4.bz2

debug from mon crash

Logan V, 11/10/2015 05:11 PM

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Logan V over 8 years ago

File 2015-11-10-moncrash-4.bz2 2015-11-10-moncrash-4.bz2 added

Apparently the file filed to attach

Failed to load resource: the server responded with a status of 413 (Request Entity Too Large)
6.2M Nov 10 11:03 2015-11-10-moncrash-4

Now it is 323k after bzip2... seems to upload fine.

Actions

Copy link

Updated by Logan V over 8 years ago

Just got done upgrading all of the osds to 0.94.5. The mons were crashing every 2-3 minutes the whole time. After upgrading the last of the OSDs the mons now seem stable. It has been roughly 15 minutes since the last crash. I will continue to watch and update again if they start crashing.

Actions

Copy link

Updated by Nathan Cutler over 8 years ago

Tracker changed from Tasks to Bug
Project changed from Stable releases to Ceph

Actions

Copy link

Updated by Joao Eduardo Luis over 8 years ago

Assignee set to Joao Eduardo Luis

Actions

Copy link

Updated by Joao Eduardo Luis over 8 years ago

Can you upload the mon's store.db somewhere or send it via email to joao@suse.de ?

This is failing applying an update from the store, so I'm guessing it's either an empty version, so having the store should make it easily reproducible locally.

Actions

Copy link

Updated by Tom Verdaat over 8 years ago

I've found similar behavior and opened bug #13783. Hoping this is the same bug but not sure.

Actions

Copy link

Updated by Nathan Cutler over 8 years ago

Related to Bug #13783: monitors crashing constantly with 0.94.5 added

Actions

Copy link

Updated by Logan V over 8 years ago

I can send it to you but the mons are no longer running 0.94.5 anymore. They are now on infernalis. The mons kept crashing even after my last update, just less frequently, and we were only on 0.94.5 as the release notes said it was necessary to get to infernalis from 0.94.3. So we were only running 0.94.5 as long as it took to upgrade all of the OSDs.

Would the store.db be useful to you now if the mons are all upgraded off of 0.94.5? If so I will send it over.

Actions

Copy link

Updated by Sage Weil over 8 years ago

Priority changed from Normal to High

Actions

Copy link

#10

Updated by Joao Eduardo Luis over 8 years ago

Category set to Monitor
Status changed from New to Need More Info
Priority changed from High to Urgent
Source set to Community (user)

Logan and Tom, I've been trying to reproduce this without success.

The store Tom provided appears to work just fine once I start a ceph-mon with 0.94.3 or 0.94.5 on it, without crashing.

I've tried several combinations of daemon versions in a brand new cluster, under several loads, hoping this would surface; again, no luck.

Do you have any additional info you can provide on your setup that could help me going forward with this? For instance, do you have any custom/non-default settings on your ceph.conf -- especially related to clog or syslog? How did you perform the upgrade? mons first, then osds? mixed mon versions at a time, perhaps? Any details you can provide would be most appreciated.

Actions

Copy link

#11

Updated by Logan V over 8 years ago

Hi Joao,

the ceph.conf being used:

[global]
fsid = <uuid>
mon_initial_members = mon1, mon2, mon3
mon_host = 10.0.0.1,10.0.0.2,10.0.0.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
mds_cache_size = 500000
mds_standby_replay = true
mon_pg_warn_max_per_osd = 1000

#performance related mount options
osd mount options xfs = rw, noatime, inode64, logbufs=8, nodiratime, nobarrier

#turn down concurrent backfills and restores by default so they dont overload io
osd max backfills = 5
osd recovery max active = 5
osd recovery delay start = 15

#log to syslog only
log file = /dev/null
log to syslog = true
err to syslog = true

[mon]
mon cluster log to syslog = true
mon cluster log file = /dev/null

Upgrade process was:

mons first, then mds, then ~200 OSDs. Mixed mon versions for a short period while doing a rolling upgrade on the 3 mons.

Actions

Copy link

#12

Updated by Tom Verdaat over 8 years ago

My experience was with a fresh installation, not an upgrade. Would like to repeat that the problem did not occur on infernalis with the same settings!

All potentially relevant config settings below:

[global]
  auth cluster required = cephx
  auth service required = cephx
  auth client required = cephx
  mon osd full ratio = .90
  mon osd nearfull ratio = .80
  mon pg warn min per osd = 4
  mon pg warn max per osd = 0
  mon pg warn max object skew = 0
  log max new = 1000
  log max recent = 1000000
  log to stderr = true
  err to stderr = true
  log to syslog = false
  err to syslog = false
  log flush on exit = true
  clog to monitors = true
  clog to syslog = false
  mon cluster log to syslog = false
  debug client = 0/5
  debug default = 0/5
  debug lockdep = 0/5
  debug context = 0/5
  debug crush = 0/5
  debug buffer = 0/0
  debug timer = 0/5
  debug filer = 0/5
  debug objecter = 0/0
  debug rados = 0/5
  debug rbd = 0/5
  debug journaler = 0/5
  debug objectcacher = 0/5
  debug optracker = 0/5
  debug objclass = 0/5
  debug filestore = 0/5
  debug ms = 0/5
  debug tp = 0/5
  debug finisher = 0/5
  debug heartbeatmap = 0/5
  debug perfcounter = 0/5
  debug rgw = 1/5
  debug javaclient = 1/5
  debug asok = 0/5
  debug throttle = 0/5

[mon]
  debug mon = 0/5
  debug paxos = 0/5
  debug auth = 0/5

Actions

Copy link

#13

Updated by Edward Huyer about 8 years ago

I think I'm seeing this problem as well, and would prefer not to upgrade to Infernalis. Currently I'm partially working around it by changing the restart parameters in /etc/init/ceph-mon.conf

This is on Ubuntu 14.04 running Hammer 0.94.5.

When one crashes, it spits the following segfault info into the error log. At around the same time (within a minute or so) the other monitors will crash as well.

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: /usr/bin/ceph-mon() [0x9adefa]
 2: (()+0x10340) [0x7f7df8d92340]
 3: (std::_Rb_tree<std::string, std::pair<std::string const, std::string>, std::_Select1st<std::pair<std::string const, std::string> >, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > >::find(std::string const&) const+0x25) [0x6518e5]
 4: (get_str_map_key(std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&, std::string const&, std::string const*)+0x1e) [0x8a002e]
 5: (LogMonitor::update_from_paxos(bool*)+0x87a) [0x6b0a5a]
 6: (PaxosService::refresh(bool*)+0x19a) [0x60432a]
 7: (Monitor::refresh_from_paxos(bool*)+0x1db) [0x5b03db]
 8: (Monitor::init_paxos()+0x85) [0x5b0745]
 9: (Monitor::sync_finish(unsigned long)+0x26a) [0x5c826a]
 10: (Monitor::handle_sync_chunk(MMonSync*)+0xc93) [0x5c9513]
 11: (Monitor::handle_sync(MMonSync*)+0x1b3) [0x5c9b13]
 12: (Monitor::dispatch(MonSession*, Message*, bool)+0x781) [0x5cf841]
 13: (Monitor::_ms_dispatch(Message*)+0x1a6) [0x5cfe36]
 14: (Monitor::ms_dispatch(Message*)+0x23) [0x5edb43]
 15: (DispatchQueue::entry()+0x649) [0x929679]
 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c99cd]
 17: (()+0x8182) [0x7f7df8d8a182]
 18: (clone()+0x6d) [0x7f7df72f547d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Digging back through the logs, it seems like once all three monitors are up and stable, they will mostly remain that way for a while, but if your see one monitor segfault you'll get a burst of crashes from all of them.

My ceph.conf is fairly unremarkable.

[global]
fsid = [redacted]
mon_initial_members = hydra0
mon_host = [redacted]
public network = [redacted]
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
log file = none
log to syslog = true
err to syslog = true
osd pool default pg num = 512
osd pool default pgp num = 512

[mon]
mon cluster log to syslog = true
mon cluster log file = none

Actions

Copy link

#14

Updated by Kefu Chai about 8 years ago

Related to Bug #13990: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) added

Actions

Copy link

#15

Updated by Kefu Chai about 8 years ago

the mon crash is also observed in e1b92081c9e4b21eb30cc873c239083a08fce12f

that's that we see the mon segfault nearly every time we create a snapshot. Tom mentioned above about a month ago that we were seeing this issue, and it still persists. One problem we're looking at is that I don't see how we can upgrade if we do get a fix for this osd map cache issue without the mon segfault issue resolved.