Bug #13748
closedceph-mons crashing constantly after 0.94.3->0.94.5 upgrade
0%
Description
This was posted about on ceph-users but I now have a debug log and didn't want to send a 6mb file to the list.
I am seeing constant mon crashes since upgrading from 0.94.3 to 0.94.5 this morning. I am still upgrading the OSDs so it is yet to be seen whether these crashes stop occurring after the OSDs are all running 0.94.5.
Debug log from one of the mons is attached.
Files
Updated by Logan V over 8 years ago
Apparently the file filed to attach
Failed to load resource: the server responded with a status of 413 (Request Entity Too Large)
6.2M Nov 10 11:03 2015-11-10-moncrash-4
Now it is 323k after bzip2... seems to upload fine.
Updated by Logan V over 8 years ago
Just got done upgrading all of the osds to 0.94.5. The mons were crashing every 2-3 minutes the whole time. After upgrading the last of the OSDs the mons now seem stable. It has been roughly 15 minutes since the last crash. I will continue to watch and update again if they start crashing.
Updated by Nathan Cutler over 8 years ago
- Tracker changed from Tasks to Bug
- Project changed from Stable releases to Ceph
Updated by Joao Eduardo Luis over 8 years ago
Can you upload the mon's store.db somewhere or send it via email to joao@suse.de ?
This is failing applying an update from the store, so I'm guessing it's either an empty version, so having the store should make it easily reproducible locally.
Updated by Tom Verdaat over 8 years ago
I've found similar behavior and opened bug #13783. Hoping this is the same bug but not sure.
Updated by Nathan Cutler over 8 years ago
- Related to Bug #13783: monitors crashing constantly with 0.94.5 added
Updated by Logan V over 8 years ago
I can send it to you but the mons are no longer running 0.94.5 anymore. They are now on infernalis. The mons kept crashing even after my last update, just less frequently, and we were only on 0.94.5 as the release notes said it was necessary to get to infernalis from 0.94.3. So we were only running 0.94.5 as long as it took to upgrade all of the OSDs.
Would the store.db be useful to you now if the mons are all upgraded off of 0.94.5? If so I will send it over.
Updated by Joao Eduardo Luis over 8 years ago
- Category set to Monitor
- Status changed from New to Need More Info
- Priority changed from High to Urgent
- Source set to Community (user)
Logan and Tom, I've been trying to reproduce this without success.
The store Tom provided appears to work just fine once I start a ceph-mon with 0.94.3 or 0.94.5 on it, without crashing.
I've tried several combinations of daemon versions in a brand new cluster, under several loads, hoping this would surface; again, no luck.
Do you have any additional info you can provide on your setup that could help me going forward with this? For instance, do you have any custom/non-default settings on your ceph.conf -- especially related to clog or syslog? How did you perform the upgrade? mons first, then osds? mixed mon versions at a time, perhaps? Any details you can provide would be most appreciated.
Updated by Logan V over 8 years ago
Hi Joao,
the ceph.conf being used:
[global] fsid = <uuid> mon_initial_members = mon1, mon2, mon3 mon_host = 10.0.0.1,10.0.0.2,10.0.0.3 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true mds_cache_size = 500000 mds_standby_replay = true mon_pg_warn_max_per_osd = 1000 #performance related mount options osd mount options xfs = rw, noatime, inode64, logbufs=8, nodiratime, nobarrier #turn down concurrent backfills and restores by default so they dont overload io osd max backfills = 5 osd recovery max active = 5 osd recovery delay start = 15 #log to syslog only log file = /dev/null log to syslog = true err to syslog = true [mon] mon cluster log to syslog = true mon cluster log file = /dev/null
Upgrade process was:
mons first, then mds, then ~200 OSDs. Mixed mon versions for a short period while doing a rolling upgrade on the 3 mons.
Updated by Tom Verdaat over 8 years ago
My experience was with a fresh installation, not an upgrade. Would like to repeat that the problem did not occur on infernalis with the same settings!
All potentially relevant config settings below:
[global] auth cluster required = cephx auth service required = cephx auth client required = cephx mon osd full ratio = .90 mon osd nearfull ratio = .80 mon pg warn min per osd = 4 mon pg warn max per osd = 0 mon pg warn max object skew = 0 log max new = 1000 log max recent = 1000000 log to stderr = true err to stderr = true log to syslog = false err to syslog = false log flush on exit = true clog to monitors = true clog to syslog = false mon cluster log to syslog = false debug client = 0/5 debug default = 0/5 debug lockdep = 0/5 debug context = 0/5 debug crush = 0/5 debug buffer = 0/0 debug timer = 0/5 debug filer = 0/5 debug objecter = 0/0 debug rados = 0/5 debug rbd = 0/5 debug journaler = 0/5 debug objectcacher = 0/5 debug optracker = 0/5 debug objclass = 0/5 debug filestore = 0/5 debug ms = 0/5 debug tp = 0/5 debug finisher = 0/5 debug heartbeatmap = 0/5 debug perfcounter = 0/5 debug rgw = 1/5 debug javaclient = 1/5 debug asok = 0/5 debug throttle = 0/5 [mon] debug mon = 0/5 debug paxos = 0/5 debug auth = 0/5
Updated by Edward Huyer about 8 years ago
I think I'm seeing this problem as well, and would prefer not to upgrade to Infernalis. Currently I'm partially working around it by changing the restart parameters in /etc/init/ceph-mon.conf
This is on Ubuntu 14.04 running Hammer 0.94.5.
When one crashes, it spits the following segfault info into the error log. At around the same time (within a minute or so) the other monitors will crash as well.
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) 1: /usr/bin/ceph-mon() [0x9adefa] 2: (()+0x10340) [0x7f7df8d92340] 3: (std::_Rb_tree<std::string, std::pair<std::string const, std::string>, std::_Select1st<std::pair<std::string const, std::string> >, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > >::find(std::string const&) const+0x25) [0x6518e5] 4: (get_str_map_key(std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&, std::string const&, std::string const*)+0x1e) [0x8a002e] 5: (LogMonitor::update_from_paxos(bool*)+0x87a) [0x6b0a5a] 6: (PaxosService::refresh(bool*)+0x19a) [0x60432a] 7: (Monitor::refresh_from_paxos(bool*)+0x1db) [0x5b03db] 8: (Monitor::init_paxos()+0x85) [0x5b0745] 9: (Monitor::sync_finish(unsigned long)+0x26a) [0x5c826a] 10: (Monitor::handle_sync_chunk(MMonSync*)+0xc93) [0x5c9513] 11: (Monitor::handle_sync(MMonSync*)+0x1b3) [0x5c9b13] 12: (Monitor::dispatch(MonSession*, Message*, bool)+0x781) [0x5cf841] 13: (Monitor::_ms_dispatch(Message*)+0x1a6) [0x5cfe36] 14: (Monitor::ms_dispatch(Message*)+0x23) [0x5edb43] 15: (DispatchQueue::entry()+0x649) [0x929679] 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c99cd] 17: (()+0x8182) [0x7f7df8d8a182] 18: (clone()+0x6d) [0x7f7df72f547d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Digging back through the logs, it seems like once all three monitors are up and stable, they will mostly remain that way for a while, but if your see one monitor segfault you'll get a burst of crashes from all of them.
My ceph.conf is fairly unremarkable.
[global] fsid = [redacted] mon_initial_members = hydra0 mon_host = [redacted] public network = [redacted] auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true log file = none log to syslog = true err to syslog = true osd pool default pg num = 512 osd pool default pgp num = 512 [mon] mon cluster log to syslog = true mon cluster log file = none
Updated by Kefu Chai about 8 years ago
- Related to Bug #13990: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) added
Updated by Kefu Chai about 8 years ago
the mon crash is also observed in e1b92081c9e4b21eb30cc873c239083a08fce12f
that's that we see the mon segfault nearly every time we create a snapshot. Tom mentioned above about a month ago that we were seeing this issue, and it still persists. One problem we're looking at is that I don't see how we can upgrade if we do get a fix for this osd map cache issue without the mon segfault issue resolved.
The stack trace on the mon segfault looks the same as the one referenced at http://tracker.ceph.com/issues/13748#note-13,
Updated by Kefu Chai about 8 years ago
- Status changed from Need More Info to In Progress
this issue was addressed by https://github.com/ceph/ceph/pull/5148 . should have backported it to hammer...
Updated by Kefu Chai about 8 years ago
- Status changed from In Progress to Fix Under Review
Updated by Loïc Dachary about 8 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Loïc Dachary about 8 years ago
- Copied to Backport #14765: hammer: ceph-mons crashing constantly after 0.94.3->0.94.5 upgrade added
Updated by Kefu Chai about 8 years ago
- Status changed from Pending Backport to Resolved
- Assignee changed from Joao Eduardo Luis to Kefu Chai