Bug #17837
closedceph-mon crashed after upgrade from hammer 0.94.7 to jewel 10.2.3
0%
Description
I've a cluster of three nodes:
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 5.45993 root default -2 1.81998 host ceph1 0 0.90999 osd.0 up 1.00000 1.00000 1 0.90999 osd.1 up 1.00000 1.00000 -3 1.81998 host ceph2 2 0.90999 osd.2 up 1.00000 1.00000 3 0.90999 osd.3 up 1.00000 1.00000 -4 1.81998 host ceph3 4 0.90999 osd.4 up 1.00000 1.00000 5 0.90999 osd.5 up 1.00000 1.00000
I've updated first the ceph3 node and now I can't start monitor daemon. It's crashed
cephus@ceph3:~$ sudo /usr/bin/ceph-mon --cluster=ceph -i ceph3 -f --setuser ceph --setgroup ceph --debug_mon 10 starting mon.ceph3 rank 2 at 192.168.49.103:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph3 fsid 3c58a184-bf27-4273-8000-405513006a7b mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7fb0cf4564c0 time 2016-11-09 14:58:58.437225 mds/FSMap.cc: 628: FAILED assert(i.second.state == MDSMap::STATE_STANDBY) ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5606d480b1eb] 2: (FSMap::sanity() const+0x932) [0x5606d4730112] 3: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160] 4: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a] 5: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433] 6: (Monitor::init_paxos()+0x85) [0x5606d446b845] 7: (Monitor::preinit()+0x925) [0x5606d447bec5] 8: (main()+0x236d) [0x5606d4409e9d] 9: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45] 10: (()+0x26106a) [0x5606d445c06a] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2016-11-09 14:58:58.440166 7fb0cf4564c0 -1 mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7fb0cf4564c0 time 2016-11-09 14:58:58.437225 mds/FSMap.cc: 628: FAILED assert(i.second.state == MDSMap::STATE_STANDBY) ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5606d480b1eb] 2: (FSMap::sanity() const+0x932) [0x5606d4730112] 3: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160] 4: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a] 5: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433] 6: (Monitor::init_paxos()+0x85) [0x5606d446b845] 7: (Monitor::preinit()+0x925) [0x5606d447bec5] 8: (main()+0x236d) [0x5606d4409e9d] 9: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45] 10: (()+0x26106a) [0x5606d445c06a] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 0> 2016-11-09 14:58:58.440166 7fb0cf4564c0 -1 mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7fb0cf4564c0 time 2016-11-09 14:58:58.437225 mds/FSMap.cc: 628: FAILED assert(i.second.state == MDSMap::STATE_STANDBY) ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5606d480b1eb] 2: (FSMap::sanity() const+0x932) [0x5606d4730112] 3: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160] 4: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a] 5: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433] 6: (Monitor::init_paxos()+0x85) [0x5606d446b845] 7: (Monitor::preinit()+0x925) [0x5606d447bec5] 8: (main()+0x236d) [0x5606d4409e9d] 9: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45] 10: (()+0x26106a) [0x5606d445c06a] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. *** Caught signal (Aborted) ** in thread 7fb0cf4564c0 thread_name:ceph-mon ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) 1: (()+0x4f6222) [0x5606d46f1222] 2: (()+0x10330) [0x7fb0ce764330] 3: (gsignal()+0x37) [0x7fb0cc9eac37] 4: (abort()+0x148) [0x7fb0cc9ee028] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x5606d480b3c5] 6: (FSMap::sanity() const+0x932) [0x5606d4730112] 7: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160] 8: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a] 9: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433] 10: (Monitor::init_paxos()+0x85) [0x5606d446b845] 11: (Monitor::preinit()+0x925) [0x5606d447bec5] 12: (main()+0x236d) [0x5606d4409e9d] 13: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45] 14: (()+0x26106a) [0x5606d445c06a] 2016-11-09 14:58:58.442973 7fb0cf4564c0 -1 *** Caught signal (Aborted) ** in thread 7fb0cf4564c0 thread_name:ceph-mon ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) 1: (()+0x4f6222) [0x5606d46f1222] 2: (()+0x10330) [0x7fb0ce764330] 3: (gsignal()+0x37) [0x7fb0cc9eac37] 4: (abort()+0x148) [0x7fb0cc9ee028] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x5606d480b3c5] 6: (FSMap::sanity() const+0x932) [0x5606d4730112] 7: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160] 8: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a] 9: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433] 10: (Monitor::init_paxos()+0x85) [0x5606d446b845] 11: (Monitor::preinit()+0x925) [0x5606d447bec5] 12: (main()+0x236d) [0x5606d4409e9d] 13: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45] 14: (()+0x26106a) [0x5606d445c06a] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 0> 2016-11-09 14:58:58.442973 7fb0cf4564c0 -1 *** Caught signal (Aborted) ** in thread 7fb0cf4564c0 thread_name:ceph-mon ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) 1: (()+0x4f6222) [0x5606d46f1222] 2: (()+0x10330) [0x7fb0ce764330] 3: (gsignal()+0x37) [0x7fb0cc9eac37] 4: (abort()+0x148) [0x7fb0cc9ee028] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x5606d480b3c5] 6: (FSMap::sanity() const+0x932) [0x5606d4730112] 7: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160] 8: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a] 9: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433] 10: (Monitor::init_paxos()+0x85) [0x5606d446b845] 11: (Monitor::preinit()+0x925) [0x5606d447bec5] 12: (main()+0x236d) [0x5606d4409e9d] 13: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45] 14: (()+0x26106a) [0x5606d445c06a] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Files
Updated by Patrick Donnelly over 7 years ago
- Status changed from New to Duplicate
- Parent task set to #16592
- Source changed from other to Community (user)
This looks like a duplicate of 16592 but in a new code path: interestingly in a slave monitor.
Updated by Patrick Donnelly over 7 years ago
- Project changed from Ceph to CephFS
- Category deleted (
Monitor)
Updated by Patrick Donnelly over 7 years ago
- Is duplicate of Bug #16592: Jewel: monitor asserts on "mon/MDSMonitor.cc: 2796: FAILED assert(info.state == MDSMap::STATE_STANDBY)" added
Updated by John Spray over 7 years ago
- Status changed from Duplicate to Need More Info
Alexander: so hopefully you stopped the upgrade at that point and you still have a working cluster of two hammer mons?
Please could you do a "ceph mds dump --format=json-pretty" and a "ceph mds getmap > mdsmap.bin" and provide the outputs on the #16592 ticket? Hopefully that will make it easy for us to reproduce the issue.
(This may indeed be a duplicate of #16592, but since this one is picked up in sanity immediately and that one was only happening later, it might be something distinct)
Updated by alexander walker over 7 years ago
yes, I've stopped my update and my cluster working now with two mon server.
Perhaps it is helpful, I've a test cluster with the same Ubuntu und Ceph version and the update was running without any problems. The difference was that the productive cluster use the M2 SSD for journal the name of this two partition is /dev/nvme0n1p4 and /dev/nvme0n1p5 on each server.
I had a problem with the permissions like here http://tracker.ceph.com/issues/15874
Updated by John Spray over 7 years ago
Note to self, dumps are on http://tracker.ceph.com/issues/16592
Updated by John Spray over 7 years ago
- Status changed from Need More Info to In Progress
Updated by John Spray over 7 years ago
- Status changed from In Progress to Need More Info
Hmm, so when I try loading up the mdsmap.bin from http://tracker.ceph.com/issues/16592#change-81117 it is decoding fine and not asserting in sanity().
I guess whatever the crashing mon is loading from its local store on startup is something different from that (maybe an earlier version of the map had something different/confusing in it).
If you install the "ceph-test" package, then you can extract the local mdsmap from the failing mon like this:
ceph-monstore-tool /var/lib/ceph/mon/ceph-ceph3 get mdsmap > mdsmap.bin.local
Updated by alexander walker over 7 years ago
- File mdsmap.bin.local mdsmap.bin.local added
here is dump of mdsmap local
Updated by John Spray over 7 years ago
- Status changed from Need More Info to In Progress
Thanks, can now reproduce here.
/home/jspray/git/ceph/src/mds/FSMap.cc: In function '(null)' thread 7f868a7bb680 time 2016-11-17 10:27:36.550149 /home/jspray/git/ceph/src/mds/FSMap.cc: 629: FAILED assert(i.second.state == MDSMap::STATE_STANDBY) ceph version v10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x7f867e2c73cb] 2: (FSMap::sanity() const+0x619) [0x7f867e481ac5] 3: (FSMap::dump(ceph::Formatter*) const+0x29) [0x7f867e47d357] 4: (DencoderBase<FSMap>::dump(ceph::Formatter*)+0x27) [0x13fda49] 5: (main()+0xba94) [0x11f6ffb] 6: (__libc_start_main()+0xf0) [0x7f8679328700] 7: (_start()+0x29) [0x11eabe9] { "epoch": 401, "compat": { "compat": {}, "ro_compat": {}, "incompat": { "feature_1": "base v0.20", "feature_2": "client writeable ranges", "feature_3": "default file layouts on dirs", "feature_4": "dir inode in separate object", "feature_5": "mds uses versioned encoding", "feature_6": "dirfrag is stored in omap", "feature_8": "no anchor table" } }, "feature_flags": { "enable_multiple": false, "ever_enabled_multiple": false }, "standbys": [ { "gid": 5854102, "name": "ceph2.aditosoftware.local", "rank": -1, "incarnation": 0, "state": "up:standby", "state_seq": 1, "addr": "192.168.49.102:6800\/1261", "standby_for_rank": -1, "standby_for_fscid": -1, "standby_for_name": "", "standby_replay": false, "export_targets": [], "features": 0, "epoch": 401 }, { "gid": 5994101, "name": "ceph3.aditosoftware.local", "rank": -1, "incarnation": 0, "state": "down:dne", "state_seq": 22, "addr": "192.168.49.103:6800\/29296", "laggy_since": "2016-11-08 14:38:41.582432", "standby_for_rank": -1, "standby_for_fscid": -1, "standby_for_name": "", "standby_replay": false, "export_targets": [], "features": 0, "epoch": 401 } ], "filesystems": [ { "mdsmap": { "epoch": 401, "flags": 0, "ever_allowed_features": 0, "explicitly_allowed_features": 0, "created": "2016-03-11 14:24:45.516358", "modified": "2016-11-08 14:38:41.582500", "tableserver": 0, "root": 0, "session_timeout": 60, "session_autoclose": 300, "max_file_size": 1099511627776, "last_failure": 395, "last_failure_osd_epoch": 1328, "compat": { "compat": {}, "ro_compat": {}, "incompat": { "feature_1": "base v0.20", "feature_2": "client writeable ranges", "feature_3": "default file layouts on dirs", "feature_4": "dir inode in separate object", "feature_5": "mds uses versioned encoding", "feature_6": "dirfrag is stored in omap", "feature_8": "no anchor table" } }, "max_mds": 1, "in": [ 0 ], "up": { "mds_0": 5854219 }, "failed": [], "damaged": [], "stopped": [], "info": { "gid_5854219": { "gid": 5854219, "name": "ceph1.aditosoftware.local", "rank": 0, "incarnation": 41, "state": "up:active", "state_seq": 111157, "addr": "192.168.49.101:6800\/1287", "standby_for_rank": -1, "standby_for_fscid": -1, "standby_for_name": "", "standby_replay": false, "export_targets": [], "features": 0 } }, "data_pools": [ 1 ], "metadata_pool": 2, "enabled": true, "fs_name": "cephfs_fs" }, "id": 0 } ] }
The done:dne standby is the problem, will look into how that might have got there and make sure we handle the case properly.
Updated by John Spray over 7 years ago
- Status changed from In Progress to Fix Under Review
Updated by Greg Farnum over 7 years ago
- Status changed from Fix Under Review to 17
Updated by alexander walker over 7 years ago
I could test the changes, do I have to compile this project?
Updated by John Spray over 7 years ago
- Status changed from 17 to Pending Backport
Updated by John Spray over 7 years ago
Alexander: I've pushed a backport of this to jewel to a branch called wip-17837-jewel. It will build in an hour or two and then be accessible via the gitbuilder server:
http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/
http://gitbuilder.ceph.com/ceph-rpm-centos7-x86_64-basic/ref/
Updated by John Spray over 7 years ago
Updated by Loïc Dachary over 7 years ago
- Copied to Backport #18100: jewel: ceph-mon crashed after upgrade from hammer 0.94.7 to jewel 10.2.3 added
Updated by Patrick Donnelly almost 7 years ago
- Status changed from Pending Backport to Resolved