Project

General

Profile

Actions

Bug #17837

closed

ceph-mon crashed after upgrade from hammer 0.94.7 to jewel 10.2.3

Added by alexander walker over 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've a cluster of three nodes:

ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 5.45993 root default
-2 1.81998     host ceph1
 0 0.90999         osd.0       up  1.00000          1.00000
 1 0.90999         osd.1       up  1.00000          1.00000
-3 1.81998     host ceph2
 2 0.90999         osd.2       up  1.00000          1.00000
 3 0.90999         osd.3       up  1.00000          1.00000
-4 1.81998     host ceph3
 4 0.90999         osd.4       up  1.00000          1.00000
 5 0.90999         osd.5       up  1.00000          1.00000

I've updated first the ceph3 node and now I can't start monitor daemon. It's crashed


cephus@ceph3:~$ sudo /usr/bin/ceph-mon --cluster=ceph -i ceph3 -f --setuser ceph --setgroup ceph --debug_mon 10
starting mon.ceph3 rank 2 at 192.168.49.103:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph3 fsid 3c58a184-bf27-4273-8000-405513006a7b
mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7fb0cf4564c0 time 2016-11-09 14:58:58.437225
mds/FSMap.cc: 628: FAILED assert(i.second.state == MDSMap::STATE_STANDBY)
 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5606d480b1eb]
 2: (FSMap::sanity() const+0x932) [0x5606d4730112]
 3: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160]
 4: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a]
 5: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433]
 6: (Monitor::init_paxos()+0x85) [0x5606d446b845]
 7: (Monitor::preinit()+0x925) [0x5606d447bec5]
 8: (main()+0x236d) [0x5606d4409e9d]
 9: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45]
 10: (()+0x26106a) [0x5606d445c06a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-11-09 14:58:58.440166 7fb0cf4564c0 -1 mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7fb0cf4564c0 time 2016-11-09 14:58:58.437225
mds/FSMap.cc: 628: FAILED assert(i.second.state == MDSMap::STATE_STANDBY)

 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5606d480b1eb]
 2: (FSMap::sanity() const+0x932) [0x5606d4730112]
 3: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160]
 4: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a]
 5: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433]
 6: (Monitor::init_paxos()+0x85) [0x5606d446b845]
 7: (Monitor::preinit()+0x925) [0x5606d447bec5]
 8: (main()+0x236d) [0x5606d4409e9d]
 9: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45]
 10: (()+0x26106a) [0x5606d445c06a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2016-11-09 14:58:58.440166 7fb0cf4564c0 -1 mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7fb0cf4564c0 time 2016-11-09 14:58:58.437225
mds/FSMap.cc: 628: FAILED assert(i.second.state == MDSMap::STATE_STANDBY)

 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5606d480b1eb]
 2: (FSMap::sanity() const+0x932) [0x5606d4730112]
 3: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160]
 4: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a]
 5: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433]
 6: (Monitor::init_paxos()+0x85) [0x5606d446b845]
 7: (Monitor::preinit()+0x925) [0x5606d447bec5]
 8: (main()+0x236d) [0x5606d4409e9d]
 9: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45]
 10: (()+0x26106a) [0x5606d445c06a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

*** Caught signal (Aborted) **
 in thread 7fb0cf4564c0 thread_name:ceph-mon
 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (()+0x4f6222) [0x5606d46f1222]
 2: (()+0x10330) [0x7fb0ce764330]
 3: (gsignal()+0x37) [0x7fb0cc9eac37]
 4: (abort()+0x148) [0x7fb0cc9ee028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x5606d480b3c5]
 6: (FSMap::sanity() const+0x932) [0x5606d4730112]
 7: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160]
 8: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a]
 9: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433]
 10: (Monitor::init_paxos()+0x85) [0x5606d446b845]
 11: (Monitor::preinit()+0x925) [0x5606d447bec5]
 12: (main()+0x236d) [0x5606d4409e9d]
 13: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45]
 14: (()+0x26106a) [0x5606d445c06a]
2016-11-09 14:58:58.442973 7fb0cf4564c0 -1 *** Caught signal (Aborted) **
 in thread 7fb0cf4564c0 thread_name:ceph-mon

 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (()+0x4f6222) [0x5606d46f1222]
 2: (()+0x10330) [0x7fb0ce764330]
 3: (gsignal()+0x37) [0x7fb0cc9eac37]
 4: (abort()+0x148) [0x7fb0cc9ee028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x5606d480b3c5]
 6: (FSMap::sanity() const+0x932) [0x5606d4730112]
 7: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160]
 8: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a]
 9: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433]
 10: (Monitor::init_paxos()+0x85) [0x5606d446b845]
 11: (Monitor::preinit()+0x925) [0x5606d447bec5]
 12: (main()+0x236d) [0x5606d4409e9d]
 13: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45]
 14: (()+0x26106a) [0x5606d445c06a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2016-11-09 14:58:58.442973 7fb0cf4564c0 -1 *** Caught signal (Aborted) **
 in thread 7fb0cf4564c0 thread_name:ceph-mon

 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (()+0x4f6222) [0x5606d46f1222]
 2: (()+0x10330) [0x7fb0ce764330]
 3: (gsignal()+0x37) [0x7fb0cc9eac37]
 4: (abort()+0x148) [0x7fb0cc9ee028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x5606d480b3c5]
 6: (FSMap::sanity() const+0x932) [0x5606d4730112]
 7: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160]
 8: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a]
 9: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433]
 10: (Monitor::init_paxos()+0x85) [0x5606d446b845]
 11: (Monitor::preinit()+0x925) [0x5606d447bec5]
 12: (main()+0x236d) [0x5606d4409e9d]
 13: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45]
 14: (()+0x26106a) [0x5606d445c06a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Files

mdsmap.bin.local (1.09 KB) mdsmap.bin.local alexander walker, 11/17/2016 06:26 AM

Related issues 2 (1 open1 closed)

Is duplicate of CephFS - Bug #16592: Jewel: monitor asserts on "mon/MDSMonitor.cc: 2796: FAILED assert(info.state == MDSMap::STATE_STANDBY)"Need More Info11/09/2016

Actions
Copied to CephFS - Backport #18100: jewel: ceph-mon crashed after upgrade from hammer 0.94.7 to jewel 10.2.3ResolvedJohn SprayActions
Actions #1

Updated by Patrick Donnelly over 7 years ago

  • Status changed from New to Duplicate
  • Parent task set to #16592
  • Source changed from other to Community (user)

This looks like a duplicate of 16592 but in a new code path: interestingly in a slave monitor.

Actions #2

Updated by Patrick Donnelly over 7 years ago

  • Parent task deleted (#16592)
Actions #3

Updated by Patrick Donnelly over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (Monitor)
Actions #4

Updated by Patrick Donnelly over 7 years ago

  • Is duplicate of Bug #16592: Jewel: monitor asserts on "mon/MDSMonitor.cc: 2796: FAILED assert(info.state == MDSMap::STATE_STANDBY)" added
Actions #5

Updated by John Spray over 7 years ago

  • Status changed from Duplicate to Need More Info

Alexander: so hopefully you stopped the upgrade at that point and you still have a working cluster of two hammer mons?

Please could you do a "ceph mds dump --format=json-pretty" and a "ceph mds getmap > mdsmap.bin" and provide the outputs on the #16592 ticket? Hopefully that will make it easy for us to reproduce the issue.

(This may indeed be a duplicate of #16592, but since this one is picked up in sanity immediately and that one was only happening later, it might be something distinct)

Actions #6

Updated by alexander walker over 7 years ago

yes, I've stopped my update and my cluster working now with two mon server.

Perhaps it is helpful, I've a test cluster with the same Ubuntu und Ceph version and the update was running without any problems. The difference was that the productive cluster use the M2 SSD for journal the name of this two partition is /dev/nvme0n1p4 and /dev/nvme0n1p5 on each server.
I had a problem with the permissions like here http://tracker.ceph.com/issues/15874

Actions #7

Updated by John Spray over 7 years ago

  • Assignee set to John Spray
Actions #8

Updated by John Spray over 7 years ago

Note to self, dumps are on http://tracker.ceph.com/issues/16592

Actions #9

Updated by John Spray over 7 years ago

  • Status changed from Need More Info to In Progress
Actions #10

Updated by John Spray over 7 years ago

  • Status changed from In Progress to Need More Info

Hmm, so when I try loading up the mdsmap.bin from http://tracker.ceph.com/issues/16592#change-81117 it is decoding fine and not asserting in sanity().

I guess whatever the crashing mon is loading from its local store on startup is something different from that (maybe an earlier version of the map had something different/confusing in it).

If you install the "ceph-test" package, then you can extract the local mdsmap from the failing mon like this:
ceph-monstore-tool /var/lib/ceph/mon/ceph-ceph3 get mdsmap > mdsmap.bin.local

Actions #11

Updated by alexander walker over 7 years ago

here is dump of mdsmap local

Actions #12

Updated by John Spray over 7 years ago

  • Status changed from Need More Info to In Progress

Thanks, can now reproduce here.

/home/jspray/git/ceph/src/mds/FSMap.cc: In function '(null)' thread 7f868a7bb680 time 2016-11-17 10:27:36.550149
/home/jspray/git/ceph/src/mds/FSMap.cc: 629: FAILED assert(i.second.state == MDSMap::STATE_STANDBY)
 ceph version v10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x7f867e2c73cb]
 2: (FSMap::sanity() const+0x619) [0x7f867e481ac5]
 3: (FSMap::dump(ceph::Formatter*) const+0x29) [0x7f867e47d357]
 4: (DencoderBase<FSMap>::dump(ceph::Formatter*)+0x27) [0x13fda49]
 5: (main()+0xba94) [0x11f6ffb]
 6: (__libc_start_main()+0xf0) [0x7f8679328700]
 7: (_start()+0x29) [0x11eabe9]

{
    "epoch": 401,
    "compat": {
        "compat": {},
        "ro_compat": {},
        "incompat": {
            "feature_1": "base v0.20",
            "feature_2": "client writeable ranges",
            "feature_3": "default file layouts on dirs",
            "feature_4": "dir inode in separate object",
            "feature_5": "mds uses versioned encoding",
            "feature_6": "dirfrag is stored in omap",
            "feature_8": "no anchor table" 
        }
    },
    "feature_flags": {
        "enable_multiple": false,
        "ever_enabled_multiple": false
    },
    "standbys": [
        {
            "gid": 5854102,
            "name": "ceph2.aditosoftware.local",
            "rank": -1,
            "incarnation": 0,
            "state": "up:standby",
            "state_seq": 1,
            "addr": "192.168.49.102:6800\/1261",
            "standby_for_rank": -1,
            "standby_for_fscid": -1,
            "standby_for_name": "",
            "standby_replay": false,
            "export_targets": [],
            "features": 0,
            "epoch": 401
        },
        {
            "gid": 5994101,
            "name": "ceph3.aditosoftware.local",
            "rank": -1,
            "incarnation": 0,
            "state": "down:dne",
            "state_seq": 22,
            "addr": "192.168.49.103:6800\/29296",
            "laggy_since": "2016-11-08 14:38:41.582432",
            "standby_for_rank": -1,
            "standby_for_fscid": -1,
            "standby_for_name": "",
            "standby_replay": false,
            "export_targets": [],
            "features": 0,
            "epoch": 401
        }
    ],
    "filesystems": [
        {
            "mdsmap": {
                "epoch": 401,
                "flags": 0,
                "ever_allowed_features": 0,
                "explicitly_allowed_features": 0,
                "created": "2016-03-11 14:24:45.516358",
                "modified": "2016-11-08 14:38:41.582500",
                "tableserver": 0,
                "root": 0,
                "session_timeout": 60,
                "session_autoclose": 300,
                "max_file_size": 1099511627776,
                "last_failure": 395,
                "last_failure_osd_epoch": 1328,
                "compat": {
                    "compat": {},
                    "ro_compat": {},
                    "incompat": {
                        "feature_1": "base v0.20",
                        "feature_2": "client writeable ranges",
                        "feature_3": "default file layouts on dirs",
                        "feature_4": "dir inode in separate object",
                        "feature_5": "mds uses versioned encoding",
                        "feature_6": "dirfrag is stored in omap",
                        "feature_8": "no anchor table" 
                    }
                },
                "max_mds": 1,
                "in": [
                    0
                ],
                "up": {
                    "mds_0": 5854219
                },
                "failed": [],
                "damaged": [],
                "stopped": [],
                "info": {
                    "gid_5854219": {
                        "gid": 5854219,
                        "name": "ceph1.aditosoftware.local",
                        "rank": 0,
                        "incarnation": 41,
                        "state": "up:active",
                        "state_seq": 111157,
                        "addr": "192.168.49.101:6800\/1287",
                        "standby_for_rank": -1,
                        "standby_for_fscid": -1,
                        "standby_for_name": "",
                        "standby_replay": false,
                        "export_targets": [],
                        "features": 0
                    }
                },
                "data_pools": [
                    1
                ],
                "metadata_pool": 2,
                "enabled": true,
                "fs_name": "cephfs_fs" 
            },
            "id": 0
        }
    ]
}

The done:dne standby is the problem, will look into how that might have got there and make sure we handle the case properly.

Actions #13

Updated by John Spray over 7 years ago

  • Status changed from In Progress to Fix Under Review
Actions #14

Updated by Greg Farnum over 7 years ago

  • Status changed from Fix Under Review to 17
Actions #15

Updated by alexander walker over 7 years ago

I could test the changes, do I have to compile this project?

Actions #16

Updated by John Spray over 7 years ago

  • Status changed from 17 to Pending Backport
Actions #17

Updated by John Spray over 7 years ago

  • Backport set to jewel
Actions #18

Updated by John Spray over 7 years ago

Alexander: I've pushed a backport of this to jewel to a branch called wip-17837-jewel. It will build in an hour or two and then be accessible via the gitbuilder server:
http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/
http://gitbuilder.ceph.com/ceph-rpm-centos7-x86_64-basic/ref/

Actions #20

Updated by Loïc Dachary over 7 years ago

  • Copied to Backport #18100: jewel: ceph-mon crashed after upgrade from hammer 0.94.7 to jewel 10.2.3 added
Actions #21

Updated by Patrick Donnelly almost 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF