Project

General

Profile

Bug #17837

ceph-mon crashed after upgrade from hammer 0.94.7 to jewel 10.2.3

Added by alexander walker 11 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
11/09/2016
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
hammer, jewel
Component(FS):
Needs Doc:
No

Description

I've a cluster of three nodes:

ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 5.45993 root default
-2 1.81998     host ceph1
 0 0.90999         osd.0       up  1.00000          1.00000
 1 0.90999         osd.1       up  1.00000          1.00000
-3 1.81998     host ceph2
 2 0.90999         osd.2       up  1.00000          1.00000
 3 0.90999         osd.3       up  1.00000          1.00000
-4 1.81998     host ceph3
 4 0.90999         osd.4       up  1.00000          1.00000
 5 0.90999         osd.5       up  1.00000          1.00000

I've updated first the ceph3 node and now I can't start monitor daemon. It's crashed


cephus@ceph3:~$ sudo /usr/bin/ceph-mon --cluster=ceph -i ceph3 -f --setuser ceph --setgroup ceph --debug_mon 10
starting mon.ceph3 rank 2 at 192.168.49.103:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph3 fsid 3c58a184-bf27-4273-8000-405513006a7b
mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7fb0cf4564c0 time 2016-11-09 14:58:58.437225
mds/FSMap.cc: 628: FAILED assert(i.second.state == MDSMap::STATE_STANDBY)
 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5606d480b1eb]
 2: (FSMap::sanity() const+0x932) [0x5606d4730112]
 3: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160]
 4: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a]
 5: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433]
 6: (Monitor::init_paxos()+0x85) [0x5606d446b845]
 7: (Monitor::preinit()+0x925) [0x5606d447bec5]
 8: (main()+0x236d) [0x5606d4409e9d]
 9: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45]
 10: (()+0x26106a) [0x5606d445c06a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-11-09 14:58:58.440166 7fb0cf4564c0 -1 mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7fb0cf4564c0 time 2016-11-09 14:58:58.437225
mds/FSMap.cc: 628: FAILED assert(i.second.state == MDSMap::STATE_STANDBY)

 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5606d480b1eb]
 2: (FSMap::sanity() const+0x932) [0x5606d4730112]
 3: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160]
 4: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a]
 5: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433]
 6: (Monitor::init_paxos()+0x85) [0x5606d446b845]
 7: (Monitor::preinit()+0x925) [0x5606d447bec5]
 8: (main()+0x236d) [0x5606d4409e9d]
 9: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45]
 10: (()+0x26106a) [0x5606d445c06a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2016-11-09 14:58:58.440166 7fb0cf4564c0 -1 mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7fb0cf4564c0 time 2016-11-09 14:58:58.437225
mds/FSMap.cc: 628: FAILED assert(i.second.state == MDSMap::STATE_STANDBY)

 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5606d480b1eb]
 2: (FSMap::sanity() const+0x932) [0x5606d4730112]
 3: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160]
 4: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a]
 5: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433]
 6: (Monitor::init_paxos()+0x85) [0x5606d446b845]
 7: (Monitor::preinit()+0x925) [0x5606d447bec5]
 8: (main()+0x236d) [0x5606d4409e9d]
 9: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45]
 10: (()+0x26106a) [0x5606d445c06a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

*** Caught signal (Aborted) **
 in thread 7fb0cf4564c0 thread_name:ceph-mon
 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (()+0x4f6222) [0x5606d46f1222]
 2: (()+0x10330) [0x7fb0ce764330]
 3: (gsignal()+0x37) [0x7fb0cc9eac37]
 4: (abort()+0x148) [0x7fb0cc9ee028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x5606d480b3c5]
 6: (FSMap::sanity() const+0x932) [0x5606d4730112]
 7: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160]
 8: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a]
 9: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433]
 10: (Monitor::init_paxos()+0x85) [0x5606d446b845]
 11: (Monitor::preinit()+0x925) [0x5606d447bec5]
 12: (main()+0x236d) [0x5606d4409e9d]
 13: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45]
 14: (()+0x26106a) [0x5606d445c06a]
2016-11-09 14:58:58.442973 7fb0cf4564c0 -1 *** Caught signal (Aborted) **
 in thread 7fb0cf4564c0 thread_name:ceph-mon

 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (()+0x4f6222) [0x5606d46f1222]
 2: (()+0x10330) [0x7fb0ce764330]
 3: (gsignal()+0x37) [0x7fb0cc9eac37]
 4: (abort()+0x148) [0x7fb0cc9ee028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x5606d480b3c5]
 6: (FSMap::sanity() const+0x932) [0x5606d4730112]
 7: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160]
 8: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a]
 9: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433]
 10: (Monitor::init_paxos()+0x85) [0x5606d446b845]
 11: (Monitor::preinit()+0x925) [0x5606d447bec5]
 12: (main()+0x236d) [0x5606d4409e9d]
 13: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45]
 14: (()+0x26106a) [0x5606d445c06a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2016-11-09 14:58:58.442973 7fb0cf4564c0 -1 *** Caught signal (Aborted) **
 in thread 7fb0cf4564c0 thread_name:ceph-mon

 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (()+0x4f6222) [0x5606d46f1222]
 2: (()+0x10330) [0x7fb0ce764330]
 3: (gsignal()+0x37) [0x7fb0cc9eac37]
 4: (abort()+0x148) [0x7fb0cc9ee028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x5606d480b3c5]
 6: (FSMap::sanity() const+0x932) [0x5606d4730112]
 7: (MDSMonitor::update_from_paxos(bool*)+0x450) [0x5606d455b160]
 8: (PaxosService::refresh(bool*)+0x19a) [0x5606d44ceb4a]
 9: (Monitor::refresh_from_paxos(bool*)+0x143) [0x5606d446b433]
 10: (Monitor::init_paxos()+0x85) [0x5606d446b845]
 11: (Monitor::preinit()+0x925) [0x5606d447bec5]
 12: (main()+0x236d) [0x5606d4409e9d]
 13: (__libc_start_main()+0xf5) [0x7fb0cc9d5f45]
 14: (()+0x26106a) [0x5606d445c06a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

mdsmap.bin.local (1.09 KB) alexander walker, 11/17/2016 06:26 AM


Related issues

Duplicates fs - Bug #16592: Jewel: monitor asserts on "mon/MDSMonitor.cc: 2796: FAILED assert(info.state == MDSMap::STATE_STANDBY)" Need More Info 11/09/2016
Copied to fs - Backport #18100: jewel: ceph-mon crashed after upgrade from hammer 0.94.7 to jewel 10.2.3 Resolved

History

#1 Updated by Patrick Donnelly 11 months ago

  • Status changed from New to Duplicate
  • Parent task set to #16592
  • Source changed from other to Community (user)

This looks like a duplicate of 16592 but in a new code path: interestingly in a slave monitor.

#2 Updated by Patrick Donnelly 11 months ago

  • Parent task deleted (#16592)

#3 Updated by Patrick Donnelly 11 months ago

  • Project changed from Ceph to fs
  • Category deleted (Monitor)

#4 Updated by Patrick Donnelly 11 months ago

  • Duplicates Bug #16592: Jewel: monitor asserts on "mon/MDSMonitor.cc: 2796: FAILED assert(info.state == MDSMap::STATE_STANDBY)" added

#5 Updated by John Spray 11 months ago

  • Status changed from Duplicate to Need More Info

Alexander: so hopefully you stopped the upgrade at that point and you still have a working cluster of two hammer mons?

Please could you do a "ceph mds dump --format=json-pretty" and a "ceph mds getmap > mdsmap.bin" and provide the outputs on the #16592 ticket? Hopefully that will make it easy for us to reproduce the issue.

(This may indeed be a duplicate of #16592, but since this one is picked up in sanity immediately and that one was only happening later, it might be something distinct)

#6 Updated by alexander walker 11 months ago

yes, I've stopped my update and my cluster working now with two mon server.

Perhaps it is helpful, I've a test cluster with the same Ubuntu und Ceph version and the update was running without any problems. The difference was that the productive cluster use the M2 SSD for journal the name of this two partition is /dev/nvme0n1p4 and /dev/nvme0n1p5 on each server.
I had a problem with the permissions like here http://tracker.ceph.com/issues/15874

#7 Updated by John Spray 11 months ago

  • Assignee set to John Spray

#8 Updated by John Spray 11 months ago

Note to self, dumps are on http://tracker.ceph.com/issues/16592

#9 Updated by John Spray 11 months ago

  • Status changed from Need More Info to In Progress

#10 Updated by John Spray 11 months ago

  • Status changed from In Progress to Need More Info

Hmm, so when I try loading up the mdsmap.bin from http://tracker.ceph.com/issues/16592#change-81117 it is decoding fine and not asserting in sanity().

I guess whatever the crashing mon is loading from its local store on startup is something different from that (maybe an earlier version of the map had something different/confusing in it).

If you install the "ceph-test" package, then you can extract the local mdsmap from the failing mon like this:
ceph-monstore-tool /var/lib/ceph/mon/ceph-ceph3 get mdsmap > mdsmap.bin.local

#11 Updated by alexander walker 11 months ago

here is dump of mdsmap local

#12 Updated by John Spray 11 months ago

  • Status changed from Need More Info to In Progress

Thanks, can now reproduce here.

/home/jspray/git/ceph/src/mds/FSMap.cc: In function '(null)' thread 7f868a7bb680 time 2016-11-17 10:27:36.550149
/home/jspray/git/ceph/src/mds/FSMap.cc: 629: FAILED assert(i.second.state == MDSMap::STATE_STANDBY)
 ceph version v10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x7f867e2c73cb]
 2: (FSMap::sanity() const+0x619) [0x7f867e481ac5]
 3: (FSMap::dump(ceph::Formatter*) const+0x29) [0x7f867e47d357]
 4: (DencoderBase<FSMap>::dump(ceph::Formatter*)+0x27) [0x13fda49]
 5: (main()+0xba94) [0x11f6ffb]
 6: (__libc_start_main()+0xf0) [0x7f8679328700]
 7: (_start()+0x29) [0x11eabe9]

{
    "epoch": 401,
    "compat": {
        "compat": {},
        "ro_compat": {},
        "incompat": {
            "feature_1": "base v0.20",
            "feature_2": "client writeable ranges",
            "feature_3": "default file layouts on dirs",
            "feature_4": "dir inode in separate object",
            "feature_5": "mds uses versioned encoding",
            "feature_6": "dirfrag is stored in omap",
            "feature_8": "no anchor table" 
        }
    },
    "feature_flags": {
        "enable_multiple": false,
        "ever_enabled_multiple": false
    },
    "standbys": [
        {
            "gid": 5854102,
            "name": "ceph2.aditosoftware.local",
            "rank": -1,
            "incarnation": 0,
            "state": "up:standby",
            "state_seq": 1,
            "addr": "192.168.49.102:6800\/1261",
            "standby_for_rank": -1,
            "standby_for_fscid": -1,
            "standby_for_name": "",
            "standby_replay": false,
            "export_targets": [],
            "features": 0,
            "epoch": 401
        },
        {
            "gid": 5994101,
            "name": "ceph3.aditosoftware.local",
            "rank": -1,
            "incarnation": 0,
            "state": "down:dne",
            "state_seq": 22,
            "addr": "192.168.49.103:6800\/29296",
            "laggy_since": "2016-11-08 14:38:41.582432",
            "standby_for_rank": -1,
            "standby_for_fscid": -1,
            "standby_for_name": "",
            "standby_replay": false,
            "export_targets": [],
            "features": 0,
            "epoch": 401
        }
    ],
    "filesystems": [
        {
            "mdsmap": {
                "epoch": 401,
                "flags": 0,
                "ever_allowed_features": 0,
                "explicitly_allowed_features": 0,
                "created": "2016-03-11 14:24:45.516358",
                "modified": "2016-11-08 14:38:41.582500",
                "tableserver": 0,
                "root": 0,
                "session_timeout": 60,
                "session_autoclose": 300,
                "max_file_size": 1099511627776,
                "last_failure": 395,
                "last_failure_osd_epoch": 1328,
                "compat": {
                    "compat": {},
                    "ro_compat": {},
                    "incompat": {
                        "feature_1": "base v0.20",
                        "feature_2": "client writeable ranges",
                        "feature_3": "default file layouts on dirs",
                        "feature_4": "dir inode in separate object",
                        "feature_5": "mds uses versioned encoding",
                        "feature_6": "dirfrag is stored in omap",
                        "feature_8": "no anchor table" 
                    }
                },
                "max_mds": 1,
                "in": [
                    0
                ],
                "up": {
                    "mds_0": 5854219
                },
                "failed": [],
                "damaged": [],
                "stopped": [],
                "info": {
                    "gid_5854219": {
                        "gid": 5854219,
                        "name": "ceph1.aditosoftware.local",
                        "rank": 0,
                        "incarnation": 41,
                        "state": "up:active",
                        "state_seq": 111157,
                        "addr": "192.168.49.101:6800\/1287",
                        "standby_for_rank": -1,
                        "standby_for_fscid": -1,
                        "standby_for_name": "",
                        "standby_replay": false,
                        "export_targets": [],
                        "features": 0
                    }
                },
                "data_pools": [
                    1
                ],
                "metadata_pool": 2,
                "enabled": true,
                "fs_name": "cephfs_fs" 
            },
            "id": 0
        }
    ]
}

The done:dne standby is the problem, will look into how that might have got there and make sure we handle the case properly.

#13 Updated by John Spray 11 months ago

  • Status changed from In Progress to Need Review

#14 Updated by Greg Farnum 11 months ago

  • Status changed from Need Review to Need Test

#15 Updated by alexander walker 11 months ago

I could test the changes, do I have to compile this project?

#16 Updated by John Spray 11 months ago

  • Status changed from Need Test to Pending Backport

#17 Updated by John Spray 11 months ago

  • Backport set to jewel

#18 Updated by John Spray 11 months ago

Alexander: I've pushed a backport of this to jewel to a branch called wip-17837-jewel. It will build in an hour or two and then be accessible via the gitbuilder server:
http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/
http://gitbuilder.ceph.com/ceph-rpm-centos7-x86_64-basic/ref/

#20 Updated by Loic Dachary 11 months ago

  • Copied to Backport #18100: jewel: ceph-mon crashed after upgrade from hammer 0.94.7 to jewel 10.2.3 added

#21 Updated by Patrick Donnelly 3 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF