Project

General

Profile

Actions

Bug #5205

closed

mon: FAILED assert(ret == 0) on config's set_val_or_die() from pick_addresses()

Added by Joao Eduardo Luis almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is the crash's log (from saaby @ #ceph):

root@ceph1-cph1c16-mon1:/var/lib/ceph/mon# ceph-mon -i ceph1-cph1c16-mon1 --debug-osd 20 -d
2013-05-29 13:39:23.760439 7f58c4b07780  0 ceph version 0.61.2-26-g1071736 (1071736d3b6611b6c5edeb9b225f32b4e9afdc6d), process ceph-mon, pid 22748
2013-05-29 13:39:23.933491 7f58c4b07780  0 mon.ceph1-cph1c16-mon1 does not exist in monmap, will attempt to join an existing cluster
common/config.cc: In function 'void md_config_t::set_val_or_die(const char*, const char*)' thread 7f58c4b07780 time 2013-05-29 13:39:23.933914
common/config.cc: 621: FAILED assert(ret == 0)
 ceph version 0.61.2-26-g1071736 (1071736d3b6611b6c5edeb9b225f32b4e9afdc6d)
 1: ceph-mon() [0x668046]
 2: ceph-mon() [0x69e889]
 3: (pick_addresses(CephContext*)+0x8d) [0x69e9ed]
 4: (main()+0x1a6b) [0x4a146b]
 5: (__libc_start_main()+0xfd) [0x7f58c2d86ead]
 6: ceph-mon() [0x4a3609]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #5195: "ceph-deploy mon create" fails when adding additional monitorsResolved05/29/2013

Actions
Actions #1

Updated by Adam Compton almost 11 years ago

I've also encountered this problem, running 0.61.2 on CentOS 6.4 (uname 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 x86_64 x86_64 GNU/Linux).

I believe I've narrowed it down to the following circumstances:

1. Have an existing cluster to which you're trying to join a new mon.
2. Specify the "public network" configuration option in the [globals] section of ceph.conf
3. Follow the instructions in http://ceph.com/docs/next/rados/operations/add-or-rm-mons/ to set up a new mon (install the software, get the key and monmap, run --mkfs, edit ceph.conf, run "ceph mon add")

In this condition, when I get to step 8 (actually running ceph-mon), it immediately aborts with the same crash log as saaby provided above. This is repeatable in my cluster, although I haven't tried destroying and recreating the other mons to start from scratch. I dug around with gdb and it looks like the problem is in fill_in_one_address (IP addresses obscured):

#8 0x0000000000666c9e in fill_in_one_address (cct=0x1250000, ifa=<value optimized out>, networks="10.x.x.x/24",
conf_var=0x6cd350 "public_addr") at common/pick_address.cc:78
78 cct->_conf->set_val_or_die(conf_var, buf);
(gdb) p conf_var
$3 = 0x6cd350 "public_addr"
(gdb) p buf
$4 = "10.x.x.x\000\377\177\000\000X\316\377\377\005\000\000\000\000\210\066\001\000\000\000\000\017\000\000\000sK\000\000\351a\256Q\377"

I'm pretty sure buf isn't supposed to have all that gunk at the end. I did not dig further to figure out where it's coming from. As a workaround, removing the "public network" option in [globals] will let it get joined to the cluster, after which you can put the "public network" option back; ceph-mon won't break on subsequent startups.

Actions #2

Updated by Joao Eduardo Luis almost 11 years ago

Thanks Adam, this provides great insight on what's going on.

Actions #3

Updated by Sage Weil almost 11 years ago

  • Status changed from New to 12
  • Priority changed from Normal to High
Actions #4

Updated by Sage Weil almost 11 years ago

  • Priority changed from High to Urgent
Actions #5

Updated by Sage Weil almost 11 years ago

  • Status changed from 12 to Fix Under Review
Actions #6

Updated by Sage Weil almost 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF