Project

General

Profile

Bug #4747

Upgrade monitors from argonaut->bobtail->next fails w/"Existing store has not been converted to 0.52 format"

Added by Ken Franklin almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
bobtail
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Testing upgrade from Argonaut to Bobtail to Next (cuttlefish). I am using Argonaut and bobtail distros on gitbuilder. The initial cluster creation is done via mkcephfs. on the final upgrade from bobtail 0.56.4 to Next I get the following output from sudo service ceph -a restart:

=== mon.a === === mon.a ===
Stopping Ceph mon.a on burnupi57...kill 53032...done === mon.a ===
Starting Ceph mon.a on burnupi57...
starting mon.a rank 2 at 10.214.136.16:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid dc9f30aa-9c2b-4311-9206-94a6a2de25ac === mon.b === === mon.b ===
Stopping Ceph mon.b on burnupi63...kill 23393...done === mon.b ===
Starting Ceph mon.b on burnupi63...
starting mon.b rank 0 at 10.214.136.4:6789/0 mon_data /var/lib/ceph/mon/ceph-b fsid dc9f30aa-9c2b-4311-9206-94a6a2de25ac === mon.c === === mon.c ===
Stopping Ceph mon.c on burnupi59...kill 18264...done === mon.c ===
Starting Ceph mon.c on burnupi59...
Invalid argument: /var/lib/ceph/mon/ceph-c/store.db: does not exist (create_if_missing is false)
mon/Monitor.cc: In function 'int Monitor::StoreConverter::needs_conversion()' thread 7f2032224780 time 2013-04-18 11:02:12.444351
mon/Monitor.cc: 4147: FAILED assert(0 "Existing store has not been converted to 0.52 format")
ceph version 0.60-525-g7e4f80b (7e4f80b12e86d0da9cedc1569c63d78cd27bb8ed)
1: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a41ee]
2: (main()+0x7a1) [0x48a671]
3: (__libc_start_main()+0xed) [0x7f203054976d]
4: /usr/bin/ceph-mon() [0x48dc2d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2013-04-18 11:02:12.444761 7f2032224780 -1 mon/Monitor.cc: In function 'int Monitor::StoreConverter::needs_conversion()' thread 7f2032224780 time 2013-04-18 11:02:12.444351
mon/Monitor.cc: 4147: FAILED assert(0 "Existing store has not been converted to 0.52 format")

ceph version 0.60-525-g7e4f80b (7e4f80b12e86d0da9cedc1569c63d78cd27bb8ed)
1: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a41ee]
2: (main()+0x7a1) [0x48a671]
3: (__libc_start_main()+0xed) [0x7f203054976d]
4: /usr/bin/ceph-mon() [0x48dc2d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
0> 2013-04-18 11:02:12.444761 7f2032224780 -1 mon/Monitor.cc: In function 'int Monitor::StoreConverter::needs_conversion()' thread 7f2032224780 time 2013-04-18 11:02:12.444351
mon/Monitor.cc: 4147: FAILED assert(0 "Existing store has not been converted to 0.52 format")
ceph version 0.60-525-g7e4f80b (7e4f80b12e86d0da9cedc1569c63d78cd27bb8ed)
1: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a41ee]
2: (main()+0x7a1) [0x48a671]
3: (__libc_start_main()+0xed) [0x7f203054976d]
4: /usr/bin/ceph-mon() [0x48dc2d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
  • Caught signal (Aborted)
    in thread 7f2032224780
    ceph version 0.60-525-g7e4f80b (7e4f80b12e86d0da9cedc1569c63d78cd27bb8ed)
    1: /usr/bin/ceph-mon() [0x590eba]
    2: (()+0xfcb0) [0x7f2031e06cb0]
    3: (gsignal()+0x35) [0x7f203055e425]
    4: (abort()+0x17b) [0x7f2030561b8b]
    5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f2030eb069d]
    6: (()+0xb5846) [0x7f2030eae846]
    7: (()+0xb5873) [0x7f2030eae873]
    8: (()+0xb596e) [0x7f2030eae96e]
    9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x649a9f]
    10: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a41ee]
    11: (main()+0x7a1) [0x48a671]
    12: (__libc_start_main()+0xed) [0x7f203054976d]
    13: /usr/bin/ceph-mon() [0x48dc2d]
    2013-04-18 11:02:12.445418 7f2032224780 -1
    Caught signal (Aborted) *
    in thread 7f2032224780
ceph version 0.60-525-g7e4f80b (7e4f80b12e86d0da9cedc1569c63d78cd27bb8ed)
1: /usr/bin/ceph-mon() [0x590eba]
2: (()+0xfcb0) [0x7f2031e06cb0]
3: (gsignal()+0x35) [0x7f203055e425]
4: (abort()+0x17b) [0x7f2030561b8b]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f2030eb069d]
6: (()+0xb5846) [0x7f2030eae846]
7: (()+0xb5873) [0x7f2030eae873]
8: (()+0xb596e) [0x7f2030eae96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x649a9f]
10: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a41ee]
11: (main()+0x7a1) [0x48a671]
12: (__libc_start_main()+0xed) [0x7f203054976d]
13: /usr/bin/ceph-mon() [0x48dc2d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
0> 2013-04-18 11:02:12.445418 7f2032224780 -1 ** Caught signal (Aborted) *
in thread 7f2032224780
ceph version 0.60-525-g7e4f80b (7e4f80b12e86d0da9cedc1569c63d78cd27bb8ed)
1: /usr/bin/ceph-mon() [0x590eba]
2: (()+0xfcb0) [0x7f2031e06cb0]
3: (gsignal()+0x35) [0x7f203055e425]
4: (abort()+0x17b) [0x7f2030561b8b]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f2030eb069d]
6: (()+0xb5846) [0x7f2030eae846]
7: (()+0xb5873) [0x7f2030eae873]
8: (()+0xb596e) [0x7f2030eae96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x649a9f]
10: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a41ee]
11: (main()+0x7a1) [0x48a671]
12: (__libc_start_main()+0xed) [0x7f203054976d]
13: /usr/bin/ceph-mon() [0x48dc2d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

bash: line 1: 19264 Aborted (core dumped) /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /tmp/ceph.conf.4af6179d7498b7b9da2f9c74dea5c1d1
failed: 'ssh burnupi59 ulimit -n 8192; /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /tmp/ceph.conf.4af6179d7498b7b9da2f9c74dea5c1d1 '
= mds.a === === mds.a ===
Stopping Ceph mds.a on burnupi57...kill 53621...done === mds.a ===
Starting Ceph mds.a on burnupi57...
starting mds.a at :/0 === osd.0 === === osd.0 ===
Stopping Ceph osd.0 on burnupi57...kill 53849...done === osd.0 ===
Starting Ceph osd.0 on burnupi57...
starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal === osd.1 === === osd.1 ===
Stopping Ceph osd.1 on burnupi59...kill 18533...done === osd.1 ===
Starting Ceph osd.1 on burnupi59...
starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal === osd.2 === === osd.2 ===
Stopping Ceph osd.2 on burnupi63...kill 23636...done === osd.2 ===
Starting Ceph osd.2 on burnupi63...
starting osd.2 at :/0 osd_data /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal


ceph.conf:

[global]

auth cluster required = cephx
auth service required = cephx
auth client required = cephx

[osd]
osd journal size = 1000

filestore xattr use omap = true
#osd mkfs type = {fs-type}
#osd mkfs options {fs-type} = {mkfs options} # default for xfs is "-f"
#osd mount options {fs-type} = {mount options} # default mount option is "rw,noatime"
  1. For example, for ext4, the mount option might look like this:
#osd mkfs options ext4 = user_xattr,rw,noatime

[mon.a]

host = burnupi57
mon addr = 10.214.136.16:6789
[mon.b]
host = burnupi63
mon addr = 10.214.136.4:6789
[mon.c]
host = burnupi59
mon addr = 10.214.136.12:6789

[osd.0]
host = burnupi57
#devs = {path-to-device}

[osd.1]
host = burnupi59
#devs = {path-to-device}
[osd.2]
host = burnupi63
#devs = {path-to-device}

[mds.a]
host = burnupi57

[client.radosgw.gateway]
host = burnupi57
keyring = /etc/ceph/keyring.radosgw.gateway
rgw socket path = /tmp/radosgw.sock
log file = /var/log/ceph/radosgw.log
rgw enable usage log = true
rgw usage log tick interval = 30
rgw usage log flush threshold = 1024
rgw usage max shards = 32
rgw usage max user shards = 1


Related issues

Duplicated by Ceph - Bug #4858: mon: doesn't necessarily call reset() during an election cycle Resolved 04/29/2013

Associated revisions

Revision fe68afe9 (diff)
Added by Greg Farnum almost 11 years ago

mon: communicate the quorum_features properly when declaring victory.

Fixes #4747.

Signed-off-by: Greg Farnum <>
Reviewed-by: Sage Weil <>

Revision 849ed598 (diff)
Added by Greg Farnum almost 11 years ago

mon: communicate the quorum_features properly when declaring victory.

Fixes #4747.

Signed-off-by: Greg Farnum <>
Reviewed-by: Sage Weil <>
(cherry picked from commit fe68afe9d10bc5d49a05a8bafa644d57783447cf)

History

#1 Updated by Greg Farnum almost 11 years ago

  • Assignee set to Joao Eduardo Luis

I believe this is about the pre-Bobtail change which started adding global ordering values to the monitor data store; it seems you're upgrading fast enough that the monitor still has values which didn't get assigned an ordering. If you leave it running long enough to cycle through those it should go fine (ie, this is not representative of what customers should see).

Joao, can you check that my suppositions are correct, and figure out how best to test the upgrade while dealing with this issue?

#2 Updated by Ian Colle almost 11 years ago

  • Priority changed from High to Urgent

#3 Updated by Joao Eduardo Luis almost 11 years ago

Greg Farnum wrote:

I believe this is about the pre-Bobtail change which started adding global ordering values to the monitor data store; it seems you're upgrading fast enough that the monitor still has values which didn't get assigned an ordering. If you leave it running long enough to cycle through those it should go fine (ie, this is not representative of what customers should see).

Joao, can you check that my suppositions are correct, and figure out how best to test the upgrade while dealing with this issue?

This is basically what I was going to suggest as a possible cause. If that is not the case, than further inquiry should be made (in which case access to the machines suffering from this would be best)

#4 Updated by Ken Franklin almost 11 years ago

I was able to recreate this twice. The first time included running functional tests in between each installation ie. Argonaut - run tests, upgrade to bobtail - run tests (left it overnight) then attempted to upgrade to cuttlefish.

With the second recreation I upgraded without running tests in between.

I will attempt to recreate it again using the first method today. What is a "long enough cycle" ?

#5 Updated by Sage Weil almost 11 years ago

hmm, could the problem may be that it wants gv values for everything in the mon store, not just the recent commits?

#6 Updated by Greg Farnum almost 11 years ago

It pretty much has to, unless it were given separate logic to figure out which commits "matter", which would be not great.

Ken, I don't think you need to reproduce this again; we know what's happening. We should define what the upgrade requirements are in case people are stepping forward quickly like that, but that's not necessarily something that needs to be set immediately on the Cuttlefish release, either.

#7 Updated by Joao Eduardo Luis almost 11 years ago

  • Project changed from devops to Ceph
  • Category set to Monitor
  • Assignee deleted (Joao Eduardo Luis)

#8 Updated by Joao Eduardo Luis almost 11 years ago

I'm not currently working on this, so I'm unassigning it from me (but still watching) in case someone else wants to pick it up.

#9 Updated by Greg Farnum almost 11 years ago

Shoot; it looks like this is actually just checking the on-disk features CompatSet; it's not iterating through the actual data at all! (which in fact might be a separate bug, but I'd have to look more closely to be sure.) This may require more investigation after all.

The only way I can see that not having happened is if you didn't actually fully upgrade or shut down the argonaut monitors; can you verify that you did do that, Ken, and that each of them crashed with this assert on the upgrade to cuttlefish?

#10 Updated by Ken Franklin almost 11 years ago

It's a manual process so I could have missed something along the way. If I used upgrade instead of dist-upgrade for bobtail on any of the nodes, it would have left old stuff behind. In any case I stepped through the procedure again and ran tests to verify functionality along the way. The upgrade to next 0.60-553 did not have any issues starting the monitors this time. I think we can file this under not-reproducible.

#11 Updated by Greg Farnum almost 11 years ago

  • Status changed from New to Can't reproduce

Awesome. I made #4758 for the fast-convert story I mentioned.

#12 Updated by Tamilarasi muthamizhan almost 11 years ago

  • Subject changed from Upgrade from argonaut->bobtail->next fails w/"Existing store has not been converted to 0.52 format" to Upgrade monitors from argonaut->bobtail->next fails w/"Existing store has not been converted to 0.52 format"
  • Status changed from Can't reproduce to New
  • Target version set to v0.61 - Cuttlefish

I am not sure, why this was marked "cant reproduce" but am hitting this on my local cluster [burnupi39, burnupi45]

steps:
  • set up an argonaut cluster on burnupi39,burnupi45 with cephx off
  • run some workloads like blogbench and fsstress
  • now, upgraded only the monitors to bobtail while the osds and mds were still running argonaut
  • after some time, upgraded the monitors to cuttlefish

now, the monitors would not come up and the cluster is not operational.

ubuntu@burnupi45:/var/lib/ceph/osd/ceph-2/current$ sudo service ceph restart mon.b
=== mon.b === 
=== mon.b === 
Stopping Ceph mon.b on burnupi45...kill 36034...done
=== mon.b === 
Starting Ceph mon.b on burnupi45...
Invalid argument: /var/lib/ceph/mon/ceph-b/store.db: does not exist (create_if_missing is false)
mon/Monitor.cc: In function 'int Monitor::StoreConverter::needs_conversion()' thread 7fbbee83c780 time 2013-04-26 13:51:41.207474
mon/Monitor.cc: 4223: FAILED assert(0 == "Existing store has not been converted to 0.52 format")
 ceph version 0.60-684-gebbdef2 (ebbdef29fa1d4e7f466ab3aa7197e851320fd6b4)
 1: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a6ebe]
 2: (main()+0x799) [0x48b0f9]
 3: (__libc_start_main()+0xed) [0x7fbbec8f576d]
 4: /usr/bin/ceph-mon() [0x48e47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2013-04-26 13:51:41.208010 7fbbee83c780 -1 mon/Monitor.cc: In function 'int Monitor::StoreConverter::needs_conversion()' thread 7fbbee83c780 time 2013-04-26 13:51:41.207474
mon/Monitor.cc: 4223: FAILED assert(0 == "Existing store has not been converted to 0.52 format")

 ceph version 0.60-684-gebbdef2 (ebbdef29fa1d4e7f466ab3aa7197e851320fd6b4)
 1: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a6ebe]
 2: (main()+0x799) [0x48b0f9]
 3: (__libc_start_main()+0xed) [0x7fbbec8f576d]
 4: /usr/bin/ceph-mon() [0x48e47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2013-04-26 13:51:41.208010 7fbbee83c780 -1 mon/Monitor.cc: In function 'int Monitor::StoreConverter::needs_conversion()' thread 7fbbee83c780 time 2013-04-26 13:51:41.207474
mon/Monitor.cc: 4223: FAILED assert(0 == "Existing store has not been converted to 0.52 format")

 ceph version 0.60-684-gebbdef2 (ebbdef29fa1d4e7f466ab3aa7197e851320fd6b4)
 1: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a6ebe]
 2: (main()+0x799) [0x48b0f9]
 3: (__libc_start_main()+0xed) [0x7fbbec8f576d]
 4: /usr/bin/ceph-mon() [0x48e47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (Aborted) **
 in thread 7fbbee83c780
 ceph version 0.60-684-gebbdef2 (ebbdef29fa1d4e7f466ab3aa7197e851320fd6b4)
 1: /usr/bin/ceph-mon() [0x59471a]
 2: (()+0xfcb0) [0x7fbbee41fcb0]
 3: (gsignal()+0x35) [0x7fbbec90a425]
 4: (abort()+0x17b) [0x7fbbec90db8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fbbed25c69d]
 6: (()+0xb5846) [0x7fbbed25a846]
 7: (()+0xb5873) [0x7fbbed25a873]
 8: (()+0xb596e) [0x7fbbed25a96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x64d7df]
 10: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a6ebe]
 11: (main()+0x799) [0x48b0f9]
 12: (__libc_start_main()+0xed) [0x7fbbec8f576d]
 13: /usr/bin/ceph-mon() [0x48e47d]
2013-04-26 13:51:41.209173 7fbbee83c780 -1 *** Caught signal (Aborted) **
 in thread 7fbbee83c780

 ceph version 0.60-684-gebbdef2 (ebbdef29fa1d4e7f466ab3aa7197e851320fd6b4)
 1: /usr/bin/ceph-mon() [0x59471a]
 2: (()+0xfcb0) [0x7fbbee41fcb0]
 3: (gsignal()+0x35) [0x7fbbec90a425]
 4: (abort()+0x17b) [0x7fbbec90db8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fbbed25c69d]
 6: (()+0xb5846) [0x7fbbed25a846]
 7: (()+0xb5873) [0x7fbbed25a873]
 8: (()+0xb596e) [0x7fbbed25a96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x64d7df]
 10: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a6ebe]
 11: (main()+0x799) [0x48b0f9]
 12: (__libc_start_main()+0xed) [0x7fbbec8f576d]
 13: /usr/bin/ceph-mon() [0x48e47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2013-04-26 13:51:41.209173 7fbbee83c780 -1 *** Caught signal (Aborted) **
 in thread 7fbbee83c780

 ceph version 0.60-684-gebbdef2 (ebbdef29fa1d4e7f466ab3aa7197e851320fd6b4)
 1: /usr/bin/ceph-mon() [0x59471a]
 2: (()+0xfcb0) [0x7fbbee41fcb0]
 3: (gsignal()+0x35) [0x7fbbec90a425]
 4: (abort()+0x17b) [0x7fbbec90db8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fbbed25c69d]
 6: (()+0xb5846) [0x7fbbed25a846]
 7: (()+0xb5873) [0x7fbbed25a873]
 8: (()+0xb596e) [0x7fbbed25a96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x64d7df]
 10: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a6ebe]
 11: (main()+0x799) [0x48b0f9]
 12: (__libc_start_main()+0xed) [0x7fbbec8f576d]
 13: /usr/bin/ceph-mon() [0x48e47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

bash: line 1: 38394 Aborted                 (core dumped) /usr/bin/ceph-mon -i b --pid-file /var/run/ceph/mon.b.pid -c /etc/ceph/ceph.conf
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i b --pid-file /var/run/ceph/mon.b.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on burnupi45...
=========================

ubuntu@burnupi45:/var/lib/ceph/osd/ceph-2/current$ sudo service ceph restart mon.c
=== mon.c === 
=== mon.c === 
Stopping Ceph mon.c on burnupi45...done
=== mon.c === 
Starting Ceph mon.c on burnupi45...
Invalid argument: /var/lib/ceph/mon/ceph-c/store.db: does not exist (create_if_missing is false)
mon/Monitor.cc: In function 'int Monitor::StoreConverter::needs_conversion()' thread 7f2911768780 time 2013-04-26 14:00:27.087130
mon/Monitor.cc: 4223: FAILED assert(0 == "Existing store has not been converted to 0.52 format")
 ceph version 0.60-684-gebbdef2 (ebbdef29fa1d4e7f466ab3aa7197e851320fd6b4)
 1: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a6ebe]
 2: (main()+0x799) [0x48b0f9]
 3: (__libc_start_main()+0xed) [0x7f290f82176d]
 4: /usr/bin/ceph-mon() [0x48e47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2013-04-26 14:00:27.087627 7f2911768780 -1 mon/Monitor.cc: In function 'int Monitor::StoreConverter::needs_conversion()' thread 7f2911768780 time 2013-04-26 14:00:27.087130
mon/Monitor.cc: 4223: FAILED assert(0 == "Existing store has not been converted to 0.52 format")

 ceph version 0.60-684-gebbdef2 (ebbdef29fa1d4e7f466ab3aa7197e851320fd6b4)
 1: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a6ebe]
 2: (main()+0x799) [0x48b0f9]
 3: (__libc_start_main()+0xed) [0x7f290f82176d]
 4: /usr/bin/ceph-mon() [0x48e47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2013-04-26 14:00:27.087627 7f2911768780 -1 mon/Monitor.cc: In function 'int Monitor::StoreConverter::needs_conversion()' thread 7f2911768780 time 2013-04-26 14:00:27.087130
mon/Monitor.cc: 4223: FAILED assert(0 == "Existing store has not been converted to 0.52 format")

 ceph version 0.60-684-gebbdef2 (ebbdef29fa1d4e7f466ab3aa7197e851320fd6b4)
 1: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a6ebe]
 2: (main()+0x799) [0x48b0f9]
 3: (__libc_start_main()+0xed) [0x7f290f82176d]
 4: /usr/bin/ceph-mon() [0x48e47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (Aborted) **
 in thread 7f2911768780
 ceph version 0.60-684-gebbdef2 (ebbdef29fa1d4e7f466ab3aa7197e851320fd6b4)
 1: /usr/bin/ceph-mon() [0x59471a]
 2: (()+0xfcb0) [0x7f291134bcb0]
 3: (gsignal()+0x35) [0x7f290f836425]
 4: (abort()+0x17b) [0x7f290f839b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f291018869d]
 6: (()+0xb5846) [0x7f2910186846]
 7: (()+0xb5873) [0x7f2910186873]
 8: (()+0xb596e) [0x7f291018696e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x64d7df]
 10: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a6ebe]
 11: (main()+0x799) [0x48b0f9]
 12: (__libc_start_main()+0xed) [0x7f290f82176d]
 13: /usr/bin/ceph-mon() [0x48e47d]
2013-04-26 14:00:27.088784 7f2911768780 -1 *** Caught signal (Aborted) **
 in thread 7f2911768780

 ceph version 0.60-684-gebbdef2 (ebbdef29fa1d4e7f466ab3aa7197e851320fd6b4)
 1: /usr/bin/ceph-mon() [0x59471a]
 2: (()+0xfcb0) [0x7f291134bcb0]
 3: (gsignal()+0x35) [0x7f290f836425]
 4: (abort()+0x17b) [0x7f290f839b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f291018869d]
 6: (()+0xb5846) [0x7f2910186846]
 7: (()+0xb5873) [0x7f2910186873]
 8: (()+0xb596e) [0x7f291018696e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x64d7df]
 10: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a6ebe]
 11: (main()+0x799) [0x48b0f9]
 12: (__libc_start_main()+0xed) [0x7f290f82176d]
 13: /usr/bin/ceph-mon() [0x48e47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2013-04-26 14:00:27.088784 7f2911768780 -1 *** Caught signal (Aborted) **
 in thread 7f2911768780

 ceph version 0.60-684-gebbdef2 (ebbdef29fa1d4e7f466ab3aa7197e851320fd6b4)
 1: /usr/bin/ceph-mon() [0x59471a]
 2: (()+0xfcb0) [0x7f291134bcb0]
 3: (gsignal()+0x35) [0x7f290f836425]
 4: (abort()+0x17b) [0x7f290f839b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f291018869d]
 6: (()+0xb5846) [0x7f2910186846]
 7: (()+0xb5873) [0x7f2910186873]
 8: (()+0xb596e) [0x7f291018696e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x64d7df]
 10: (Monitor::StoreConverter::needs_conversion()+0x77e) [0x4a6ebe]
 11: (main()+0x799) [0x48b0f9]
 12: (__libc_start_main()+0xed) [0x7f290f82176d]
 13: /usr/bin/ceph-mon() [0x48e47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

bash: line 1: 43921 Aborted                 (core dumped) /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on burnupi45...

leaving the test machines in the current state.

#13 Updated by Tamilarasi muthamizhan almost 11 years ago

upgraded the osds and mds as well. but the monitors are stuck up. one of the monitors seems to be up.

ubuntu@burnupi39:/etc/cephsudo service ceph -a status === mon.a ===
mon.a: running {"version":"0.60-684-gebbdef2"} === mon.b ===
mon.b: not running. === mon.c ===
mon.c: not running. === mds.a ===
mds.a: running {"version":"0.60-686-ge0c39c1"} === osd.1 ===
osd.1: running {"version":"0.60-684-gebbdef2"} === osd.2 ===
osd.2: running {"version":"0.60-684-gebbdef2"}

#14 Updated by Greg Farnum almost 11 years ago

  • Status changed from New to In Progress
  • Assignee set to Tamilarasi muthamizhan

Looked at this briefly and am having Tamil check it again. From the logs it appears the monitors never formed a quorum while running the Bobtail code, so she's going to check and make sure they manage to do that.

#15 Updated by Tamilarasi muthamizhan almost 11 years ago

  • Assignee changed from Tamilarasi muthamizhan to Greg Farnum

Greg, checked that and now, hitting this on only one monitor [mon.c on burnupi45].

leaving the test machines burnupi39 and burnupi45 for you to take a look.

#16 Updated by Greg Farnum almost 11 years ago

Hrm, the store for mon.c has the global versions, but for some reason the feature_set on disk hasn't been updated. Going to review that code.

#17 Updated by Greg Farnum almost 11 years ago

Okay, this is actually #4858 — not calling reset() meant we weren't clearing out the paxos_recovered member, so the GV stuff wasn't getting activated.

#18 Updated by Greg Farnum almost 11 years ago

  • Status changed from In Progress to Resolved

Resolving this because the actual bug is broader.

#19 Updated by Tamilarasi muthamizhan almost 11 years ago

  • Status changed from Resolved to In Progress

reopening the bug.
hit this again on burnupi39, burnupi45. Greg is already looking into it.

#20 Updated by Greg Farnum almost 11 years ago

  • Status changed from In Progress to Fix Under Review

I've managed to reproduce this locally just using vstart. It appears that we haven't actually been setting the MMonElection::quorum_features member, which is relied on when the peons are deciding what to write down. This means only the leaders have been changing their on-disk feature sets. We'll need a point release to address it, I'm afraid, if we want to do so cleanly. (We can also just tell people to delete and re-sync, but that's not really the message I want to send.) We probably haven't noticed because the GV stuff is turned on by default for newly-created bobtail clusters.

wip-4747 needs a quick review and merge; Tamil, you can test wip-4747-backport instead of a bobtail release if you like.

#21 Updated by Ian Colle almost 11 years ago

  • Backport set to bobtail

#22 Updated by Greg Farnum almost 11 years ago

  • Status changed from Fix Under Review to Resolved

I tested it with vstart upgrades and all looks good. Pushed the fix to "next" and backported to "bobtail".

Also available in: Atom PDF