Project

General

Profile

Actions

Bug #5640

closed

mon: FAILED assert(!db->create_and_open(std::cerr)) on 0.61.4

Added by Joao Eduardo Luis almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Normal
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've been unable to reproduce this, but it kept happening on a user's cluster while trying to upgrade from bobtail to cuttlefish 0.61.4.

2013-07-11 12:35:17.446679 7fefea7ff780  0 ceph version 0.61.4-42-gb684fcf (b684fcfd7a47001286548e28d99e3f920e9bab5b), process ceph-mon, pid 14273
2013-07-11 12:41:23.970419 7f84dc80b780  0 ceph version 0.61.4-42-gb684fcf (b684fcfd7a47001286548e28d99e3f920e9bab5b), process ceph-mon, pid 14865
2013-07-11 12:41:23.982740 7f84dc80b780 10 needs_conversion
2013-07-11 12:41:23.982764 7f84dc80b780  1 store(/etc/ceph/mon.a) mount
2013-07-11 12:41:23.982797 7f84dc80b780 15 store(/etc/ceph/mon.a) exists_bl magic
2013-07-11 12:41:23.982809 7f84dc80b780 20 _check_gv_store
2013-07-11 12:41:23.982811 7f84dc80b780 15 store(/etc/ceph/mon.a) exists_bl feature_set
2013-07-11 12:41:23.982827 7f84dc80b780 20 store(/etc/ceph/mon.a) reading at off 0 of 120
2013-07-11 12:41:23.982869 7f84dc80b780 15 store(/etc/ceph/mon.a) get_bl feature_set = 120 bytes
2013-07-11 12:41:23.982892 7f84dc80b780  1 found old GV monitor store format -- should convert!
2013-07-11 12:41:23.983958 7f84dc80b780 -1 mon/Monitor.cc: In function 'int Monitor::StoreConverter::convert()' thread 7f84dc80b780 time 2013-07-11 12:41:23.983152
mon/Monitor.cc: 4329: FAILED assert(!db->create_and_open(std::cerr))

 ceph version 0.61.4-42-gb684fcf (b684fcfd7a47001286548e28d99e3f920e9bab5b)
 1: (Monitor::StoreConverter::convert()+0x3d6) [0x4b2846]
 2: (main()+0x16b4) [0x4750e4]
 3: (__libc_start_main()+0xfd) [0x325a61ecdd]
 4: /usr/bin/ceph-mon() [0x472769]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Running strace (attached) shows that store.db/LOCK was already being held when 'Monitor::StoreConvert::convert()' attempted to open the database:

[pid 15630] futex(0x2ed00ac, FUTEX_WAIT_PRIVATE, 13, NULL <unfinished ...>
[pid 15626] munmap(0x7f73b8cd3000, 4096) = 0
[pid 15626] write(2, "IO error: lock /etc/ceph/mon.a/s"..., 69IO error: lock /etc/ceph/mon.a/store.db/LOCK: already held by process) = 69
[pid 15626] write(2, "\n", 1
)           = 1
[pid 15626] futex(0x325a990600, FUTEX_WAKE_PRIVATE, 2147483647) = 0
[pid 15626] futex(0x325c015a90, FUTEX_WAKE_PRIVATE, 2147483647) = 0
[pid 15626] write(2, "mon/Monitor.cc: In function 'int"..., 190mon/Monitor.cc: In function 'int Monitor::StoreConverter::convert()' thread 7f73b8cbf780 time 2013-07-11 13:00:13.261293
mon/Monitor.cc: 4329: FAILED assert(!db->create_and_open(std::cerr))
) = 190

We're now going to resort to analysing the code.


Files

strace-LOCK.txt (60.2 KB) strace-LOCK.txt Joao Eduardo Luis, 07/16/2013 07:17 AM
Actions #1

Updated by Sage Weil almost 11 years ago

it sounds like any lines like

mon/Monitor.cc: 4329: FAILED assert(!db->create_and_open(std::cerr))

are the culprits. we want exactly one open() or create_and_open() in any execution path, since leveldb doesn't behave well with start/stop/start cycles.

Actions #2

Updated by Joao Eduardo Luis almost 11 years ago

Yeah, from the strace it sure looks a lot like a race between an open(store.db/LOCK), and a fcntl(F_SETLK, F_UNLCK) followed by a close() on a previous open(store.db/LOCK), either caused by us with all the opening/closing or something persisting on the leveldb side.

Actions #3

Updated by Joao Eduardo Luis almost 11 years ago

sigh we already have opened the db at that point. I'll push the fix as soon as I finish reviewing the surrounding code.

Actions #4

Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from In Progress to Fix Under Review

pushed wip-5640

Actions #5

Updated by Sage Weil almost 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF