Bug #5640
closedmon: FAILED assert(!db->create_and_open(std::cerr)) on 0.61.4
0%
Description
I've been unable to reproduce this, but it kept happening on a user's cluster while trying to upgrade from bobtail to cuttlefish 0.61.4.
2013-07-11 12:35:17.446679 7fefea7ff780 0 ceph version 0.61.4-42-gb684fcf (b684fcfd7a47001286548e28d99e3f920e9bab5b), process ceph-mon, pid 14273 2013-07-11 12:41:23.970419 7f84dc80b780 0 ceph version 0.61.4-42-gb684fcf (b684fcfd7a47001286548e28d99e3f920e9bab5b), process ceph-mon, pid 14865 2013-07-11 12:41:23.982740 7f84dc80b780 10 needs_conversion 2013-07-11 12:41:23.982764 7f84dc80b780 1 store(/etc/ceph/mon.a) mount 2013-07-11 12:41:23.982797 7f84dc80b780 15 store(/etc/ceph/mon.a) exists_bl magic 2013-07-11 12:41:23.982809 7f84dc80b780 20 _check_gv_store 2013-07-11 12:41:23.982811 7f84dc80b780 15 store(/etc/ceph/mon.a) exists_bl feature_set 2013-07-11 12:41:23.982827 7f84dc80b780 20 store(/etc/ceph/mon.a) reading at off 0 of 120 2013-07-11 12:41:23.982869 7f84dc80b780 15 store(/etc/ceph/mon.a) get_bl feature_set = 120 bytes 2013-07-11 12:41:23.982892 7f84dc80b780 1 found old GV monitor store format -- should convert! 2013-07-11 12:41:23.983958 7f84dc80b780 -1 mon/Monitor.cc: In function 'int Monitor::StoreConverter::convert()' thread 7f84dc80b780 time 2013-07-11 12:41:23.983152 mon/Monitor.cc: 4329: FAILED assert(!db->create_and_open(std::cerr)) ceph version 0.61.4-42-gb684fcf (b684fcfd7a47001286548e28d99e3f920e9bab5b) 1: (Monitor::StoreConverter::convert()+0x3d6) [0x4b2846] 2: (main()+0x16b4) [0x4750e4] 3: (__libc_start_main()+0xfd) [0x325a61ecdd] 4: /usr/bin/ceph-mon() [0x472769] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Running strace (attached) shows that store.db/LOCK was already being held when 'Monitor::StoreConvert::convert()' attempted to open the database:
[pid 15630] futex(0x2ed00ac, FUTEX_WAIT_PRIVATE, 13, NULL <unfinished ...> [pid 15626] munmap(0x7f73b8cd3000, 4096) = 0 [pid 15626] write(2, "IO error: lock /etc/ceph/mon.a/s"..., 69IO error: lock /etc/ceph/mon.a/store.db/LOCK: already held by process) = 69 [pid 15626] write(2, "\n", 1 ) = 1 [pid 15626] futex(0x325a990600, FUTEX_WAKE_PRIVATE, 2147483647) = 0 [pid 15626] futex(0x325c015a90, FUTEX_WAKE_PRIVATE, 2147483647) = 0 [pid 15626] write(2, "mon/Monitor.cc: In function 'int"..., 190mon/Monitor.cc: In function 'int Monitor::StoreConverter::convert()' thread 7f73b8cbf780 time 2013-07-11 13:00:13.261293 mon/Monitor.cc: 4329: FAILED assert(!db->create_and_open(std::cerr)) ) = 190
We're now going to resort to analysing the code.
Files
Updated by Sage Weil almost 11 years ago
it sounds like any lines like
mon/Monitor.cc: 4329: FAILED assert(!db->create_and_open(std::cerr))
are the culprits. we want exactly one open() or create_and_open() in any execution path, since leveldb doesn't behave well with start/stop/start cycles.
Updated by Joao Eduardo Luis almost 11 years ago
Yeah, from the strace it sure looks a lot like a race between an open(store.db/LOCK), and a fcntl(F_SETLK, F_UNLCK) followed by a close() on a previous open(store.db/LOCK), either caused by us with all the opening/closing or something persisting on the leveldb side.
Updated by Joao Eduardo Luis almost 11 years ago
sigh we already have opened the db at that point. I'll push the fix as soon as I finish reviewing the surrounding code.
Updated by Joao Eduardo Luis almost 11 years ago
- Status changed from In Progress to Fix Under Review
pushed wip-5640
Updated by Sage Weil almost 11 years ago
- Status changed from Fix Under Review to Resolved