Project

General

Profile

Actions

Bug #4811

closed

incorrect shutdown: mon/MonMap.h: 160: FAILED assert(mon_addr.count(n))

Added by Denis kaganovich about 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Monitor unable to start after incorrect shutdown. First happened on busy node with swapoff -a (twice), on older version it was long time unable to init mon. Now - after power loss, last "next" give assertion after some attempts to run.

Resolved by recreating unconnected monitor: removed store.db, then
ceph mon getmap -o /tmp/monmap && ceph-mon -i 4 --mkfs --monmap /tmp/monmap

IMHO looks like paxos inconsistent, not repaired state.

Level 10 log attached.


Files

ceph-mon.4.log (114 KB) ceph-mon.4.log Denis kaganovich, 04/25/2013 12:25 AM
ceph-mon.1.log (424 KB) ceph-mon.1.log Denis kaganovich, 04/25/2013 04:44 PM
ceph-mon.3.log (356 KB) ceph-mon.3.log Denis kaganovich, 04/25/2013 07:35 PM
Actions #1

Updated by Denis kaganovich about 11 years ago

PS About 2 previous failures: I just think about RAM limit. It just was so: busy node, long running, swapoff -a (but a lot of RAM "cached", monitor not RAM-aggressive (usually), I use tcmalloc & libatomic, even disable large pages to tcmalloc on this single busy node), twice monitor (single) was absent without any messages anywere, Just no ceph-mon once apon a time. After it monitor unable to start again in similar manner. But now I sure can provide info above, about power failure.

Actions #2

Updated by Denis kaganovich about 11 years ago

Sorry for flooding (this is my morning):

Last failure (on different node then first 2), between power-on and this log generating, node was slowdown and fast frozen (pinging) & rebooted via button: IMHO something eat RAM/swap. Possible (but unsure) this bug prefixed by memory bombing, probably - from monitor (there no time to lookup). (now swapoff on all 3 nodes)

Actions #3

Updated by Ian Colle about 11 years ago

  • Priority changed from Normal to Urgent
Actions #4

Updated by Greg Farnum about 11 years ago

  • Category set to Monitor
Actions #5

Updated by Sage Weil about 11 years ago

Just to clarify: this failed startup is happening only after you did the manual repair (remove store.db, replace monmap)?

We want to get to the bottom of the memory usage piece, either way.

what is the output of 'ldd /usr/bin/ceph-mon'?

Actions #6

Updated by Sage Weil about 11 years ago

  • Status changed from New to Need More Info
  • Assignee set to Sage Weil
Actions #7

Updated by Denis kaganovich about 11 years ago

Sage Weil wrote:

Just to clarify: this failed startup is happening only after you did the manual repair (remove store.db, replace monmap)?

No. Before:
1) memory bomb (not sure about 100%)
2) failed starup
3) manual repair - re-create monitor
4) starting good monitor

We want to get to the bottom of the memory usage piece, either way.

?

what is the output of 'ldd /usr/bin/ceph-mon'?

From last failed node:

linux-vdso.so.1 (0x00007fffff7fe000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa208dc2000)
libcrypto++.so.0 => /usr/lib64/libcrypto++.so.0 (0x00007fa208644000)
libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fa20843f000)
librt.so.1 => /lib64/librt.so.1 (0x00007fa208236000)
libtcmalloc.so.4 => /usr/lib64/libtcmalloc.so.4 (0x00007fa207fba000)
libboost_system.so.1.52.0 => /usr/lib64/libboost_system.so.1.52.0 (0x00007fa207db6000)
libleveldb.so.1 => /usr/lib64/libleveldb.so.1 (0x00007fa207b48000)
libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.3/libstdc++.so.6 (0x00007fa207843000)
libm.so.6 => /lib64/libm.so.6 (0x00007fa207535000)
libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.3/libgcc_s.so.1 (0x00007fa20731f000)
libc.so.6 => /lib64/libc.so.6 (0x00007fa206f03000)
/lib64/ld-linux-x86-64.so.2 (0x00007fa208fdf000)
libunwind.so.8 => /usr/lib64/libunwind.so.8 (0x00007fa206ce5000)
libsnappy.so.1 => /usr/lib64/libsnappy.so.1 (0x00007fa206add000)
Actions #8

Updated by Sage Weil about 11 years ago

  • Status changed from Need More Info to Resolved
Actions #9

Updated by Denis kaganovich about 11 years ago

Sage Weil wrote:

3ce35a6743e050bf0de5abd5ad32f522c5664f3d

Hmm. Now starting good, but silent collapsing soon with nothin more then (1 line example):

2013-04-26 02:15:29.980021 7f079744e700 0 -- 10.227.227.101:6789/0 >> 10.227.227.104:0/2711594300 p
ipe(0x6beac80 sd=27 :6789 s=0 pgs=0 cs=0 l=0).accept peer addr is really 10.227.227.104:0/2711594300
(socket is 10.227.227.104:53382/0)

Actions #10

Updated by Denis kaganovich about 11 years ago

And even "killall ceph-mon -w" waiting long (or infinite)...

Actions #11

Updated by Sage Weil about 11 years ago

Can you attach the new startup log?

Actions #12

Updated by Denis kaganovich about 11 years ago

This mon now not in quorum (but running).

Actions #13

Updated by Denis kaganovich about 11 years ago

PS First time was HEALTH_OK. Not once restart.

Actions #14

Updated by Denis kaganovich about 11 years ago

"killall ceph-mon -w" need twice.

Actions #15

Updated by Sage Weil about 11 years ago

hmm, according to the log it is in quorum (and leader) and healthy..

Actions #16

Updated by Denis kaganovich about 11 years ago

Now all wrong, but there are at least this log.

Actions #17

Updated by Denis kaganovich about 11 years ago

IMHO problems prior to this commit. Now I happy to get working f4804849b7644f2c1dfd92404682f510a88e9a23 and going to sleep.

Actions #18

Updated by Denis kaganovich about 11 years ago

Post-problems IMHO solved... Upgrading gcc to 4.8.0 & -fno-aggressive-loop-optimizations (but this is black magic and may be false). If anybody want to read 3 logs before ceph-mon is silently stuck - http://mahatma.bspu.unibel.by/download/transit/stuck-mon.tar.bz2

Looks like dangerous loop after commit f4804849b7644f2c1dfd92404682f510a88e9a23. But as optimization problems usually not a bug - just FYI and done here.

Actions #19

Updated by Greg Farnum about 11 years ago

That commit doesn't touch the monitor code, and I don't believe those osd types are used in the monitor either. What makes you think that commit has the problem?

Actions #20

Updated by Denis kaganovich about 11 years ago

Not single commit and not THIS bug commit. I got sure stuck (IMHO 100% last 3 of 3, not second) mon, need "kill" twice. In archive in URL can see near similar log size (but not precise) in serial start/kill way. After revert to f4804849b7644f2c1dfd92404682f510a88e9a23 (just previous working for me) in same environment & CFLAGS - mon go around this place and still working. After described GCC changes (and disabled libatomic while, but IMHO this is not related) both nodes with symptoms still working near day without problems.

So, this commit is good, after it - something wrong, but possible only on higher optimization level on GCC 4.6.3 & 4.7.2 (but I build 99% of Gentoo packages with this flags, even there are sanitized).

I cannot be sure (as I not get precise point of failure), but first this stuck coming at tens of minutes maximum on healtly cluster. So, high % about this is right bug bounds. I can try revert to old GCC+CFLAGS, but prefer to keep cluster healthly (at least for statistics).

PS I know: I can compare trees and try to find new strange loops, but while I satisfied by idea about this flag is safe.

Actions #21

Updated by Denis kaganovich almost 11 years ago

About deadloop: opened #4845

Actions

Also available in: Atom PDF