Bug #4811: incorrect shutdown: mon/MonMap.h: 160: FAILED assert(mon_addr.count(n)) - Ceph - Ceph

Actions

Copy link

Bug #4811

closed

incorrect shutdown: mon/MonMap.h: 160: FAILED assert(mon_addr.count(n))

Added by Denis kaganovich about 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sage Weil

Category:

Monitor

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Monitor unable to start after incorrect shutdown. First happened on busy node with swapoff -a (twice), on older version it was long time unable to init mon. Now - after power loss, last "next" give assertion after some attempts to run.

Resolved by recreating unconnected monitor: removed store.db, then
ceph mon getmap -o /tmp/monmap && ceph-mon -i 4 --mkfs --monmap /tmp/monmap

IMHO looks like paxos inconsistent, not repaired state.

Level 10 log attached.

Files

Download all files

ceph-mon.4.log (114 KB) ceph-mon.4.log		Denis kaganovich, 04/25/2013 12:25 AM
ceph-mon.1.log (424 KB) ceph-mon.1.log		Denis kaganovich, 04/25/2013 04:44 PM
ceph-mon.3.log (356 KB) ceph-mon.3.log		Denis kaganovich, 04/25/2013 07:35 PM

Actions

Copy link

Updated by Denis kaganovich about 11 years ago

PS About 2 previous failures: I just think about RAM limit. It just was so: busy node, long running, swapoff -a (but a lot of RAM "cached", monitor not RAM-aggressive (usually), I use tcmalloc & libatomic, even disable large pages to tcmalloc on this single busy node), twice monitor (single) was absent without any messages anywere, Just no ceph-mon once apon a time. After it monitor unable to start again in similar manner. But now I sure can provide info above, about power failure.

Actions

Copy link

Updated by Denis kaganovich about 11 years ago

Sorry for flooding (this is my morning):

Last failure (on different node then first 2), between power-on and this log generating, node was slowdown and fast frozen (pinging) & rebooted via button: IMHO something eat RAM/swap. Possible (but unsure) this bug prefixed by memory bombing, probably - from monitor (there no time to lookup). (now swapoff on all 3 nodes)

Actions

Copy link

Updated by Ian Colle about 11 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Greg Farnum about 11 years ago

Category set to Monitor

Actions

Copy link

Updated by Sage Weil about 11 years ago

Just to clarify: this failed startup is happening only after you did the manual repair (remove store.db, replace monmap)?

We want to get to the bottom of the memory usage piece, either way.

what is the output of 'ldd /usr/bin/ceph-mon'?

Actions

Copy link

Updated by Sage Weil about 11 years ago

Status changed from New to Need More Info
Assignee set to Sage Weil

Actions

Copy link

Updated by Denis kaganovich about 11 years ago

Sage Weil wrote:

Just to clarify: this failed startup is happening only after you did the manual repair (remove store.db, replace monmap)?

No. Before:
1) memory bomb (not sure about 100%)
2) failed starup
3) manual repair - re-create monitor
4) starting good monitor

We want to get to the bottom of the memory usage piece, either way.

what is the output of 'ldd /usr/bin/ceph-mon'?

From last failed node:

linux-vdso.so.1 (0x00007fffff7fe000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa208dc2000)
    libcrypto++.so.0 => /usr/lib64/libcrypto++.so.0 (0x00007fa208644000)
    libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fa20843f000)
    librt.so.1 => /lib64/librt.so.1 (0x00007fa208236000)
    libtcmalloc.so.4 => /usr/lib64/libtcmalloc.so.4 (0x00007fa207fba000)
    libboost_system.so.1.52.0 => /usr/lib64/libboost_system.so.1.52.0 (0x00007fa207db6000)
    libleveldb.so.1 => /usr/lib64/libleveldb.so.1 (0x00007fa207b48000)
    libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.3/libstdc++.so.6 (0x00007fa207843000)
    libm.so.6 => /lib64/libm.so.6 (0x00007fa207535000)
    libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.3/libgcc_s.so.1 (0x00007fa20731f000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fa206f03000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fa208fdf000)
    libunwind.so.8 => /usr/lib64/libunwind.so.8 (0x00007fa206ce5000)
    libsnappy.so.1 => /usr/lib64/libsnappy.so.1 (0x00007fa206add000)

Actions

Copy link

Updated by Sage Weil about 11 years ago

Status changed from Need More Info to Resolved

3ce35a6743e050bf0de5abd5ad32f522c5664f3d

Actions

Copy link

Updated by Denis kaganovich about 11 years ago

Sage Weil wrote:

3ce35a6743e050bf0de5abd5ad32f522c5664f3d

Hmm. Now starting good, but silent collapsing soon with nothin more then (1 line example):

2013-04-26 02:15:29.980021 7f079744e700 0 -- 10.227.227.101:6789/0 >> 10.227.227.104:0/2711594300 p
ipe(0x6beac80 sd=27 :6789 s=0 pgs=0 cs=0 l=0).accept peer addr is really 10.227.227.104:0/2711594300
(socket is 10.227.227.104:53382/0)

Actions

Copy link

#10

Updated by Denis kaganovich about 11 years ago

And even "killall ceph-mon -w" waiting long (or infinite)...

Actions

Copy link

#11

Updated by Sage Weil about 11 years ago

Can you attach the new startup log?

Actions

Copy link

#12

Updated by Denis kaganovich about 11 years ago

File ceph-mon.1.log ceph-mon.1.log added

This mon now not in quorum (but running).

Actions

Copy link

#13

Updated by Denis kaganovich about 11 years ago

PS First time was HEALTH_OK. Not once restart.

Actions

Copy link

#14

Updated by Denis kaganovich about 11 years ago

"killall ceph-mon -w" need twice.

Actions

Copy link

#15

Updated by Sage Weil about 11 years ago

hmm, according to the log it is in quorum (and leader) and healthy..

Actions

Copy link

#16

Updated by Denis kaganovich about 11 years ago

File ceph-mon.3.log ceph-mon.3.log added

Now all wrong, but there are at least this log.

Actions

Copy link

#17

Updated by Denis kaganovich about 11 years ago

IMHO problems prior to this commit. Now I happy to get working f4804849b7644f2c1dfd92404682f510a88e9a23 and going to sleep.

Actions

Copy link

#18

Updated by Denis kaganovich about 11 years ago

Post-problems IMHO solved... Upgrading gcc to 4.8.0 & -fno-aggressive-loop-optimizations (but this is black magic and may be false). If anybody want to read 3 logs before ceph-mon is silently stuck - http://mahatma.bspu.unibel.by/download/transit/stuck-mon.tar.bz2

Looks like dangerous loop after commit f4804849b7644f2c1dfd92404682f510a88e9a23. But as optimization problems usually not a bug - just FYI and done here.

Actions

Copy link

#19

Updated by Greg Farnum about 11 years ago

That commit doesn't touch the monitor code, and I don't believe those osd types are used in the monitor either. What makes you think that commit has the problem?

Actions

Copy link

#20

Updated by Denis kaganovich about 11 years ago

Not single commit and not THIS bug commit. I got sure stuck (IMHO 100% last 3 of 3, not second) mon, need "kill" twice. In archive in URL can see near similar log size (but not precise) in serial start/kill way. After revert to f4804849b7644f2c1dfd92404682f510a88e9a23 (just previous working for me) in same environment & CFLAGS - mon go around this place and still working. After described GCC changes (and disabled libatomic while, but IMHO this is not related) both nodes with symptoms still working near day without problems.

So, this commit is good, after it - something wrong, but possible only on higher optimization level on GCC 4.6.3 & 4.7.2 (but I build 99% of Gentoo packages with this flags, even there are sanitized).

I cannot be sure (as I not get precise point of failure), but first this stuck coming at tens of minutes maximum on healtly cluster. So, high % about this is right bug bounds. I can try revert to old GCC+CFLAGS, but prefer to keep cluster healthly (at least for statistics).

PS I know: I can compare trees and try to find new strange loops, but while I satisfied by idea about this flag is safe.

Actions

Copy link

#21

Updated by Denis kaganovich almost 11 years ago

About deadloop: opened #4845

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #4811

incorrect shutdown: mon/MonMap.h: 160: FAILED assert(mon_addr.count(n))

Updated by Denis kaganovich about 11 years ago

Updated by Denis kaganovich about 11 years ago

Updated by Ian Colle about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Denis kaganovich about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Denis kaganovich about 11 years ago

Updated by Denis kaganovich about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Denis kaganovich about 11 years ago

Updated by Denis kaganovich about 11 years ago

Updated by Denis kaganovich about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Denis kaganovich about 11 years ago

Updated by Denis kaganovich about 11 years ago

Updated by Denis kaganovich about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Denis kaganovich about 11 years ago

Updated by Denis kaganovich almost 11 years ago