Project

General

Profile

Actions

Bug #9816

closed

mon exits unexpectedly and gracefully

Added by Loïc Dachary over 9 years ago. Updated about 9 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version 0.86-431-gabeaee4 (abeaee4cce470a4d0c289db21e7ef4d59bdf0f68)
...
2014-10-19 00:02:01.213777 7fa90e415700 10 mon.a@0(leader).pg v6 send_pg_creates to 64 pgs
2014-10-19 00:02:01.213781 7fa90e415700 10 mon.a@0(leader).osd e6 update_logger
2014-10-19 00:02:01.213777 7fa90a8ad700 10 -- 127.0.0.1:7104/0 >> 127.0.0.1:0/1026209 pipe(0x5db4800 sd=22 :7104 s=2 pgs=1 cs=1 l=1 c=0x5b52aa0).writer: state = open policy.server=1
2014-10-19 00:02:01.213788 7fa90e415700  0 log_channel(cluster) log [INF] : osdmap e6: 0 osds: 0 up, 0 in
2014-10-19 00:02:01.213789 7fa90a8ad700 20 -- 127.0.0.1:7104/0 >> 127.0.0.1:0/1026209 pipe(0x5db4800 sd=22 :7104 s=2 pgs=1 cs=1 l=1 c=0x5b52aa0).writer sleeping
2014-10-19 00:02:01.213809 7fa90e415700  1 -- 127.0.0.1:7104/0 --> 127.0.0.1:7104/0 -- log(1 entries) v1 -- ?+0 0x5b38fc0 con 0x5b50580
2014-10-19 00:02:01.213816 7fa90e415700 20 -- 127.0.0.1:7104/0 submit_message log(1 entries) v1 local
2014-10-19 00:02:01.213834 7fa90cb22700 20 mon.a@0(leader) e1 have connection
2014-10-19 00:02:01.213839 7fa90cb22700 20 mon.a@0(leader) e1 ms_dispatch existing session MonSession: mon.0 127.0.0.1:7104/0 is openallow * for mon.0 127.0.0.1:7104/0
2014-10-19 00:02:01.213849 7fa90cb22700 20 mon.a@0(leader) e1  caps allow *
2014-10-19 00:02:01.213852 7fa90cb22700 10 mon.a@0(leader).paxosservice(logm 1..15) dispatch log(1 entries) v1 from mon.0 127.0.0.1:7104/0
2014-10-19 00:02:01.213861 7fa90cb22700 10 mon.a@0(leader).paxosservice(logm 1..15)  waiting for paxos -> readable (v0)
2014-10-19 00:02:01.213870 7fa90cb22700 20 -- 127.0.0.1:7104/0 done calling dispatch on 0x5b39d40
2014-10-19 00:02:01.213885 7fa90cb22700  1 -- 127.0.0.1:7104/0 <== mon.0 127.0.0.1:7104/0 0 ==== log(1 entries) v1 ==== 0+0+0 (0 0 0) 0x5b38fc0 con 0x5b50580
2014-10-19 00:02:01.213899 7fa90cb22700 20 mon.a@0(leader) e1 have connection
2014-10-19 00:02:01.213903 7fa90cb22700 20 run: 34: return 1

This happened after four hours of running make -j8 check in a loop. What's bad is that it happened only once in dozens of runs. The good part is that we have the full logs of all operations and the mon.
...
TEST_crush_rule_create_erasure: 98: ./ceph osd crush rule rm ruleset3
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
TEST_crush_rule_create_erasure: 99: ./ceph osd crush rule ls
TEST_crush_rule_create_erasure: 99: grep ruleset3
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
TEST_crush_rule_create_erasure: 103: ./ceph osd erasure-code-profile rm default
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2014-10-19 00:02:01.217487 7f4f72f5d700  0 monclient: hunting for new mon
2014-10-19 00:02:01.217729 7f4f704c7700  0 -- 127.0.0.1:0/1026209 >> 127.0.0.1:7104/0 pipe(0x7f4f68003a50 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4f68007cf0).fault
TEST_crush_rule_create_erasure: 104: ./ceph osd erasure-code-profile ls
TEST_crush_rule_create_erasure: 104: grep default
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2014-10-19 00:02:01.458110 7f6e50395700  0 -- :/1026944 >> 127.0.0.1:7104/0 pipe(0x7f6e4c00c050 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6e4c0102f0).fault
...
2014-10-19 00:06:58.507723 7f6e50395700  0 -- :/1026944 >> 127.0.0.1:7104/0 pipe(0x7f6e4001d300 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6e40013880).fault
2014-10-19 00:07:01.458306 7f6e53dcc700  0 monclient(hunting): authenticate timed out after 300
2014-10-19 00:07:01.458369 7f6e53dcc700  0 librados: client.admin authentication error (110) Connection timed out
Error connecting to cluster: TimedOut
TEST_crush_rule_create_erasure: 105: ./ceph osd crush rule create-erasure ruleset3
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2014-10-19 00:07:01.765353 7f8f2422d700  0 -- :/1024595 >> 127.0.0.1:7104/0 pipe(0x7f8f2000c050 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8f200102f0).fault
...
2014-10-19 00:11:58.814682 7f8f2412c700  0 -- :/1024595 >> 127.0.0.1:7104/0 pipe(0x7f8f140008c0 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8f14014640).fault
2014-10-19 00:12:01.765513 7f8f26bc2700  0 monclient(hunting): authenticate timed out after 300
2014-10-19 00:12:01.765573 7f8f26bc2700  0 librados: client.admin authentication error (110) Connection timed out
Error connecting to cluster: TimedOut
TEST_crush_rule_create_erasure: 105: return 1

Files

osd-crush.sh.log.gz (77.6 KB) osd-crush.sh.log.gz mon tests and mon logs Loïc Dachary, 10/19/2014 08:12 AM
Actions #1

Updated by Joao Eduardo Luis over 9 years ago

just a hunch: feels like you're capturing only stdout from the monitor, and the monitor may have hit the 'mon data avail crit' percentage and gracefully quit -- error message wouldn't show if you were only capturing stdout and as far as I'm aware that's the only way the monitor will even quit gracefully.

Actions #2

Updated by Loïc Dachary about 9 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF