Actions
Bug #9816
closedmon exits unexpectedly and gracefully
Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
ceph version 0.86-431-gabeaee4 (abeaee4cce470a4d0c289db21e7ef4d59bdf0f68) ... 2014-10-19 00:02:01.213777 7fa90e415700 10 mon.a@0(leader).pg v6 send_pg_creates to 64 pgs 2014-10-19 00:02:01.213781 7fa90e415700 10 mon.a@0(leader).osd e6 update_logger 2014-10-19 00:02:01.213777 7fa90a8ad700 10 -- 127.0.0.1:7104/0 >> 127.0.0.1:0/1026209 pipe(0x5db4800 sd=22 :7104 s=2 pgs=1 cs=1 l=1 c=0x5b52aa0).writer: state = open policy.server=1 2014-10-19 00:02:01.213788 7fa90e415700 0 log_channel(cluster) log [INF] : osdmap e6: 0 osds: 0 up, 0 in 2014-10-19 00:02:01.213789 7fa90a8ad700 20 -- 127.0.0.1:7104/0 >> 127.0.0.1:0/1026209 pipe(0x5db4800 sd=22 :7104 s=2 pgs=1 cs=1 l=1 c=0x5b52aa0).writer sleeping 2014-10-19 00:02:01.213809 7fa90e415700 1 -- 127.0.0.1:7104/0 --> 127.0.0.1:7104/0 -- log(1 entries) v1 -- ?+0 0x5b38fc0 con 0x5b50580 2014-10-19 00:02:01.213816 7fa90e415700 20 -- 127.0.0.1:7104/0 submit_message log(1 entries) v1 local 2014-10-19 00:02:01.213834 7fa90cb22700 20 mon.a@0(leader) e1 have connection 2014-10-19 00:02:01.213839 7fa90cb22700 20 mon.a@0(leader) e1 ms_dispatch existing session MonSession: mon.0 127.0.0.1:7104/0 is openallow * for mon.0 127.0.0.1:7104/0 2014-10-19 00:02:01.213849 7fa90cb22700 20 mon.a@0(leader) e1 caps allow * 2014-10-19 00:02:01.213852 7fa90cb22700 10 mon.a@0(leader).paxosservice(logm 1..15) dispatch log(1 entries) v1 from mon.0 127.0.0.1:7104/0 2014-10-19 00:02:01.213861 7fa90cb22700 10 mon.a@0(leader).paxosservice(logm 1..15) waiting for paxos -> readable (v0) 2014-10-19 00:02:01.213870 7fa90cb22700 20 -- 127.0.0.1:7104/0 done calling dispatch on 0x5b39d40 2014-10-19 00:02:01.213885 7fa90cb22700 1 -- 127.0.0.1:7104/0 <== mon.0 127.0.0.1:7104/0 0 ==== log(1 entries) v1 ==== 0+0+0 (0 0 0) 0x5b38fc0 con 0x5b50580 2014-10-19 00:02:01.213899 7fa90cb22700 20 mon.a@0(leader) e1 have connection 2014-10-19 00:02:01.213903 7fa90cb22700 20 run: 34: return 1
This happened after four hours of running make -j8 check in a loop. What's bad is that it happened only once in dozens of runs. The good part is that we have the full logs of all operations and the mon.
... TEST_crush_rule_create_erasure: 98: ./ceph osd crush rule rm ruleset3 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** TEST_crush_rule_create_erasure: 99: ./ceph osd crush rule ls TEST_crush_rule_create_erasure: 99: grep ruleset3 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** TEST_crush_rule_create_erasure: 103: ./ceph osd erasure-code-profile rm default *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** 2014-10-19 00:02:01.217487 7f4f72f5d700 0 monclient: hunting for new mon 2014-10-19 00:02:01.217729 7f4f704c7700 0 -- 127.0.0.1:0/1026209 >> 127.0.0.1:7104/0 pipe(0x7f4f68003a50 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4f68007cf0).fault TEST_crush_rule_create_erasure: 104: ./ceph osd erasure-code-profile ls TEST_crush_rule_create_erasure: 104: grep default *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** 2014-10-19 00:02:01.458110 7f6e50395700 0 -- :/1026944 >> 127.0.0.1:7104/0 pipe(0x7f6e4c00c050 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6e4c0102f0).fault ... 2014-10-19 00:06:58.507723 7f6e50395700 0 -- :/1026944 >> 127.0.0.1:7104/0 pipe(0x7f6e4001d300 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6e40013880).fault 2014-10-19 00:07:01.458306 7f6e53dcc700 0 monclient(hunting): authenticate timed out after 300 2014-10-19 00:07:01.458369 7f6e53dcc700 0 librados: client.admin authentication error (110) Connection timed out Error connecting to cluster: TimedOut TEST_crush_rule_create_erasure: 105: ./ceph osd crush rule create-erasure ruleset3 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** 2014-10-19 00:07:01.765353 7f8f2422d700 0 -- :/1024595 >> 127.0.0.1:7104/0 pipe(0x7f8f2000c050 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8f200102f0).fault ... 2014-10-19 00:11:58.814682 7f8f2412c700 0 -- :/1024595 >> 127.0.0.1:7104/0 pipe(0x7f8f140008c0 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8f14014640).fault 2014-10-19 00:12:01.765513 7f8f26bc2700 0 monclient(hunting): authenticate timed out after 300 2014-10-19 00:12:01.765573 7f8f26bc2700 0 librados: client.admin authentication error (110) Connection timed out Error connecting to cluster: TimedOut TEST_crush_rule_create_erasure: 105: return 1
Files
Actions