Actions
Bug #9816
closedmon exits unexpectedly and gracefully
Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
ceph version 0.86-431-gabeaee4 (abeaee4cce470a4d0c289db21e7ef4d59bdf0f68) ... 2014-10-19 00:02:01.213777 7fa90e415700 10 mon.a@0(leader).pg v6 send_pg_creates to 64 pgs 2014-10-19 00:02:01.213781 7fa90e415700 10 mon.a@0(leader).osd e6 update_logger 2014-10-19 00:02:01.213777 7fa90a8ad700 10 -- 127.0.0.1:7104/0 >> 127.0.0.1:0/1026209 pipe(0x5db4800 sd=22 :7104 s=2 pgs=1 cs=1 l=1 c=0x5b52aa0).writer: state = open policy.server=1 2014-10-19 00:02:01.213788 7fa90e415700 0 log_channel(cluster) log [INF] : osdmap e6: 0 osds: 0 up, 0 in 2014-10-19 00:02:01.213789 7fa90a8ad700 20 -- 127.0.0.1:7104/0 >> 127.0.0.1:0/1026209 pipe(0x5db4800 sd=22 :7104 s=2 pgs=1 cs=1 l=1 c=0x5b52aa0).writer sleeping 2014-10-19 00:02:01.213809 7fa90e415700 1 -- 127.0.0.1:7104/0 --> 127.0.0.1:7104/0 -- log(1 entries) v1 -- ?+0 0x5b38fc0 con 0x5b50580 2014-10-19 00:02:01.213816 7fa90e415700 20 -- 127.0.0.1:7104/0 submit_message log(1 entries) v1 local 2014-10-19 00:02:01.213834 7fa90cb22700 20 mon.a@0(leader) e1 have connection 2014-10-19 00:02:01.213839 7fa90cb22700 20 mon.a@0(leader) e1 ms_dispatch existing session MonSession: mon.0 127.0.0.1:7104/0 is openallow * for mon.0 127.0.0.1:7104/0 2014-10-19 00:02:01.213849 7fa90cb22700 20 mon.a@0(leader) e1 caps allow * 2014-10-19 00:02:01.213852 7fa90cb22700 10 mon.a@0(leader).paxosservice(logm 1..15) dispatch log(1 entries) v1 from mon.0 127.0.0.1:7104/0 2014-10-19 00:02:01.213861 7fa90cb22700 10 mon.a@0(leader).paxosservice(logm 1..15) waiting for paxos -> readable (v0) 2014-10-19 00:02:01.213870 7fa90cb22700 20 -- 127.0.0.1:7104/0 done calling dispatch on 0x5b39d40 2014-10-19 00:02:01.213885 7fa90cb22700 1 -- 127.0.0.1:7104/0 <== mon.0 127.0.0.1:7104/0 0 ==== log(1 entries) v1 ==== 0+0+0 (0 0 0) 0x5b38fc0 con 0x5b50580 2014-10-19 00:02:01.213899 7fa90cb22700 20 mon.a@0(leader) e1 have connection 2014-10-19 00:02:01.213903 7fa90cb22700 20 run: 34: return 1
This happened after four hours of running make -j8 check in a loop. What's bad is that it happened only once in dozens of runs. The good part is that we have the full logs of all operations and the mon.
... TEST_crush_rule_create_erasure: 98: ./ceph osd crush rule rm ruleset3 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** TEST_crush_rule_create_erasure: 99: ./ceph osd crush rule ls TEST_crush_rule_create_erasure: 99: grep ruleset3 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** TEST_crush_rule_create_erasure: 103: ./ceph osd erasure-code-profile rm default *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** 2014-10-19 00:02:01.217487 7f4f72f5d700 0 monclient: hunting for new mon 2014-10-19 00:02:01.217729 7f4f704c7700 0 -- 127.0.0.1:0/1026209 >> 127.0.0.1:7104/0 pipe(0x7f4f68003a50 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4f68007cf0).fault TEST_crush_rule_create_erasure: 104: ./ceph osd erasure-code-profile ls TEST_crush_rule_create_erasure: 104: grep default *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** 2014-10-19 00:02:01.458110 7f6e50395700 0 -- :/1026944 >> 127.0.0.1:7104/0 pipe(0x7f6e4c00c050 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6e4c0102f0).fault ... 2014-10-19 00:06:58.507723 7f6e50395700 0 -- :/1026944 >> 127.0.0.1:7104/0 pipe(0x7f6e4001d300 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6e40013880).fault 2014-10-19 00:07:01.458306 7f6e53dcc700 0 monclient(hunting): authenticate timed out after 300 2014-10-19 00:07:01.458369 7f6e53dcc700 0 librados: client.admin authentication error (110) Connection timed out Error connecting to cluster: TimedOut TEST_crush_rule_create_erasure: 105: ./ceph osd crush rule create-erasure ruleset3 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** 2014-10-19 00:07:01.765353 7f8f2422d700 0 -- :/1024595 >> 127.0.0.1:7104/0 pipe(0x7f8f2000c050 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8f200102f0).fault ... 2014-10-19 00:11:58.814682 7f8f2412c700 0 -- :/1024595 >> 127.0.0.1:7104/0 pipe(0x7f8f140008c0 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8f14014640).fault 2014-10-19 00:12:01.765513 7f8f26bc2700 0 monclient(hunting): authenticate timed out after 300 2014-10-19 00:12:01.765573 7f8f26bc2700 0 librados: client.admin authentication error (110) Connection timed out Error connecting to cluster: TimedOut TEST_crush_rule_create_erasure: 105: return 1
Files
Updated by Joao Eduardo Luis over 9 years ago
just a hunch: feels like you're capturing only stdout from the monitor, and the monitor may have hit the 'mon data avail crit' percentage and gracefully quit -- error message wouldn't show if you were only capturing stdout and as far as I'm aware that's the only way the monitor will even quit gracefully.
Updated by Loïc Dachary about 9 years ago
- Status changed from New to Can't reproduce
Actions