Bug #9816: mon exits unexpectedly and gracefully - Ceph - Ceph

Actions

Copy link

Bug #9816

closed

mon exits unexpectedly and gracefully

Added by Loïc Dachary over 9 years ago. Updated about 9 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ceph version 0.86-431-gabeaee4 (abeaee4cce470a4d0c289db21e7ef4d59bdf0f68)
...
2014-10-19 00:02:01.213777 7fa90e415700 10 mon.a@0(leader).pg v6 send_pg_creates to 64 pgs
2014-10-19 00:02:01.213781 7fa90e415700 10 mon.a@0(leader).osd e6 update_logger
2014-10-19 00:02:01.213777 7fa90a8ad700 10 -- 127.0.0.1:7104/0 >> 127.0.0.1:0/1026209 pipe(0x5db4800 sd=22 :7104 s=2 pgs=1 cs=1 l=1 c=0x5b52aa0).writer: state = open policy.server=1
2014-10-19 00:02:01.213788 7fa90e415700  0 log_channel(cluster) log [INF] : osdmap e6: 0 osds: 0 up, 0 in
2014-10-19 00:02:01.213789 7fa90a8ad700 20 -- 127.0.0.1:7104/0 >> 127.0.0.1:0/1026209 pipe(0x5db4800 sd=22 :7104 s=2 pgs=1 cs=1 l=1 c=0x5b52aa0).writer sleeping
2014-10-19 00:02:01.213809 7fa90e415700  1 -- 127.0.0.1:7104/0 --> 127.0.0.1:7104/0 -- log(1 entries) v1 -- ?+0 0x5b38fc0 con 0x5b50580
2014-10-19 00:02:01.213816 7fa90e415700 20 -- 127.0.0.1:7104/0 submit_message log(1 entries) v1 local
2014-10-19 00:02:01.213834 7fa90cb22700 20 mon.a@0(leader) e1 have connection
2014-10-19 00:02:01.213839 7fa90cb22700 20 mon.a@0(leader) e1 ms_dispatch existing session MonSession: mon.0 127.0.0.1:7104/0 is openallow * for mon.0 127.0.0.1:7104/0
2014-10-19 00:02:01.213849 7fa90cb22700 20 mon.a@0(leader) e1  caps allow *
2014-10-19 00:02:01.213852 7fa90cb22700 10 mon.a@0(leader).paxosservice(logm 1..15) dispatch log(1 entries) v1 from mon.0 127.0.0.1:7104/0
2014-10-19 00:02:01.213861 7fa90cb22700 10 mon.a@0(leader).paxosservice(logm 1..15)  waiting for paxos -> readable (v0)
2014-10-19 00:02:01.213870 7fa90cb22700 20 -- 127.0.0.1:7104/0 done calling dispatch on 0x5b39d40
2014-10-19 00:02:01.213885 7fa90cb22700  1 -- 127.0.0.1:7104/0 <== mon.0 127.0.0.1:7104/0 0 ==== log(1 entries) v1 ==== 0+0+0 (0 0 0) 0x5b38fc0 con 0x5b50580
2014-10-19 00:02:01.213899 7fa90cb22700 20 mon.a@0(leader) e1 have connection
2014-10-19 00:02:01.213903 7fa90cb22700 20 run: 34: return 1

This happened after four hours of running make -j8 check in a loop. What's bad is that it happened only once in dozens of runs. The good part is that we have the full logs of all operations and the mon.

...
TEST_crush_rule_create_erasure: 98: ./ceph osd crush rule rm ruleset3
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
TEST_crush_rule_create_erasure: 99: ./ceph osd crush rule ls
TEST_crush_rule_create_erasure: 99: grep ruleset3
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
TEST_crush_rule_create_erasure: 103: ./ceph osd erasure-code-profile rm default
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2014-10-19 00:02:01.217487 7f4f72f5d700  0 monclient: hunting for new mon
2014-10-19 00:02:01.217729 7f4f704c7700  0 -- 127.0.0.1:0/1026209 >> 127.0.0.1:7104/0 pipe(0x7f4f68003a50 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4f68007cf0).fault
TEST_crush_rule_create_erasure: 104: ./ceph osd erasure-code-profile ls
TEST_crush_rule_create_erasure: 104: grep default
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2014-10-19 00:02:01.458110 7f6e50395700  0 -- :/1026944 >> 127.0.0.1:7104/0 pipe(0x7f6e4c00c050 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6e4c0102f0).fault
...
2014-10-19 00:06:58.507723 7f6e50395700  0 -- :/1026944 >> 127.0.0.1:7104/0 pipe(0x7f6e4001d300 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6e40013880).fault
2014-10-19 00:07:01.458306 7f6e53dcc700  0 monclient(hunting): authenticate timed out after 300
2014-10-19 00:07:01.458369 7f6e53dcc700  0 librados: client.admin authentication error (110) Connection timed out
Error connecting to cluster: TimedOut
TEST_crush_rule_create_erasure: 105: ./ceph osd crush rule create-erasure ruleset3
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2014-10-19 00:07:01.765353 7f8f2422d700  0 -- :/1024595 >> 127.0.0.1:7104/0 pipe(0x7f8f2000c050 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8f200102f0).fault
...
2014-10-19 00:11:58.814682 7f8f2412c700  0 -- :/1024595 >> 127.0.0.1:7104/0 pipe(0x7f8f140008c0 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8f14014640).fault
2014-10-19 00:12:01.765513 7f8f26bc2700  0 monclient(hunting): authenticate timed out after 300
2014-10-19 00:12:01.765573 7f8f26bc2700  0 librados: client.admin authentication error (110) Connection timed out
Error connecting to cluster: TimedOut
TEST_crush_rule_create_erasure: 105: return 1

Files

osd-crush.sh.log.gz (77.6 KB) osd-crush.sh.log.gz

mon tests and mon logs

Loïc Dachary, 10/19/2014 08:12 AM

Actions

Copy link

Updated by Joao Eduardo Luis over 9 years ago

just a hunch: feels like you're capturing only stdout from the monitor, and the monitor may have hit the 'mon data avail crit' percentage and gracefully quit -- error message wouldn't show if you were only capturing stdout and as far as I'm aware that's the only way the monitor will even quit gracefully.

Actions

Copy link

Updated by Loïc Dachary about 9 years ago

Status changed from New to Can't reproduce

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #9816

mon exits unexpectedly and gracefully

Updated by Joao Eduardo Luis over 9 years ago

Updated by Loïc Dachary about 9 years ago