Project

General

Profile

Actions

Bug #550

closed

mon: PGMonitor::update_from_paxos()

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

One of my monitors crashed, got this backtrace:

2010-11-05 19:43:22.959829 7f4419954710 log [INF] : mon.node14 calling new monitor election
2010-11-05 19:43:22.992754 7f4419153710 cephx keyserverdata: get_caps: name=mon.
2010-11-05 19:43:22.992805 7f4419153710 cephx keyserverdata: get_secret: num of caps=0
2010-11-05 19:43:22.992817 7f4419153710 cephx: build_service_ticket service mon secret_id 18446744073709551615 ticket_info.ticket.name=mon.
2010-11-05 19:43:24.184936 7f441a155710 mon.node14@2(peon).osd e1565 e1565: 12 osds: 12 up, 12 in -- 1 blacklisted MDSes
2010-11-05 19:43:24.717521 7f441a155710 cephx server client.admin: handle_request get_principal_session_key
2010-11-05 19:43:24.717552 7f441a155710 cephx: verify_authorizer decrypted service auth secret_id=9
2010-11-05 19:43:24.717582 7f441a155710 cephx: verify_authorizer global_id=4607
2010-11-05 19:43:24.717615 7f441a155710 cephx: verify_authorizer ok nonce e101050725a6705d reply_bl.length()=36
2010-11-05 19:43:24.717625 7f441a155710 cephx server client.admin:  ticket_req.keys = 2
2010-11-05 19:43:24.717645 7f441a155710 cephx server client.admin:  adding key for service mds
2010-11-05 19:43:24.717654 7f441a155710 cephx keyserverdata: get_service_secret service mds id 106 AQB1OdRM8KyPBRAA7l7pr5znpUhraEr0i799ww== expires 2010-11-05 20:05:53.485573
2010-11-05 19:43:24.717694 7f441a155710 cephx keyserverdata: get_caps: name=client.admin
2010-11-05 19:43:24.717704 7f441a155710 cephx keyserverdata: get_secret: num of caps=3
2010-11-05 19:43:24.717716 7f441a155710 cephx: build_service_ticket_reply encoding 1 tickets with secret AQC5NNRMeL/3DBAAV82nyWsm1PM8DQvygwnuzg==
2010-11-05 19:43:24.717738 7f441a155710 cephx: build_service_ticket service mds secret_id 106 ticket_info.ticket.name=client.admin
2010-11-05 19:43:24.717757 7f441a155710 cephx: service_ticket_blob is 0000 : 01 6a 00 00 00 00 00 00 00 70 00 00 00 4d 75 ea : .j.......p...Mu.
0010 : dd d6 ab fc d0 1f 47 94 69 56 63 7f d1 1a ff cb : ......G.iVc.....
0020 : 57 b1 52 9b 63 19 5e 51 e8 75 11 fb bf 11 02 fa : W.R.c.^Q.u......
0030 : 30 6c 56 00 42 e0 15 14 4c 09 8f f3 1b ac ef d0 : 0lV.B...L.......
0040 : 14 a1 73 e3 3a 2d 09 f4 92 30 75 d4 58 3d 7d 01 : ..s.:-...0u.X=}.
0050 : f3 e1 c0 5c 0b 45 a3 fe 7b ee 9d 73 29 9d dd 1c : ...\.E..{..s)...
0060 : 89 32 df 1e 8f 07 74 45 d1 79 24 bf a2 12 70 18 : .2....tE.y$...p.
0070 : b1 9b e5 88 23 ea 6f a6 ce 6a 19 56 a6          : ....#.o..j.V.

*** Caught signal (ABRT) ***
 ceph version 0.23~rc (commit:e304a2451a80f11117cb01031734e72077c88ce0)
 1: (sigabrt_handler(int)+0x7d) [0x559d5d]
 2: (()+0x33af0) [0x7f441b5a6af0]
 3: (gsignal()+0x35) [0x7f441b5a6a75]
 4: (abort()+0x180) [0x7f441b5aa5c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f441be5c8e5]
 6: (()+0xcad16) [0x7f441be5ad16]
 7: (()+0xcad43) [0x7f441be5ad43]
 8: (()+0xcae3e) [0x7f441be5ae3e]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x11e) [0x44bc1e]
 10: (PGMonitor::update_from_paxos()+0x2e5) [0x4c4a15]
 11: (PaxosService::_active()+0x36) [0x487be6]
 12: (finish_contexts(std::list<Context*, std::allocator<Context*> >&, int)+0x1b7) [0x485387]
 13: (Paxos::handle_lease(MMonPaxos*)+0x3e9) [0x4807e9]
 14: (Paxos::dispatch(PaxosServiceMessage*)+0x1fb) [0x483dfb]
 15: (Monitor::_ms_dispatch(Message*)+0xbb4) [0x4724b4]
 16: (Monitor::ms_dispatch(Message*)+0x79) [0x47d759]
 17: (SimpleMessenger::dispatch_entry()+0x749) [0x454559]
 18: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x44ab5c]
 19: (Thread::_entry_func(void*)+0xa) [0x45fbda]
 20: (()+0x69ca) [0x7f441c43a9ca]
 21: (clone()+0x6d) [0x7f441b6596fd]

I've run cdebugpack and uploaded the data to logger.ceph.widodh.nl:/srv/ceph/issues/cmon_crash_update_paxos

Actions #1

Updated by Wido den Hollander over 13 years ago

While I thought it wasn't related to the MDS issue i'm seeing, it might seem it is:

[168282.081151] libceph: mds0 (unknown sockaddr family 0) connect error
[168794.081195] libceph: mds0 (unknown sockaddr family 0) connect error
[169306.081157] libceph: mds0 (unknown sockaddr family 0) connect error
[169598.622505] ceph: mds0 reconnect start
[169598.955622] ceph: mds0 caps went stale, renewing
[169598.958900] libceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6818 socket closed
[169599.416186] libceph: mon2 [2001:16f8:10:2::c3c3:2e5c]:6789 socket closed
[169599.417301] libceph: mon2 [2001:16f8:10:2::c3c3:2e5c]:6789 session lost, hunting for new mon
[169599.417424] libceph: mon2 [2001:16f8:10:2::c3c3:2e5c]:6789 connection failed
[169600.010682] libceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6818 connection reset
[169600.012507] libceph: reset on mds0
[169600.012510] ceph: mds0 closed our session
[169600.012512] ceph: mds0 reconnect start
[169600.214978] libceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6818 protocol version mismatch, my 32 != server's 32
[169600.216961] libceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6818 protocol version mismatch
[169604.792105] libceph: mon1 [2001:16f8:10:2::c3c3:3f9b]:6789 session established

As you can see, mds0 is starting to get back to life and then mon2 crashed.

Client is running the master branch, commit 2f56f56ad991edd51ffd0baf1182245ee1277a04 ( 2.6.36-rc8-20625-g2f56f56 )

Actions #2

Updated by Sage Weil over 13 years ago

  • Status changed from New to Can't reproduce

haven't been able to reproduce this. 62716aa7 gives us useful error messages. if/when it comes up again we'll know more.

Actions

Also available in: Atom PDF