Project

General

Profile

Actions

Bug #4644

closed

mds crashing after upgrade from 0.58 to 0.60

Added by norbert schmidt about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

after upgrade from 0.58 to 0.60, one mds is crashed and still crashing directly after start

--- begin dump of recent events ---
  -120> 2013-04-02 11:44:23.896083 7fe140415780  5 asok(0x2af0000) register_command perfcounters_dump hook 0x2aa8010
  -119> 2013-04-02 11:44:23.896121 7fe140415780  5 asok(0x2af0000) register_command 1 hook 0x2aa8010
  -118> 2013-04-02 11:44:23.896124 7fe140415780  5 asok(0x2af0000) register_command perf dump hook 0x2aa8010
  -117> 2013-04-02 11:44:23.896130 7fe140415780  5 asok(0x2af0000) register_command perfcounters_schema hook 0x2aa8010
  -116> 2013-04-02 11:44:23.896133 7fe140415780  5 asok(0x2af0000) register_command 2 hook 0x2aa8010
  -115> 2013-04-02 11:44:23.896135 7fe140415780  5 asok(0x2af0000) register_command perf schema hook 0x2aa8010
  -114> 2013-04-02 11:44:23.896137 7fe140415780  5 asok(0x2af0000) register_command config show hook 0x2aa8010
  -113> 2013-04-02 11:44:23.896140 7fe140415780  5 asok(0x2af0000) register_command config set hook 0x2aa8010
  -112> 2013-04-02 11:44:23.896144 7fe140415780  5 asok(0x2af0000) register_command log flush hook 0x2aa8010
  -111> 2013-04-02 11:44:23.896147 7fe140415780  5 asok(0x2af0000) register_command log dump hook 0x2aa8010
  -110> 2013-04-02 11:44:23.896181 7fe140415780  5 asok(0x2af0000) register_command log reopen hook 0x2aa8010
  -109> 2013-04-02 11:44:23.904109 7fe140415780  0 ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c), process ceph-mds, pid 2228
  -108> 2013-04-02 11:44:23.904528 7fe140415780  1 accepter.accepter.bind my_inst.addr is 0.0.0.0:6809/2228 need_addr=1
  -107> 2013-04-02 11:44:23.910098 7fe140415780  1 finished global_init_daemonize
  -106> 2013-04-02 11:44:23.910307 7fe140415780  5 asok(0x2af0000) init /var/run/ceph/ceph-mds.a.asok
  -105> 2013-04-02 11:44:23.910331 7fe140415780  5 asok(0x2af0000) bind_and_listen /var/run/ceph/ceph-mds.a.asok
  -104> 2013-04-02 11:44:23.910405 7fe140415780  5 asok(0x2af0000) register_command 0 hook 0x2aa0038
  -103> 2013-04-02 11:44:23.910415 7fe140415780  5 asok(0x2af0000) register_command version hook 0x2aa0038
  -102> 2013-04-02 11:44:23.910420 7fe140415780  5 asok(0x2af0000) register_command git_version hook 0x2aa0038
  -101> 2013-04-02 11:44:23.910425 7fe140415780  5 asok(0x2af0000) register_command help hook 0x2aa8050
  -100> 2013-04-02 11:44:23.910478 7fe140415780 10 monclient(hunting): build_initial_monmap
   -99> 2013-04-02 11:44:23.910522 7fe13d70b700  5 asok(0x2af0000) entry start
   -98> 2013-04-02 11:44:23.939661 7fe140415780  1 -- 0.0.0.0:6809/2228 messenger.start
   -97> 2013-04-02 11:44:23.940508 7fe140415780  5 adding auth protocol: none
   -96> 2013-04-02 11:44:23.940528 7fe140415780  5 adding auth protocol: none
   -95> 2013-04-02 11:44:23.941582 7fe140415780  1 accepter.accepter.start
   -94> 2013-04-02 11:44:23.941610 7fe140415780 10 monclient(hunting): init
   -93> 2013-04-02 11:44:23.941616 7fe140415780  5 adding auth protocol: none
   -92> 2013-04-02 11:44:23.941617 7fe140415780 10 monclient(hunting): auth_supported 1 method none
   -91> 2013-04-02 11:44:23.941679 7fe140415780 10 monclient(hunting): _reopen_session
   -90> 2013-04-02 11:44:23.943741 7fe140415780 10 monclient(hunting): _pick_new_mon picked mon.c con 0x2b68580 addr 192.168.99.110:6789/0
   -89> 2013-04-02 11:44:23.943799 7fe140415780 10 monclient(hunting): _send_mon_message to mon.c at 192.168.99.110:6789/0
   -88> 2013-04-02 11:44:23.944853 7fe140411700  1 -- 192.168.16.129:6809/2228 learned my addr 192.168.16.129:6809/2228
   -87> 2013-04-02 11:44:23.945572 7fe13bf08700  5 mds.-1.0 ms_handle_connect on 192.168.99.110:6789/0
   -86> 2013-04-02 11:44:23.943809 7fe140415780  1 -- 0.0.0.0:6809/2228 --> 192.168.99.110:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 0x2b10400 con 0x2b68580
   -85> 2013-04-02 11:44:23.965313 7fe140415780 10 monclient(hunting): renew_subs
   -84> 2013-04-02 11:44:26.941893 7fe13af06700 10 monclient(hunting): tick
   -83> 2013-04-02 11:44:26.941917 7fe13af06700  1 monclient(hunting): continuing hunt
   -82> 2013-04-02 11:44:26.941920 7fe13af06700 10 monclient(hunting): _reopen_session
   -81> 2013-04-02 11:44:26.941923 7fe13af06700  1 -- 192.168.16.129:6809/2228 mark_down 0x2b68580 -- 0x2b88280
   -80> 2013-04-02 11:44:26.942055 7fe13af06700 10 monclient(hunting): _pick_new_mon picked mon.a con 0x2b68840 addr 192.168.16.129:6789/0
   -79> 2013-04-02 11:44:26.941986 7fe139f04700  2 -- 192.168.16.129:6809/2228 >> 192.168.99.110:6789/0 pipe(0x2b88280 sd=16 :50721 s=4 pgs=44 cs=1 l=1).reader couldn't read tag, Success
   -78> 2013-04-02 11:44:26.942078 7fe13af06700 10 monclient(hunting): _send_mon_message to mon.a at 192.168.16.129:6789/0
   -77> 2013-04-02 11:44:26.942082 7fe13af06700  1 -- 192.168.16.129:6809/2228 --> 192.168.16.129:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 0x2b10400 con 0x2b68840
   -76> 2013-04-02 11:44:26.942079 7fe139f04700  2 -- 192.168.16.129:6809/2228 >> 192.168.99.110:6789/0 pipe(0x2b88280 sd=16 :50721 s=4 pgs=44 cs=1 l=1).fault 0: Success
   -75> 2013-04-02 11:44:26.942089 7fe13af06700 10 monclient(hunting): renew_subs
   -74> 2013-04-02 11:44:26.942784 7fe13bf08700  5 mds.-1.0 ms_handle_connect on 192.168.16.129:6789/0
   -73> 2013-04-02 11:44:26.943603 7fe13bf08700  1 -- 192.168.16.129:6809/2228 <== mon.2 192.168.16.129:6789/0 1 ==== mon_map v1 ==== 473+0+0 (3243395836 0 0) 0x2bc83c0 con 0x2b68840
   -72> 2013-04-02 11:44:26.943640 7fe13bf08700 10 monclient(hunting): handle_monmap mon_map v1
   -71> 2013-04-02 11:44:26.943658 7fe13bf08700 10 monclient(hunting):  got monmap 6, mon.a is now rank 2
   -70> 2013-04-02 11:44:26.943660 7fe13bf08700 10 monclient(hunting): dump:
epoch 6
fsid 4eacc6f6-6561-4359-a035-1eb9a91eb45d
last_changed 2013-04-02 09:12:35.994897
created 2012-10-24 16:32:47.986799
0: 192.168.99.110:6789/0 mon.c
1: 192.168.23.110:6789/0 mon.b
2: 192.168.16.129:6789/0 mon.a

   -69> 2013-04-02 11:44:26.943715 7fe13bf08700  1 monclient(hunting): found mon.a
   -68> 2013-04-02 11:44:26.943725 7fe13bf08700  1 -- 192.168.16.129:6809/2228 <== mon.2 192.168.16.129:6789/0 2 ==== auth_reply(proto 1 0 Success) v1 ==== 24+0+0 (2265305362 0 0) 0x2b10600 con 0x2b68840
   -67> 2013-04-02 11:44:26.943747 7fe13bf08700 10 monclient: my global_id is 10244
   -66> 2013-04-02 11:44:26.943750 7fe13bf08700 10 monclient: _send_mon_message to mon.a at 192.168.16.129:6789/0
   -65> 2013-04-02 11:44:26.943757 7fe13bf08700  1 -- 192.168.16.129:6809/2228 --> 192.168.16.129:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x2bb8380 con 0x2b68840
   -64> 2013-04-02 11:44:26.950949 7fe13bf08700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2013-04-02 11:43:56.950945)
   -63> 2013-04-02 11:44:26.951044 7fe140415780  5 monclient: authenticate success, global_id 10244
   -62> 2013-04-02 11:44:26.951125 7fe140415780  5 asok(0x2af0000) register_command objecter_requests hook 0x2aa8100
   -61> 2013-04-02 11:44:26.951207 7fe140415780 10 monclient: _send_mon_message to mon.a at 192.168.16.129:6789/0
   -60> 2013-04-02 11:44:26.951220 7fe140415780  1 -- 192.168.16.129:6809/2228 --> 192.168.16.129:6789/0 -- mdsbeacon(10244/a up:boot seq 1 v0) v2 -- ?+0 0x2bd8000 con 0x2b68840
   -59> 2013-04-02 11:44:26.951245 7fe140415780 10 monclient: renew_subs
   -58> 2013-04-02 11:44:26.951250 7fe140415780 10 monclient: _send_mon_message to mon.a at 192.168.16.129:6789/0
   -57> 2013-04-02 11:44:26.951254 7fe140415780  1 -- 192.168.16.129:6809/2228 --> 192.168.16.129:6789/0 -- mon_subscribe({monmap=7+,osdmap=0}) v2 -- ?+0 0x2bb81c0 con 0x2b68840
   -56> 2013-04-02 11:44:26.951267 7fe140415780 10 monclient: renew_subs
   -55> 2013-04-02 11:44:26.951277 7fe140415780 10 monclient: _send_mon_message to mon.a at 192.168.16.129:6789/0
   -54> 2013-04-02 11:44:26.951291 7fe140415780  1 -- 192.168.16.129:6809/2228 --> 192.168.16.129:6789/0 -- mon_subscribe({mdsmap=0+,monmap=7+,osdmap=0}) v2 -- ?+0 0x2bb88c0 con 0x2b68840
   -53> 2013-04-02 11:44:26.951561 7fe13bf08700  1 -- 192.168.16.129:6809/2228 <== mon.2 192.168.16.129:6789/0 3 ==== mon_map v1 ==== 473+0+0 (3243395836 0 0) 0x2bc85a0 con 0x2b68840
   -52> 2013-04-02 11:44:26.951580 7fe13bf08700 10 monclient: handle_monmap mon_map v1
   -51> 2013-04-02 11:44:26.951590 7fe13bf08700 10 monclient:  got monmap 6, mon.a is now rank 2
   -50> 2013-04-02 11:44:26.951592 7fe13bf08700 10 monclient: dump:
epoch 6
fsid 4eacc6f6-6561-4359-a035-1eb9a91eb45d
last_changed 2013-04-02 09:12:35.994897
created 2012-10-24 16:32:47.986799
0: 192.168.99.110:6789/0 mon.c
1: 192.168.23.110:6789/0 mon.b
2: 192.168.16.129:6789/0 mon.a

   -49> 2013-04-02 11:44:26.951623 7fe13bf08700  1 -- 192.168.16.129:6809/2228 <== mon.2 192.168.16.129:6789/0 4 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (3455830995 0 0) 0x2b904e0 con 0x2b68840
   -48> 2013-04-02 11:44:26.951638 7fe13bf08700 10 monclient: handle_subscribe_ack sent 2013-04-02 11:44:23.965317 renew after 2013-04-02 11:46:53.965317
   -47> 2013-04-02 11:44:26.963038 7fe13bf08700  1 -- 192.168.16.129:6809/2228 <== mon.2 192.168.16.129:6789/0 5 ==== osd_map(3678..3678 src has 3176..3678) v3 ==== 11918+0+0 (2321284020 0 0) 0x2b11200 con 0x2b68840
   -46> 2013-04-02 11:44:26.968030 7fe13bf08700  1 -- 192.168.16.129:6809/2228 <== mon.2 192.168.16.129:6789/0 6 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (3455830995 0 0) 0x2b90820 con 0x2b68840
   -45> 2013-04-02 11:44:26.968046 7fe13bf08700 10 monclient: handle_subscribe_ack sent 0.000000, ignoring
   -44> 2013-04-02 11:44:26.969045 7fe13bf08700  1 -- 192.168.16.129:6809/2228 <== mon.2 192.168.16.129:6789/0 7 ==== mdsmap(e 762) v1 ==== 583+0+0 (3620617815 0 0) 0x2b11000 con 0x2b68840
   -43> 2013-04-02 11:44:26.969074 7fe13bf08700  5 mds.-1.0 handle_mds_map epoch 762 from mon.2
   -42> 2013-04-02 11:44:26.969130 7fe13bf08700  1 -- 192.168.16.129:6809/2228 <== mon.2 192.168.16.129:6789/0 8 ==== osd_map(3678..3678 src has 3176..3678) v3 ==== 11918+0+0 (2321284020 0 0) 0x2b11800 con 0x2b68840
   -41> 2013-04-02 11:44:26.969158 7fe13bf08700  1 -- 192.168.16.129:6809/2228 <== mon.2 192.168.16.129:6789/0 9 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (3455830995 0 0) 0x2b90680 con 0x2b68840
   -40> 2013-04-02 11:44:26.969168 7fe13bf08700 10 monclient: handle_subscribe_ack sent 0.000000, ignoring
   -39> 2013-04-02 11:44:27.163253 7fe13bf08700  1 -- 192.168.16.129:6809/2228 <== mon.2 192.168.16.129:6789/0 10 ==== mdsmap(e 763) v1 ==== 786+0+0 (3610614263 0 0) 0x2b11600 con 0x2b68840
   -38> 2013-04-02 11:44:27.163277 7fe13bf08700  5 mds.-1.-1 handle_mds_map epoch 763 from mon.2
   -37> 2013-04-02 11:44:27.163301 7fe13bf08700  1 mds.-1.0 handle_mds_map standby
   -36> 2013-04-02 11:44:27.422404 7fe13bf08700  1 -- 192.168.16.129:6809/2228 <== mon.2 192.168.16.129:6789/0 11 ==== mdsmap(e 764) v1 ==== 583+0+0 (2230448747 0 0) 0x2b11400 con 0x2b68840
   -35> 2013-04-02 11:44:27.422429 7fe13bf08700  5 mds.-1.0 handle_mds_map epoch 764 from mon.2
   -34> 2013-04-02 11:44:27.422452 7fe13bf08700  1 -- 192.168.16.129:6809/2228 mark_down 192.168.23.110:6800/18751 -- pipe dne
   -33> 2013-04-02 11:44:27.422461 7fe13bf08700  1 mds.0.40 handle_mds_map i am now mds.0.40
   -32> 2013-04-02 11:44:27.422463 7fe13bf08700  1 mds.0.40 handle_mds_map state change up:standby --> up:replay
   -31> 2013-04-02 11:44:27.422465 7fe13bf08700  1 mds.0.40 replay_start
   -30> 2013-04-02 11:44:27.422871 7fe13bf08700  1 mds.0.40  recovery set is
   -29> 2013-04-02 11:44:27.422885 7fe13bf08700  1 mds.0.40  need osdmap epoch 3679, have 3678
   -28> 2013-04-02 11:44:27.422892 7fe13bf08700  1 mds.0.40  waiting for osdmap 3679 (which blacklists prior instance)
   -27> 2013-04-02 11:44:27.422901 7fe13bf08700 10 monclient: renew_subs
   -26> 2013-04-02 11:44:27.422907 7fe13bf08700 10 monclient: _send_mon_message to mon.a at 192.168.16.129:6789/0
   -25> 2013-04-02 11:44:27.422914 7fe13bf08700  1 -- 192.168.16.129:6809/2228 --> 192.168.16.129:6789/0 -- mon_subscribe({mdsmap=765+,monmap=7+,osdmap=3679}) v2 -- ?+0 0x2bb8380 con 0x2b68840
   -24> 2013-04-02 11:44:27.422936 7fe13bf08700  1 mds.0.cache handle_mds_failure mds.0 : recovery peers are
   -23> 2013-04-02 11:44:27.443165 7fe13bf08700  5 mds.0.migrator handle_mds_failure_or_stop mds.0
   -22> 2013-04-02 11:44:27.464363 7fe13bf08700  1 -- 192.168.16.129:6809/2228 <== mon.2 192.168.16.129:6789/0 12 ==== osd_map(3679..3679 src has 3178..3679) v3 ==== 308+0+0 (1123034295 0 0) 0x2b11a00 con 0x2b68840
   -21> 2013-04-02 11:44:27.464417 7fe13bf08700  2 mds.0.40 boot_start 1: opening inotable
   -20> 2013-04-02 11:44:27.464503 7fe13bf08700  1 -- 192.168.16.129:6809/2228 --> 192.168.23.153:6832/14317 -- osd_op(mds.0.40:1 mds0_inotable [read 0~0] 1.b852b893 e3679) v4 -- ?+0 0x2ae8240 con 0x2b68c60
   -19> 2013-04-02 11:44:27.464531 7fe13bf08700  2 mds.0.40 boot_start 1: opening sessionmap
   -18> 2013-04-02 11:44:27.464584 7fe13bf08700  1 -- 192.168.16.129:6809/2228 --> 192.168.23.153:6841/14764 -- osd_op(mds.0.40:2 mds0_sessionmap [read 0~0] 1.3270c60b e3679) v4 -- ?+0 0x2ae86c0 con 0x2b694a0
   -17> 2013-04-02 11:44:27.464594 7fe13bf08700  2 mds.0.40 boot_start 1: opening anchor table
   -16> 2013-04-02 11:44:27.464639 7fe13bf08700  1 -- 192.168.16.129:6809/2228 --> 192.168.23.110:6810/20043 -- osd_op(mds.0.40:3 mds_anchortable [read 0~0] 1.a977f6a7 e3679) v4 -- ?+0 0x2ae8480 con 0x2b69340
   -15> 2013-04-02 11:44:27.464658 7fe13bf08700  2 mds.0.40 boot_start 1: opening snap table
   -14> 2013-04-02 11:44:27.464700 7fe13bf08700  1 -- 192.168.16.129:6809/2228 --> 192.168.23.153:6844/14861 -- osd_op(mds.0.40:4 mds_snaptable [read 0~0] 1.d90270ad e3679) v4 -- ?+0 0x2ae8d80 con 0x2b691e0
   -13> 2013-04-02 11:44:27.464718 7fe13bf08700  2 mds.0.40 boot_start 1: opening mds log
   -12> 2013-04-02 11:44:27.464720 7fe13bf08700  5 mds.0.log open discovering log bounds
   -11> 2013-04-02 11:44:27.464726 7fe13bf08700  1 mds.0.journaler(ro) recover start
   -10> 2013-04-02 11:44:27.464727 7fe13bf08700  1 mds.0.journaler(ro) read_head
    -9> 2013-04-02 11:44:27.464776 7fe13bf08700  1 -- 192.168.16.129:6809/2228 --> 192.168.23.153:6847/14980 -- osd_op(mds.0.40:5 200.00000000 [read 0~0] 1.844f3494 e3679) v4 -- ?+0 0x2ae8b40 con 0x2b69080
    -8> 2013-04-02 11:44:27.464796 7fe13bf08700  1 -- 192.168.16.129:6809/2228 <== mon.2 192.168.16.129:6789/0 13 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (3455830995 0 0) 0x2b909c0 con 0x2b68840
    -7> 2013-04-02 11:44:27.464804 7fe13bf08700 10 monclient: handle_subscribe_ack sent 2013-04-02 11:44:27.422902 renew after 2013-04-02 11:46:57.422902
    -6> 2013-04-02 11:44:27.465869 7fe13bf08700  5 mds.0.40 ms_handle_connect on 192.168.23.153:6832/14317
    -5> 2013-04-02 11:44:27.465885 7fe13bf08700  5 mds.0.40 ms_handle_connect on 192.168.23.153:6847/14980
    -4> 2013-04-02 11:44:27.465970 7fe13bf08700  5 mds.0.40 ms_handle_connect on 192.168.23.110:6810/20043
    -3> 2013-04-02 11:44:27.465987 7fe13bf08700  5 mds.0.40 ms_handle_connect on 192.168.23.153:6841/14764
    -2> 2013-04-02 11:44:27.466032 7fe13bf08700  5 mds.0.40 ms_handle_connect on 192.168.23.153:6844/14861
    -1> 2013-04-02 11:44:27.474970 7fe13bf08700  1 -- 192.168.16.129:6809/2228 <== osd.7 192.168.23.153:6841/14764 1 ==== osd_op_reply(2 mds0_sessionmap [read 0~542] ondisk = 0) v4 ==== 114+0+542 (1234829482 0 3336007267) 0x2bc8780 con 0x2b694a0
     0> 2013-04-02 11:44:27.479071 7fe13bf08700 -1 *** Caught signal (Aborted) **
 in thread 7fe13bf08700

 ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c)
 1: /usr/local/bin/ceph-mds() [0x852349]
 2: (()+0xf500) [0x7fe13fff5500]
 3: (gsignal()+0x35) [0x7fe13e7418a5]
 4: (abort()+0x175) [0x7fe13e743085]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fe13effaa5d]
 6: (()+0xbcbe6) [0x7fe13eff8be6]
 7: (()+0xbcc13) [0x7fe13eff8c13]
 8: (()+0xbcd0e) [0x7fe13eff8d0e]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x13e) [0x7b7fee]
 10: (session_info_t::decode(ceph::buffer::list::iterator&)+0x415) [0x7dfa65]
 11: (SessionMap::decode(ceph::buffer::list::iterator&)+0x82c) [0x6e478c]
 12: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x81) [0x6e59c1]
 13: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe1e) [0x71290e]
 14: (MDS::handle_core_message(Message*)+0x957) [0x4d1c67]
 15: (MDS::_dispatch(Message*)+0x2f) [0x4d1d1f]
 16: (MDS::ms_dispatch(Message*)+0x19b) [0x4d38eb]
 17: (DispatchQueue::entry()+0x309) [0x825da9]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x7a9e0d]
 19: (()+0x7851) [0x7fe13ffed851]
 20: (clone()+0x6d) [0x7fe13e7f711d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/mds.a.log
--- end dump of recent events ---

Files

alexxy.mds.log (62.3 KB) alexxy.mds.log Joao Eduardo Luis, 04/04/2013 03:34 AM
sessionmap (543 Bytes) sessionmap norbert schmidt, 04/04/2013 11:15 PM
0001-mds-fix-session_info_t-decoding.patch (895 Bytes) 0001-mds-fix-session_info_t-decoding.patch Zheng Yan, 04/08/2013 08:20 PM
Actions #1

Updated by Joao Eduardo Luis about 11 years ago

  • Description updated (diff)
Actions #2

Updated by Joao Eduardo Luis about 11 years ago

alexxy @ IRC also hit this issue. Attaching log.

Actions #3

Updated by Greg Farnum about 11 years ago

  • Status changed from New to Need More Info
  • Assignee set to Greg Farnum

It failed to decode the SessionMap properly here, but I can't tell why and the code hasn't changed at all between those versions. Can one of you please:
1) restart your MDS with "debug mds = 20" in the config file,
2) Get the "mds0_sessionmap" object out of your metadata pool and attach it? (I believe the syntax is "rados --pool=metadata get mds0_sessionmap -o sessionmap.tmp", but I could be misremembering)

Actions #4

Updated by Greg Farnum about 11 years ago

I got a SessionMap from alexxy and it somehow has a bad version number attached to it. More importantly when I hexdump it, it looks nothing like a SessionMap begin should (the bits are not all turned on). It's also quite large — 14k.

norbert, I'd love to get your sessionmap and debug logs for comparison as well. :)

Actions #5

Updated by Dan Mick about 11 years ago

  • Subject changed from mds craching after upgrade from 0.58 to 0.60 to mds crashing after upgrade from 0.58 to 0.60
Actions #6

Updated by Zheng Yan about 11 years ago

I guess this bug was introduced by commit 0bcf2ac081b8386fe00387b654aa5676a7902c80

diff --git a/src/mds/mdstypes.cc b/src/mds/mdstypes.cc
index ad4a71a..6c34c63 100644
--- a/src/mds/mdstypes.cc
+++ b/src/mds/mdstypes.cc
@@ -543,7 +543,7 @@ void session_info_t::decode(bufferlist::iterator& p)
   ::decode(inst, p);
   if (struct_v == 2) {
     set<tid_t> s;
-    ::decode(completed_requests, p);
+    ::decode(s, p);
     while (!s.empty()) {
       completed_requests[*s.begin()] = inodeno_t();
       s.erase(s.begin());
Actions #7

Updated by norbert schmidt about 11 years ago

logfile with debug mds = 20

2013-04-05 08:04:25.934459 7f717b3e5780  0 ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c), process ceph-mds, pid 15599
2013-04-05 08:04:25.936397 7f717b3e5780 10 mds.-1.0 168 MDSCacheObject
2013-04-05 08:04:25.936412 7f717b3e5780 10 mds.-1.0 2144        CInode
2013-04-05 08:04:25.936413 7f717b3e5780 10 mds.-1.0 16   elist<>::item   *7=112
2013-04-05 08:04:25.936414 7f717b3e5780 10 mds.-1.0 384  inode_t
2013-04-05 08:04:25.936416 7f717b3e5780 10 mds.-1.0 56    nest_info_t
2013-04-05 08:04:25.936417 7f717b3e5780 10 mds.-1.0 32    frag_info_t
2013-04-05 08:04:25.936448 7f717b3e5780 10 mds.-1.0 40   SimpleLock   *5=200
2013-04-05 08:04:25.936449 7f717b3e5780 10 mds.-1.0 48   ScatterLock  *3=144
2013-04-05 08:04:25.936451 7f717b3e5780 10 mds.-1.0 472 CDentry
2013-04-05 08:04:25.936452 7f717b3e5780 10 mds.-1.0 16   elist<>::item
2013-04-05 08:04:25.936453 7f717b3e5780 10 mds.-1.0 40   SimpleLock
2013-04-05 08:04:25.936453 7f717b3e5780 10 mds.-1.0 1016        CDir
2013-04-05 08:04:25.936459 7f717b3e5780 10 mds.-1.0 16   elist<>::item   *2=32
2013-04-05 08:04:25.936460 7f717b3e5780 10 mds.-1.0 192  fnode_t
2013-04-05 08:04:25.936461 7f717b3e5780 10 mds.-1.0 56    nest_info_t *2
2013-04-05 08:04:25.936462 7f717b3e5780 10 mds.-1.0 32    frag_info_t *2
2013-04-05 08:04:25.936463 7f717b3e5780 10 mds.-1.0 168 Capability
2013-04-05 08:04:25.936465 7f717b3e5780 10 mds.-1.0 32   xlist<>::item   *2=64
2013-04-05 08:04:25.937151 7f717b3e1700 10 mds.-1.0 MDS::ms_get_authorizer type=mon
2013-04-05 08:04:25.937353 7f7176ed8700  5 mds.-1.0 ms_handle_connect on 192.168.16.129:6789/0
2013-04-05 08:04:25.938584 7f717b3e5780 10 mds.-1.0 beacon_send up:boot seq 1 (currently up:boot)
2013-04-05 08:04:25.938755 7f717b3e5780 10 mds.-1.0 create_logger
2013-04-05 08:04:25.939946 7f7176ed8700  5 mds.-1.0 handle_mds_map epoch 13062 from mon.2
2013-04-05 08:04:25.940056 7f7176ed8700 10 mds.-1.0      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2013-04-05 08:04:25.940061 7f7176ed8700 10 mds.-1.0  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2013-04-05 08:04:25.940065 7f7176ed8700 10 mds.-1.-1 map says i am 192.168.16.129:6809/15599 mds.-1.-1 state down:dne
2013-04-05 08:04:25.940072 7f7176ed8700 10 mds.-1.-1 not in map yet
2013-04-05 08:04:26.038493 7f7176ed8700  5 mds.-1.-1 handle_mds_map epoch 13063 from mon.2
2013-04-05 08:04:26.038515 7f7176ed8700 10 mds.-1.-1      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2013-04-05 08:04:26.038522 7f7176ed8700 10 mds.-1.-1  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2013-04-05 08:04:26.038527 7f7176ed8700 10 mds.-1.0 map says i am 192.168.16.129:6809/15599 mds.-1.0 state up:standby
2013-04-05 08:04:26.038533 7f7176ed8700  1 mds.-1.0 handle_mds_map standby
2013-04-05 08:04:29.938876 7f7174dd3700 10 mds.-1.0 beacon_send up:standby seq 2 (currently up:standby)
2013-04-05 08:04:29.940963 7f7176ed8700 10 mds.-1.0 handle_mds_beacon up:standby seq 2 rtt 0.002065
2013-04-05 08:04:30.948423 7f7174dd3700 20 mds.-1.bal get_load no root, no load
2013-04-05 08:04:30.948576 7f7174dd3700 15 mds.-1.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.47>
2013-04-05 08:04:33.939196 7f7174dd3700 10 mds.-1.0 beacon_send up:standby seq 3 (currently up:standby)
2013-04-05 08:04:33.941165 7f7176ed8700 10 mds.-1.0 handle_mds_beacon up:standby seq 3 rtt 0.001955
2013-04-05 08:04:35.939135 7f7174dd3700 20 mds.-1.bal get_load no root, no load
2013-04-05 08:04:35.939189 7f7174dd3700 15 mds.-1.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.44>
2013-04-05 08:04:37.320922 7f7176ed8700  5 mds.-1.0 handle_mds_map epoch 13064 from mon.2
2013-04-05 08:04:37.320943 7f7176ed8700 10 mds.-1.0      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2013-04-05 08:04:37.320949 7f7176ed8700 10 mds.-1.0  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2013-04-05 08:04:37.320953 7f7176ed8700 10 mds.0.44 map says i am 192.168.16.129:6809/15599 mds.0.44 state up:replay
2013-04-05 08:04:37.320958 7f7176ed8700 10 mds.0.44  peer mds gid 10319 removed from map
2013-04-05 08:04:37.321007 7f7176ed8700  1 mds.0.44 handle_mds_map i am now mds.0.44
2013-04-05 08:04:37.321008 7f7176ed8700  1 mds.0.44 handle_mds_map state change up:standby --> up:replay
2013-04-05 08:04:37.321010 7f7176ed8700  1 mds.0.44 replay_start
2013-04-05 08:04:37.321016 7f7176ed8700  7 mds.0.cache set_recovery_set
2013-04-05 08:04:37.321017 7f7176ed8700  1 mds.0.44  recovery set is
2013-04-05 08:04:37.321019 7f7176ed8700  1 mds.0.44  need osdmap epoch 3690, have 3689
2013-04-05 08:04:37.321021 7f7176ed8700  1 mds.0.44  waiting for osdmap 3690 (which blacklists prior instance)
2013-04-05 08:04:37.321050 7f7176ed8700  7 mds.0.cache handle_mds_failure mds.0
2013-04-05 08:04:37.321054 7f7176ed8700  1 mds.0.cache handle_mds_failure mds.0 : recovery peers are
2013-04-05 08:04:37.321057 7f7176ed8700 10 mds.0.cache  wants_resolve
2013-04-05 08:04:37.321058 7f7176ed8700 10 mds.0.cache  got_resolve
2013-04-05 08:04:37.321059 7f7176ed8700 10 mds.0.cache  rejoin_sent
2013-04-05 08:04:37.321060 7f7176ed8700 10 mds.0.cache  rejoin_gather
2013-04-05 08:04:37.321061 7f7176ed8700 10 mds.0.cache  rejoin_ack_gather
2013-04-05 08:04:37.321064 7f7176ed8700  5 mds.0.migrator handle_mds_failure_or_stop mds.0
2013-04-05 08:04:37.321072 7f7176ed8700 10 mds.0.cache show_subtrees - no subtrees
2013-04-05 08:04:37.463368 7f7176ed8700  2 mds.0.44 boot_start 1: opening inotable
2013-04-05 08:04:37.463381 7f7176ed8700 10 mds.0.inotable: load
2013-04-05 08:04:37.463505 7f7176ed8700  2 mds.0.44 boot_start 1: opening sessionmap
2013-04-05 08:04:37.463572 7f7176ed8700 10 mds.0.sessionmap load
2013-04-05 08:04:37.463627 7f7176ed8700  2 mds.0.44 boot_start 1: opening anchor table
2013-04-05 08:04:37.463630 7f7176ed8700 10 mds.0.anchortable: load
2013-04-05 08:04:37.463699 7f7176ed8700  2 mds.0.44 boot_start 1: opening snap table
2013-04-05 08:04:37.463701 7f7176ed8700 10 mds.0.snaptable: load
2013-04-05 08:04:37.463754 7f7176ed8700  2 mds.0.44 boot_start 1: opening mds log
2013-04-05 08:04:37.463756 7f7176ed8700  5 mds.0.log open discovering log bounds
2013-04-05 08:04:37.464244 7f71743d0700 10 mds.0.44 MDS::ms_get_authorizer type=osd
2013-04-05 08:04:37.464253 7f71744d1700 10 mds.0.44 MDS::ms_get_authorizer type=osd
2013-04-05 08:04:37.464334 7f71745d2700 10 mds.0.44 MDS::ms_get_authorizer type=osd
2013-04-05 08:04:37.464470 7f71741ce700 10 mds.0.44 MDS::ms_get_authorizer type=osd
2013-04-05 08:04:37.464524 7f71742cf700 10 mds.0.44 MDS::ms_get_authorizer type=osd
2013-04-05 08:04:37.464536 7f7176ed8700  5 mds.0.44 ms_handle_connect on 46.252.23.153:6841/14764
2013-04-05 08:04:37.464590 7f7176ed8700  5 mds.0.44 ms_handle_connect on 192.168.23.110:6810/20043
2013-04-05 08:04:37.464602 7f7176ed8700  5 mds.0.44 ms_handle_connect on 46.252.23.153:6832/14317
2013-04-05 08:04:37.464851 7f7176ed8700  5 mds.0.44 ms_handle_connect on 46.252.23.153:6847/14980
2013-04-05 08:04:37.464874 7f7176ed8700  5 mds.0.44 ms_handle_connect on 46.252.23.153:6844/14861
2013-04-05 08:04:37.465699 7f7176ed8700 10 mds.0.anchortable: load_2 got 29 bytes
2013-04-05 08:04:37.465718 7f7176ed8700 10 mds.0.anchortable: load_2 loaded v0
2013-04-05 08:04:37.466027 7f7176ed8700 10 mds.0.inotable: load_2 got 397 bytes
2013-04-05 08:04:37.466029 7f7176ed8700 10 mds.0.inotable: load_2 loaded v33020
2013-04-05 08:04:37.466481 7f7173bc8700 10 mds.0.44 MDS::ms_get_authorizer type=osd
2013-04-05 08:04:37.470513 7f7176ed8700  5 mds.0.44 ms_handle_connect on 46.252.23.153:6800/13951
2013-04-05 08:04:37.470593 7f7176ed8700 10 mds.0.sessionmap dump
2013-04-05 08:04:37.472687 7f7176ed8700 -1 *** Caught signal (Aborted) **
 in thread 7f7176ed8700

 ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c)
 1: /usr/local/bin/ceph-mds() [0x852349]
 2: (()+0xf500) [0x7f717afc5500]
 3: (gsignal()+0x35) [0x7f71797118a5]
 4: (abort()+0x175) [0x7f7179713085]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f7179fcaa5d]
 6: (()+0xbcbe6) [0x7f7179fc8be6]
 7: (()+0xbcc13) [0x7f7179fc8c13]
 8: (()+0xbcd0e) [0x7f7179fc8d0e]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x13e) [0x7b7fee]
 10: (session_info_t::decode(ceph::buffer::list::iterator&)+0x415) [0x7dfa65]
 11: (SessionMap::decode(ceph::buffer::list::iterator&)+0x82c) [0x6e478c]
 12: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x81) [0x6e59c1]
 13: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe1e) [0x71290e]
 14: (MDS::handle_core_message(Message*)+0x957) [0x4d1c67]
 15: (MDS::_dispatch(Message*)+0x2f) [0x4d1d1f]
 16: (MDS::ms_dispatch(Message*)+0x19b) [0x4d38eb]
 17: (DispatchQueue::entry()+0x309) [0x825da9]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x7a9e0d]
 19: (()+0x7851) [0x7f717afbd851]
 20: (clone()+0x6d) [0x7f71797c711d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
  -189> 2013-04-05 08:04:25.932687 7f717b3e5780  5 asok(0x24c8000) register_command perfcounters_dump hook 0x2480010
  -188> 2013-04-05 08:04:25.932720 7f717b3e5780  5 asok(0x24c8000) register_command 1 hook 0x2480010
  -187> 2013-04-05 08:04:25.932723 7f717b3e5780  5 asok(0x24c8000) register_command perf dump hook 0x2480010
  -186> 2013-04-05 08:04:25.932729 7f717b3e5780  5 asok(0x24c8000) register_command perfcounters_schema hook 0x2480010
  -185> 2013-04-05 08:04:25.932733 7f717b3e5780  5 asok(0x24c8000) register_command 2 hook 0x2480010
  -184> 2013-04-05 08:04:25.932737 7f717b3e5780  5 asok(0x24c8000) register_command perf schema hook 0x2480010
  -183> 2013-04-05 08:04:25.932740 7f717b3e5780  5 asok(0x24c8000) register_command config show hook 0x2480010
  -182> 2013-04-05 08:04:25.932743 7f717b3e5780  5 asok(0x24c8000) register_command config set hook 0x2480010
  -181> 2013-04-05 08:04:25.932759 7f717b3e5780  5 asok(0x24c8000) register_command log flush hook 0x2480010
  -180> 2013-04-05 08:04:25.932761 7f717b3e5780  5 asok(0x24c8000) register_command log dump hook 0x2480010
  -179> 2013-04-05 08:04:25.932763 7f717b3e5780  5 asok(0x24c8000) register_command log reopen hook 0x2480010
  -178> 2013-04-05 08:04:25.934459 7f717b3e5780  0 ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c), process ceph-mds, pid 15599
  -177> 2013-04-05 08:04:25.934840 7f717b3e5780  1 accepter.accepter.bind my_inst.addr is 0.0.0.0:6809/15599 need_addr=1
  -176> 2013-04-05 08:04:25.935498 7f717b3e5780  1 finished global_init_daemonize
  -175> 2013-04-05 08:04:25.935764 7f717b3e5780  5 asok(0x24c8000) init /var/run/ceph/ceph-mds.a.asok
  -174> 2013-04-05 08:04:25.935788 7f717b3e5780  5 asok(0x24c8000) bind_and_listen /var/run/ceph/ceph-mds.a.asok
  -173> 2013-04-05 08:04:25.935858 7f717b3e5780  5 asok(0x24c8000) register_command 0 hook 0x2478038
  -172> 2013-04-05 08:04:25.935927 7f717b3e5780  5 asok(0x24c8000) register_command version hook 0x2478038
  -171> 2013-04-05 08:04:25.935932 7f717b3e5780  5 asok(0x24c8000) register_command git_version hook 0x2478038
  -170> 2013-04-05 08:04:25.935936 7f717b3e5780  5 asok(0x24c8000) register_command help hook 0x2480050
  -169> 2013-04-05 08:04:25.935979 7f717b3e5780 10 monclient(hunting): build_initial_monmap
  -168> 2013-04-05 08:04:25.936051 7f71786db700  5 asok(0x24c8000) entry start
  -167> 2013-04-05 08:04:25.936111 7f717b3e5780  1 -- 0.0.0.0:6809/15599 messenger.start
  -166> 2013-04-05 08:04:25.936239 7f717b3e5780  5 adding auth protocol: none
  -165> 2013-04-05 08:04:25.936250 7f717b3e5780  5 adding auth protocol: none
  -164> 2013-04-05 08:04:25.936397 7f717b3e5780 10 mds.-1.0 168 MDSCacheObject
  -163> 2013-04-05 08:04:25.936412 7f717b3e5780 10 mds.-1.0 2144        CInode
  -162> 2013-04-05 08:04:25.936413 7f717b3e5780 10 mds.-1.0 16   elist<>::item   *7=112
  -161> 2013-04-05 08:04:25.936414 7f717b3e5780 10 mds.-1.0 384  inode_t
  -160> 2013-04-05 08:04:25.936416 7f717b3e5780 10 mds.-1.0 56    nest_info_t
  -159> 2013-04-05 08:04:25.936417 7f717b3e5780 10 mds.-1.0 32    frag_info_t
  -158> 2013-04-05 08:04:25.936448 7f717b3e5780 10 mds.-1.0 40   SimpleLock   *5=200
  -157> 2013-04-05 08:04:25.936449 7f717b3e5780 10 mds.-1.0 48   ScatterLock  *3=144
  -156> 2013-04-05 08:04:25.936451 7f717b3e5780 10 mds.-1.0 472 CDentry
  -155> 2013-04-05 08:04:25.936452 7f717b3e5780 10 mds.-1.0 16   elist<>::item
  -154> 2013-04-05 08:04:25.936453 7f717b3e5780 10 mds.-1.0 40   SimpleLock
  -153> 2013-04-05 08:04:25.936453 7f717b3e5780 10 mds.-1.0 1016        CDir
  -152> 2013-04-05 08:04:25.936459 7f717b3e5780 10 mds.-1.0 16   elist<>::item   *2=32
  -151> 2013-04-05 08:04:25.936460 7f717b3e5780 10 mds.-1.0 192  fnode_t
  -150> 2013-04-05 08:04:25.936461 7f717b3e5780 10 mds.-1.0 56    nest_info_t *2
  -149> 2013-04-05 08:04:25.936462 7f717b3e5780 10 mds.-1.0 32    frag_info_t *2
  -148> 2013-04-05 08:04:25.936463 7f717b3e5780 10 mds.-1.0 168 Capability
  -147> 2013-04-05 08:04:25.936465 7f717b3e5780 10 mds.-1.0 32   xlist<>::item   *2=64
  -146> 2013-04-05 08:04:25.936543 7f717b3e5780  1 accepter.accepter.start
  -145> 2013-04-05 08:04:25.936576 7f717b3e5780 10 monclient(hunting): init
  -144> 2013-04-05 08:04:25.936582 7f717b3e5780  5 adding auth protocol: none
  -143> 2013-04-05 08:04:25.936586 7f717b3e5780 10 monclient(hunting): auth_supported 1 method none
  -142> 2013-04-05 08:04:25.936666 7f717b3e5780 10 monclient(hunting): _reopen_session
  -141> 2013-04-05 08:04:25.936728 7f717b3e5780 10 monclient(hunting): _pick_new_mon picked mon.a con 0x2540580 addr 192.168.16.129:6789/0
  -140> 2013-04-05 08:04:25.936777 7f717b3e5780 10 monclient(hunting): _send_mon_message to mon.a at 192.168.16.129:6789/0
  -139> 2013-04-05 08:04:25.936784 7f717b3e5780  1 -- 0.0.0.0:6809/15599 --> 192.168.16.129:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 0x24e8400 con 0x2540580
  -138> 2013-04-05 08:04:25.936798 7f717b3e5780 10 monclient(hunting): renew_subs
  -137> 2013-04-05 08:04:25.937087 7f717b3e1700  1 -- 192.168.16.129:6809/15599 learned my addr 192.168.16.129:6809/15599
  -136> 2013-04-05 08:04:25.937151 7f717b3e1700 10 mds.-1.0 MDS::ms_get_authorizer type=mon
  -135> 2013-04-05 08:04:25.937353 7f7176ed8700  5 mds.-1.0 ms_handle_connect on 192.168.16.129:6789/0
  -134> 2013-04-05 08:04:25.938097 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 1 ==== mon_map v1 ==== 473+0+0 (3243395836 0 0) 0x25a03c0 con 0x2540580
  -133> 2013-04-05 08:04:25.938131 7f7176ed8700 10 monclient(hunting): handle_monmap mon_map v1
  -132> 2013-04-05 08:04:25.938156 7f7176ed8700 10 monclient(hunting):  got monmap 6, mon.a is now rank 2
  -131> 2013-04-05 08:04:25.938161 7f7176ed8700 10 monclient(hunting): dump:
epoch 6
fsid 4eacc6f6-6561-4359-a035-1eb9a91eb45d
last_changed 2013-04-02 09:12:35.994897
created 2012-10-24 16:32:47.986799
0: 192.168.99.110:6789/0 mon.c
1: 192.168.23.110:6789/0 mon.b
2: 192.168.16.129:6789/0 mon.a

  -130> 2013-04-05 08:04:25.938206 7f7176ed8700  1 monclient(hunting): found mon.a
  -129> 2013-04-05 08:04:25.938214 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 2 ==== auth_reply(proto 1 0 Success) v1 ==== 24+0+0 (3352920965 0 0) 0x24e8600 con 0x2540580
  -128> 2013-04-05 08:04:25.938241 7f7176ed8700 10 monclient: my global_id is 10337
  -127> 2013-04-05 08:04:25.938243 7f7176ed8700 10 monclient: _send_mon_message to mon.a at 192.168.16.129:6789/0
  -126> 2013-04-05 08:04:25.938248 7f7176ed8700  1 -- 192.168.16.129:6809/15599 --> 192.168.16.129:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x2590380 con 0x2540580
  -125> 2013-04-05 08:04:25.938306 7f7176ed8700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2013-04-05 08:03:55.938304)
  -124> 2013-04-05 08:04:25.938364 7f717b3e5780  5 monclient: authenticate success, global_id 10337
  -123> 2013-04-05 08:04:25.938520 7f717b3e5780  5 asok(0x24c8000) register_command objecter_requests hook 0x24800f0
  -122> 2013-04-05 08:04:25.938584 7f717b3e5780 10 mds.-1.0 beacon_send up:boot seq 1 (currently up:boot)
  -121> 2013-04-05 08:04:25.938625 7f717b3e5780 10 monclient: _send_mon_message to mon.a at 192.168.16.129:6789/0
  -120> 2013-04-05 08:04:25.938638 7f717b3e5780  1 -- 192.168.16.129:6809/15599 --> 192.168.16.129:6789/0 -- mdsbeacon(10337/a up:boot seq 1 v0) v2 -- ?+0 0x25b0000 con 0x2540580
  -119> 2013-04-05 08:04:25.938701 7f717b3e5780 10 monclient: renew_subs
  -118> 2013-04-05 08:04:25.938705 7f717b3e5780 10 monclient: _send_mon_message to mon.a at 192.168.16.129:6789/0
  -117> 2013-04-05 08:04:25.938708 7f717b3e5780  1 -- 192.168.16.129:6809/15599 --> 192.168.16.129:6789/0 -- mon_subscribe({monmap=7+,osdmap=0}) v2 -- ?+0 0x25901c0 con 0x2540580
  -116> 2013-04-05 08:04:25.938715 7f717b3e5780 10 monclient: renew_subs
  -115> 2013-04-05 08:04:25.938717 7f717b3e5780 10 monclient: _send_mon_message to mon.a at 192.168.16.129:6789/0
  -114> 2013-04-05 08:04:25.938719 7f717b3e5780  1 -- 192.168.16.129:6809/15599 --> 192.168.16.129:6789/0 -- mon_subscribe({mdsmap=0+,monmap=7+,osdmap=0}) v2 -- ?+0 0x25908c0 con 0x2540580
  -113> 2013-04-05 08:04:25.938755 7f717b3e5780 10 mds.-1.0 create_logger
  -112> 2013-04-05 08:04:25.939038 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 3 ==== mon_map v1 ==== 473+0+0 (3243395836 0 0) 0x25a05a0 con 0x2540580
  -111> 2013-04-05 08:04:25.939055 7f7176ed8700 10 monclient: handle_monmap mon_map v1
  -110> 2013-04-05 08:04:25.939072 7f7176ed8700 10 monclient:  got monmap 6, mon.a is now rank 2
  -109> 2013-04-05 08:04:25.939079 7f7176ed8700 10 monclient: dump:
epoch 6
fsid 4eacc6f6-6561-4359-a035-1eb9a91eb45d
last_changed 2013-04-02 09:12:35.994897
created 2012-10-24 16:32:47.986799
0: 192.168.99.110:6789/0 mon.c
1: 192.168.23.110:6789/0 mon.b
2: 192.168.16.129:6789/0 mon.a

  -108> 2013-04-05 08:04:25.939113 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 4 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (3455830995 0 0) 0x25684e0 con 0x2540580
  -107> 2013-04-05 08:04:25.939127 7f7176ed8700 10 monclient: handle_subscribe_ack sent 2013-04-05 08:04:25.936799 renew after 2013-04-05 08:06:55.936799
  -106> 2013-04-05 08:04:25.939584 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 5 ==== osd_map(3689..3689 src has 3188..3689) v3 ==== 11774+0+0 (2846523975 0 0) 0x24e9200 con 0x2540580
  -105> 2013-04-05 08:04:25.939820 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 6 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (3455830995 0 0) 0x2568820 con 0x2540580
  -104> 2013-04-05 08:04:25.939833 7f7176ed8700 10 monclient: handle_subscribe_ack sent 0.000000, ignoring
  -103> 2013-04-05 08:04:25.939837 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 7 ==== mdsmap(e 13062) v1 ==== 583+0+0 (1073864237 0 0) 0x24e9000 con 0x2540580
  -102> 2013-04-05 08:04:25.939946 7f7176ed8700  5 mds.-1.0 handle_mds_map epoch 13062 from mon.2
  -101> 2013-04-05 08:04:25.940056 7f7176ed8700 10 mds.-1.0      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
  -100> 2013-04-05 08:04:25.940061 7f7176ed8700 10 mds.-1.0  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
   -99> 2013-04-05 08:04:25.940065 7f7176ed8700 10 mds.-1.-1 map says i am 192.168.16.129:6809/15599 mds.-1.-1 state down:dne
   -98> 2013-04-05 08:04:25.940072 7f7176ed8700 10 mds.-1.-1 not in map yet
   -97> 2013-04-05 08:04:25.940077 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 8 ==== osd_map(3689..3689 src has 3188..3689) v3 ==== 11774+0+0 (2846523975 0 0) 0x24e9800 con 0x2540580
   -96> 2013-04-05 08:04:25.940089 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 9 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (3455830995 0 0) 0x2568680 con 0x2540580
   -95> 2013-04-05 08:04:25.940094 7f7176ed8700 10 monclient: handle_subscribe_ack sent 0.000000, ignoring
   -94> 2013-04-05 08:04:26.038468 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 10 ==== mdsmap(e 13063) v1 ==== 786+0+0 (1560868916 0 0) 0x24e9600 con 0x2540580
   -93> 2013-04-05 08:04:26.038493 7f7176ed8700  5 mds.-1.-1 handle_mds_map epoch 13063 from mon.2
   -92> 2013-04-05 08:04:26.038515 7f7176ed8700 10 mds.-1.-1      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
   -91> 2013-04-05 08:04:26.038522 7f7176ed8700 10 mds.-1.-1  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
   -90> 2013-04-05 08:04:26.038527 7f7176ed8700 10 mds.-1.0 map says i am 192.168.16.129:6809/15599 mds.-1.0 state up:standby
   -89> 2013-04-05 08:04:26.038533 7f7176ed8700  1 mds.-1.0 handle_mds_map standby
   -88> 2013-04-05 08:04:28.936793 7f7175ed6700 10 monclient: tick
   -87> 2013-04-05 08:04:28.936816 7f7175ed6700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2013-04-05 08:03:58.936815)
   -86> 2013-04-05 08:04:28.936827 7f7175ed6700 10 monclient: renew subs? (now: 2013-04-05 08:04:28.936827; renew after: 2013-04-05 08:06:55.936799) -- no
   -85> 2013-04-05 08:04:29.938876 7f7174dd3700 10 mds.-1.0 beacon_send up:standby seq 2 (currently up:standby)
   -84> 2013-04-05 08:04:29.938912 7f7174dd3700 10 monclient: _send_mon_message to mon.a at 192.168.16.129:6789/0
   -83> 2013-04-05 08:04:29.938932 7f7174dd3700  1 -- 192.168.16.129:6809/15599 --> 192.168.16.129:6789/0 -- mdsbeacon(10337/a up:standby seq 2 v13063) v2 -- ?+0 0x25b02c0 con 0x2540580
   -82> 2013-04-05 08:04:29.940877 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 11 ==== mdsbeacon(10337/a up:standby seq 2 v13063) v2 ==== 103+0+0 (2969039057 0 0) 0x25b0000 con 0x2540580
   -81> 2013-04-05 08:04:29.940963 7f7176ed8700 10 mds.-1.0 handle_mds_beacon up:standby seq 2 rtt 0.002065
   -80> 2013-04-05 08:04:30.948423 7f7174dd3700 20 mds.-1.bal get_load no root, no load
   -79> 2013-04-05 08:04:30.948576 7f7174dd3700 15 mds.-1.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.47>
   -78> 2013-04-05 08:04:33.939196 7f7174dd3700 10 mds.-1.0 beacon_send up:standby seq 3 (currently up:standby)
   -77> 2013-04-05 08:04:33.939221 7f7174dd3700 10 monclient: _send_mon_message to mon.a at 192.168.16.129:6789/0
   -76> 2013-04-05 08:04:33.939232 7f7174dd3700  1 -- 192.168.16.129:6809/15599 --> 192.168.16.129:6789/0 -- mdsbeacon(10337/a up:standby seq 3 v13063) v2 -- ?+0 0x25b0580 con 0x2540580
   -75> 2013-04-05 08:04:33.941136 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 12 ==== mdsbeacon(10337/a up:standby seq 3 v13063) v2 ==== 103+0+0 (3386805686 0 0) 0x25b0b00 con 0x2540580
   -74> 2013-04-05 08:04:33.941165 7f7176ed8700 10 mds.-1.0 handle_mds_beacon up:standby seq 3 rtt 0.001955
   -73> 2013-04-05 08:04:35.939135 7f7174dd3700 20 mds.-1.bal get_load no root, no load
   -72> 2013-04-05 08:04:35.939189 7f7174dd3700 15 mds.-1.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.44>
   -71> 2013-04-05 08:04:37.320902 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 13 ==== mdsmap(e 13064) v1 ==== 583+0+0 (2482818736 0 0) 0x24e9400 con 0x2540580
   -70> 2013-04-05 08:04:37.320922 7f7176ed8700  5 mds.-1.0 handle_mds_map epoch 13064 from mon.2
   -69> 2013-04-05 08:04:37.320943 7f7176ed8700 10 mds.-1.0      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
   -68> 2013-04-05 08:04:37.320949 7f7176ed8700 10 mds.-1.0  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
   -67> 2013-04-05 08:04:37.320953 7f7176ed8700 10 mds.0.44 map says i am 192.168.16.129:6809/15599 mds.0.44 state up:replay
   -66> 2013-04-05 08:04:37.320958 7f7176ed8700 10 mds.0.44  peer mds gid 10319 removed from map
   -65> 2013-04-05 08:04:37.320960 7f7176ed8700  1 -- 192.168.16.129:6809/15599 mark_down 192.168.23.110:6808/24924 -- pipe dne
   -64> 2013-04-05 08:04:37.321007 7f7176ed8700  1 mds.0.44 handle_mds_map i am now mds.0.44
   -63> 2013-04-05 08:04:37.321008 7f7176ed8700  1 mds.0.44 handle_mds_map state change up:standby --> up:replay
   -62> 2013-04-05 08:04:37.321010 7f7176ed8700  1 mds.0.44 replay_start
   -61> 2013-04-05 08:04:37.321016 7f7176ed8700  7 mds.0.cache set_recovery_set
   -60> 2013-04-05 08:04:37.321017 7f7176ed8700  1 mds.0.44  recovery set is
   -59> 2013-04-05 08:04:37.321019 7f7176ed8700  1 mds.0.44  need osdmap epoch 3690, have 3689
   -58> 2013-04-05 08:04:37.321021 7f7176ed8700  1 mds.0.44  waiting for osdmap 3690 (which blacklists prior instance)
   -57> 2013-04-05 08:04:37.321029 7f7176ed8700 10 monclient: renew_subs
   -56> 2013-04-05 08:04:37.321033 7f7176ed8700 10 monclient: _send_mon_message to mon.a at 192.168.16.129:6789/0
   -55> 2013-04-05 08:04:37.321036 7f7176ed8700  1 -- 192.168.16.129:6809/15599 --> 192.168.16.129:6789/0 -- mon_subscribe({mdsmap=13065+,monmap=7+,osdmap=3690}) v2 -- ?+0 0x2590380 con 0x2540580
   -54> 2013-04-05 08:04:37.321050 7f7176ed8700  7 mds.0.cache handle_mds_failure mds.0
   -53> 2013-04-05 08:04:37.321054 7f7176ed8700  1 mds.0.cache handle_mds_failure mds.0 : recovery peers are
   -52> 2013-04-05 08:04:37.321057 7f7176ed8700 10 mds.0.cache  wants_resolve
   -51> 2013-04-05 08:04:37.321058 7f7176ed8700 10 mds.0.cache  got_resolve
   -50> 2013-04-05 08:04:37.321059 7f7176ed8700 10 mds.0.cache  rejoin_sent
   -49> 2013-04-05 08:04:37.321060 7f7176ed8700 10 mds.0.cache  rejoin_gather
   -48> 2013-04-05 08:04:37.321061 7f7176ed8700 10 mds.0.cache  rejoin_ack_gather
   -47> 2013-04-05 08:04:37.321064 7f7176ed8700  5 mds.0.migrator handle_mds_failure_or_stop mds.0
   -46> 2013-04-05 08:04:37.321072 7f7176ed8700 10 mds.0.cache show_subtrees - no subtrees
   -45> 2013-04-05 08:04:37.463246 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 14 ==== osd_map(3690..3690 src has 3189..3690) v3 ==== 308+0+0 (275205728 0 0) 0x24e9c00 con 0x2540580
   -44> 2013-04-05 08:04:37.463368 7f7176ed8700  2 mds.0.44 boot_start 1: opening inotable
   -43> 2013-04-05 08:04:37.463381 7f7176ed8700 10 mds.0.inotable: load
   -42> 2013-04-05 08:04:37.463480 7f7176ed8700  1 -- 192.168.16.129:6809/15599 --> 46.252.23.153:6832/14317 -- osd_op(mds.0.44:1 mds0_inotable [read 0~0] 1.b852b893 e3690) v4 -- ?+0 0x24c0480 con 0x2540c60
   -41> 2013-04-05 08:04:37.463505 7f7176ed8700  2 mds.0.44 boot_start 1: opening sessionmap
   -40> 2013-04-05 08:04:37.463572 7f7176ed8700 10 mds.0.sessionmap load
   -39> 2013-04-05 08:04:37.463616 7f7176ed8700  1 -- 192.168.16.129:6809/15599 --> 46.252.23.153:6841/14764 -- osd_op(mds.0.44:2 mds0_sessionmap [read 0~0] 1.3270c60b e3690) v4 -- ?+0 0x24c0900 con 0x25414a0
   -38> 2013-04-05 08:04:37.463627 7f7176ed8700  2 mds.0.44 boot_start 1: opening anchor table
   -37> 2013-04-05 08:04:37.463630 7f7176ed8700 10 mds.0.anchortable: load
   -36> 2013-04-05 08:04:37.463670 7f7176ed8700  1 -- 192.168.16.129:6809/15599 --> 192.168.23.110:6810/20043 -- osd_op(mds.0.44:3 mds_anchortable [read 0~0] 1.a977f6a7 e3690) v4 -- ?+0 0x24c06c0 con 0x2541340
   -35> 2013-04-05 08:04:37.463699 7f7176ed8700  2 mds.0.44 boot_start 1: opening snap table
   -34> 2013-04-05 08:04:37.463701 7f7176ed8700 10 mds.0.snaptable: load
   -33> 2013-04-05 08:04:37.463742 7f7176ed8700  1 -- 192.168.16.129:6809/15599 --> 46.252.23.153:6844/14861 -- osd_op(mds.0.44:4 mds_snaptable [read 0~0] 1.d90270ad e3690) v4 -- ?+0 0x24c0fc0 con 0x25411e0
   -32> 2013-04-05 08:04:37.463754 7f7176ed8700  2 mds.0.44 boot_start 1: opening mds log
   -31> 2013-04-05 08:04:37.463756 7f7176ed8700  5 mds.0.log open discovering log bounds
   -30> 2013-04-05 08:04:37.463762 7f7176ed8700  1 mds.0.journaler(ro) recover start
   -29> 2013-04-05 08:04:37.463763 7f7176ed8700  1 mds.0.journaler(ro) read_head
   -28> 2013-04-05 08:04:37.463814 7f7176ed8700  1 -- 192.168.16.129:6809/15599 --> 46.252.23.153:6847/14980 -- osd_op(mds.0.44:5 200.00000000 [read 0~0] 1.844f3494 e3690) v4 -- ?+0 0x24c0d80 con 0x2541080
   -27> 2013-04-05 08:04:37.463837 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== mon.2 192.168.16.129:6789/0 15 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (3455830995 0 0) 0x25689c0 con 0x2540580
   -26> 2013-04-05 08:04:37.463845 7f7176ed8700 10 monclient: handle_subscribe_ack sent 2013-04-05 08:04:37.321030 renew after 2013-04-05 08:07:07.321030
   -25> 2013-04-05 08:04:37.464244 7f71743d0700 10 mds.0.44 MDS::ms_get_authorizer type=osd
   -24> 2013-04-05 08:04:37.464253 7f71744d1700 10 mds.0.44 MDS::ms_get_authorizer type=osd
   -23> 2013-04-05 08:04:37.464334 7f71745d2700 10 mds.0.44 MDS::ms_get_authorizer type=osd
   -22> 2013-04-05 08:04:37.464470 7f71741ce700 10 mds.0.44 MDS::ms_get_authorizer type=osd
   -21> 2013-04-05 08:04:37.464524 7f71742cf700 10 mds.0.44 MDS::ms_get_authorizer type=osd
   -20> 2013-04-05 08:04:37.464536 7f7176ed8700  5 mds.0.44 ms_handle_connect on 46.252.23.153:6841/14764
   -19> 2013-04-05 08:04:37.464590 7f7176ed8700  5 mds.0.44 ms_handle_connect on 192.168.23.110:6810/20043
   -18> 2013-04-05 08:04:37.464602 7f7176ed8700  5 mds.0.44 ms_handle_connect on 46.252.23.153:6832/14317
   -17> 2013-04-05 08:04:37.464851 7f7176ed8700  5 mds.0.44 ms_handle_connect on 46.252.23.153:6847/14980
   -16> 2013-04-05 08:04:37.464874 7f7176ed8700  5 mds.0.44 ms_handle_connect on 46.252.23.153:6844/14861
   -15> 2013-04-05 08:04:37.465559 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== osd.19 192.168.23.110:6810/20043 1 ==== osd_op_reply(3 mds_anchortable [read 0~29] ondisk = 0) v4 ==== 114+0+29 (1788275687 0 218791405) 0x25a0780 con 0x2541340
   -14> 2013-04-05 08:04:37.465699 7f7176ed8700 10 mds.0.anchortable: load_2 got 29 bytes
   -13> 2013-04-05 08:04:37.465718 7f7176ed8700 10 mds.0.anchortable: load_2 loaded v0
   -12> 2013-04-05 08:04:37.465738 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== osd.9 46.252.23.153:6847/14980 1 ==== osd_op_reply(5 200.00000000 [read 0~84] ondisk = 0) v4 ==== 111+0+84 (2060639142 0 172874593) 0x25a0b40 con 0x2541080
   -11> 2013-04-05 08:04:37.465759 7f7176ed8700  1 mds.0.journaler(ro) _finish_read_head loghead(trim 281018368, expire 281018839, write 408581072).  probing for end of log (from 408581072)...
   -10> 2013-04-05 08:04:37.465763 7f7176ed8700  1 mds.0.journaler(ro) probing for end of the log
    -9> 2013-04-05 08:04:37.465816 7f7176ed8700  1 -- 192.168.16.129:6809/15599 --> 46.252.23.153:6832/14317 -- osd_op(mds.0.44:6 200.00000061 [stat] 1.4fc4cbea rwordered e3690) v4 -- ?+0 0x24c0b40 con 0x2540c60
    -8> 2013-04-05 08:04:37.465941 7f7176ed8700  1 -- 192.168.16.129:6809/15599 --> 46.252.23.153:6800/13951 -- osd_op(mds.0.44:7 200.00000062 [stat] 1.31af1135 rwordered e3690) v4 -- ?+0 0x24c1680 con 0x2541b80
    -7> 2013-04-05 08:04:37.465959 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== osd.14 46.252.23.153:6832/14317 1 ==== osd_op_reply(1 mds0_inotable [read 0~397] ondisk = 0) v4 ==== 112+0+397 (3374773485 0 151292013) 0x25a0d20 con 0x2540c60
    -6> 2013-04-05 08:04:37.466027 7f7176ed8700 10 mds.0.inotable: load_2 got 397 bytes
    -5> 2013-04-05 08:04:37.466029 7f7176ed8700 10 mds.0.inotable: load_2 loaded v33020
    -4> 2013-04-05 08:04:37.466481 7f7173bc8700 10 mds.0.44 MDS::ms_get_authorizer type=osd
    -3> 2013-04-05 08:04:37.470513 7f7176ed8700  5 mds.0.44 ms_handle_connect on 46.252.23.153:6800/13951
    -2> 2013-04-05 08:04:37.470531 7f7176ed8700  1 -- 192.168.16.129:6809/15599 <== osd.7 46.252.23.153:6841/14764 1 ==== osd_op_reply(2 mds0_sessionmap [read 0~542] ondisk = 0) v4 ==== 114+0+542 (2655036902 0 3336007267) 0x25a0960 con 0x25414a0
    -1> 2013-04-05 08:04:37.470593 7f7176ed8700 10 mds.0.sessionmap dump
     0> 2013-04-05 08:04:37.472687 7f7176ed8700 -1 *** Caught signal (Aborted) **
 in thread 7f7176ed8700

 ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c)
 1: /usr/local/bin/ceph-mds() [0x852349]
 2: (()+0xf500) [0x7f717afc5500]
 3: (gsignal()+0x35) [0x7f71797118a5]
 4: (abort()+0x175) [0x7f7179713085]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f7179fcaa5d]
 6: (()+0xbcbe6) [0x7f7179fc8be6]
 7: (()+0xbcc13) [0x7f7179fc8c13]
 8: (()+0xbcd0e) [0x7f7179fc8d0e]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x13e) [0x7b7fee]
 10: (session_info_t::decode(ceph::buffer::list::iterator&)+0x415) [0x7dfa65]
 11: (SessionMap::decode(ceph::buffer::list::iterator&)+0x82c) [0x6e478c]
 12: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x81) [0x6e59c1]
 13: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe1e) [0x71290e]
 14: (MDS::handle_core_message(Message*)+0x957) [0x4d1c67]
 15: (MDS::_dispatch(Message*)+0x2f) [0x4d1d1f]
 16: (MDS::ms_dispatch(Message*)+0x19b) [0x4d38eb]
 17: (DispatchQueue::entry()+0x309) [0x825da9]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x7a9e0d]
 19: (()+0x7851) [0x7f717afbd851]
 20: (clone()+0x6d) [0x7f71797c711d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
  20/20 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/mds.a.log
--- end dump of recent events ---

Actions #8

Updated by norbert schmidt about 11 years ago

sessionmap, command is rados --pool=metadata get mds0_sessionmap /tmp/sessionmap (without -o) :)

Actions #9

Updated by Zheng Yan about 11 years ago

there is a typo in session_info_t::decode

Actions #10

Updated by Joao Eduardo Luis about 11 years ago

alexxy on IRC is reporting that the patch doesn't work. He would have provided his report himself, but it appears that he still has some registration issues.

Actions #11

Updated by Greg Farnum about 11 years ago

alexxy's sessionmap doesn't look anything like a sessionmap should; this won't fix his issue. Norbert's is at least superficially realistic looking but this doesn't seem to help his either, although Yan is correct about the typo and I've pulled it in to my local tree. More testing will be required.

Actions #12

Updated by Zheng Yan about 11 years ago

The 26th byte of Norbert's sessionmap is 1. If I'm not wrong, it's struct_v for session_info_t. But the oldest version current code supports is 2.

Actions #13

Updated by Greg Farnum about 11 years ago

  • Status changed from Fix Under Review to In Progress
  • Priority changed from Normal to Urgent

That shouldn't be a problem for v0.58; it included version 2 session_info_t. You sure that's the version you upgraded from, Norbert?

Going over this again it does look shockingly as though we failed to include forward compatibility of the SessionMap between bobtail and current next though...crap. That's probably something we need to fix or anybody running CephFS and not doing a full shutdown of all their clients is going to have a very bad day.

Actions #14

Updated by Joao Eduardo Luis about 11 years ago

update directly from IRC, as alexxy is still having registration issues:

<alexxy> joao: upgrade was from version 0.56.x

Actions #15

Updated by Greg Farnum about 11 years ago

Ah, this looks to be less bad than I thought — the (struct_v == 2) check should be (struct_v <= 2) is all, from the same patch that had the "completed_requests" typo. Let me test those upgrade paths and make sure so I can reproduce and verify.

Actions #16

Updated by Alexey Shvetsov about 11 years ago

if you have some patch that we can test, i'd be glad =)

Actions #17

Updated by Greg Farnum about 11 years ago

I'm having trouble reproducing this bug, but I'm probably not going through the right steps. A patch that I think should handle it is below (thanks Yan!)

commit 01f99c9d6cc0c3494bcb81e914b28bcc4ca2f7a1
Author: Yan, Zheng <zheng.z.yan@intel.com>
Date:   Fri Apr 5 13:58:36 2013 +0800

    mds: fix session_info_t decoding

    commit 0bcf2ac081 changes session_info_t's format, but there is
    a typo in the code that decodes old format. We also need to
    decode struct_v == 1 in the same way as struct_v == 2.

    Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
    Signed-off-by: Greg Farnum <greg@inktank.com>

diff --git a/src/mds/mdstypes.cc b/src/mds/mdstypes.cc
index ad4a71a..b1ce640 100644
--- a/src/mds/mdstypes.cc
+++ b/src/mds/mdstypes.cc
@@ -541,9 +541,9 @@ void session_info_t::decode(bufferlist::iterator& p)
 {
   DECODE_START_LEGACY_COMPAT_LEN(3, 2, 2, p);
   ::decode(inst, p);
-  if (struct_v == 2) {
+  if (struct_v <= 2) {
     set<tid_t> s;
-    ::decode(completed_requests, p);
+    ::decode(s, p);
     while (!s.empty()) {
       completed_requests[*s.begin()] = inodeno_t();
       s.erase(s.begin());

However, the mds0_sessionmap you gave me doesn't look right alexxy — it might be that something went wrong along the chain and the actual one is fine, but if not this isn't going to fix it.

back to reproducing

Actions #18

Updated by Greg Farnum about 11 years ago

  • Status changed from In Progress to Fix Under Review
  • Assignee changed from Greg Farnum to Sam Lang

Hurray, I did manage to reproduce so I guess I just missed before, and indeed it works with that patch and fails without it. Pushed the patch to wip-sessionmap-4644; you can cherry-pick that onto whatever branch you like if you're building locally; otherwise it's on our "next" (going to be Cuttlefish) branch.

I'll give it to Sam for validation since we both added a line, but if you could give us a thumbs-up that'd be good too, Yan.

Actions #19

Updated by Zheng Yan about 11 years ago

The code looks good.

Actions #20

Updated by norbert schmidt about 11 years ago

let me know if i can test patches for you ! :)

Actions #21

Updated by Alexey Shvetsov about 11 years ago

Last patch seems work. At least mds dont crash anymore. Also df reports non bogus values.
I'll add this patch to gentoo package

Actions #22

Updated by norbert schmidt about 11 years ago

so patch tested, mds is running fine now. thx !

Actions #23

Updated by Greg Farnum about 11 years ago

  • Status changed from Fix Under Review to Resolved

Merged into next as of commit:d777b8e66b2e950266e52589c129b00f77b8afc0 (Thanks Sam!).

Actions

Also available in: Atom PDF