Bug #2379
closedMon crash after start
0%
Description
--- begin dump of recent events --- -3> 2012-05-02 17:58:13.741789 7fb5dcd97780 1 store(/srv/ceph/mon) mount -2> 2012-05-02 17:58:13.741972 7fb5dcd97780 0 ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1), process ceph-mon, pid 16077 -1> 2012-05-02 17:58:13.743663 7fb5dcd97780 1 mon.n8c1@0(probing) e7 init fsid b9d2c51b-83c1-4175-b168-d3a57fa4b9e5 0> 2012-05-02 17:58:13.744765 7fb5dcd97780 -1 *** Caught signal (Aborted) ** in thread 7fb5dcd97780 ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1) 1: /usr/bin/ceph-mon() [0x615cf8] 2: (()+0xf030) [0x7fb5dc97c030] 3: (gsignal()+0x35) [0x7fb5db101475] 4: (abort()+0x180) [0x7fb5db1046f0] 5: (__assert_fail()+0xf1) [0x7fb5db0fa621] 6: /usr/bin/ceph-mon() [0x499743] 7: (Monitor::init()+0x3c2) [0x482832] 8: (main()+0x1b2a) [0x47293a] 9: (__libc_start_main()+0xfd) [0x7fb5db0edead] 10: /usr/bin/ceph-mon() [0x474e59] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- end dump of recent events ---
Ceph version on n8c1: 0.46-1~bpo70+1 (the same problem appeared with version 0.44.1-1~bpo70+1)
Ceph version on the rest nodes: 0.44.1-1~bpo70+1
Kernel version: 3.2.12-1
Ceph config:
[global]
auth supported = cephx
keyring = /srv/ceph/keyring.admin
[mon]
mon data = /srv/ceph/mon
[mon.n3c1]
host = n3c1
mon addr = 1.1.1.1:6789
[mon.n8c1]
host = n8c1
mon addr = 2.2.2.2:6789
[mon.n4c1]
host = n4c1
mon addr = 3.3.3.3:6789
[mds]
debug mds = 1
keyring = /srv/ceph/ceph-stage2/keyring.$name
[mds.n3c1]
host = n3c1
mds standby replay = true
mds standby for name = n4c1
[mds.n4c1]
host = n4c1
[osd]
osd data = /srv/ceph/$name
osd journal = /srv/ceph/$name.journal
osd journal size = 1000
filestore btrfs snap = 0
keyring = /srv/ceph/ceph-stage2/keyring.$name
debug osd = 1
[osd.1]
host = n3c1
[osd.0]
host = n4c1
Updated by Maciej Galkiewicz almost 12 years ago
I can provide it for you. Please describe how to do this.
Updated by Sage Weil almost 12 years ago
- Category set to Monitor
- Status changed from New to Need More Info
What would actually be better is if you could fire up gdb on ceph-mon and the core file and paste a backtrace here. Or find us in #ceph, that'll be faster if we end up needing more info.
Thanks!
Updated by Sage Weil almost 12 years ago
- Source changed from Development to Community (user)
the problem was that latest was at a higher version than last_committed but slurping wasn't set.
checking whether logs indicate that it was definitely started successfully with 0.42 or later (when slurping flag was added).. that would indicate something is still broken.
Updated by Greg Farnum almost 12 years ago
Sage, did you check out the logging for slurping yet?
Maciej, do you still need instructions on core files?
Updated by Maciej Galkiewicz almost 12 years ago
The problem occurred again.
-28> 2012-05-13 13:31:47.534066 7f9c78d5d700 0 log [INF] : mon.n8c1 calling new monitor election -27> 2012-05-13 13:31:47.797830 7f9c78d5d700 0 log [INF] : mon.n8c1 calling new monitor election -26> 2012-05-13 13:31:49.600410 7f9c7855c700 1 mon.n8c1@0(electing) e7 discarding message auth(proto 0 29 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum -25> 2012-05-13 13:31:52.818827 7f9c7855c700 0 log [INF] : mon.n8c1@0 won leader election with quorum 0,2 -24> 2012-05-13 13:32:00.402827 7f9c77351700 0 -- 1.1.1.1:6789/0 >> 10.0.9.34:0/1682607833 pipe(0x2785a00 sd =24 pgs=0 cs=0 l=0).accept peer addr is really 10.0.9.34:0/1682607833 (socket is 10.0.9.34:34742/0) -23> 2012-05-13 13:32:02.367065 7f9c76341700 0 -- 1.1.1.1:6789/0 >> 10.0.5.30:0/2888157195 pipe(0x2857c80 sd =26 pgs=0 cs=0 l=0).accept peer addr is really 10.0.5.30:0/2888157195 (socket is 10.0.5.30:45356/0) -22> 2012-05-13 13:32:06.442501 7f9c78d5d700 0 log [INF] : mon.n8c1 calling new monitor election -21> 2012-05-13 13:32:07.182669 7f9c77351700 0 -- 1.1.1.1:6789/0 >> 10.0.9.34:0/1682607833 pipe(0x2857500 sd =24 pgs=0 cs=0 l=0).accept peer addr is really 10.0.9.34:0/1682607833 (socket is 10.0.9.34:34743/0) -20> 2012-05-13 13:32:07.382599 7f9c78d5d700 0 log [INF] : mon.n8c1 calling new monitor election -19> 2012-05-13 13:32:12.403379 7f9c7855c700 0 log [INF] : mon.n8c1@0 won leader election with quorum 0,2 -18> 2012-05-13 13:32:37.342850 7f9c78d5d700 1 mon.n8c1@0(slurping).osd e811 e811: 2 osds: 2 up, 2 in -17> 2012-05-13 13:33:02.481908 7f9c78d5d700 0 log [INF] : mon.n8c1 calling new monitor election -16> 2012-05-13 13:33:05.212981 7f9c78d5d700 0 log [INF] : mon.n8c1 calling new monitor election -15> 2012-05-13 13:33:10.248659 7f9c7855c700 0 log [INF] : mon.n8c1@0 won leader election with quorum 0,1 -14> 2012-05-13 13:33:13.514838 7f9c78d5d700 0 -- 1.1.1.1:6789/0 send_message dropped message mdsmap(e 804) v1 because of no pipe on con 0x27618c0 -13> 2012-05-13 13:33:13.514865 7f9c78d5d700 0 -- 1.1.1.1:6789/0 send_message dropped message mdsmap(e 804) v1 because of no pipe on con 0x27618c0 -12> 2012-05-13 13:33:13.514874 7f9c78d5d700 0 -- 1.1.1.1:6789/0 send_message dropped message mdsmap(e 804) v1 because of no pipe on con 0x27618c0 -11> 2012-05-13 13:33:13.740516 7f9c78d5d700 0 log [INF] : mon.n8c1 calling new monitor election -10> 2012-05-13 13:33:18.778031 7f9c7855c700 0 log [INF] : mon.n8c1@0 won leader election with quorum 0,1 -9> 2012-05-13 13:33:21.312197 7f9c76846700 0 -- 1.1.1.1:6789/0 >> 10.0.8.19:0/1114788498 pipe(0x2784780 sd =24 pgs=0 cs=0 l=0).accept peer addr is really 10.0.8.19:0/1114788498 (socket is 10.0.8.19:52518/0) -8> 2012-05-13 13:33:21.617743 7f9c77351700 0 -- 1.1.1.1:6789/0 >> 10.0.9.36:0/3521849648 pipe(0x278f780 sd =25 pgs=0 cs=0 l=0).accept peer addr is really 10.0.9.36:0/3521849648 (socket is 10.0.9.36:50829/0) -7> 2012-05-13 13:33:26.897076 7f9c78d5d700 0 log [INF] : mon.n8c1 calling new monitor election -6> 2012-05-13 13:33:26.978336 7f9c77a58700 0 -- 1.1.1.1:6789/0 >> 10.0.9.35:0/336267109 pipe(0x278fa00 sd= 26 pgs=0 cs=0 l=0).accept peer addr is really 10.0.9.35:0/336267109 (socket is 10.0.9.35:55828/0) -5> 2012-05-13 13:33:28.562429 7f9c78d5d700 0 log [INF] : mon.n8c1 calling new monitor election -4> 2012-05-13 13:33:33.591594 7f9c7855c700 0 log [INF] : mon.n8c1@0 won leader election with quorum 0,1 -3> 2012-05-13 13:33:37.216898 7f9c77a58700 0 -- 1.1.1.1:6789/0 >> 10.0.9.38:0/3476192466 pipe(0x2ae5000 sd =26 pgs=0 cs=0 l=0).accept peer addr is really 10.0.9.38:0/3476192466 (socket is 10.0.9.38:56509/0) -2> 2012-05-13 13:33:37.807372 7f9c78d5d700 0 log [INF] : mon.n8c1 calling new monitor election -1> 2012-05-13 13:33:38.773870 7f9c78d5d700 0 log [INF] : mon.n8c1 calling new monitor election 0> 2012-05-13 13:33:39.603867 7f9c7855c700 -1 mon/PGMonitor.cc: In function 'virtual void PGMonitor::update_from_paxos()' thread 7f9c7855c700 time 2012-05-13 13:33:39.603193 mon/PGMonitor.cc: 146: FAILED assert(paxosv >= pg_map.version) ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1) 1: (PGMonitor::update_from_paxos()+0xc73) [0x4f0423] 2: (PGMonitor::tick()+0x56) [0x4e8716] 3: (Monitor::tick()+0x5d) [0x4875fd] 4: (SafeTimer::timer_thread()+0x34a) [0x5c7c1a] 5: (SafeTimerThread::entry()+0xd) [0x5c87ed] 6: (()+0x6b50) [0x7f9c7d821b50] 7: (clone()+0x6d) [0x7f9c7c05590d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- end dump of recent events --- 2012-05-13 13:33:39.608304 7f9c7855c700 -1 *** Caught signal (Aborted) ** in thread 7f9c7855c700 ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1) 1: ceph-mon() [0x615cf8] 2: (()+0xf030) [0x7f9c7d82a030] 3: (gsignal()+0x35) [0x7f9c7bfaf475] 4: (abort()+0x180) [0x7f9c7bfb26f0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9c7c80052d] 6: (()+0x62636) [0x7f9c7c7fe636] 7: (()+0x62663) [0x7f9c7c7fe663] 8: (()+0x6288e) [0x7f9c7c7fe88e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x261) [0x5cfc01] 10: (PGMonitor::update_from_paxos()+0xc73) [0x4f0423] 11: (PGMonitor::tick()+0x56) [0x4e8716] 12: (Monitor::tick()+0x5d) [0x4875fd] 13: (SafeTimer::timer_thread()+0x34a) [0x5c7c1a] 14: (SafeTimerThread::entry()+0xd) [0x5c87ed] 15: (()+0x6b50) [0x7f9c7d821b50] 16: (clone()+0x6d) [0x7f9c7c05590d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2012-05-13 13:33:39.608304 7f9c7855c700 -1 *** Caught signal (Aborted) ** in thread 7f9c7855c700 ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1) 1: ceph-mon() [0x615cf8] 2: (()+0xf030) [0x7f9c7d82a030] 3: (gsignal()+0x35) [0x7f9c7bfaf475] 4: (abort()+0x180) [0x7f9c7bfb26f0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9c7c80052d] 6: (()+0x62636) [0x7f9c7c7fe636] 7: (()+0x62663) [0x7f9c7c7fe663] 8: (()+0x6288e) [0x7f9c7c7fe88e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x261) [0x5cfc01] 10: (PGMonitor::update_from_paxos()+0xc73) [0x4f0423] 11: (PGMonitor::tick()+0x56) [0x4e8716] 12: (Monitor::tick()+0x5d) [0x4875fd] 13: (SafeTimer::timer_thread()+0x34a) [0x5c7c1a] 14: (SafeTimerThread::entry()+0xd) [0x5c87ed] 15: (()+0x6b50) [0x7f9c7d821b50] 16: (clone()+0x6d) [0x7f9c7c05590d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- end dump of recent events ---
After the crash slurping was not set. I had to do this once again.
Updated by Sage Weil almost 12 years ago
If this happens again, can you grab a tarball of the mon data directory before fixing/restarting?
Also, if you could enable in-memory logging, we'll get a nice dump if we assert again:
debug mon = 1/20
debug ms = 0/1
(that will keep recent log messages in memory and only write them out on a crash).
Thanks!
Updated by Sage Weil almost 12 years ago
triggered this on congress.
looks like latest is a 150 epochs out ahead of last_committed and the actual states:
root@peon5752:/srv/ceph/mon.peon5752/pgmap# ls 1868863 1868910 1868957 1869004 1869051 1869098 1869145 1869192 1868864 1868911 1868958 1869005 1869052 1869099 1869146 1869193 1868865 1868912 1868959 1869006 1869053 1869100 1869147 1869194 1868866 1868913 1868960 1869007 1869054 1869101 1869148 1869195 1868867 1868914 1868961 1869008 1869055 1869102 1869149 1869196 1868868 1868915 1868962 1869009 1869056 1869103 1869150 1869197 1868869 1868916 1868963 1869010 1869057 1869104 1869151 1869198 1868870 1868917 1868964 1869011 1869058 1869105 1869152 1869199 1868871 1868918 1868965 1869012 1869059 1869106 1869153 1869200 1868872 1868919 1868966 1869013 1869060 1869107 1869154 1869201 1868873 1868920 1868967 1869014 1869061 1869108 1869155 1869202 1868874 1868921 1868968 1869015 1869062 1869109 1869156 1869203 1868875 1868922 1868969 1869016 1869063 1869110 1869157 1869204 1868876 1868923 1868970 1869017 1869064 1869111 1869158 1869205 1868877 1868924 1868971 1869018 1869065 1869112 1869159 1869206 1868878 1868925 1868972 1869019 1869066 1869113 1869160 1869207 1868879 1868926 1868973 1869020 1869067 1869114 1869161 1869208 1868880 1868927 1868974 1869021 1869068 1869115 1869162 1869209 1868881 1868928 1868975 1869022 1869069 1869116 1869163 1869210 1868882 1868929 1868976 1869023 1869070 1869117 1869164 1869211 1868883 1868930 1868977 1869024 1869071 1869118 1869165 1869212 1868884 1868931 1868978 1869025 1869072 1869119 1869166 1869213 1868885 1868932 1868979 1869026 1869073 1869120 1869167 1869214 1868886 1868933 1868980 1869027 1869074 1869121 1869168 1869215 1868887 1868934 1868981 1869028 1869075 1869122 1869169 1869216 1868888 1868935 1868982 1869029 1869076 1869123 1869170 1869217 1868889 1868936 1868983 1869030 1869077 1869124 1869171 1869218 1868890 1868937 1868984 1869031 1869078 1869125 1869172 1869219 1868891 1868938 1868985 1869032 1869079 1869126 1869173 1869220 1868892 1868939 1868986 1869033 1869080 1869127 1869174 1869221 1868893 1868940 1868987 1869034 1869081 1869128 1869175 1869222 1868894 1868941 1868988 1869035 1869082 1869129 1869176 1869223 1868895 1868942 1868989 1869036 1869083 1869130 1869177 1869224 1868896 1868943 1868990 1869037 1869084 1869131 1869178 1869225 1868897 1868944 1868991 1869038 1869085 1869132 1869179 1869226 1868898 1868945 1868992 1869039 1869086 1869133 1869180 1869227 1868899 1868946 1868993 1869040 1869087 1869134 1869181 1869228 1868900 1868947 1868994 1869041 1869088 1869135 1869182 1869229 1868901 1868948 1868995 1869042 1869089 1869136 1869183 1869230 1868902 1868949 1868996 1869043 1869090 1869137 1869184 22709 1868903 1868950 1868997 1869044 1869091 1869138 1869185 first_committed 1868904 1868951 1868998 1869045 1869092 1869139 1869186 last_committed 1868905 1868952 1868999 1869046 1869093 1869140 1869187 last_pn 1868906 1868953 1869000 1869047 1869094 1869141 1869188 latest 1868907 1868954 1869001 1869048 1869095 1869142 1869189 slurping 1868908 1868955 1869002 1869049 1869096 1869143 1869190 1868909 1868956 1869003 1869050 1869097 1869144 1869191 root@peon5752:/srv/ceph/mon.peon5752/pgmap# cat first_committed 1868863 root@peon5752:/srv/ceph/mon.peon5752/pgmap# cat last_committed 1869230 root@peon5752:/srv/ceph/mon.peon5752/pgmap# hexdump -C latest |head 00000000 34 86 1c 00 00 00 00 00 7d f6 47 01 03 34 86 1c |4.......}.G..4..| 00000010 00 00 00 00 00 38 00 01 00 01 05 00 00 00 00 00 |.....8..........| 00000020 00 00 fa ff 00 00 ff ff ff ff 09 08 29 01 00 00 |............)...| 00000030 3d 01 00 00 00 00 00 00 73 71 00 00 46 1a 00 00 |=.......sq..F...| 00000040 00 00 00 00 78 5e 00 00 06 00 00 00 00 00 00 00 |....x^..........| 00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000060 00 00 00 00 55 59 00 00 f7 69 00 00 01 00 00 00 |....UY...i......| 00000070 00 00 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 |................| 00000080 00 31 01 00 00 00 00 00 00 b5 70 00 00 36 b7 c6 |.1........p..6..| 00000090 4f 88 cb cd 33 02 02 62 00 00 00 03 03 58 00 00 |O...3..b.....X..| root@peon5752:/srv/ceph/mon.peon5752/pgmap# printf "%d\n" 0x1c8634 1869364 (gdb) p paxosv $1 = 1869230 (gdb) p pg_map.version $2 = 1869364
Updated by Sage Weil almost 12 years ago
- Status changed from Need More Info to Fix Under Review
see wip-2379
Updated by Sage Weil almost 12 years ago
- Status changed from Fix Under Review to Resolved