Bug #5460
closedv0.61.3 -> v0.65 upgrade: new OSDs mark old as down
0%
Description
I tried upgrading my v0.61.3 cluster to v0.65 today. All of the new (v0.65) OSDs are marking all of the old (v0.61.3) ones as down but not the new ones. I currently have 24 OSDs running v0.65 (out of 144). I had to set mon osd min reporters to 26, as without it 144-26=118 OSDs constantly flap between being marked as down by the new ones and marking themselves up.
The logs are full of:
2013-06-26 07:22:58.117660 7fefa16a6700 -1 osd.1 189205 heartbeat_check: no reply from osd.140 ever on either front or back, first ping sent 2013-06-26 07:11:52.256656 (cutoff 2013-06-26 07:22:38.117061) 2013-06-26 07:22:58.117668 7fefa16a6700 -1 osd.1 189205 heartbeat_check: no reply from osd.141 ever on either front or back, first ping sent 2013-06-26 07:11:52.256656 (cutoff 2013-06-26 07:22:38.117061) 2013-06-26 07:22:58.117674 7fefa16a6700 -1 osd.1 189205 heartbeat_check: no reply from osd.142 ever on either front or back, first ping sent 2013-06-26 07:16:48.411919 (cutoff 2013-06-26 07:22:38.117061) 2013-06-26 07:22:58.117679 7fefa16a6700 -1 osd.1 189205 heartbeat_check: no reply from osd.143 ever on either front or back, first ping sent 2013-06-26 07:16:57.720749 (cutoff 2013-06-26 07:22:38.117061)
and also full of
2013-06-26 07:22:56.410590 7fef80cb2700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0xa498500 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol 2013-06-26 07:22:56.411721 7fef71943700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0x8e43c80 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol 2013-06-26 07:22:58.114967 7fef80cb2700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0xa498500 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol 2013-06-26 07:22:58.116399 7fef71943700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0x8e43c80 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol 2013-06-26 07:22:58.117316 7fef80cb2700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0xa498500 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol
which, as expectedly, corresponds to:
[pid 8047] socket(PF_UNSPEC, SOCK_STREAM, 0) = -1 EAFNOSUPPORT (Address family not supported by protocol) [pid 8047] socket(PF_UNSPEC, SOCK_STREAM, 0) = -1 EAFNOSUPPORT (Address family not supported by protocol) [pid 8047] socket(PF_UNSPEC, SOCK_STREAM, 0) = -1 EAFNOSUPPORT (Address family not supported by protocol) [pid 8047] socket(PF_UNSPEC, SOCK_STREAM, 0) = -1 EAFNOSUPPORT (Address family not supported by protocol)
I guess I could just bite the bullet and upgrade all OSDs to get rid of that problem, but for the meantime I've stopped upgrading and left it partially upgraded (24/144 OSDs) to help track this down further.
I got a debug-osd 30/debug-ms 10 log from telling such an OSD, as well as one with a clean boot/wait for settle/shutdown sequence. Both are on cephdrop now as heartbeat-v0.65-debugosd30ms10.log.bz2 (863K) heartbeat-v0.65-debugosd30ms10-boot.log.bz2 (17M).