Project

General

Profile

Actions

Bug #5460

closed

v0.61.3 -> v0.65 upgrade: new OSDs mark old as down

Added by Faidon Liambotis almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I tried upgrading my v0.61.3 cluster to v0.65 today. All of the new (v0.65) OSDs are marking all of the old (v0.61.3) ones as down but not the new ones. I currently have 24 OSDs running v0.65 (out of 144). I had to set mon osd min reporters to 26, as without it 144-26=118 OSDs constantly flap between being marked as down by the new ones and marking themselves up.

The logs are full of:

2013-06-26 07:22:58.117660 7fefa16a6700 -1 osd.1 189205 heartbeat_check: no reply from osd.140 ever on either front or back, first ping sent 2013-06-26 07:11:52.256656 (cutoff 2013-06-26 07:22:38.117061)
2013-06-26 07:22:58.117668 7fefa16a6700 -1 osd.1 189205 heartbeat_check: no reply from osd.141 ever on either front or back, first ping sent 2013-06-26 07:11:52.256656 (cutoff 2013-06-26 07:22:38.117061)
2013-06-26 07:22:58.117674 7fefa16a6700 -1 osd.1 189205 heartbeat_check: no reply from osd.142 ever on either front or back, first ping sent 2013-06-26 07:16:48.411919 (cutoff 2013-06-26 07:22:38.117061)
2013-06-26 07:22:58.117679 7fefa16a6700 -1 osd.1 189205 heartbeat_check: no reply from osd.143 ever on either front or back, first ping sent 2013-06-26 07:16:57.720749 (cutoff 2013-06-26 07:22:38.117061)

and also full of

2013-06-26 07:22:56.410590 7fef80cb2700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0xa498500 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol
2013-06-26 07:22:56.411721 7fef71943700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0x8e43c80 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol
2013-06-26 07:22:58.114967 7fef80cb2700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0xa498500 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol
2013-06-26 07:22:58.116399 7fef71943700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0x8e43c80 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol
2013-06-26 07:22:58.117316 7fef80cb2700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0xa498500 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol

which, as expectedly, corresponds to:

[pid  8047] socket(PF_UNSPEC, SOCK_STREAM, 0) = -1 EAFNOSUPPORT (Address family not supported by protocol)
[pid  8047] socket(PF_UNSPEC, SOCK_STREAM, 0) = -1 EAFNOSUPPORT (Address family not supported by protocol)
[pid  8047] socket(PF_UNSPEC, SOCK_STREAM, 0) = -1 EAFNOSUPPORT (Address family not supported by protocol)
[pid  8047] socket(PF_UNSPEC, SOCK_STREAM, 0) = -1 EAFNOSUPPORT (Address family not supported by protocol)

I guess I could just bite the bullet and upgrade all OSDs to get rid of that problem, but for the meantime I've stopped upgrading and left it partially upgraded (24/144 OSDs) to help track this down further.

I got a debug-osd 30/debug-ms 10 log from telling such an OSD, as well as one with a clean boot/wait for settle/shutdown sequence. Both are on cephdrop now as heartbeat-v0.65-debugosd30ms10.log.bz2 (863K) heartbeat-v0.65-debugosd30ms10-boot.log.bz2 (17M).

Actions

Also available in: Atom PDF