Project

General

Profile

Bug #5460

v0.61.3 -> v0.65 upgrade: new OSDs mark old as down

Added by Faidon Liambotis over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I tried upgrading my v0.61.3 cluster to v0.65 today. All of the new (v0.65) OSDs are marking all of the old (v0.61.3) ones as down but not the new ones. I currently have 24 OSDs running v0.65 (out of 144). I had to set mon osd min reporters to 26, as without it 144-26=118 OSDs constantly flap between being marked as down by the new ones and marking themselves up.

The logs are full of:

2013-06-26 07:22:58.117660 7fefa16a6700 -1 osd.1 189205 heartbeat_check: no reply from osd.140 ever on either front or back, first ping sent 2013-06-26 07:11:52.256656 (cutoff 2013-06-26 07:22:38.117061)
2013-06-26 07:22:58.117668 7fefa16a6700 -1 osd.1 189205 heartbeat_check: no reply from osd.141 ever on either front or back, first ping sent 2013-06-26 07:11:52.256656 (cutoff 2013-06-26 07:22:38.117061)
2013-06-26 07:22:58.117674 7fefa16a6700 -1 osd.1 189205 heartbeat_check: no reply from osd.142 ever on either front or back, first ping sent 2013-06-26 07:16:48.411919 (cutoff 2013-06-26 07:22:38.117061)
2013-06-26 07:22:58.117679 7fefa16a6700 -1 osd.1 189205 heartbeat_check: no reply from osd.143 ever on either front or back, first ping sent 2013-06-26 07:16:57.720749 (cutoff 2013-06-26 07:22:38.117061)

and also full of

2013-06-26 07:22:56.410590 7fef80cb2700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0xa498500 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol
2013-06-26 07:22:56.411721 7fef71943700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0x8e43c80 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol
2013-06-26 07:22:58.114967 7fef80cb2700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0xa498500 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol
2013-06-26 07:22:58.116399 7fef71943700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0x8e43c80 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol
2013-06-26 07:22:58.117316 7fef80cb2700 -1 -- 10.64.0.173:0/20396 >> :/0 pipe(0xa498500 sd=-1 :0 s=1 pgs=0 cs=0 l=1).connect couldn't created socket Address family not supported by protocol

which, as expectedly, corresponds to:

[pid  8047] socket(PF_UNSPEC, SOCK_STREAM, 0) = -1 EAFNOSUPPORT (Address family not supported by protocol)
[pid  8047] socket(PF_UNSPEC, SOCK_STREAM, 0) = -1 EAFNOSUPPORT (Address family not supported by protocol)
[pid  8047] socket(PF_UNSPEC, SOCK_STREAM, 0) = -1 EAFNOSUPPORT (Address family not supported by protocol)
[pid  8047] socket(PF_UNSPEC, SOCK_STREAM, 0) = -1 EAFNOSUPPORT (Address family not supported by protocol)

I guess I could just bite the bullet and upgrade all OSDs to get rid of that problem, but for the meantime I've stopped upgrading and left it partially upgraded (24/144 OSDs) to help track this down further.

I got a debug-osd 30/debug-ms 10 log from telling such an OSD, as well as one with a clean boot/wait for settle/shutdown sequence. Both are on cephdrop now as heartbeat-v0.65-debugosd30ms10.log.bz2 (863K) heartbeat-v0.65-debugosd30ms10-boot.log.bz2 (17M).

Associated revisions

Revision fe663317 (diff)
Added by David Zafman over 10 years ago

Handle non-existent front interface in maps from older MONs

Fix OSDService::get_con_osd_hb() to not try to get_connection() without front interface
Fix OSD::handle_osd_map() to check for missing front interface

Fixes: #5460

Signed-off-by: David Zafman <>
Reviewed-by: Sage Weil <>

Revision e8b42a69 (diff)
Added by Sage Weil over 10 years ago

osd/OSDMap: handle case where some new osds have hb_front and others don't

Do not assume that because at least one OSD has an hb_front addr that they
all do, or else we will end up assigning garbage here and later thinking
it is a addr (or, more precisely, != entity_addr_t()).

Fixes: #5460
Signed-off-by: Sage Weil <>
Reviewed-by: David Zafman <>

History

#1 Updated by Sage Weil over 10 years ago

  • Priority changed from Urgent to Immediate

#2 Updated by David Zafman over 10 years ago

  • Status changed from New to Fix Under Review

Proposed fix in wip-5460.

In older release the osdmap did not specify a front-side interface. A couple of places missing check for unspecified front-side interface.

#3 Updated by Sage Weil over 10 years ago

Note the workaround here is to upgrade and restart Mons first.

#4 Updated by Sage Weil over 10 years ago

  • Status changed from Fix Under Review to Resolved

#5 Updated by Faidon Liambotis over 10 years ago

I did have mons upgraded and restarted first, sorry for not mentioning this earlier. I'll try the workaround nevertheless and get back to you.

#6 Updated by Sage Weil over 10 years ago

Faidon Liambotis wrote:

I did have mons upgraded and restarted first, sorry for not mentioning this earlier. I'll try the workaround nevertheless and get back to you.

Sorry, you're right.. the problem is there even if the mons are upgraded first!

#7 Updated by Faidon Liambotis over 10 years ago

I've just tried

2013-06-29 13:55:32.101348 7f1ea7aff780  0 ceph version 0.65-188-g946a838 (946a838cffa0927d1237489e8c2c143e87d66892), process ceph-osd, pid 22833

(which includes the fe663317 fix)

And I'm getting the exact same errors:

2013-06-29 13:55:42.454428 7f1e82f57700 -1 -- 10.64.0.173:0/22833 >> :/0 pipe(0x8513a00 sd=-1 :0 s=1 pgs=0 cs=0 l=1 c=0x8515420).connect couldn't created socket Address family not supported by protocol
2013-06-29 13:55:42.455970 7f1e82f57700 -1 -- 10.64.0.173:0/22833 >> :/0 pipe(0x8513a00 sd=-1 :0 s=1 pgs=0 cs=0 l=1 c=0x8515420).connect couldn't created socket Address family not supported by protocol
2013-06-29 13:55:42.657581 7f1e82f57700 -1 -- 10.64.0.173:0/22833 >> :/0 pipe(0x8513a00 sd=-1 :0 s=1 pgs=0 cs=0 l=1 c=0x8515420).connect couldn't created socket Address family not supported by protocol

2013-06-29 13:56:06.192099 7f1ea1353700 -1 osd.0 189287 heartbeat_check: no reply from osd.29 ever on either front or back, first ping sent 2013-06-29 13:55:45.219553 (cutoff 2013-06-29 13:55:46.192094)
2013-06-29 13:56:06.192114 7f1ea1353700 -1 osd.0 189287 heartbeat_check: no reply from osd.31 ever on either front or back, first ping sent 2013-06-29 13:55:45.219553 (cutoff 2013-06-29 13:55:46.192094)
2013-06-29 13:56:06.192125 7f1ea1353700 -1 osd.0 189287 heartbeat_check: no reply from osd.51 ever on either front or back, first ping sent 2013-06-29 13:55:45.219553 (cutoff 2013-06-29 13:55:46.192094)
2013-06-29 13:56:06.192139 7f1ea1353700 -1 osd.0 189287 heartbeat_check: no reply from osd.64 ever on either front or back, first ping sent 2013-06-29 13:55:45.219553 (cutoff 2013-06-29 13:55:46.192094)
2013-06-29 13:56:06.192150 7f1ea1353700 -1 osd.0 189287 heartbeat_check: no reply from osd.103 ever on either front or back, first ping sent 2013-06-29 13:55:45.219553 (cutoff 2013-06-29 13:55:46.192094)

It doesn't look like this is fixed.

#8 Updated by David Zafman over 10 years ago

I updated a cluster from 0.61 to the same 0.65 SHA1 version.

$ ceph --version
ceph version 0.65-188-g946a838 (946a838cffa0927d1237489e8c2c143e87d66892)

Please verify that each daemon is running what you think it is. I had trouble using initctl to restart my daemons, so I had to manually.

ubuntu@mira119:~$ sudo ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok version {"version":"0.61"}
ubuntu@mira119:~$ sudo ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok version {"version":"0.61"}
ubuntu@mira119:~$ sudo ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok version {"version":"0.65-188-g946a838"}
ubuntu@mira119:~$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.0.asok version {"version":"0.65-188-g946a838"}

ubuntu@mira120:~$ sudo ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok version {"version":"0.61"}
ubuntu@mira120:~$ sudo ceph --admin-daemon /var/run/ceph/ceph-osd.4.asok version {"version":"0.61"}
ubuntu@mira120:~$ sudo ceph --admin-daemon /var/run/ceph/ceph-osd.3.asok version {"version":"0.61"}
ubuntu@mira120:~$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.1.asok version {"version":"0.65-188-g946a838"}
ubuntu@mira120:~$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.2.asok version {"version":"0.65-188-g946a838"}

#9 Updated by Faidon Liambotis over 10 years ago

I rechecked and confirm that all the old ones run 0.61.3. I've stopped the new 0.65+ ones since then, but you can see how I pasted the first line from the log of the OSD booting up which mentions the 0.65-188-g946a838 version plus it's pid and a timestamp verifying that the errors shown are from that version.

#10 Updated by Sage Weil over 10 years ago

  • Status changed from Resolved to Need More Info

Can you set 'debug ms = 20' and 'debug osd = 20', reproduce the problem, and attach teh log on the marked-down osd and one of the osds that marks it down?

Thanks!

#11 Updated by Sage Weil over 10 years ago

  • Assignee set to Sage Weil
  • Priority changed from Immediate to Urgent

#12 Updated by Faidon Liambotis over 10 years ago

I started osd.0 with 0.65-188-g946a838 and --debug-ms 20 --debug-osd 20. I did have the same errors, although this time it was complaining just for two (random) OSDs, osd.29 & osd.31, no idea why. At that point, I injected debug-ms 20/osd 20 to osd.29 and also picked up logs from there too.

These are 5460-debugosd20ms20-ceph-osd.0.log.bz2 (5.7G uncompressed!) and 5460-debugosd20ms20-ceph-osd.29.log.bz2, respectively.

#13 Updated by Sage Weil over 10 years ago

pushed a patch to paravoid-test branch; can you give it a try? the best theory i have is that at one point an osd started up with the new code and then went back to the old. or, it appeared that way to the monitors for some odd reason. either way, the debug output is less ambiguous so i will know more even if it fails again.

thanks!

#14 Updated by Sage Weil over 10 years ago

  • Status changed from Need More Info to Resolved

Also available in: Atom PDF