Project

General

Profile

Support #36115

After Mimic upgrade OSD's stuck at booting.

Added by morphin by over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

After Luminous to Mimic upgrade when I try to start an OSD. Its
stucking at "booting". (I edit the hostnames so do not care if they're
not identical.)

OSD log: https://paste.ubuntu.com/p/hFhc2dkSqb/
OSD LOGS with debug=20
OSD8: https://www.dropbox.com/s/5e01f5odtsq3iqi/ceph-osd.8.log?dl=0
OSD156: https://www.dropbox.com/s/ox7or2uizyiwdo7/ceph-osd.156.log?dl=0

MON log: https://paste.ubuntu.com/p/F85mYwvP4C/
MGR log: https://paste.ubuntu.com/p/jYQ5kJstnH/
CEPH.conf https://paste.ubuntu.com/p/qDwjzdsmGK/
Telnet OSD to MON: https://paste.ubuntu.com/p/fbn9hTWv8q/

I upgraded the system with this order:

1- Stop MDS >OSD's -> MGR -> MON -> Servers
2
Upgrade OS image 4.14.30-1-lts to --> 4.14.70-1-lts "Ceph,kernel etc"
3- Reboot server and restore backups.
4- Start mons, check was ok.
5- Start mgrs, check was ok.
6- Check versions; https://paste.ubuntu.com/p/bxqF9wgDMn/
7- Start osds, All the osd's stuck at "booting":
https://paste.ubuntu.com/p/NY6SP2MBmd/
8- I did not start MDS.

Above procedure was tested on my test servers. I tried to upgrade 3
test server with this order. And when I start OSD's, they started
pretty fast without problems. My cluster health was OK. However in my
PROD cluster upgrade OSD does start but they stuck at booting status.
The only difference of PROD is the network and the count of OSDs.

I need a debug method for OSD's. Because OSD's do not give any clue
what should I do!
As you can see my mons & mgr, are properly working. But OSD's are not.
I think this because they can't talk to MON's somehow.
I tried to marking all the OSD's "down" + restart all OSD's but
nothing's changed. I checked network communication between osd's and
mon's and it seems fine. I'm using 10G LACP with jumbo frame for
cluster network and 10G LACP for public network. And it was working
very well before the upgrade.

I checked everything what I know. My last choice is to downgrade and I
don't know if it solves my problem or not.
My hours limited. I have large amounts of data within data pool. It
needs to be ready on Monday.

Please help me if you can.

History

#1 Updated by morphin by over 5 years ago

IPERF test between 2 node: https://paste.ubuntu.com/p/7rRYSSqtyh/

I dont think this is related to network or firewall etc. Because it was working before the upgrade and my cluster network even do not have gateway. Its special for ceph cluster network with jumbo frame MTU.
I tried even reboot my switch for cleaning arps and etc.
I'm out of time. I will try downgrade from backup.

#2 Updated by morphin by over 5 years ago

My main kernel is: Linux 4.14.70-1-lts Also I tried 4.18.8-arch1-1-ARCH. Nothing changed.
I'm sure this problem related to ceph and mimic.
After downgrade I will share the results.

#3 Updated by Patrick Donnelly over 5 years ago

  • Tracker changed from Bug to Support
  • Project changed from Ceph to RADOS
  • Target version deleted (v13.2.2)

Also available in: Atom PDF