Support #36115
After Mimic upgrade OSD's stuck at booting.
0%
Description
After Luminous to Mimic upgrade when I try to start an OSD. Its
stucking at "booting". (I edit the hostnames so do not care if they're
not identical.)
OSD log: https://paste.ubuntu.com/p/hFhc2dkSqb/
OSD LOGS with debug=20
OSD8: https://www.dropbox.com/s/5e01f5odtsq3iqi/ceph-osd.8.log?dl=0
OSD156: https://www.dropbox.com/s/ox7or2uizyiwdo7/ceph-osd.156.log?dl=0
MON log: https://paste.ubuntu.com/p/F85mYwvP4C/
MGR log: https://paste.ubuntu.com/p/jYQ5kJstnH/
CEPH.conf https://paste.ubuntu.com/p/qDwjzdsmGK/
Telnet OSD to MON: https://paste.ubuntu.com/p/fbn9hTWv8q/
I upgraded the system with this order:
1- Stop MDS >OSD's -> MGR -> MON -> Servers Upgrade OS image 4.14.30-1-lts to --> 4.14.70-1-lts "Ceph,kernel etc"
2
3- Reboot server and restore backups.
4- Start mons, check was ok.
5- Start mgrs, check was ok.
6- Check versions; https://paste.ubuntu.com/p/bxqF9wgDMn/
7- Start osds, All the osd's stuck at "booting":
https://paste.ubuntu.com/p/NY6SP2MBmd/
8- I did not start MDS.
Above procedure was tested on my test servers. I tried to upgrade 3
test server with this order. And when I start OSD's, they started
pretty fast without problems. My cluster health was OK. However in my
PROD cluster upgrade OSD does start but they stuck at booting status.
The only difference of PROD is the network and the count of OSDs.
I need a debug method for OSD's. Because OSD's do not give any clue
what should I do!
As you can see my mons & mgr, are properly working. But OSD's are not.
I think this because they can't talk to MON's somehow.
I tried to marking all the OSD's "down" + restart all OSD's but
nothing's changed. I checked network communication between osd's and
mon's and it seems fine. I'm using 10G LACP with jumbo frame for
cluster network and 10G LACP for public network. And it was working
very well before the upgrade.
I checked everything what I know. My last choice is to downgrade and I
don't know if it solves my problem or not.
My hours limited. I have large amounts of data within data pool. It
needs to be ready on Monday.
Please help me if you can.
History
#1 Updated by morphin by over 5 years ago
IPERF test between 2 node: https://paste.ubuntu.com/p/7rRYSSqtyh/
I dont think this is related to network or firewall etc. Because it was working before the upgrade and my cluster network even do not have gateway. Its special for ceph cluster network with jumbo frame MTU.
I tried even reboot my switch for cleaning arps and etc.
I'm out of time. I will try downgrade from backup.
#2 Updated by morphin by over 5 years ago
My main kernel is: Linux 4.14.70-1-lts Also I tried 4.18.8-arch1-1-ARCH. Nothing changed.
I'm sure this problem related to ceph and mimic.
After downgrade I will share the results.
#3 Updated by Patrick Donnelly over 5 years ago
- Tracker changed from Bug to Support
- Project changed from Ceph to RADOS
- Target version deleted (
v13.2.2)