Project

General

Profile

Actions

Support #36115

open

After Mimic upgrade OSD's stuck at booting.

Added by morphin by over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

After Luminous to Mimic upgrade when I try to start an OSD. Its
stucking at "booting". (I edit the hostnames so do not care if they're
not identical.)

OSD log: https://paste.ubuntu.com/p/hFhc2dkSqb/
OSD LOGS with debug=20
OSD8: https://www.dropbox.com/s/5e01f5odtsq3iqi/ceph-osd.8.log?dl=0
OSD156: https://www.dropbox.com/s/ox7or2uizyiwdo7/ceph-osd.156.log?dl=0

MON log: https://paste.ubuntu.com/p/F85mYwvP4C/
MGR log: https://paste.ubuntu.com/p/jYQ5kJstnH/
CEPH.conf https://paste.ubuntu.com/p/qDwjzdsmGK/
Telnet OSD to MON: https://paste.ubuntu.com/p/fbn9hTWv8q/

I upgraded the system with this order:

1- Stop MDS >OSD's -> MGR -> MON -> Servers
2
Upgrade OS image 4.14.30-1-lts to --> 4.14.70-1-lts "Ceph,kernel etc"
3- Reboot server and restore backups.
4- Start mons, check was ok.
5- Start mgrs, check was ok.
6- Check versions; https://paste.ubuntu.com/p/bxqF9wgDMn/
7- Start osds, All the osd's stuck at "booting":
https://paste.ubuntu.com/p/NY6SP2MBmd/
8- I did not start MDS.

Above procedure was tested on my test servers. I tried to upgrade 3
test server with this order. And when I start OSD's, they started
pretty fast without problems. My cluster health was OK. However in my
PROD cluster upgrade OSD does start but they stuck at booting status.
The only difference of PROD is the network and the count of OSDs.

I need a debug method for OSD's. Because OSD's do not give any clue
what should I do!
As you can see my mons & mgr, are properly working. But OSD's are not.
I think this because they can't talk to MON's somehow.
I tried to marking all the OSD's "down" + restart all OSD's but
nothing's changed. I checked network communication between osd's and
mon's and it seems fine. I'm using 10G LACP with jumbo frame for
cluster network and 10G LACP for public network. And it was working
very well before the upgrade.

I checked everything what I know. My last choice is to downgrade and I
don't know if it solves my problem or not.
My hours limited. I have large amounts of data within data pool. It
needs to be ready on Monday.

Please help me if you can.

Actions

Also available in: Atom PDF