Project

General

Profile

Actions

Support #48630

closed

non-LVM OSD do not start after upgrade from 15.2.4 -> 15.2.7

Added by ronnie laptop over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
cephadm
Target version:
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

During upgrade from 15.2.4 to 15.2.7 (docker_hub image), some of our OSD's do not startup after their systemd unit.run file was replaced during the upgrade.
The new unit.run script roughly consists of:

  • start of docker container, block device LVM
  • if step 1 fails, it should start docker container of other device type

Some of our OSD's are not of LVM type, but of 2 partitions type: (#1 XFS, #2 bluestore ).

We can workaround this bug by commenting out the first docker run statement in the unit files for the OSD in question, but this is holding up the upgrade big time, and requires lot of manual work edits.

Could you confirm there is testing in place for older block device types, and fix the issue for future releases?

Actions #1

Updated by Sebastian Wagner about 3 years ago

  • Description updated (diff)
Actions #2

Updated by Sebastian Wagner about 3 years ago

I think you probably want to migrate to ceph-volume for now.

Actions #3

Updated by Sebastian Wagner about 3 years ago

  • Tracker changed from Bug to Support
  • Status changed from New to Resolved
Actions #4

Updated by ronnie laptop almost 3 years ago

Sebastian Wagner wrote:

I think you probably want to migrate to ceph-volume for now.

Hi Sebastian,

Thanks for the response, but this raises some questions:
- if we should not use the old filesystems anymore, should we be notified on this in release notes? I guess other people have the same issue? Shuld the OSD's not start at all, maybe with an additional parameter to noticed end-users?
- is there a good procedure for migrating the OSD's? Because with ~400 OSD's (12/14 TB each), with roughly half of the OSD's on the old volume, what is a good approach for migrating? I can think of two ways:
-- mark each OSD down, leave it for some weeks to drain/rebalance, and then zap it and reuse it and rebalance all the data back
-- on the rough way zap the disk (as if it would have failed), and reuse it and pray that Crush works correctly and rebalance all the data back.

Any advice is more then welcome as these scenarios are not really clear documented!

Actions

Also available in: Atom PDF