Project

General

Profile

Bug #21023

BlueStore-OSDs marked as destroyed in OSD-map after v12.1.1 to v12.1.4 upgrade

Added by Martin Millnert about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSDMap
Target version:
Start date:
08/17/2017
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

(I intended to contact but DreamHost rejected my posting, hence a bug report instead)

I have a small single-host 5-osd Luminous v12.1.1 cluster, that I am in
the progress of inline-upgrading to BlueStore from FileStore.

I just upgraded to v12.1.4 and now all my bluestore OSD:s are marked as
'destroyed' in the osdmap.

What I did for the upgrade was:

apt upgrade
systemctl restart ceph-mon@<my-host-name>.service
systemctl restart ceph-mgr@<my-host-name>.service
systemctl restart ceph-osd.target

And following this the BlueStore-OSD:s got upset (osd.1-4 are BlueStore, osd.5 is FileStore):

  1. ceph osd tree
    ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
    -2 0 host <my-host-name>
    -1 36.50000 root default
    -3 36.50000 host <my-host>
    1 hdd 7.29999 osd.1 destroyed 1.00000 1.00000
    2 hdd 7.29999 osd.2 destroyed 1.00000 1.00000
    3 hdd 7.29999 osd.3 destroyed 1.00000 1.00000
    4 hdd 7.29999 osd.4 destroyed 1.00000 1.00000
    5 hdd 7.29999 osd.5 up 1.00000 1.00000

The in-place upgrade I'm doing uses the "ceph osd destroy" procedure to
in-place replace OSDs and it has been working for all intents and
purposes I've been able to see. Thus all OSD:s marked destroyed above have all had
'osd detroy' issued to them... but following that they've been up, too...

Now I'd like to know:
1) how can I fix this, and,
2) what I did wrong to cause it?

I saw no docs on which order to restart processes, but I fully ACK that I may have done it in the wrong order (mon->mgr->osd). Still seems a bit risky that one would end up in the above state.

History

#1 Updated by Martin Millnert about 2 years ago

I dumped my past 30 osdmaps and went through them and note that the OSD:s I have converted does indeed keep their 'destroyed'-flag even after I re-create the new ones. In v12.1.1 I was still able to launch them however.

My upgrade procedure is essentially:

systemctl stop ceph-osd@${id}
ceph osd destroy osd.${id}]--yes-i-really-mean-it
umount /var/lib/ceph/osd/${cluster}-${id}
  1. then several steps to create the bluestore devices
  2. <trimmed for simplicity>
    ceph-osd --setuser ceph -i ${id} --mkkey --mkfs
    ceph auth add osd.${id} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/${cluster}-${id}/keyring
    systemctl start ceph-osd@${id}

Comparing my notes with https://github.com/MartinEmrich/kb/blob/master/ceph/Manual-Bluestore.md , I see I'm purposefully not performing Martin Eimrich's step "ceph --setuser ceph osd crush add 0 7.3000 host=`hostname`" , since I have -- I thought -- not removed my OSD:s.

It seems clear however I'm missing a step to re-enable OSD after having used the OSD destroy command -- and that there was a bug? in v12.1.1 that ignored the destroy flag when booting up?

#2 Updated by Martin Millnert about 2 years ago

I now figured out there is a new 'ceph osd new' command that it appears I should run (whose syntax is incorrectly described in 'man ceph' vs. the "ceph osd --help" output -- the latter seems more correct) in my inline upgrade process.

When I try to use it however:

root@davinci:~# ceph -i ceph-new-secrets.json osd new d706a467-bb1b-40ce-a07d-fd8fdc0ea427 osd.1 
Error EINVAL: entity osd.1 exists but caps do not match

.. it doesn't work. The manual says nothing about how to describe caps with the secrets.json file or the osd new command in general. (Updates sure to follow)

#3 Updated by Martin Millnert about 2 years ago

In the OSDMonitor.cc source file I find yet another description of how the command should work, and this seems to be as close to the 'source' as possible so I trust this over the man page and 'ceph' command so far.

On lines https://github.com/ceph/ceph/blob/v12.1.4/src/mon/OSDMonitor.cc#L6740-L6741 it is stated that supplying the secrets file is optional.
When I do supply the secrets file, it appears that I'm blocked by a fail on https://github.com/ceph/ceph/blob/v12.1.4/src/mon/AuthMonitor.cc#L733-L736 .
My osd.1 auth has caps, that I've added, but the man page does not describe how to supply caps to the secrets.json file.
The code (OSDMonitor.cc) doesn't seem to expect any.

I replace osd.1 auth with one withoyt any caps, still get "caps do not match" error from "ceph osd new" command. If I remove osd.1 auth completely, I get "secrets do not match" from the "ceph osd new" command.

It seems to me all I really need to do is to wipe the CEPH_OSD_DESTROYED bit from the relevant osdmap entries?

#4 Updated by Martin Millnert about 2 years ago

Update:

'ceph osd new' does return 0 if I simply skip giving it a auth json file. But it doesn't clear the 'destroyed' bit from the osdmap.

I've written a simple patch to osdmaptool to clear the bit ( https://github.com/ceph/ceph/compare/master...Millnert:osdmaptool_undestroy?expand=1 ).

Now with some IRC help from gregs42 i'm going to attempt to replace the osdmaps using ceph-monstore-tool (and ceph-objectstore-tool if necessary), or actually, preferably, just append the n+1 epoch I got from osdmaptool's output onto the mon datastore; relaunch it, and have it propagate to OSDs (hopefully the OSDs fetch latest osdmap before refusing to boot due to destroyed-bit set).

Update pending.

#5 Updated by Josh Durgin about 2 years ago

  • Assignee set to Neha Ojha

#6 Updated by Neha Ojha about 2 years ago

  • Status changed from New to In Progress

#7 Updated by Martin Millnert about 2 years ago

Here's an update from the bug submitter.

1. The osdmaptool approach failed since there is no longer any command to inject osdmaps into mon (since several years).
2. I've instead written a patch for tools/ceph_monstore_tool.cc which works for monstores, but the osdstores remain to be fixed.
3. I've traced the issues I have with the 'osd new' command not doing what it should again. My related findings below:

A. The man page and ceph command output disagree about options ordering. Man page says 'id' first and 'uuid' later. "ceph osd help" says uuid first and id or osdname later.
B. When issuing the command as "ceph osd new -i ceph-new-secrets.json <uuid-of-osd.1> 1, and having a 'osd.1' ceph auth entry, I received first "caps do not match". After staring/tracing the source code for a while I found the reason to: The caps one needs to have on an OSD for 'osd new' are hardcoded and defined at https://github.com/ceph/ceph/blob/v12.1.4/src/mon/AuthMonitor.cc#L878-L882
C. After having updated my caps to be identical to those of the hard coding in the source code, I am still returned here https://github.com/ceph/ceph/blob/v12.1.4/src/mon/OSDMonitor.cc#L6916
But this is incorrect! One IS idempotent as per https://github.com/ceph/ceph/blob/v12.1.4/src/mon/OSDMonitor.cc#L6773-L6779 , after having run 'osd destroy osd.1 --yes-i-really-mean-it'. At least on 12.1.4 as it seems.
The OSD remains in the osdmap as per my osd dump:

root@hostname:~/sdfsdf# ceph osd dump
epoch 5965
fsid a2ff4e1f-54c0-476a-8891-45c305b1a2e9
created 2016-08-08 01:23:30.220664
modified 2017-08-17 09:32:42.061199
flags noout,noscrub,nodeep-scrub,sortbitwise,recovery_deletes
crush_version 3
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release luminous
pool 4 'ec' erasure size 5 min_size 3 crush_rule 1 object_hash rjenkins pg_num 160 pgp_num 160 last_change 96 flags hashpspool stripe_width 4128
pool 5 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 160 pgp_num 160 last_change 121 flags hashpspool stripe_width 0
    removed_snaps [1~3]
pool 8 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 160 pgp_num 160 last_change 215 flags hashpspool stripe_width 0
pool 9 'cephfs_data' erasure size 5 min_size 4 crush_rule 1 object_hash rjenkins pg_num 160 pgp_num 160 last_change 216 flags hashpspool stripe_width 12288
max_osd 6
osd.1 down in  weight 1 up_from 5659 up_thru 5952 down_at 5960 last_clean_interval [4927,5643) <$IPADDR>:6805/6161 <$IPADDR>:6806/6161 <$IPADDR>:6807/6161 <$IPADDR>:6808/6161 destroyed,exists d706a467-bb1b-40ce-a07d-fd8fdc0ea427
osd.2 down in  weight 1 up_from 5648 up_thru 5958 down_at 5960 last_clean_interval [5005,5643) <$IPADDR>:6801/6108 <$IPADDR>:6802/6108 <$IPADDR>:6803/6108 <$IPADDR>:6804/6108 destroyed,exists e3031736-acb4-4a5e-a67f-f9b171167cbd
osd.3 down in  weight 1 up_from 5653 up_thru 5954 down_at 5960 last_clean_interval [5468,5643) <$IPADDR>:6809/6211 <$IPADDR>:6810/6211 <$IPADDR>:6811/6211 <$IPADDR>:6812/6211 destroyed,exists 4288d76d-2e14-4557-8d54-216319eb3581
osd.4 down in  weight 1 up_from 5947 up_thru 5950 down_at 5964 last_clean_interval [208,5643) <$IPADDR>:6813/20771 <$IPADDR>:6814/20771 <$IPADDR>:6815/20771 <$IPADDR>:6816/20771 destroyed,exists 1b6b013e-29ec-462b-bd4f-75030d0bfffc
osd.5 up   in  weight 1 up_from 5962 up_thru 5964 down_at 5960 last_clean_interval [5645,5959) <$IPADDR>:6817/8004 <$IPADDR>:6818/8004 <$IPADDR>:6819/8004 <$IPADDR>:6820/8004 exists,up c0d25d14-4f99-4554-b811-e8373d8b032b
root@hostname:~/sdfsdf# ceph osd destroy osd.1 --yes-i-really-mean-it
destroyed osd.1
root@hostname:~/sdfsdf# ceph osd dump
epoch 5965
fsid a2ff4e1f-54c0-476a-8891-45c305b1a2e9
created 2016-08-08 01:23:30.220664
modified 2017-08-17 09:32:42.061199
flags noout,noscrub,nodeep-scrub,sortbitwise,recovery_deletes
crush_version 3
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release luminous
pool 4 'ec' erasure size 5 min_size 3 crush_rule 1 object_hash rjenkins pg_num 160 pgp_num 160 last_change 96 flags hashpspool stripe_width 4128
pool 5 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 160 pgp_num 160 last_change 121 flags hashpspool stripe_width 0
    removed_snaps [1~3]
pool 8 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 160 pgp_num 160 last_change 215 flags hashpspool stripe_width 0
pool 9 'cephfs_data' erasure size 5 min_size 4 crush_rule 1 object_hash rjenkins pg_num 160 pgp_num 160 last_change 216 flags hashpspool stripe_width 12288
max_osd 6
osd.1 down in  weight 1 up_from 5659 up_thru 5952 down_at 5960 last_clean_interval [4927,5643) <$IPADDR>:6805/6161 <$IPADDR>:6806/6161 <$IPADDR>:6807/6161 <$IPADDR>:6808/6161 destroyed,exists d706a467-bb1b-40ce-a07d-fd8fdc0ea427
osd.2 down in  weight 1 up_from 5648 up_thru 5958 down_at 5960 last_clean_interval [5005,5643) <$IPADDR>:6801/6108 <$IPADDR>:6802/6108 <$IPADDR>:6803/6108 <$IPADDR>:6804/6108 destroyed,exists e3031736-acb4-4a5e-a67f-f9b171167cbd
osd.3 down in  weight 1 up_from 5653 up_thru 5954 down_at 5960 last_clean_interval [5468,5643) <$IPADDR>:6809/6211 <$IPADDR>:6810/6211 <$IPADDR>:6811/6211 <$IPADDR>:6812/6211 destroyed,exists 4288d76d-2e14-4557-8d54-216319eb3581
osd.4 down in  weight 1 up_from 5947 up_thru 5950 down_at 5964 last_clean_interval [208,5643) <$IPADDR>:6813/20771 <$IPADDR>:6814/20771 <$IPADDR>:6815/20771 <$IPADDR>:6816/20771 destroyed,exists 1b6b013e-29ec-462b-bd4f-75030d0bfffc
osd.5 up   in  weight 1 up_from 5962 up_thru 5964 down_at 5960 last_clean_interval [5645,5959) <$IPADDR>:6817/8004 <$IPADDR>:6818/8004 <$IPADDR>:6819/8004 <$IPADDR>:6820/8004 exists,up c0d25d14-4f99-4554-b811-e8373d8b032b

D. Following the logic of OSDMonitor::prepare_command_osd_new -- https://github.com/ceph/ceph/blob/v12.1.4/src/mon/OSDMonitor.cc#L6714 -- it seems the following conditions has to be met to arrive at the section which issues the "recreate osd" code, at https://github.com/ceph/ceph/blob/v12.1.4/src/mon/OSDMonitor.cc#L6714 * is_recreate_destroyed = true (btw, L6811 could use the bool rather than perform the check again) * may_be_idempotent = false which only can happen iff OSDMonitor:validate_osd_create returns >= 0 && !EEXIST. And THAT only happens if either UUID either isn't supplied (which it has to be) or if the UUID doesn't match the OLD uuid. Which it has to.

4. Trying the command "ceph osd new -i ceph-new-secrets.json $(uuidgen -r) 1" ran through the code and I received a new OSD "osd.1" with state "exists,new". Thus entering this into the inplace-replace-osd procedure seems to work as expected! I now just need to supply the new OSDs UUID (/var/lib/ceph/osd/$cluster-$id/fsid), a simple command misunderstanding in the end. The documentation (man/command) didn't clarify which UUID is supplied to 'osd new', but the patch above does make that a fair bit more clear.
5. I still need to complete the patch for tools/ceph_objectstore_tool.cc to fix my cluster (clear the deleted flags on the previous OSDs which now refuse to boot). Be right back on that.

#8 Updated by Neha Ojha about 2 years ago

https://github.com/ceph/ceph/pull/17326 should address 3A.

As for the OSD remaining in the osdmap after destroy, that seems to be the ideal behavior of osd destroy. Quoting the man page documentation of osd destroy: "This command will not remove the OSD from crush, nor will it remove the OSD from the OSD map. Instead, once the command successfully completes, the OSD will show marked as destroyed."

#9 Updated by Martin Millnert about 2 years ago

Right, it was never a problem that the OSD remained in the OSD-map. The issue is that 12.1.x (x < 4) didn't care about 'destroyed' flags when booting OSDs, while 12.1.4 do, which left the majority of my OSDs in an unbootable state, after I had done inplace upgrades on them but skipping the 'osd new' step.

I have now fixed this using a combination of a patched ceph-monstore-tool, a patched osdmaptool (it came in handy at last!) together with ceph-objectstore-tool for extract/change/inject of fixed osdmaps with the destroyed-bit cleared.

My patches for local edits:

- osdmaptool: https://github.com/ceph/ceph/compare/master...Millnert:osdmaptool_undestroy
- ceph_monstore_tool: https://github.com/ceph/ceph/compare/master...Millnert:ceph_monstore_tool-clear_destroyed_flag

It's not perfect code ofc but it allowed me to unfsck the cluster.

#10 Updated by Neha Ojha about 2 years ago

  • Status changed from In Progress to Resolved

Good to know that the problem has been solved.

Marking this issue resolved for now. Feel free to open it, if the problem persists.

Also available in: Atom PDF