Project

General

Profile

Bug #8614

OSD keyring shifted

Added by jimmy lu almost 10 years ago. Updated over 9 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

Env:
1 host = mon, mds, admin for ceph-deploy
4 hosts = osd server

I troubleshooted an OSD being down on 1 of my osd server. I had to reboot it to fix a down osd/drive. While the host came back from a reboot, the osds' keyring shifted by increment of 1. The keys don't match with <ceph auth export>, so all the OSD on this host are down, not able to authenticate with mon server.

[jlu@gfsnode1 ~]$ sudo service ceph start
Password: === osd.34 ===
2014-06-16 14:00:12.630262 7f36171a9700 0 librados: osd.34 authentication error (1) Operation not permitted
Error connecting to cluster: PermissionError
failed: 'timeout 30 /usr/bin/ceph c /etc/ceph/ceph.conf --name=osd.34 --keyring=/var/lib/ceph/osd/ceph-34/keyring osd crush create-or-move - 34 2.73 host=gfsnode1 root=default'
[jlu@gfsnode1 ~]$


[jlu@gfsnode1 ceph-34]$ cd ../ceph-33
[jlu@gfsnode1 ceph-33]$ pwd
/var/lib/ceph/osd/ceph-33
[jlu@gfsnode1 ceph-33]$ cat whoami
34
[jlu@gfsnode1 ceph-33]$ cd ../ceph-34
[jlu@gfsnode1 ceph-34]$ pwd
/var/lib/ceph/osd/ceph-34
[jlu@gfsnode1 ceph-34]$ cat whoami
35
[jlu@gfsnode1 ceph-34]$

I also compare the keys with <ceph auth export> from the mon node, the [osd_id] and keys are not match up.

From mon node:
osd.33
key: AQDdPXVTUP0OAxAAptykWQeOWrSwg+DIMwRCwA==
caps: [mon] allow profile osd
caps: [osd] allow *

From osd node:
[jlu@gfsnode1 ceph-33]$ pwd
/var/lib/ceph/osd/ceph-33
[jlu@gfsnode1 ceph-33]$ cat keyring
[jlu@gfsnode1 ceph-33]$ sudo !!
sudo cat keyring
[osd.34]
key = AQAwPnVT6G7fBRAA86D4FuxN0U8uKXk0brPbCQ==
[jlu@gfsnode1 ceph-33]$

——

From mon node:
osd.34
key: AQAwPnVT6G7fBRAA86D4FuxN0U8uKXk0brPbCQ==
caps: [mon] allow profile osd
caps: [osd] allow *

From osd node:
[jlu@gfsnode1 ceph-34]$ pwd
/var/lib/ceph/osd/ceph-34
[jlu@gfsnode1 ceph-34]$ cat keyring
[jlu@gfsnode1 ceph-34]$ sudo !!
sudo cat keyring
[osd.35]
key = AQBbPnVTmG4BLxAA6UV6XHbZepXUEXB6VJQzEA==
[jlu@gfsnode1 ceph-34]$

All 11 OSDs shifted increment of one. Very odd.

THanks,
Jimmy

History

#1 Updated by jimmy lu almost 10 years ago

My cluster is down right now, please help.

2014-06-23 11:42:25.117964 7fc0d4cf3700 1 mon.gfsnode5@0(leader).paxos(paxos active c 4679143..4679735) is_readable now=2014-06-23 11:42:25.117965 lease_expire=0.000000 has v0 lc 4679735
2014-06-23 11:42:25.234989 7fc0d56f4700 0 log [INF] : pgmap v2470770: 2100 pgs: 1 peering, 2 active+clean+scrubbing+deep, 2081 active+clean, 1 remapped+peering, 12 active+remapped+backfill_toofull, 1 down+peering, 2 active+clean+scrubbing; 27581 GB data, 56285 GB used, 35871 GB / 92156 GB avail; 45306/17202069 objects degraded (0.263%)
2014-06-23 11:42:25.237365 7fc0d4cf3700 1 mon.gfsnode5@0(leader).paxos(paxos active c 4679143..4679737) is_readable now=2014-06-23 11:42:25.237369 lease_expire=0.000000 has v0 lc 4679737
2014-06-23 11:42:25.237581 7fc0d4cf3700 0 mon.gfsnode5@0(leader) e1 handle_command mon_command({"prefix": "status"} v 0) v1
2014-06-23 11:42:25.588230 7fc0d4cf3700 1 mon.gfsnode5@0(leader).paxos(paxos active c 4679143..4679737) is_readable now=2014-06-23 11:42:25.588233 lease_expire=0.000000 has v0 lc 4679737
2014-06-23 11:42:25.626749 7fc0d4cf3700 1 mon.gfsnode5@0(leader).paxos(paxos active c 4679143..4679737) is_readable now=2014-06-23 11:42:25.626751 lease_expire=0.000000 has v0 lc 4679737
2014-06-23 11:42:25.644452 7fc0d4cf3700 1 mon.gfsnode5@0(leader).paxos(paxos active c 4679143..4679737) is_readable now=2014-06-23 11:42:25.644454 lease_expire=0.000000 has v0 lc 4679737
2014-06-23 11:42:25.877449 7fc0d4cf3700 1 mon.gfsnode5@0(leader).paxos(paxos active c 4679143..4679737) is_readable now=2014-06-23 11:42:25.877451 lease_expire=0.000000 has v0 lc 4679737
2014-06-23 11:42:26.026013 7fc0d56f4700 0 mon.gfsnode5@0(leader).data_health(1) update_stats avail 1% total 4128448 used 3851096 avail 67640
2014-06-23 11:42:26.033697 7fc0d56f4700 1 mon.gfsnode5@0(leader).data_health(1) reached critical levels of available space on local monitor storage - shutdown!
2014-06-23 11:42:26.033706 7fc0d56f4700 0 * Shutdown via Data Health Service
2014-06-23 11:42:26.033726 7fc0d38e1700 -1 mon.gfsnode5@0(leader) e1 *
Got Signal Interrupt *
*
2014-06-23 11:42:26.033736 7fc0d38e1700 1 mon.gfsnode5@0(leader) e1 shutdown
2014-06-23 11:42:26.034087 7fc0d38e1700 0 quorum service shutdown
2014-06-23 11:42:26.034091 7fc0d38e1700 0 mon.gfsnode5@0(shutdown).health(1) HealthMonitor::service_shutdown 1 services
2014-06-23 11:42:26.034096 7fc0d38e1700 0 quorum service shutdown

jimmy lu wrote:

Hello,

Env:
1 host = mon, mds, admin for ceph-deploy
4 hosts = osd server

I troubleshooted an OSD being down on 1 of my osd server. I had to reboot it to fix a down osd/drive. While the host came back from a reboot, the osds' keyring shifted by increment of 1. The keys don't match with <ceph auth export>, so all the OSD on this host are down, not able to authenticate with mon server.

[jlu@gfsnode1 ~]$ sudo service ceph start
Password: === osd.34 ===
2014-06-16 14:00:12.630262 7f36171a9700 0 librados: osd.34 authentication error (1) Operation not permitted
Error connecting to cluster: PermissionError
failed: 'timeout 30 /usr/bin/ceph c /etc/ceph/ceph.conf --name=osd.34 --keyring=/var/lib/ceph/osd/ceph-34/keyring osd crush create-or-move - 34 2.73 host=gfsnode1 root=default'
[jlu@gfsnode1 ~]$


[jlu@gfsnode1 ceph-34]$ cd ../ceph-33
[jlu@gfsnode1 ceph-33]$ pwd
/var/lib/ceph/osd/ceph-33
[jlu@gfsnode1 ceph-33]$ cat whoami
34
[jlu@gfsnode1 ceph-33]$ cd ../ceph-34
[jlu@gfsnode1 ceph-34]$ pwd
/var/lib/ceph/osd/ceph-34
[jlu@gfsnode1 ceph-34]$ cat whoami
35
[jlu@gfsnode1 ceph-34]$

I also compare the keys with <ceph auth export> from the mon node, the [osd_id] and keys are not match up.

From mon node:
osd.33
key: AQDdPXVTUP0OAxAAptykWQeOWrSwg+DIMwRCwA==
caps: [mon] allow profile osd
caps: [osd] allow *

From osd node:
[jlu@gfsnode1 ceph-33]$ pwd
/var/lib/ceph/osd/ceph-33
[jlu@gfsnode1 ceph-33]$ cat keyring
[jlu@gfsnode1 ceph-33]$ sudo !!
sudo cat keyring
[osd.34]
key = AQAwPnVT6G7fBRAA86D4FuxN0U8uKXk0brPbCQ==
[jlu@gfsnode1 ceph-33]$

——

From mon node:
osd.34
key: AQAwPnVT6G7fBRAA86D4FuxN0U8uKXk0brPbCQ==
caps: [mon] allow profile osd
caps: [osd] allow *

From osd node:
[jlu@gfsnode1 ceph-34]$ pwd
/var/lib/ceph/osd/ceph-34
[jlu@gfsnode1 ceph-34]$ cat keyring
[jlu@gfsnode1 ceph-34]$ sudo !!
sudo cat keyring
[osd.35]
key = AQBbPnVTmG4BLxAA6UV6XHbZepXUEXB6VJQzEA==
[jlu@gfsnode1 ceph-34]$

All 11 OSDs shifted increment of one. Very odd.

THanks,
Jimmy

#2 Updated by Joao Eduardo Luis over 9 years ago

  • Status changed from New to Can't reproduce

The monitor shutdown because you ran out of disk space on the monitor disk.

You should free up space or move the monitor to a disk with more space. Then you should fix your keyrings, which is a separate issue. If you are able to reproduce this please let us know. Until then I'm marking this as Can't Reproduce.

Also available in: Atom PDF