Project

General

Profile

Actions

Bug #4282

closed

auth: could not find secret_id=0

Added by Ivan Kudryavtsev about 11 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

root@ceph-osd-2-1:/var/log/ceph# ceph -v
ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061)

ON osd:

2013-02-27 09:03:28.537954 7f2da3a98700 0 bad crc in data 1492286351 != exp 4159396423
2013-02-27 09:03:28.538672 7f2da3a98700 0 -- 10.251.0.5:6807/20236 >> 10.252.0.10:0/4242981374 pipe(0xd4fd780 sd=28 :6807 s=0 pgs=0 cs=0 l=0).accept peer addr is really 10.252.0.10:0/4242981374 (socket is 10.252.0.10:38749/0)
2013-02-27 09:03:28.538865 7f2da3a98700 0 auth: could not find secret_id=0
2013-02-27 09:03:28.538880 7f2da3a98700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=0
2013-02-27 09:03:28.538887 7f2da3a98700 0 -- 10.251.0.5:6807/20236 >> 10.252.0.10:0/4242981374 pipe(0xd4fd780 sd=28 :6807 s=0 pgs=0 cs=0 l=1).accept: got bad authorizer
2013-02-27 09:03:28.539194 7f2da3a98700 0 auth: could not find secret_id=0
2013-02-27 09:03:28.539207 7f2da3a98700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=0
2013-02-27 09:03:28.539216 7f2da3a98700 0 -- 10.251.0.5:6807/20236 >> 10.252.0.10:0/4242981374 pipe(0xd4fd780 sd=28 :6807 s=0 pgs=0 cs=0 l=1).accept: got bad authorizer
2013-02-27 09:03:28.837500 7f2da3a98700 0 -- 10.251.0.5:6807/20236 >> 10.252.0.10:0/4242981374 pipe(0x849ec80 sd=28 :6807 s=0 pgs=0 cs=0 l=0).accept peer addr is really 10.252.0.10:0/4242981374 (socket is 10.252.0.10:38750/0)

On RBD enabled clients:

[134113.838114] libceph: osd16 10.251.0.5:6807 socket error on read
[136019.537543] libceph: osd5 10.251.0.5:6808 socket closed
[136019.538720] libceph: osd5 10.251.0.5:6808 socket error on read
[137115.672726] libceph: osd5 10.251.0.5:6808 socket closed
[137115.674056] libceph: osd5 10.251.0.5:6808 socket error on read
[137613.570674] libceph: osd5 10.251.0.5:6808 socket closed
[137613.571922] libceph: osd5 10.251.0.5:6808 socket error on read

Actions #1

Updated by Sage Weil about 11 years ago

  • Status changed from New to Need More Info
  • Assignee set to Sage Weil
  • Priority changed from Normal to High

Can you please do

ceph osd tell \* injectargs '--debug-monc 20 --debug-auth 20 --debug-ms 1'

and then see if you can reproduce this? We've sporatically seen that missing secret issue but haven't beena ble to gather enough evidence to diagnose.

Also, note that the mismatched checksum implies you have an unreliable network.. is there anything fishy going on there?

Actions #2

Updated by Sage Weil about 11 years ago

Got some partial logs on this from another user (zendesk 254). The osd has secret ids 1001-1003, and the expirations are valid. It looks like the kernel client is just sending a malformed authorizer with a zeroed secred_id, and/or the mon is feeding it bad info inducing that behavior.

Actions #3

Updated by Sage Weil about 11 years ago

My best guess right now is that either this is a kclient bug with startup (we send an osd req before we have a valid ticket), but the log doesn't quite support htat theory..

or, the mon is sending bad tickets when the client goes to renew. 'debug auth = 30' and 'debug ms = 1' on the mon would help narrow that down.

Actions #4

Updated by Josh Durgin about 11 years ago

I hit this on a test cluster a little while ago, but wasn't able to reproduce once I turned on logging and restarted the osd. It happened to an osd, so I suspect it's the monitor sending bad tickets.

Actions #5

Updated by Sage Weil about 11 years ago

  • Priority changed from High to Urgent

a few problems:

0- some other issue is making us repeatedly fault and reconnect.
1- the authorizer isn't rebuilt when the service keys rotate, only when we get an authentication failure.
2- the rebuilt authorizer is okay, but then when we fault, we think it's bc the authorizer is (still) bad, and call invalidate_authorizer
...

Actions #6

Updated by Sage Weil about 11 years ago

  • Status changed from Need More Info to In Progress
Actions #7

Updated by Sage Weil about 11 years ago

  • Status changed from In Progress to Fix Under Review
Actions #8

Updated by Sage Weil about 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions #9

Updated by Steffen Weißgerber almost 9 years ago

Hi,

I got the same issue with a 6 node ceph cluster (giant 0.87.2) when booting servers via rbd from it.
The servers are Gentoo installations with a 3.18.11 Kernel and an initramfs created with dracut.

The server boots with an ip address set with kernel boot ip-option. Within the initramfs the root
partition is succesfully mapped with rbd map $rbd_name --pool $rbd_pool --id kvm.

The server then runs perfect for 2 hours before logging

May 29 09:46:56 al44 kernel: libceph: osd12 2.1.1.92:6813 socket error on read
May 29 09:46:56 al44 kernel: libceph: osd10 2.1.1.140:6800 socket error on read
May 29 09:46:56 al44 kernel: libceph: osd16 2.1.1.93:6812 socket error on read
May 29 09:47:40 al44 kernel: libceph: osd13 2.1.1.92:6809 socket error on read
May 29 09:47:40 al44 kernel: libceph: osd20 2.1.1.95:6813 socket error on read
May 29 09:47:40 al44 kernel: libceph: osd19 2.1.1.93:6800 socket error on read
May 29 09:47:40 al44 kernel: libceph: osd3 2.1.1.138:6804 socket error on read

The osd's are logging things like:

2015-05-29 09:57:56.999740 7fba74a22700 10 cephx: verify_authorizer decrypted service osd secret_id=7549
2015-05-29 09:57:56.999750 7fba74a22700 0 auth: could not find secret_id=7549
2015-05-29 09:57:56.999752 7fba74a22700 10 auth: dump_rotating:
2015-05-29 09:57:56.999754 7fba74a22700 10 auth: id 7550 AQAl/WdVMBqhABAATuA0Z15XyaIXPTXLl9KXRw== expires 2015-05-29 09:46:13.010551
2015-05-29 09:57:56.999766 7fba74a22700 10 auth: id 7551 AQA0C2hV0FwjHhAATnhAslYWR7gOVU8otRaVJw== expires 2015-05-29 10:46:13.010551
2015-05-29 09:57:56.999774 7fba74a22700 10 auth: id 7552 AQBHGWhViL7IDhAA+ecCsxXj3vSiek+5qx7b9A== expires 2015-05-29 11:46:15.248024
2015-05-29 09:57:56.999781 7fba74a22700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=7549
2015-05-29 09:57:56.999784 7fba74a22700 0 -- 2.1.1.139:6805/8062 >> 2.1.1.252:0/3530212140 pipe(0x2f027440 sd=352 :6805 s=0 pgs=0 cs=0 l=1 c=0x22ed9080).accept: got bad authorizer
2015-05-29 09:57:56.999998 7fba74a22700 10 cephx: verify_authorizer decrypted service osd secret_id=7549
2015-05-29 09:57:57.000008 7fba74a22700 0 auth: could not find secret_id=7549

Looks like the ceph client request connection with an old invalid key.

This runs for an hour and then the server reconnects succesfully:

2015-05-29 11:53:18.256425 7fbaa4c46700 10 cephx: verify_authorizer decrypted service osd secret_id=7553
2015-05-29 11:53:18.256514 7fbaa4c46700 10 cephx: verify_authorizer global_id=1685042
2015-05-29 11:53:18.256578 7fbaa4c46700 10 cephx: verify_authorizer ok nonce 643c986966334873 reply_bl.length()=36
2015-05-29 11:53:18.256640 7fbaa4c46700 10 In get_auth_session_handler for protocol 2

This also happens independent from rbd disk based qemu-kvm (version 2.2.0) instances running on that server or not. When running VM's
these are further up, running and accessible while the server is in trouble.

The behavior is reproducable.

If you like I can provide download links for the full logs. The log level is more verbose only for a minute.

Regards

Steffen

Actions

Also available in: Atom PDF