Bug #462: cephx: verify_authorizer_reply exception in decode_decrypt - Ceph - Ceph

Actions

Copy link

Bug #462

closed

cephx: verify_authorizer_reply exception in decode_decrypt

Added by Wido den Hollander over 13 years ago. Updated about 13 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Since I started using cephx on my cluster I started seeing these messages in my logfiles.

Now for example, I see this on osd1:

2010-10-05 06:36:17.746660 7f7611eb2710 cephx: verify_authorizer_reply exception in decode_decrypt with AQAtq6pMUCzODxAA9E68RSxgjWDmIkPkvVaHtw==
2010-10-05 06:36:17.746704 7f7611eb2710 -- [2001:16f8:10:2::c3c3:a24f]:6800/16813 >> [2001:16f8:10:2::c3c3:3f6c]:6800/12170 pipe(0x7e7a000 sd=29 pgs=0 cs=0 l=0).failed verifying authorize reply
2010-10-05 06:36:19.208015 7f7611eb2710 cephx: verify_authorizer_reply exception in decode_decrypt with AQAtq6pMUCzODxAA9E68RSxgjWDmIkPkvVaHtw==
2010-10-05 06:36:19.208057 7f7611eb2710 -- [2001:16f8:10:2::c3c3:a24f]:6800/16813 >> [2001:16f8:10:2::c3c3:3f6c]:6800/12170 pipe(0x7e7a000 sd=31 pgs=0 cs=0 l=0).failed verifying authorize reply

2001:16f8:10:2::c3c3:3f6c is node07 (osd6) where these messages are not showing up.

I tried to restart osd1 multiple times, but that wouldn't make these messages disappear.

On IRC I heard Sage saying that it is something with a rotating key which isn't working like it's supposed to.

Since I see this message coming by for the last few months, I thought it might be usefull to check this out sometime. Seems those two nodes don't want to communicate.

Actions

Copy link

Updated by Sage Weil over 13 years ago

Category deleted (~~OSD~~)

Actions

Copy link

Updated by Wido den Hollander over 13 years ago

I've just done a fresh mkcephfs on my cluster and then I started to see:

2010-11-01 10:08:56.594139   osd e27: 12 osds: 8 up, 12 in
2010-11-01 10:08:56.985240    pg v52: 3168 pgs: 319 active+clean, 1448 peering, 754 active+clean+degraded, 647 degraded+peering; 5 KB data, 28968 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%)
2010-11-01 10:08:57.494674    pg v53: 3168 pgs: 319 active+clean, 1448 peering, 754 active+clean+degraded, 647 degraded+peering; 5 KB data, 28968 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%)
2010-11-01 10:08:57.982481   log 2010-11-01 10:08:54.368842 osd8 [2001:16f8:10:2::c3c3:1b6c]:6800/10128 8 : [WRN] map e26 wrongly marked me down
2010-11-01 10:08:57.982481   log 2010-11-01 10:08:55.214929 osd0 [2001:16f8:10:2::c3c3:8f6b]:6800/596 7 : [WRN] map e26 wrongly marked me down
2010-11-01 10:08:57.982481   log 2010-11-01 10:08:56.714909 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 218 : [INF] osd0 [2001:16f8:10:2::c3c3:8f6b]:6800/596 boot
2010-11-01 10:08:57.982481   log 2010-11-01 10:08:56.714983 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 219 : [INF] osd8 [2001:16f8:10:2::c3c3:1b6c]:6800/10128 boot
2010-11-01 10:08:59.143030    pg v54: 3168 pgs: 319 active+clean, 1553 peering, 1117 active+clean+degraded, 179 degraded+peering; 5 KB data, 28928 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%)
2010-11-01 10:09:02.128949    pg v55: 3168 pgs: 319 active+clean, 1593 peering, 1097 active+clean+degraded, 159 degraded+peering; 5 KB data, 28952 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%)
2010-11-01 10:09:03.594608    pg v56: 3168 pgs: 319 active+clean, 1593 peering, 1097 active+clean+degraded, 159 degraded+peering; 5 KB data, 28976 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%)
2010-11-01 10:09:07.983439   osd e28: 12 osds: 10 up, 12 in
2010-11-01 10:09:08.483092   log 2010-11-01 10:09:06.370674 osd3 [2001:16f8:10:2::c3c3:2e3a]:6800/15066 7 : [WRN] map e27 wrongly marked me down
2010-11-01 10:09:08.483092   log 2010-11-01 10:09:07.103896 osd5 [2001:16f8:10:2::c3c3:3b6a]:6800/2454 7 : [WRN] map e27 wrongly marked me down
2010-11-01 10:09:08.783447    pg v57: 3168 pgs: 319 active+clean, 1593 peering, 1097 active+clean+degraded, 159 degraded+peering; 5 KB data, 28976 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%)
2010-11-01 10:09:09.972082   log 2010-11-01 10:09:08.104738 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 220 : [INF] osd3 [2001:16f8:10:2::c3c3:2e3a]:6800/15066 boot
2010-11-01 10:09:09.972082   log 2010-11-01 10:09:08.104840 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 221 : [INF] osd5 [2001:16f8:10:2::c3c3:3b6a]:6800/2454 boot
2010-11-01 10:09:13.105902   osd e29: 12 osds: 11 up, 12 in
2010-11-01 10:09:13.450344    pg v58: 3168 pgs: 319 active+clean, 1593 peering, 1097 active+clean+degraded, 159 degraded+peering; 5 KB data, 28976 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%)
2010-11-01 10:09:13.916917   log 2010-11-01 10:09:10.514752 osd1 [2001:16f8:10:2::c3c3:a24f]:6800/5797 7 : [WRN] map e28 wrongly marked me down
2010-11-01 10:09:13.916917   log 2010-11-01 10:09:12.078952 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 222 : [INF] osd4 [2001:16f8:10:2::c3c3:fa1b]:6800/21440 failed (by osd9 [2001:16f8:10:2::c3c3:2e56]:6800/14195)
2010-11-01 10:09:13.916917   log 2010-11-01 10:09:12.079069 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 223 : [INF] osd11 [2001:16f8:10:2::c3c3:ab76]:6800/4602 failed (by osd9 [2001:16f8:10:2::c3c3:2e56]:6800/14195)
2010-11-01 10:09:13.916917   log 2010-11-01 10:09:12.401233 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 224 : [INF] osd4 [2001:16f8:10:2::c3c3:fa1b]:6800/21440 failed (by osd10 [2001:16f8:10:2::c3c3:2bfe]:6800/18982)
2010-11-01 10:09:13.916917   log 2010-11-01 10:09:12.401349 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 225 : [INF] osd11 [2001:16f8:10:2::c3c3:ab76]:6800/4602 failed (by osd10 [2001:16f8:10:2::c3c3:2bfe]:6800/18982)
2010-11-01 10:09:15.627973   log 2010-11-01 10:09:13.227514 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 226 : [INF] osd1 [2001:16f8:10:2::c3c3:a24f]:6800/5797 boot
2010-11-01 10:09:15.627973   log 2010-11-01 10:09:14.038831 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 227 : [INF] osd4 [2001:16f8:10:2::c3c3:fa1b]:6800/21440 failed (by osd6 [2001:16f8:10:2::c3c3:3f6c]:6800/3016)
2010-11-01 10:09:15.627973   log 2010-11-01 10:09:14.039170 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 228 : [INF] osd11 [2001:16f8:10:2::c3c3:ab76]:6800/4602 failed (by osd6 [2001:16f8:10:2::c3c3:3f6c]:6800/3016)
2010-11-01 10:09:15.939627   osd e30: 12 osds: 10 up, 12 in
2010-11-01 10:09:17.139636    pg v59: 3168 pgs: 319 active+clean, 1593 peering, 1097 active+clean+degraded, 159 degraded+peering; 5 KB data, 28976 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%)
2010-11-01 10:09:17.783780   log 2010-11-01 10:09:14.833381 osd2 [2001:16f8:10:2::c3c3:4a8c]:6800/25817 7 : [WRN] map e29 wrongly marked me down
2010-11-01 10:09:17.783780   log 2010-11-01 10:09:16.061290 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 229 : [INF] osd2 [2001:16f8:10:2::c3c3:4a8c]:6800/25817 boot
2010-11-01 10:09:18.117324   osd e31: 12 osds: 10 up, 12 in
2010-11-01 10:09:18.880067    pg v60: 3168 pgs: 319 active+clean, 1593 peering, 1097 active+clean+degraded, 159 degraded+peering; 5 KB data, 28976 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%)
2010-11-01 10:09:19.640246   osd e32: 12 osds: 9 up, 12 in
2010-11-01 10:09:20.317160   log 2010-11-01 10:09:18.129748 osd4 [2001:16f8:10:2::c3c3:fa1b]:6800/21440 8 : [WRN] map e31 wrongly marked me down
2010-11-01 10:09:20.317160   log 2010-11-01 10:09:18.239740 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 230 : [INF] osd0 [2001:16f8:10:2::c3c3:8f6b]:6800/596 failed (by osd10 [2001:16f8:10:2::c3c3:2bfe]:6800/18982)
2010-11-01 10:09:20.317160   log 2010-11-01 10:09:18.239848 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 231 : [INF] osd8 [2001:16f8:10:2::c3c3:1b6c]:6800/10128 failed (by osd10 [2001:16f8:10:2::c3c3:2bfe]:6800/18982)
2010-11-01 10:09:20.645262    pg v61: 3168 pgs: 319 active+clean, 1794 peering, 478 active+clean+degraded, 577 degraded+peering; 5 KB data, 29640 KB used, 1891 GB / 1891 GB avail; 3/10 degraded (30.000%)
2010-11-01 10:09:21.806161   log 2010-11-01 10:09:19.762009 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 232 : [INF] osd4 [2001:16f8:10:2::c3c3:fa1b]:6800/21440 boot

In the logs of osd0 for example, I found:

2010-11-01 10:09:50.375617 7f3c0694d710 cephx: verify_authorizer could not decrypt ticket info
2010-11-01 10:09:50.375623 7f3c0694d710 -- [2001:16f8:10:2::c3c3:8f6b]:6800/596 >> [2001:16f8:10:2::c3c3:2bfe]:6800/18982 pipe(0x1556780 sd=16 pgs=0 cs=0 l=0).accept bad authorizer
2010-11-01 10:09:50.376871 7f3c0694d710 7f3c0694d710 bad magic in decode_decrypt, 17276497083762528454 != 18374858748799134293
2010-11-01 10:09:50.376888 7f3c0694d710 cephx: verify_authorizer could not decrypt ticket info
2010-11-01 10:09:50.376894 7f3c0694d710 -- [2001:16f8:10:2::c3c3:8f6b]:6800/596 >> [2001:16f8:10:2::c3c3:2e56]:6800/14195 pipe(0x1578c80 sd=16 pgs=0 cs=0 l=0).accept bad authorizer
2010-11-01 10:09:50.379744 7f3c0694d710 7f3c0694d710 bad magic in decode_decrypt, 17276497083762528454 != 18374858748799134293
2010-11-01 10:09:50.379754 7f3c0694d710 cephx: verify_authorizer could not decrypt ticket info
2010-11-01 10:09:50.379760 7f3c0694d710 -- [2001:16f8:10:2::c3c3:8f6b]:6800/596 >> [2001:16f8:10:2::c3c3:2f6c]:6800/2780 pipe(0x1581000 sd=16 pgs=0 cs=0 l=0).accept bad authorizer

Has something gone wrong with the mkcephfs? Restarting the OSD doesn't make any difference.

Actions

Copy link

Updated by Yehuda Sadeh over 13 years ago

Status changed from New to In Progress

Shouldn't happen any more with 7d7af85c3a8b90ef93a152131a4b8fd66ceb6996. We always keep 3 ("rotating") keys for each service: the old key, the current key, and the new key. When the old key completely expires we shift the next two keys (the current becomes the old and the next becomes the current) and we create a new 'next' key.
The problem was that due to some bug the next key timeout was always the current key timeout + the ttl. This might work in a perfect world, however as we (almost) never retire keys exactly when they expires (usually a few seconds after that), then what actually happens is that we ended up having keys that time out much earlier than we expected. So this fix makes sure that the new key's ttl is not earlier than current time + 2 * ttl.
I'll wait a bit before closing it, making sure that everything is fixed.

Actions

Copy link