Bug #462
closedcephx: verify_authorizer_reply exception in decode_decrypt
0%
Description
Since I started using cephx on my cluster I started seeing these messages in my logfiles.
Now for example, I see this on osd1:
2010-10-05 06:36:17.746660 7f7611eb2710 cephx: verify_authorizer_reply exception in decode_decrypt with AQAtq6pMUCzODxAA9E68RSxgjWDmIkPkvVaHtw== 2010-10-05 06:36:17.746704 7f7611eb2710 -- [2001:16f8:10:2::c3c3:a24f]:6800/16813 >> [2001:16f8:10:2::c3c3:3f6c]:6800/12170 pipe(0x7e7a000 sd=29 pgs=0 cs=0 l=0).failed verifying authorize reply 2010-10-05 06:36:19.208015 7f7611eb2710 cephx: verify_authorizer_reply exception in decode_decrypt with AQAtq6pMUCzODxAA9E68RSxgjWDmIkPkvVaHtw== 2010-10-05 06:36:19.208057 7f7611eb2710 -- [2001:16f8:10:2::c3c3:a24f]:6800/16813 >> [2001:16f8:10:2::c3c3:3f6c]:6800/12170 pipe(0x7e7a000 sd=31 pgs=0 cs=0 l=0).failed verifying authorize reply
2001:16f8:10:2::c3c3:3f6c is node07 (osd6) where these messages are not showing up.
I tried to restart osd1 multiple times, but that wouldn't make these messages disappear.
On IRC I heard Sage saying that it is something with a rotating key which isn't working like it's supposed to.
Since I see this message coming by for the last few months, I thought it might be usefull to check this out sometime. Seems those two nodes don't want to communicate.
Updated by Wido den Hollander over 13 years ago
I've just done a fresh mkcephfs on my cluster and then I started to see:
2010-11-01 10:08:56.594139 osd e27: 12 osds: 8 up, 12 in 2010-11-01 10:08:56.985240 pg v52: 3168 pgs: 319 active+clean, 1448 peering, 754 active+clean+degraded, 647 degraded+peering; 5 KB data, 28968 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%) 2010-11-01 10:08:57.494674 pg v53: 3168 pgs: 319 active+clean, 1448 peering, 754 active+clean+degraded, 647 degraded+peering; 5 KB data, 28968 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%) 2010-11-01 10:08:57.982481 log 2010-11-01 10:08:54.368842 osd8 [2001:16f8:10:2::c3c3:1b6c]:6800/10128 8 : [WRN] map e26 wrongly marked me down 2010-11-01 10:08:57.982481 log 2010-11-01 10:08:55.214929 osd0 [2001:16f8:10:2::c3c3:8f6b]:6800/596 7 : [WRN] map e26 wrongly marked me down 2010-11-01 10:08:57.982481 log 2010-11-01 10:08:56.714909 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 218 : [INF] osd0 [2001:16f8:10:2::c3c3:8f6b]:6800/596 boot 2010-11-01 10:08:57.982481 log 2010-11-01 10:08:56.714983 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 219 : [INF] osd8 [2001:16f8:10:2::c3c3:1b6c]:6800/10128 boot 2010-11-01 10:08:59.143030 pg v54: 3168 pgs: 319 active+clean, 1553 peering, 1117 active+clean+degraded, 179 degraded+peering; 5 KB data, 28928 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%) 2010-11-01 10:09:02.128949 pg v55: 3168 pgs: 319 active+clean, 1593 peering, 1097 active+clean+degraded, 159 degraded+peering; 5 KB data, 28952 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%) 2010-11-01 10:09:03.594608 pg v56: 3168 pgs: 319 active+clean, 1593 peering, 1097 active+clean+degraded, 159 degraded+peering; 5 KB data, 28976 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%) 2010-11-01 10:09:07.983439 osd e28: 12 osds: 10 up, 12 in 2010-11-01 10:09:08.483092 log 2010-11-01 10:09:06.370674 osd3 [2001:16f8:10:2::c3c3:2e3a]:6800/15066 7 : [WRN] map e27 wrongly marked me down 2010-11-01 10:09:08.483092 log 2010-11-01 10:09:07.103896 osd5 [2001:16f8:10:2::c3c3:3b6a]:6800/2454 7 : [WRN] map e27 wrongly marked me down 2010-11-01 10:09:08.783447 pg v57: 3168 pgs: 319 active+clean, 1593 peering, 1097 active+clean+degraded, 159 degraded+peering; 5 KB data, 28976 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%) 2010-11-01 10:09:09.972082 log 2010-11-01 10:09:08.104738 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 220 : [INF] osd3 [2001:16f8:10:2::c3c3:2e3a]:6800/15066 boot 2010-11-01 10:09:09.972082 log 2010-11-01 10:09:08.104840 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 221 : [INF] osd5 [2001:16f8:10:2::c3c3:3b6a]:6800/2454 boot 2010-11-01 10:09:13.105902 osd e29: 12 osds: 11 up, 12 in 2010-11-01 10:09:13.450344 pg v58: 3168 pgs: 319 active+clean, 1593 peering, 1097 active+clean+degraded, 159 degraded+peering; 5 KB data, 28976 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%) 2010-11-01 10:09:13.916917 log 2010-11-01 10:09:10.514752 osd1 [2001:16f8:10:2::c3c3:a24f]:6800/5797 7 : [WRN] map e28 wrongly marked me down 2010-11-01 10:09:13.916917 log 2010-11-01 10:09:12.078952 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 222 : [INF] osd4 [2001:16f8:10:2::c3c3:fa1b]:6800/21440 failed (by osd9 [2001:16f8:10:2::c3c3:2e56]:6800/14195) 2010-11-01 10:09:13.916917 log 2010-11-01 10:09:12.079069 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 223 : [INF] osd11 [2001:16f8:10:2::c3c3:ab76]:6800/4602 failed (by osd9 [2001:16f8:10:2::c3c3:2e56]:6800/14195) 2010-11-01 10:09:13.916917 log 2010-11-01 10:09:12.401233 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 224 : [INF] osd4 [2001:16f8:10:2::c3c3:fa1b]:6800/21440 failed (by osd10 [2001:16f8:10:2::c3c3:2bfe]:6800/18982) 2010-11-01 10:09:13.916917 log 2010-11-01 10:09:12.401349 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 225 : [INF] osd11 [2001:16f8:10:2::c3c3:ab76]:6800/4602 failed (by osd10 [2001:16f8:10:2::c3c3:2bfe]:6800/18982) 2010-11-01 10:09:15.627973 log 2010-11-01 10:09:13.227514 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 226 : [INF] osd1 [2001:16f8:10:2::c3c3:a24f]:6800/5797 boot 2010-11-01 10:09:15.627973 log 2010-11-01 10:09:14.038831 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 227 : [INF] osd4 [2001:16f8:10:2::c3c3:fa1b]:6800/21440 failed (by osd6 [2001:16f8:10:2::c3c3:3f6c]:6800/3016) 2010-11-01 10:09:15.627973 log 2010-11-01 10:09:14.039170 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 228 : [INF] osd11 [2001:16f8:10:2::c3c3:ab76]:6800/4602 failed (by osd6 [2001:16f8:10:2::c3c3:3f6c]:6800/3016) 2010-11-01 10:09:15.939627 osd e30: 12 osds: 10 up, 12 in 2010-11-01 10:09:17.139636 pg v59: 3168 pgs: 319 active+clean, 1593 peering, 1097 active+clean+degraded, 159 degraded+peering; 5 KB data, 28976 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%) 2010-11-01 10:09:17.783780 log 2010-11-01 10:09:14.833381 osd2 [2001:16f8:10:2::c3c3:4a8c]:6800/25817 7 : [WRN] map e29 wrongly marked me down 2010-11-01 10:09:17.783780 log 2010-11-01 10:09:16.061290 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 229 : [INF] osd2 [2001:16f8:10:2::c3c3:4a8c]:6800/25817 boot 2010-11-01 10:09:18.117324 osd e31: 12 osds: 10 up, 12 in 2010-11-01 10:09:18.880067 pg v60: 3168 pgs: 319 active+clean, 1593 peering, 1097 active+clean+degraded, 159 degraded+peering; 5 KB data, 28976 KB used, 1891 GB / 1891 GB avail; 4/10 degraded (40.000%) 2010-11-01 10:09:19.640246 osd e32: 12 osds: 9 up, 12 in 2010-11-01 10:09:20.317160 log 2010-11-01 10:09:18.129748 osd4 [2001:16f8:10:2::c3c3:fa1b]:6800/21440 8 : [WRN] map e31 wrongly marked me down 2010-11-01 10:09:20.317160 log 2010-11-01 10:09:18.239740 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 230 : [INF] osd0 [2001:16f8:10:2::c3c3:8f6b]:6800/596 failed (by osd10 [2001:16f8:10:2::c3c3:2bfe]:6800/18982) 2010-11-01 10:09:20.317160 log 2010-11-01 10:09:18.239848 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 231 : [INF] osd8 [2001:16f8:10:2::c3c3:1b6c]:6800/10128 failed (by osd10 [2001:16f8:10:2::c3c3:2bfe]:6800/18982) 2010-11-01 10:09:20.645262 pg v61: 3168 pgs: 319 active+clean, 1794 peering, 478 active+clean+degraded, 577 degraded+peering; 5 KB data, 29640 KB used, 1891 GB / 1891 GB avail; 3/10 degraded (30.000%) 2010-11-01 10:09:21.806161 log 2010-11-01 10:09:19.762009 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 232 : [INF] osd4 [2001:16f8:10:2::c3c3:fa1b]:6800/21440 boot
In the logs of osd0 for example, I found:
2010-11-01 10:09:50.375617 7f3c0694d710 cephx: verify_authorizer could not decrypt ticket info 2010-11-01 10:09:50.375623 7f3c0694d710 -- [2001:16f8:10:2::c3c3:8f6b]:6800/596 >> [2001:16f8:10:2::c3c3:2bfe]:6800/18982 pipe(0x1556780 sd=16 pgs=0 cs=0 l=0).accept bad authorizer 2010-11-01 10:09:50.376871 7f3c0694d710 7f3c0694d710 bad magic in decode_decrypt, 17276497083762528454 != 18374858748799134293 2010-11-01 10:09:50.376888 7f3c0694d710 cephx: verify_authorizer could not decrypt ticket info 2010-11-01 10:09:50.376894 7f3c0694d710 -- [2001:16f8:10:2::c3c3:8f6b]:6800/596 >> [2001:16f8:10:2::c3c3:2e56]:6800/14195 pipe(0x1578c80 sd=16 pgs=0 cs=0 l=0).accept bad authorizer 2010-11-01 10:09:50.379744 7f3c0694d710 7f3c0694d710 bad magic in decode_decrypt, 17276497083762528454 != 18374858748799134293 2010-11-01 10:09:50.379754 7f3c0694d710 cephx: verify_authorizer could not decrypt ticket info 2010-11-01 10:09:50.379760 7f3c0694d710 -- [2001:16f8:10:2::c3c3:8f6b]:6800/596 >> [2001:16f8:10:2::c3c3:2f6c]:6800/2780 pipe(0x1581000 sd=16 pgs=0 cs=0 l=0).accept bad authorizer
Has something gone wrong with the mkcephfs? Restarting the OSD doesn't make any difference.
Updated by Yehuda Sadeh over 13 years ago
- Status changed from New to In Progress
Shouldn't happen any more with 7d7af85c3a8b90ef93a152131a4b8fd66ceb6996. We always keep 3 ("rotating") keys for each service: the old key, the current key, and the new key. When the old key completely expires we shift the next two keys (the current becomes the old and the next becomes the current) and we create a new 'next' key.
The problem was that due to some bug the next key timeout was always the current key timeout + the ttl. This might work in a perfect world, however as we (almost) never retire keys exactly when they expires (usually a few seconds after that), then what actually happens is that we ended up having keys that time out much earlier than we expected. So this fix makes sure that the new key's ttl is not earlier than current time + 2 * ttl.
I'll wait a bit before closing it, making sure that everything is fixed.
Updated by Sage Weil about 13 years ago
- Status changed from In Progress to Resolved