Bug #9744
closedcephx: verify_reply couldn't decrypt with error: error decoding block for decryption
0%
Description
Shortly after upgrade 0.80.5 to 0.80.6 cluster became slow and then almost completely stopped
with several OSDs exhibiting slow requests over 1000 seconds old and several PGs stuck
in "peering" and/or "inactive" state for a while.
Quick peek into their logs revealed the following:
2014-10-11 11:03:31.783967 7fcee3a48700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-10-11 11:03:31.783977 7fcee3a48700 0 -- 192.168.0.2:6815/6585 >> 192.168.0.201:6800/13398 pipe(0x7fcf26052280 sd=22 :46446 s=1 pgs=0 cs=0 l=1 c=0x7fcf118006e0).failed verifying authorize reply 2014-10-11 11:03:46.856869 7fcee3a48700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-10-11 11:03:46.856879 7fcee3a48700 0 -- 192.168.0.2:6815/6585 >> 192.168.0.201:6800/13398 pipe(0x7fcf26052280 sd=22 :46522 s=1 pgs=0 cs=0 l=1 c=0x7fcf118006e0).failed verifying authorize reply
Restarting affected OSDs help for some time but then (some times later) the same error strikes back.
Apart from upgrade and adding another node with one OSD on it there were no changes to configuration.
Strangest thing is that it appears that only some OSDs are affected even where there are more than
one OSD on the host some seems to stuck with "verify_reply couldn't decrypt" while others
seems to be active and log no errors.
Could it be a regression or am I missing something? How to troubleshoot this issue? Thanks.
Updated by Dmitry Smirnov over 9 years ago
Found the following in the logs of the new OSD:
2014-10-10 12:42:33.069104 7f12edb2c700 0 auth: could not find secret_id=6561 2014-10-10 12:42:33.069112 7f12edb2c700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=6561 2014-10-10 12:42:33.069118 7f12edb2c700 0 -- 192.168.0.201:6801/13398 >> 192.168.0.204:6821/10008156 pipe(0x7f130e512780 sd=149 :6801 s=0 p gs=0 cs=0 l=0 c=0x7f1312ad8840).accept: got bad authorizer
Updated by Dmitry Smirnov over 9 years ago
I think I found the problem: new node (with new OSD) had incorrect time.
Everything returned to normal after correcting the clock.
However I had to restart some OSDs (not just the one running on host with incorrect clock) to get some PGs out of "peering" state.
Ceph seems to be pretty vulnerable to incorrect clock on OSD host.
I wish there were detection of OSD clock drift as well as relevant errors pointing to the right direction -- it could have saved me hours of troubleshooting and help to avoid down time.
Also perhaps Ceph should not be affected so much by the incorrect time on one of the OSD hosts.
Updated by Sage Weil over 9 years ago
- Status changed from New to Won't Fix
this happens when clocks are very skewed.
Updated by Dmitry Smirnov over 9 years ago
Sage Weil wrote:
this happens when clocks are very skewed.
Are we OK with such vulnerability that allow to bring the whole cluster down when clock is wrong on the machine where nothing but OSD daemon is running? Why do we tolerate such issues?
Updated by Loïc Dachary over 8 years ago
- Has duplicate Bug #13527: moniter segmentation fault added
Updated by Brad Hubbard about 8 years ago
The crash associated with the duplicate bug is fixed by commit e9e05333ac7c64758bf14d80f6179e001c0fdbfd