Bug #9744: cephx: verify_reply couldn't decrypt with error: error decoding block for decryption - Ceph - Ceph

Actions

Copy link

Bug #9744

closed

cephx: verify_reply couldn't decrypt with error: error decoding block for decryption

Added by Dmitry Smirnov over 9 years ago. Updated about 8 years ago.

Status:

Won't Fix

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Shortly after upgrade 0.80.5 to 0.80.6 cluster became slow and then almost completely stopped
with several OSDs exhibiting slow requests over 1000 seconds old and several PGs stuck
in "peering" and/or "inactive" state for a while.
Quick peek into their logs revealed the following:

2014-10-11 11:03:31.783967 7fcee3a48700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption
2014-10-11 11:03:31.783977 7fcee3a48700  0 -- 192.168.0.2:6815/6585 >> 192.168.0.201:6800/13398 pipe(0x7fcf26052280 sd=22 :46446 s=1 pgs=0 cs=0 l=1 c=0x7fcf118006e0).failed verifying authorize reply
2014-10-11 11:03:46.856869 7fcee3a48700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption
2014-10-11 11:03:46.856879 7fcee3a48700  0 -- 192.168.0.2:6815/6585 >> 192.168.0.201:6800/13398 pipe(0x7fcf26052280 sd=22 :46522 s=1 pgs=0 cs=0 l=1 c=0x7fcf118006e0).failed verifying authorize reply

Restarting affected OSDs help for some time but then (some times later) the same error strikes back.
Apart from upgrade and adding another node with one OSD on it there were no changes to configuration.
Strangest thing is that it appears that only some OSDs are affected even where there are more than
one OSD on the host some seems to stuck with "verify_reply couldn't decrypt" while others
seems to be active and log no errors.

Could it be a regression or am I missing something? How to troubleshoot this issue? Thanks.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Dmitry Smirnov over 9 years ago

Found the following in the logs of the new OSD:

2014-10-10 12:42:33.069104 7f12edb2c700  0 auth: could not find secret_id=6561
2014-10-10 12:42:33.069112 7f12edb2c700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=6561
2014-10-10 12:42:33.069118 7f12edb2c700  0 -- 192.168.0.201:6801/13398 >> 192.168.0.204:6821/10008156 pipe(0x7f130e512780 sd=149 :6801 s=0 p
gs=0 cs=0 l=0 c=0x7f1312ad8840).accept: got bad authorizer

Actions

Copy link

Updated by Dmitry Smirnov over 9 years ago

I think I found the problem: new node (with new OSD) had incorrect time.
Everything returned to normal after correcting the clock.
However I had to restart some OSDs (not just the one running on host with incorrect clock) to get some PGs out of "peering" state.

Ceph seems to be pretty vulnerable to incorrect clock on OSD host.
I wish there were detection of OSD clock drift as well as relevant errors pointing to the right direction -- it could have saved me hours of troubleshooting and help to avoid down time.

Also perhaps Ceph should not be affected so much by the incorrect time on one of the OSD hosts.

Actions

Copy link

Updated by Sage Weil over 9 years ago

Status changed from New to Won't Fix

this happens when clocks are very skewed.

Actions

Copy link

Updated by Dmitry Smirnov over 9 years ago

Sage Weil wrote:

this happens when clocks are very skewed.

Are we OK with such vulnerability that allow to bring the whole cluster down when clock is wrong on the machine where nothing but OSD daemon is running? Why do we tolerate such issues?

Actions

Copy link