Project

General

Profile

Actions

Bug #8379

closed

cephx: clock skew causes verify_reply decode error

Added by jian li almost 10 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
cephx
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

linux: 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
4 osd and 1 mon
ceph version: firefly
the following is cepf.conf
[global]
fsid = 49a9d6de-0583-4941-b6ee-7007482ddbcf
mon_initial_members = estack4-HP-Z220-SFF-Workstation
mon_host = 192.168.12.5
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

the details error message after run cli rados bench p rbd 10 write
Maintaining 16 concurrent writes of 4194304 bytes for up to 10 seconds or 0 objects
Object prefix: benchmark_data_estack3-HP-Z220-SFF-Workstati_16356
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 02014-05-17 10:46:54.492303 7f163c3a9700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption
2014-05-17 10:46:54.492310 7f163c3a9700 0 -
192.168.12.4:0/1016356 >> 192.168.12.4:6804/15397 pipe(0xce5fd0 sd=4 :48528 s=1 pgs=0 cs=0 l=1 c=0xce6240).failed verifying authorize reply
2014-05-17 10:46:54.492351 7f163c3a9700 0 -- 192.168.12.4:0/1016356 >> 192.168.12.4:6804/15397 pipe(0xce5fd0 sd=4 :48528 s=1 pgs=0 cs=0 l=1 c=0xce6240).fault
2014-05-17 10:46:54.492700 7f163c3a9700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption
2014-05-17 10:46:54.492715 7f163c3a9700 0 -- 192.168.12.4:0/1016356 >> 192.168.12.4:6804/15397 pipe(0xce5fd0 sd=4 :48529 s=1 pgs=0 cs=0 l=1 c=0xce6240).failed verifying authorize reply
0 0 - 0
2014-05-17 10:46:54.518700 7f163c2a8700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption
2014-05-17 10:46:54.518701 7f163c3a9700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption
2014-05-17 10:46:54.518707 7f163c2a8700 0 -- 192.168.12.4:0/1016356 >> 192.168.12.3:6800/15005 pipe(0xce7200 sd=5 :51696 s=1 pgs=0 cs=0 l=1 c=0xce7470).failed verifying authorize reply
2014-05-17 10:46:54.518710 7f163c3a9700 0 -- 192.168.12.4:0/1016356 >> 192.168.12.4:6804/15397 pipe(0xce5fd0 sd=4 :48533 s=1 pgs=0 cs=0 l=1 c=0xce6240).failed verifying authorize reply
2014-05-17 10:46:54.518729 7f163c2a8700 0 -- 192.168.12.4:0/1016356 >> 192.168.12.3:6800/15005 pipe(0xce7200 sd=5 :51696 s=1 pgs=0 cs=0 l=1 c=0xce7470).fault

estack@estack2-HP-Z220-SFF-Workstation:/var/log/ceph$ ceph health
2014-05-17 11:48:39.381253 7f51376bf700 1 -- :/0 messenger.start
2014-05-17 11:48:39.381734 7f51376bf700 1 -- :/1016928 --> 192.168.12.5:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x7f5130024a30 con 0x7f51300245e0
2014-05-17 11:48:39.382311 7f5134748700 1 -- 192.168.12.3:0/1016928 learned my addr 192.168.12.3:0/1016928
2014-05-17 11:48:39.383252 7f513574a700 1 -- 192.168.12.3:0/1016928 <== mon.0 192.168.12.5:6789/0 1 ==== mon_map v1 ==== 221+0+0 (149619065 0 0) 0x7f511c000ad0 con 0x7f51300245e0
2014-05-17 11:48:39.383351 7f513574a700 1 -- 192.168.12.3:0/1016928 <== mon.0 192.168.12.5:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (4013688572 0 0) 0x7f511c000fc0 con 0x7f51300245e0
2014-05-17 11:48:39.383526 7f513574a700 1 -- 192.168.12.3:0/1016928 --> 192.168.12.5:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x7f5120001330 con 0x7f51300245e0
2014-05-17 11:48:39.384319 7f513574a700 1 -- 192.168.12.3:0/1016928 <== mon.0 192.168.12.5:6789/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 206+0+0 (2213948709 0 0) 0x7f511c000fc0 con 0x7f51300245e0
2014-05-17 11:48:39.384482 7f513574a700 1 -- 192.168.12.3:0/1016928 --> 192.168.12.5:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 0x7f5120001e30 con 0x7f51300245e0
2014-05-17 11:48:39.385188 7f513574a700 1 -- 192.168.12.3:0/1016928 <== mon.0 192.168.12.5:6789/0 4 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 393+0+0 (2549800467 0 0) 0x7f511c000fc0 con 0x7f51300245e0
2014-05-17 11:48:39.385310 7f513574a700 1 -- 192.168.12.3:0/1016928 --> 192.168.12.5:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7f5130024c40 con 0x7f51300245e0
2014-05-17 11:48:39.385468 7f51376bf700 1 -- 192.168.12.3:0/1016928 --> 192.168.12.5:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x7f5130025a80 con 0x7f51300245e0
2014-05-17 11:48:39.385483 7f51376bf700 1 -- 192.168.12.3:0/1016928 --> 192.168.12.5:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x7f51300210d0 con 0x7f51300245e0
2014-05-17 11:48:39.386057 7f513574a700 1 -- 192.168.12.3:0/1016928 <== mon.0 192.168.12.5:6789/0 5 ==== mon_map v1 ==== 221+0+0 (149619065 0 0) 0x7f511c0012b0 con 0x7f51300245e0
2014-05-17 11:48:39.386139 7f513574a700 1 -- 192.168.12.3:0/1016928 <== mon.0 192.168.12.5:6789/0 6 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (187160225 0 0) 0x7f511c001550 con 0x7f51300245e0
2014-05-17 11:48:39.386216 7f513574a700 1 -- 192.168.12.3:0/1016928 <== mon.0 192.168.12.5:6789/0 7 ==== osd_map(148..148 src has 1..148) v3 ==== 4080+0+0 (2292252469 0 0) 0x7f511c001230 con 0x7f51300245e0
2014-05-17 11:48:39.386359 7f513574a700 1 -- 192.168.12.3:0/1016928 <== mon.0 192.168.12.5:6789/0 8 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (187160225 0 0) 0x7f511c0027f0 con 0x7f51300245e0
2014-05-17 11:48:39.386380 7f513574a700 1 -- 192.168.12.3:0/1016928 <== mon.0 192.168.12.5:6789/0 9 ==== osd_map(148..148 src has 1..148) v3 ==== 4080+0+0 (2292252469 0 0) 0x7f511c0039a0 con 0x7f51300245e0
2014-05-17 11:48:39.386392 7f513574a700 1 -- 192.168.12.3:0/1016928 <== mon.0 192.168.12.5:6789/0 10 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (187160225 0 0) 0x7f511c003bb0 con 0x7f51300245e0
2014-05-17 11:48:39.387767 7f51376bf700 1 -- 192.168.12.3:0/1016928 --> 192.168.12.5:6789/0 -- mon_command({"prefix": "get_command_descriptions"} v 0) v1 -- ?+0 0x7f5130016570 con 0x7f51300245e0
2014-05-17 11:48:39.391012 7f513574a700 1 -- 192.168.12.3:0/1016928 <== mon.0 192.168.12.5:6789/0 11 ==== mon_command_ack([{"prefix": "get_command_descriptions"}]=0 v0) v1 ==== 72+0+29582 (1092875540 0 1116028220) 0x7f511c001bc0 con 0x7f51300245e0
2014-05-17 11:48:39.445616 7f51376bf700 1 -- 192.168.12.3:0/1016928 --> 192.168.12.5:6789/0 -- mon_command({"prefix": "health"} v 0) v1 -- ?+0 0x7f5130024c40 con 0x7f51300245e0
2014-05-17 11:48:39.446479 7f513574a700 1 -- 192.168.12.3:0/1016928 <== mon.0 192.168.12.5:6789/0 12 ==== mon_command_ack([{"prefix": "health"}]=0 v0) v1 ==== 54+0+10 (2555570491 0 3610226052) 0x7f511c000fc0 con 0x7f51300245e0
HEALTH_OK
2014-05-17 11:48:39.449378 7f51376bf700 1 -- 192.168.12.3:0/1016928 mark_down 0x7f51300245e0 -- 0x7f5130024370
2014-05-17 11:48:39.449439 7f51376bf700 1 -- 192.168.12.3:0/1016928 mark_down_all
2014-05-17 11:48:39.449555 7f51376bf700 1 -- 192.168.12.3:0/1016928 shutdown complete.

it works again after I restart the OSD, But it does not work again when I do not use any ceph operations for a period of time? i need to restart every osd again then it works

please help to look at that issue, thanks!


Files

ceph_log.zip (3.52 MB) ceph_log.zip jian li, 05/20/2014 05:58 AM
Actions #1

Updated by jian li almost 10 years ago

the pervios test cli 'ceph health' is wrong
In actually the result alway be 'HEALTH_OK'

I collect the osd log:
10 osd.1 186 do_waiters -- start
2014-05-17 14:33:17.209128 7f92aac86700 10 osd.1 186 do_waiters -- finish
2014-05-17 14:33:17.571869 7f9294814700 0 auth: could not find secret_id=46
2014-05-17 14:33:17.571881 7f9294814700 0* cephx: verify_authorizer could not get service secret for service osd secret_id=46*
2014-05-17 14:33:17.571886 7f9294814700 0 -- 192.168.12.3:6800/18101 >> 192.168.12.4:0/1017003 pipe(0x3824a00 sd=29 :6800 s=0 pgs=0 cs=0 l=1 c=0x36cb9c0).accept: got bad authorizer
2014-05-17 14:33:17.573336 7f92a1330700 1 osd.1 186 ms_handle_reset con 0x36cb9c0 session 0
2014-05-17 14:33:17.574171 7f9294814700 0 auth: could not find secret_id=46
2014-05-17 14:33:17.574183 7f9294814700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=46
2014-05-17 14:33:17.574189 7f9294814700 0 -- 192.168.12.3:6800/18101 >> 192.168.12.4:0/1017003 pipe(0x3826d00 sd=29 :6800 s=0 pgs=0 cs=0 l=1 c=0x36cbb20).accept: got bad authorizer
2014-05-17 14:33:17.575027 7f92a1330700 1 osd.1 186 ms_handle_reset con 0x36cbb20 session 0
2014-05-17 14:33:17.616982 7f929eb2b700 20 osd.1 186 _share_map_outgoing 0x3506160 already has epoch 186
2014-05-17 14:33:17.617039 7f929db29700 20 osd.1 186 _share_map_outgoing 0x3506160 already has epoch 186
2014-05-17 14:33:17.680886 7f929831e700 20 osd.1 186 update_osd_stat osd_stat(11176 MB used, 843 GB avail, 900 GB total, peers [0,2,3]/[] op hist [])

also O collect the mon log;
2014-05-17 22:49:36.664612 7fe1dad39700 11 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 tick
2014-05-17 22:49:36.664629 7fe1dad39700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).pg v3893 v3893: 192 pgs: 192 active+clean; 12661 MB data, 51751 MB used, 3369 GB / 3602 GB avail
2014-05-17 22:49:36.664647 7fe1dad39700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).mds e1 e1: 0/0/1 up
2014-05-17 22:49:36.664653 7fe1dad39700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).osd e186 e186: 4 osds: 4 up, 4 in
2014-05-17 22:49:36.664659 7fe1dad39700 1 mon.estack4-HP-Z220-SFF-Workstation@0(leader).paxos(paxos active c 7280..7795) is_readable now=2014-05-17 22:49:36.664659 lease_expire=0.000000 has v0 lc 7795
2014-05-17 22:49:36.664691 7fe1dad39700 1 mon.estack4-HP-Z220-SFF-Workstation@0(leader).paxos(paxos active c 7280..7795) is_readable now=2014-05-17 22:49:36.664693 lease_expire=0.000000 has v0 lc 7795
2014-05-17 22:49:36.664701 7fe1dad39700 1 mon.estack4-HP-Z220-SFF-Workstation@0(leader).paxos(paxos active c 7280..7795) is_readable now=2014-05-17 22:49:36.664702 lease_expire=0.000000 has v0 lc 7795
2014-05-17 22:49:36.664707 7fe1dad39700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).osd e186 min_last_epoch_clean 186
2014-05-17 22:49:36.664710 7fe1dad39700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).log v3592 log
2014-05-17 22:49:36.664713 7fe1dad39700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).auth v68 auth
2014-05-17 22:49:36.664718 7fe1dad39700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 sync_trim_providers
2014-05-17 22:49:37.553068 7fe1d8b33700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 ms_verify_authorizer 192.168.12.4:0/1017479 client protocol 0
2014-05-17 22:49:37.553495 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 have connection
2014-05-17 22:49:37.553500 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 do not have session, making new one
2014-05-17 22:49:37.553504 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 ms_dispatch new session MonSession: client.? 192.168.12.4:0/1017479 is open for client.? 192.168.12.4:0/1017479
2014-05-17 22:49:37.553513 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 setting timeout on session
2014-05-17 22:49:37.553515 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 caps
2014-05-17 22:49:37.553519 7fe1da538700 1 mon.estack4-HP-Z220-SFF-Workstation@0(leader).paxos(paxos active c 7280..7795) is_readable now=2014-05-17 22:49:37.553519 lease_expire=0.000000 has v0 lc 7795
2014-05-17 22:49:37.553541 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).auth v68 preprocess_query auth(proto 0 30 bytes epoch 0) v1 from client.? 192.168.12.4:0/1017479
2014-05-17 22:49:37.553549 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).auth v68 prep_auth() blob_size=30
2014-05-17 22:49:37.553559 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).auth v68 AuthMonitor::assign_global_id m=auth(proto 0 30 bytes epoch 0) v1 mon=0/1 last_allocated=4812 max_global_id=4896
2014-05-17 22:49:37.553566 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).auth v68 next_global_id should be 4813
2014-05-17 22:49:37.554478 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 have connection
2014-05-17 22:49:37.554483 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 ms_dispatch existing session MonSession: client.? 192.168.12.4:0/1017479 is open for client.? 192.168.12.4:0/1017479
2014-05-17 22:49:37.554492 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 caps
2014-05-17 22:49:37.554496 7fe1da538700 1 mon.estack4-HP-Z220-SFF-Workstation@0(leader).paxos(paxos active c 7280..7795) is_readable now=2014-05-17 22:49:37.554496 lease_expire=0.000000 has v0 lc 7795
2014-05-17 22:49:37.554535 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).auth v68 preprocess_query auth(proto 2 32 bytes epoch 0) v1 from client.? 192.168.12.4:0/1017479
2014-05-17 22:49:37.554544 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).auth v68 prep_auth() blob_size=32
2014-05-17 22:49:37.555424 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 have connection
2014-05-17 22:49:37.555429 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 ms_dispatch existing session MonSession: client.? 192.168.12.4:0/1017479 is openallow * for client.? 192.168.12.4:0/1017479
2014-05-17 22:49:37.555439 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 caps allow *
2014-05-17 22:49:37.555443 7fe1da538700 1 mon.estack4-HP-Z220-SFF-Workstation@0(leader).paxos(paxos active c 7280..7795) is_readable now=2014-05-17 22:49:37.555444 lease_expire=0.000000 has v0 lc 7795
2014-05-17 22:49:37.555457 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).auth v68 preprocess_query auth(proto 2 165 bytes epoch 0) v1 from client.? 192.168.12.4:0/1017479
2014-05-17 22:49:37.555465 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).auth v68 prep_auth() blob_size=165
2014-05-17 22:49:37.556357 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 have connection
2014-05-17 22:49:37.556362 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 ms_dispatch existing session MonSession: client.? 192.168.12.4:0/1017479 is openallow * for client.? 192.168.12.4:0/1017479
2014-05-17 22:49:37.556372 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 caps allow *
2014-05-17 22:49:37.556376 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 handle_subscribe mon_subscribe({monmap=0+}) v2
2014-05-17 22:49:37.556384 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 check_sub monmap next 0 have 1
2014-05-17 22:49:37.556435 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 have connection
2014-05-17 22:49:37.556437 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 ms_dispatch existing session MonSession: client.? 192.168.12.4:0/1017479 is openallow * for client.? 192.168.12.4:0/1017479
2014-05-17 22:49:37.556443 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 caps allow *
2014-05-17 22:49:37.556446 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 handle_subscribe mon_subscribe({monmap=2+,osdmap=0}) v2
2014-05-17 22:49:37.556450 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 check_sub monmap next 2 have 1
2014-05-17 22:49:37.556454 7fe1da538700 20 is_capable service=osd command= read on cap allow *
2014-05-17 22:49:37.556456 7fe1da538700 20 allow so far , doing grant allow *
2014-05-17 22:49:37.556457 7fe1da538700 20 allow all
2014-05-17 22:49:37.556459 7fe1da538700 10 mon.estack4-HP-Z220-SFF-Workstation@0(leader).osd e186 check_sub 0x3d48300 next 0 (onetime)
2014-05-17 22:49:37.556539 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 have connection
2014-05-17 22:49:37.556542 7fe1da538700 20 mon.estack4-HP-Z220-SFF-Workstation@0(leader) e1 ms_dispatch existing session MonSession: client.? 192.168.12.4:0/1017479 is openallow * for client.? 192.168.12.4:0/1017479

Actions #2

Updated by Sage Weil almost 10 years ago

  • Source changed from other to Community (user)

can you reproduce this with debug auth = 20 and debug monc = 20?

Actions #3

Updated by Sage Weil almost 10 years ago

  • Status changed from New to Need More Info
Actions #4

Updated by jian li almost 10 years ago

The settings of ceph.conf is
[global]
fsid = 49a9d6de-0583-4941-b6ee-7007482ddbcf
mon_initial_members = estack4-HP-Z220-SFF-Workstation
mon_host = 192.168.12.5
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
[mon]
debug mon = 20
debug paxos = 1/5
debug auth = 20
[osd]
debug osd = 5/20
debug filestore = 1/5
debug journal = 1
debug monc = 5/20

pleae refer to the log file.
It works in 2014-05-21 02:00:00, but does not work in 2014-05-21 03:41:48 when I use the cli rados bench p rbd 10 write

Thanks!

Actions #5

Updated by jian li almost 10 years ago

I fix that issue when all the MON and OSDs nodes SYNC the same system date.

Actions #6

Updated by jian li almost 10 years ago

@Sage Weil Weil, I can't close that issue, please help to close, thanks!
BTW, if add the above limitation to CEPH installation guide is better.

Actions #7

Updated by Sage Weil almost 10 years ago

  • Subject changed from cephx: verify_reply couldn't decrypt with error: error decoding block for decryption to cephx: clock skew causes verify_reply decode error
  • Status changed from Need More Info to 12
Actions #8

Updated by Sage Weil almost 10 years ago

Thanks, now that we know what causes this I think we should improve the error message (at a minimum).

Actions #9

Updated by Greg Farnum almost 10 years ago

So this is just the OSDs (or, possibly, the monitors) being out of sync with their clocks and expiring or advancing their signing keys out-of-step enough that they can't communicate?
I don't remember exactly the cephx system (is it based on absolute or relative times?), but it sounds to me like this is the system working exactly as intended, and we can't do much about that.
I guess we could append the error message with ", other end has the wrong keys or time is out of sync"?

Actions #10

Updated by Sage Weil almost 10 years ago

Greg Farnum wrote:

So this is just the OSDs (or, possibly, the monitors) being out of sync with their clocks and expiring or advancing their signing keys out-of-step enough that they can't communicate?
I don't remember exactly the cephx system (is it based on absolute or relative times?), but it sounds to me like this is the system working exactly as intended, and we can't do much about that.
I guess we could append the error message with ", other end has the wrong keys or time is out of sync"?

exactly. i'm just thinking we reproduce, identify the specific error that is coming up, and make it include something like "... (clock skew?)"

Actions #11

Updated by Josh Durgin over 9 years ago

  • Project changed from rbd to Ceph
  • Category set to Monitor
Actions #12

Updated by Greg Farnum over 9 years ago

  • Category changed from Monitor to cephx
Actions #13

Updated by karan singh almost 9 years ago

Looks like this this error is not ONLY because of date / time sync.

I have date / time synced and even clock drift set, still getting error.

ished
2015-07-01 23:40:20.373885 7f2500929700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption
2015-07-01 23:40:20.373902 7f2500929700  0 -- 192.168.1.101:6789/0 >> 192.168.1.102:6789/0 pipe(0x41802c0 sd=21 :55061 s=1 pgs=0 cs=0 l=0 c=0x41198c0).failed verifying authorize reply
2015-07-01 23:40:20.375347 7f24f7b8b700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption
2015-07-01 23:40:20.375375 7f24f7b8b700  0 -- 192.168.1.101:6789/0 >> 192.168.1.103:6789/0 pipe(0x4180dc0 sd=20 :44382 s=1 pgs=0 cs=0 l=0 c=0x4119600).failed verifying authorize reply
2015-07-01 23:40:31.508802 7f24fbef8700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
2015-07-01 23:40:31.509042 7f24fbef8700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
2015-07-01 23:40:31.514204 7f24fbef8700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
[root@ceph-node1 ceph]# cat /etc/ceph/ceph.conf | grep -i clock
mon_clock_drift_allowed = .15
mon_clock_drift_warn_backoff = 30
[root@ceph-node1 ceph]#
[root@ceph-node1 ceph]# for i in 1 2 3 ; do ssh ceph-node$i date ; done
Wed Jul  1 23:33:11 EEST 2015
Wed Jul  1 23:33:11 EEST 2015
Wed Jul  1 23:33:11 EEST 2015
[root@ceph-node1 ceph]# ceph -v
ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e)
[root@ceph-node1 ceph]#
[root@ceph-node1 ceph]# ceph mon stat
e3: 3 mons at {ceph-node1=192.168.1.101:6789/0,ceph-node2=192.168.1.102:6789/0,ceph-node3=192.168.1.103:6789/0}, election epoch 852, quorum 1,2 ceph-node2,ceph-node3
[root@ceph-node1 ceph]#
Actions #14

Updated by Blade Doyle over 8 years ago

karan singh wrote:

Looks like this this error is not ONLY because of date / time sync.

I have date / time synced and even clock drift set, still getting error.

[...]

[...]

Agreed.

In my configuration the "cephx: verify_reply couldn't decrypt with error: error decoding block for decryption" was caused by iptables blocking the ceph ports. I resolved the issue by adding the following rules (before the final reject):

-A INPUT -m tcp -p tcp --dport 6789 -j ACCEPT
-A INPUT -p tcp -m multiport --dports 6800:6810 -j ACCEPT

Actions #15

Updated by Sage Weil over 7 years ago

  • Assignee deleted (Sage Weil)
Actions #16

Updated by Samuel Just over 7 years ago

  • Status changed from 12 to Can't reproduce
Actions

Also available in: Atom PDF