Bug #40716
hammer client failed to auth against master OSD
0%
Description
2019-07-10 13:36:45.745486 7f9940cf6700 10 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56892 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connecting to 172.21.15.39:6804/1465818 2019-07-10 13:36:45.746015 7f9940cf6700 20 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connect read peer addr 172.21.15.39:6804/1465818 on socket 15 2019-07-10 13:36:45.746076 7f9940cf6700 20 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connect peer addr for me is 172.21.15.32:56924/0 2019-07-10 13:36:45.746105 7f9940cf6700 10 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connect sent my addr 172.21.15.32:0/1977413338 2019-07-10 13:36:45.746118 7f9940cf6700 10 cephx client: build_authorizer for service osd 2019-07-10 13:36:45.746160 7f9940cf6700 10 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connect.authorizer_len=174 protocol=2 2019-07-10 13:36:45.746173 7f9940cf6700 10 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connect sending gseq=37713 cseq=0 proto=24 2019-07-10 13:36:45.746190 7f9940cf6700 20 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connect wrote (self +) cseq, waiting for reply 2019-07-10 13:36:45.746409 7f9940cf6700 20 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connect got reply tag 16 connect_seq 0 global_seq 0 proto 24 flags 0 features 509868447236095 2019-07-10 13:36:45.746425 7f9940cf6700 10 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).reply.authorizer_len=32 2019-07-10 13:36:45.746454 7f9940cf6700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2019-07-10 13:36:45.746457 7f9940cf6700 0 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).failed verifying authorize reply
on the client, running 0.94.*.
/a/sage-2019-07-10_01:52:27-rados-wip-sage-testing-2019-07-09-1801-distro-basic-smithi/4107283
description: rados/thrash-old-clients/{0-size-min-size-overrides/2-size-2-min-size.yaml 1-install/hammer.yaml backoff/peering.yaml ceph.yaml clusters/{openstack.yaml three-plus-one.yaml} d-balancer/crush-compat.yaml distro$/{centos_latest.yaml} msgr-failures/few.yaml rados.yaml thrashers/careful.yaml thrashosds-health.yaml workloads/cache-snaps.yaml}
Related issues
History
#1 Updated by Sage Weil over 4 years ago
/a/sage-2019-07-19_21:25:20-rados-master-distro-basic-smithi/4130750
#2 Updated by Sage Weil over 4 years ago
captured some detailed logs here:
/a/sage-2019-09-23_02:45:54-rados-wip-sage2-testing-2019-09-22-1659-distro-basic-smithi/4327952
#3 Updated by Sage Weil over 4 years ago
- Status changed from 12 to Fix Under Review
- Backport set to nautilus
- Pull request ID set to 30523
#4 Updated by Sage Weil over 4 years ago
backport may be non-trivial (or possibly unnecessary), since there was a huge post-nautilus cleanup/refactor/simplification.
#5 Updated by Sage Weil over 4 years ago
see https://github.com/ceph/ceph/pull/30524 for nautilus backport/fix
#6 Updated by Nathan Cutler over 4 years ago
- Status changed from Fix Under Review to Pending Backport
#7 Updated by Nathan Cutler over 4 years ago
- Copied to Backport #42013: nautilus: hammer client failed to auth against master OSD added
#8 Updated by Nathan Cutler over 4 years ago
- Status changed from Pending Backport to Fix Under Review
#9 Updated by Sage Weil over 4 years ago
- Status changed from Fix Under Review to Pending Backport
#10 Updated by Sage Weil over 4 years ago
- Status changed from Pending Backport to Fix Under Review
#11 Updated by Sage Weil over 4 years ago
- Status changed from Fix Under Review to Pending Backport
#12 Updated by Nathan Cutler over 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".
#13 Updated by Ilya Dryomov over 4 years ago
Just a note that when exposed to this bug the kernel client crashes trying to dereference a NULL sg:
[ 443.766507] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c [ 443.774365] IP: [<ffffffff8133ef98>] scatterwalk_pagedone+0x58/0x90 Stack traceback for pid 80 0xffff88042d3b2010 80 2 1 5 R 0xffff88042d3b24f0 *kworker/5:1 ffff88042d3e9890 0000000000000018 ffff88042d3e98d8 ffffffff8133f087 0000000000000002 0000000181341baf ffff88042d3e9950 0000000000000010 ffff88042d3e9988 ffff88042b4cdf80 ffff88042d3e9a48 ffff88042d3e9910 Call Trace: [<ffffffff8133f087>] ? scatterwalk_copychunks+0x77/0x140 [<ffffffff81341e84>] ? blkcipher_walk_done+0x1f4/0x230 [<ffffffff81349304>] ? crypto_cbc_decrypt+0x134/0x250 [<ffffffff8134a630>] ? aes_encrypt+0xdc0/0xdc0 [<ffffffff81349846>] ? crypto_aes_set_key+0x16/0x40 [<ffffffffa03eab6e>] ? ceph_aes_decrypt2+0x20e/0x330 [libceph] [<ffffffffa03eab6e>] ? ceph_aes_decrypt2+0x20e/0x330 [libceph] [<ffffffffa03eb801>] ? ceph_decrypt2+0x61/0x100 [libceph] [<ffffffffa03ebe72>] ? ceph_x_decrypt+0x72/0x140 [libceph] [<ffffffffa03ec06a>] ? ceph_x_verify_authorizer_reply+0x5a/0x100 [libceph] [<ffffffffa03d4d4b>] ? ceph_tcp_recvmsg+0x4b/0x60 [libceph] [<ffffffffa03e9899>] ? ceph_auth_verify_authorizer_reply+0x49/0x70 [libceph] [<ffffffffa03de429>] ? verify_authorizer_reply+0x29/0x30 [libceph] [<ffffffffa03d78ae>] ? con_work+0x3ae/0x2e00 [libceph] [<ffffffff8108aa2e>] ? process_one_work+0x1ee/0x5d0 [<ffffffff8108a9cb>] ? process_one_work+0x18b/0x5d0 [<ffffffff8108bacb>] ? worker_thread+0x11b/0x3c0 [<ffffffff8108b9b0>] ? manage_workers.isra.16+0x290/0x290 [<ffffffff81092cfa>] ? kthread+0xea/0xf0 [<ffffffff81092c10>] ? kthread_stop+0x160/0x160 [<ffffffff816fb4d8>] ? ret_from_fork+0x58/0x90 [<ffffffff81092c10>] ? kthread_stop+0x160/0x160 void scatterwalk_start(struct scatter_walk *walk, struct scatterlist *sg) { walk->sg = sg; BUG_ON(!sg->length); <-- sg is NULL walk->offset = sg->offset; }
This is on an old kernel, dug from the archives. I haven't looked at anything recent yet.