Project

General

Profile

Bug #40716

hammer client failed to auth against master OSD

Added by Sage Weil over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2019-07-10 13:36:45.745486 7f9940cf6700 10 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56892 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connecting to 172.21.15.39:6804/1465818
2019-07-10 13:36:45.746015 7f9940cf6700 20 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connect read peer addr 172.21.15.39:6804/1465818 on socket 15
2019-07-10 13:36:45.746076 7f9940cf6700 20 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connect peer addr for me is 172.21.15.32:56924/0
2019-07-10 13:36:45.746105 7f9940cf6700 10 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connect sent my addr 172.21.15.32:0/1977413338
2019-07-10 13:36:45.746118 7f9940cf6700 10 cephx client: build_authorizer for service osd
2019-07-10 13:36:45.746160 7f9940cf6700 10 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connect.authorizer_len=174 protocol=2
2019-07-10 13:36:45.746173 7f9940cf6700 10 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connect sending gseq=37713 cseq=0 proto=24
2019-07-10 13:36:45.746190 7f9940cf6700 20 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connect wrote (self +) cseq, waiting for reply
2019-07-10 13:36:45.746409 7f9940cf6700 20 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).connect got reply tag 16 connect_seq 0 global_seq 0 proto 24 flags 0 features 509868447236095
2019-07-10 13:36:45.746425 7f9940cf6700 10 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).reply.authorizer_len=32
2019-07-10 13:36:45.746454 7f9940cf6700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption
2019-07-10 13:36:45.746457 7f9940cf6700  0 -- 172.21.15.32:0/1977413338 >> 172.21.15.39:6804/1465818 pipe(0x7f99383b9ca0 sd=15 :56924 s=1 pgs=0 cs=0 l=1 c=0x7f99383bdf40).failed verifying authorize reply

on the client, running 0.94.*.

/a/sage-2019-07-10_01:52:27-rados-wip-sage-testing-2019-07-09-1801-distro-basic-smithi/4107283
description: rados/thrash-old-clients/{0-size-min-size-overrides/2-size-2-min-size.yaml 1-install/hammer.yaml backoff/peering.yaml ceph.yaml clusters/{openstack.yaml three-plus-one.yaml} d-balancer/crush-compat.yaml distro$/{centos_latest.yaml} msgr-failures/few.yaml rados.yaml thrashers/careful.yaml thrashosds-health.yaml workloads/cache-snaps.yaml}


Related issues

Copied to Messengers - Backport #42013: nautilus: hammer client failed to auth against master OSD Resolved

History

#1 Updated by Sage Weil over 4 years ago

/a/sage-2019-07-19_21:25:20-rados-master-distro-basic-smithi/4130750

#2 Updated by Sage Weil over 4 years ago

captured some detailed logs here:

/a/sage-2019-09-23_02:45:54-rados-wip-sage2-testing-2019-09-22-1659-distro-basic-smithi/4327952

#3 Updated by Sage Weil over 4 years ago

  • Status changed from 12 to Fix Under Review
  • Backport set to nautilus
  • Pull request ID set to 30523

#4 Updated by Sage Weil over 4 years ago

backport may be non-trivial (or possibly unnecessary), since there was a huge post-nautilus cleanup/refactor/simplification.

#5 Updated by Sage Weil over 4 years ago

see https://github.com/ceph/ceph/pull/30524 for nautilus backport/fix

#6 Updated by Nathan Cutler over 4 years ago

  • Status changed from Fix Under Review to Pending Backport

#7 Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #42013: nautilus: hammer client failed to auth against master OSD added

#8 Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Fix Under Review

#9 Updated by Sage Weil over 4 years ago

  • Status changed from Fix Under Review to Pending Backport

#10 Updated by Sage Weil over 4 years ago

  • Status changed from Pending Backport to Fix Under Review

#11 Updated by Sage Weil over 4 years ago

  • Status changed from Fix Under Review to Pending Backport

#12 Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

#13 Updated by Ilya Dryomov over 4 years ago

Just a note that when exposed to this bug the kernel client crashes trying to dereference a NULL sg:

[  443.766507] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
[  443.774365] IP: [<ffffffff8133ef98>] scatterwalk_pagedone+0x58/0x90

Stack traceback for pid 80
0xffff88042d3b2010       80        2  1    5   R  0xffff88042d3b24f0 *kworker/5:1
 ffff88042d3e9890 0000000000000018 ffff88042d3e98d8 ffffffff8133f087
 0000000000000002 0000000181341baf ffff88042d3e9950 0000000000000010
 ffff88042d3e9988 ffff88042b4cdf80 ffff88042d3e9a48 ffff88042d3e9910
Call Trace:
 [<ffffffff8133f087>] ? scatterwalk_copychunks+0x77/0x140
 [<ffffffff81341e84>] ? blkcipher_walk_done+0x1f4/0x230
 [<ffffffff81349304>] ? crypto_cbc_decrypt+0x134/0x250
 [<ffffffff8134a630>] ? aes_encrypt+0xdc0/0xdc0
 [<ffffffff81349846>] ? crypto_aes_set_key+0x16/0x40
 [<ffffffffa03eab6e>] ? ceph_aes_decrypt2+0x20e/0x330 [libceph]
 [<ffffffffa03eab6e>] ? ceph_aes_decrypt2+0x20e/0x330 [libceph]
 [<ffffffffa03eb801>] ? ceph_decrypt2+0x61/0x100 [libceph]
 [<ffffffffa03ebe72>] ? ceph_x_decrypt+0x72/0x140 [libceph]
 [<ffffffffa03ec06a>] ? ceph_x_verify_authorizer_reply+0x5a/0x100 [libceph]
 [<ffffffffa03d4d4b>] ? ceph_tcp_recvmsg+0x4b/0x60 [libceph]
 [<ffffffffa03e9899>] ? ceph_auth_verify_authorizer_reply+0x49/0x70 [libceph]
 [<ffffffffa03de429>] ? verify_authorizer_reply+0x29/0x30 [libceph]
 [<ffffffffa03d78ae>] ? con_work+0x3ae/0x2e00 [libceph]
 [<ffffffff8108aa2e>] ? process_one_work+0x1ee/0x5d0
 [<ffffffff8108a9cb>] ? process_one_work+0x18b/0x5d0
 [<ffffffff8108bacb>] ? worker_thread+0x11b/0x3c0
 [<ffffffff8108b9b0>] ? manage_workers.isra.16+0x290/0x290
 [<ffffffff81092cfa>] ? kthread+0xea/0xf0
 [<ffffffff81092c10>] ? kthread_stop+0x160/0x160
 [<ffffffff816fb4d8>] ? ret_from_fork+0x58/0x90
 [<ffffffff81092c10>] ? kthread_stop+0x160/0x160

void scatterwalk_start(struct scatter_walk *walk, struct scatterlist *sg)
{
        walk->sg = sg;

        BUG_ON(!sg->length);  <-- sg is NULL

        walk->offset = sg->offset;
}

This is on an old kernel, dug from the archives. I haven't looked at anything recent yet.

Also available in: Atom PDF