Bug #7093: osd: peering can send messages prior to auth - Ceph - Ceph

Actions

Copy link

Bug #7093

closed

osd: peering can send messages prior to auth

Added by Sage Weil over 10 years ago. Updated about 10 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Ian Colle

Category:

OSD

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

we are still authenticating:

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000009061a9 in Wait (mutex=..., this=0x7fff71a29a70) at ./common/Cond.h:55
#2  MonClient::authenticate (this=0x7fff71a29660, timeout=0) at mon/MonClient.cc:448
#3  0x0000000000632927 in OSD::init (this=0x24ce000) at osd/OSD.cc:1228
#4  0x00000000005de24d in main (argc=<optimized out>, argv=<optimized out>) at ceph_osd.cc:468

but get an osdmap:

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000009411d2 in Wait (mutex=..., this=0x24ce538) at common/Cond.h:55
#2  ThreadPool::drain (this=0x24ce470, wq=0x24cefd0) at common/WorkQueue.cc:252
#3  0x00000000006344b4 in drain (this=0x24cefd0) at ./common/WorkQueue.h:153
#4  OSD::handle_osd_map (this=0x24ce000, m=0x2d68000) at osd/OSD.cc:5294
#5  0x0000000000636deb in OSD::_dispatch (this=0x24ce000, m=0x2d68000) at osd/OSD.cc:4647
#6  0x00000000006374ec in OSD::ms_dispatch (this=0x24ce000, m=0x2d68000) at osd/OSD.cc:4446
#7  0x00000000009fa199 in ms_deliver_dispatch (m=0x2d68000, this=0x248b000) at msg/Messenger.h:587
#8  DispatchQueue::entry (this=0x248b0e8) at msg/DispatchQueue.cc:123
#9  0x00000000009356cd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/DispatchQueue.h:104
#10 0x00007ff858e68e9a in start_thread (arg=0x7ff84add5700) at pthread_create.c:308
#11 0x00007ff8572213fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#12 0x0000000000000000 in ?? ()

which kicks a bunch of pg peering state machines, see that the osdmap shows our rank as up (not us, though.. we are restarting!), and sends messages. whcih crashes with

#0  0x00007ff858e70b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x00000000008888be in reraise_fatal (signum=11) at global/signal_handler.cc:59
#2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
#3  <signal handler called>
#4  0x00000000005e85ab in OSD::ms_get_authorizer (this=0x24ce000, dest_type=4, authorizer=0x7ff83f2ba338, force_new=false) at osd/OSD.cc:4471
#5  0x000000000092e9fd in ms_deliver_get_authorizer (force_new=<optimized out>, peer_type=4, this=0x248bc00) at msg/Messenger.h:661
#6  SimpleMessenger::get_authorizer (this=0x248bc00, peer_type=4, force_new=false) at msg/SimpleMessenger.cc:356
#7  0x0000000000a15664 in Pipe::connect (this=0x252c500) at msg/Pipe.cc:883
#8  0x0000000000a1890d in Pipe::writer (this=0x252c500) at msg/Pipe.cc:1518
#9  0x0000000000a227fd in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59
#10 0x00007ff858e68e9a in start_thread (arg=0x7ff83f2bb700) at pthread_create.c:308
#11 0x00007ff8572213fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#12 0x0000000000000000 in ?? ()
(gdb) f 4
#4  0x00000000005e85ab in OSD::ms_get_authorizer (this=0x24ce000, dest_type=4, authorizer=0x7ff83f2ba338, force_new=false) at osd/OSD.cc:4471
4471    osd/OSD.cc: No such file or directory.
(gdb) p monc->auth
$1 = (AuthClientHandler *) 0x0

this was on emperor, but the bug still exists in master.

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-12-31_19:40:07-upgrade:small-master-testing-basic-plana/20548$ cat orig.config.yaml 
archive_path: /var/lib/teuthworker/archive/teuthology-2013-12-31_19:40:07-upgrade:small-master-testing-basic-plana/20548
description: upgrade/small/rgw/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-workload/s3tests.yaml 3-upgrade-sequence/upgrade-all.yaml 4-restart/restart.yaml
  5-emperor-workload/final.yaml distro/ubuntu_12.04.yaml}
email: null
job_id: '20548'
kernel:
  kdb: true
  sha1: f48db1e9ac6f1578ab7efef9f66c70279e2f0cb5
last_in_suite: false
machine_type: plana
name: teuthology-2013-12-31_19:40:07-upgrade:small-master-testing-basic-plana
nuke-on-error: true
os_type: ubuntu
os_version: '12.04'
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        debug ms: 1
        debug osd: 5
    log-whitelist:
    - slow request
    sha1: cae663af403af202df76ea4df84b43f919b4a541
  ceph-deploy:
    branch:
      dev: master
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
  install:
    ceph:
      sha1: cae663af403af202df76ea4df84b43f919b4a541
  s3tests:
    branch: master
  workunit:
    sha1: cae663af403af202df76ea4df84b43f919b4a541
owner: scheduled_teuthology@teuthology
roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
- - mon.b
  - mon.c
  - osd.2
  - osd.3
- - client.0
tasks:
- chef: null
- clock.check: null
- install:
    branch: dumpling
- ceph:
    fs: xfs
- rgw:
  - client.0
- s3tests:
    client.0:
      force-branch: dumpling
      rgw_server: client.0
- install.upgrade:
    all:
      branch: emperor
- ceph.restart:
  - osd.0
  - osd.1
  - osd.2
  - osd.3
  - mon.a
  - mon.b
  - mon.c
  - mds.a
  - rgw.client.0
- s3tests:
    client.0:
      rgw_server: client.0
teuthology_branch: master
verbose: true

Actions

Copy link

Updated by Sage Weil over 10 years ago

several instances in teuthology-2014-01-02_19:40:02-upgrade:parallel-master-testing-basic-plana

Actions

Copy link

Updated by Sage Weil over 10 years ago

Status changed from 12 to Fix Under Review

Actions

Copy link

Updated by Samuel Just over 10 years ago

Patches look good to me.

Actions

Copy link

Updated by Sage Weil over 10 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

Updated by Greg Farnum about 10 years ago

Status changed from Resolved to Pending Backport
Assignee set to Ian Colle

Backported this to dumpling in 183deb899bc6b1b7b2a1ec639425e45786e56b01

Do we also want to backport it to emperor?

Actions

Copy link

Updated by Greg Farnum about 10 years ago

Priority changed from Urgent to Normal

Actions

Copy link

Updated by Florian Haas about 10 years ago

I would like to add the following comment because I have learned that this is related to an issue we have seen in the wild, and the original description does not allow a regular user to draw a connection between this bug and the observed behavior. Maybe Ian or Sage could review that this is accurate:

This problem would cause a snowball effect causing many OSDs in the cluster to get stuck in a peer/crash/restart/peer/crash cycle (where the restart is triggered by the respawn directive in the ceph-osd upstart job). If multiple OSDs on the same node are affected, this may also lead to nodes being stuck with a high load average and/or becoming unresponsive for console and SSH sessions.

Actions

Copy link

Updated by Ian Colle about 10 years ago

Status changed from Pending Backport to Resolved

Backported to Emperor and Dumpling

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #7093

osd: peering can send messages prior to auth

Updated by Sage Weil over 10 years ago

Updated by Sage Weil over 10 years ago

Updated by Samuel Just over 10 years ago

Updated by Sage Weil over 10 years ago

Updated by Greg Farnum about 10 years ago

Updated by Greg Farnum about 10 years ago

Updated by Florian Haas about 10 years ago

Updated by Ian Colle about 10 years ago