Project

General

Profile

Actions

Bug #7093

closed

osd: peering can send messages prior to auth

Added by Sage Weil over 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Ian Colle
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

we are still authenticating:

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000009061a9 in Wait (mutex=..., this=0x7fff71a29a70) at ./common/Cond.h:55
#2  MonClient::authenticate (this=0x7fff71a29660, timeout=0) at mon/MonClient.cc:448
#3  0x0000000000632927 in OSD::init (this=0x24ce000) at osd/OSD.cc:1228
#4  0x00000000005de24d in main (argc=<optimized out>, argv=<optimized out>) at ceph_osd.cc:468

but get an osdmap:
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000009411d2 in Wait (mutex=..., this=0x24ce538) at common/Cond.h:55
#2  ThreadPool::drain (this=0x24ce470, wq=0x24cefd0) at common/WorkQueue.cc:252
#3  0x00000000006344b4 in drain (this=0x24cefd0) at ./common/WorkQueue.h:153
#4  OSD::handle_osd_map (this=0x24ce000, m=0x2d68000) at osd/OSD.cc:5294
#5  0x0000000000636deb in OSD::_dispatch (this=0x24ce000, m=0x2d68000) at osd/OSD.cc:4647
#6  0x00000000006374ec in OSD::ms_dispatch (this=0x24ce000, m=0x2d68000) at osd/OSD.cc:4446
#7  0x00000000009fa199 in ms_deliver_dispatch (m=0x2d68000, this=0x248b000) at msg/Messenger.h:587
#8  DispatchQueue::entry (this=0x248b0e8) at msg/DispatchQueue.cc:123
#9  0x00000000009356cd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/DispatchQueue.h:104
#10 0x00007ff858e68e9a in start_thread (arg=0x7ff84add5700) at pthread_create.c:308
#11 0x00007ff8572213fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#12 0x0000000000000000 in ?? ()

which kicks a bunch of pg peering state machines, see that the osdmap shows our rank as up (not us, though.. we are restarting!), and sends messages. whcih crashes with
#0  0x00007ff858e70b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x00000000008888be in reraise_fatal (signum=11) at global/signal_handler.cc:59
#2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
#3  <signal handler called>
#4  0x00000000005e85ab in OSD::ms_get_authorizer (this=0x24ce000, dest_type=4, authorizer=0x7ff83f2ba338, force_new=false) at osd/OSD.cc:4471
#5  0x000000000092e9fd in ms_deliver_get_authorizer (force_new=<optimized out>, peer_type=4, this=0x248bc00) at msg/Messenger.h:661
#6  SimpleMessenger::get_authorizer (this=0x248bc00, peer_type=4, force_new=false) at msg/SimpleMessenger.cc:356
#7  0x0000000000a15664 in Pipe::connect (this=0x252c500) at msg/Pipe.cc:883
#8  0x0000000000a1890d in Pipe::writer (this=0x252c500) at msg/Pipe.cc:1518
#9  0x0000000000a227fd in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59
#10 0x00007ff858e68e9a in start_thread (arg=0x7ff83f2bb700) at pthread_create.c:308
#11 0x00007ff8572213fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#12 0x0000000000000000 in ?? ()
(gdb) f 4
#4  0x00000000005e85ab in OSD::ms_get_authorizer (this=0x24ce000, dest_type=4, authorizer=0x7ff83f2ba338, force_new=false) at osd/OSD.cc:4471
4471    osd/OSD.cc: No such file or directory.
(gdb) p monc->auth
$1 = (AuthClientHandler *) 0x0

this was on emperor, but the bug still exists in master.

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-12-31_19:40:07-upgrade:small-master-testing-basic-plana/20548$ cat orig.config.yaml 
archive_path: /var/lib/teuthworker/archive/teuthology-2013-12-31_19:40:07-upgrade:small-master-testing-basic-plana/20548
description: upgrade/small/rgw/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-workload/s3tests.yaml 3-upgrade-sequence/upgrade-all.yaml 4-restart/restart.yaml
  5-emperor-workload/final.yaml distro/ubuntu_12.04.yaml}
email: null
job_id: '20548'
kernel:
  kdb: true
  sha1: f48db1e9ac6f1578ab7efef9f66c70279e2f0cb5
last_in_suite: false
machine_type: plana
name: teuthology-2013-12-31_19:40:07-upgrade:small-master-testing-basic-plana
nuke-on-error: true
os_type: ubuntu
os_version: '12.04'
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        debug ms: 1
        debug osd: 5
    log-whitelist:
    - slow request
    sha1: cae663af403af202df76ea4df84b43f919b4a541
  ceph-deploy:
    branch:
      dev: master
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
  install:
    ceph:
      sha1: cae663af403af202df76ea4df84b43f919b4a541
  s3tests:
    branch: master
  workunit:
    sha1: cae663af403af202df76ea4df84b43f919b4a541
owner: scheduled_teuthology@teuthology
roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
- - mon.b
  - mon.c
  - osd.2
  - osd.3
- - client.0
tasks:
- chef: null
- clock.check: null
- install:
    branch: dumpling
- ceph:
    fs: xfs
- rgw:
  - client.0
- s3tests:
    client.0:
      force-branch: dumpling
      rgw_server: client.0
- install.upgrade:
    all:
      branch: emperor
- ceph.restart:
  - osd.0
  - osd.1
  - osd.2
  - osd.3
  - mon.a
  - mon.b
  - mon.c
  - mds.a
  - rgw.client.0
- s3tests:
    client.0:
      rgw_server: client.0
teuthology_branch: master
verbose: true
Actions #1

Updated by Sage Weil over 10 years ago

several instances in teuthology-2014-01-02_19:40:02-upgrade:parallel-master-testing-basic-plana

Actions #2

Updated by Sage Weil over 10 years ago

  • Status changed from 12 to Fix Under Review
Actions #3

Updated by Samuel Just over 10 years ago

Patches look good to me.

Actions #4

Updated by Sage Weil over 10 years ago

  • Status changed from Fix Under Review to Resolved
Actions #5

Updated by Greg Farnum about 10 years ago

  • Status changed from Resolved to Pending Backport
  • Assignee set to Ian Colle

Backported this to dumpling in 183deb899bc6b1b7b2a1ec639425e45786e56b01

Do we also want to backport it to emperor?

Actions #6

Updated by Greg Farnum about 10 years ago

  • Priority changed from Urgent to Normal
Actions #7

Updated by Florian Haas about 10 years ago

I would like to add the following comment because I have learned that this is related to an issue we have seen in the wild, and the original description does not allow a regular user to draw a connection between this bug and the observed behavior. Maybe Ian or Sage could review that this is accurate:

This problem would cause a snowball effect causing many OSDs in the cluster to get stuck in a peer/crash/restart/peer/crash cycle (where the restart is triggered by the respawn directive in the ceph-osd upstart job). If multiple OSDs on the same node are affected, this may also lead to nodes being stuck with a high load average and/or becoming unresponsive for console and SSH sessions.

Actions #8

Updated by Ian Colle about 10 years ago

  • Status changed from Pending Backport to Resolved

Backported to Emperor and Dumpling

Actions

Also available in: Atom PDF