Project

General

Profile

Actions

Bug #8315

closed

osd: watch callback vs callback funky

Added by Yuri Weinstein almost 10 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
firefly
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-05-07_19:15:07-upgrade:dumpling-dumpling-testing-basic-plana/242015/

coredump from ceph-osd.5.log.gz in /a/teuthology-2014-05-07_19:15:07-upgrade:/log

    -2> 2014-05-08 09:42:01.703766 7f9e458f8700  5 --OSD::tracker-- reqid: unknown.0.0:0, seq: 198, time: 2014-05-08 09:42:01.703766, event: reached_pg, request: MOSDPGPush(1.d 69 [PushOp(d0c18e1d/605.00000000/head//1, version: 5'1, data_included: [0~230], data_size: 230, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(d0c18e1d/605.00000000/head//1@5'1, copy_subset: [0~230], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:230, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)),PushOp(d90270ad/mds_snaptable/head//1, version: 5'2, data_included: [0~46], data_size: 46, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(d90270ad/mds_snaptable/head//1@5'2, copy_subset: [0~46], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:46, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v1
    -1> 2014-05-08 09:42:01.703787 7f9e458f8700 10 osd.5 pg_epoch: 69 pg[1.d( v 5'2 lc 0'0 (0'0,5'2] local-les=69 n=2 ec=1 les/c 54/54 68/68/53) [1,5] r=1 lpr=69 pi=53-67/1 luod=0'0 active m=2] do_pending_flush doing pending flush
     0> 2014-05-08 09:42:01.711999 7f9e450f7700 -1 *** Caught signal (Aborted) **
 in thread 7f9e450f7700

 ceph version 0.67.8-5-g0e685c6 (0e685c68aa6a500aa7fa433cd9b8246f70c5383e)
 1: ceph-osd() [0x7fd33a]
 2: (()+0xfcb0) [0x7f9e59205cb0]
 3: (gsignal()+0x35) [0x7f9e5728b425]
 4: (abort()+0x17b) [0x7f9e5728eb8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9e57bde69d]
 6: (()+0xb5846) [0x7f9e57bdc846]
 7: (()+0xb5873) [0x7f9e57bdc873]
 8: (()+0xb596e) [0x7f9e57bdc96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x8c6f5f]
 10: (Watch::get_delayed_cb()+0xc3) [0x6ddad3]
 11: (ReplicatedPG::handle_watch_timeout(std::tr1::shared_ptr<Watch>)+0x9e9) [0x5f0789]
 12: (ReplicatedPG::check_blacklisted_obc_watchers(ObjectContext*)+0x3ba) [0x5f0dfa]
 13: (ReplicatedPG::populate_obc_watchers(ObjectContext*)+0x60b) [0x5f16db]
 14: (ReplicatedPG::get_object_context(hobject_t const&, bool)+0x1c4) [0x5f2084]
 15: (ReplicatedPG::prep_object_replica_pushes(hobject_t const&, eversion_t, int, std::map<int, std::vector<PushOp, std::allocator<PushOp> >, std::less<int>, std::allocator<std::pair<int const, std::vector<PushOp, std::allocator<PushOp> > > > >*)+0x10e) [0x6022fe]
 16: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0x657) [0x603847]
 17: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*, ThreadPool::TPHandle&)+0x736) [0x621ff6]
 18: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x1b8) [0x68a578]
 19: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0x11) [0x6c9ce1]
 20: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x8b7796]
 21: (ThreadPool::WorkThread::entry()+0x10) [0x8b95a0]
 22: (()+0x7e9a) [0x7f9e591fde9a]
 23: (clone()+0x6d) [0x7f9e573493fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2014-05-08T09:43:23.799 ERROR:teuthology.misc:Saw exception from osd.5
Traceback (most recent call last):
  File "/home/teuthworker/teuthology-dumpling/teuthology/misc.py", line 831, in stop_daemons_of_type
    daemon.stop()
  File "/home/teuthworker/teuthology-dumpling/teuthology/task/ceph.py", line 36, in stop
    run.wait([self.proc])
  File "/home/teuthworker/teuthology-dumpling/teuthology/orchestra/run.py", line 282, in wait
    proc.exitstatus.get()
  File "/usr/lib/python2.7/dist-packages/gevent/event.py", line 207, in get
    raise self._exception
CommandFailedError: Command failed on 10.214.132.31 with status 1: '/home/ubuntu/cephtest/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage sudo /home/ubuntu/cephtest/daemon-helper kill ceph-osd -f -i 5'
archive_path: /var/lib/teuthworker/archive/teuthology-2014-05-07_19:15:07-upgrade:dumpling-dumpling-testing-basic-plana/242015
branch: dumpling
description: upgrade/dumpling/rbd/{0-cluster/start.yaml 1-dumpling-install/v0.67.4.yaml
  2-workload/rbd.yaml 3-upgrade-sequence/upgrade-mds-mon-osd.yaml 4-final/osdthrash.yaml}
email: null
job_id: '242015'
kernel: &id001
  kdb: true
  sha1: d35a5d13758c7a7aeb19dbd17e51b4c9a712ca2c
last_in_suite: false
machine_type: plana
name: teuthology-2014-05-07_19:15:07-upgrade:dumpling-dumpling-testing-basic-plana
nuke-on-error: true
os_type: ubuntu
overrides:
  admin_socket:
    branch: dumpling
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    fs: xfs
    log-whitelist:
    - slow request
    - scrub
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch
    sha1: 0e685c68aa6a500aa7fa433cd9b8246f70c5383e
  ceph-deploy:
    branch:
      dev: dumpling
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 0e685c68aa6a500aa7fa433cd9b8246f70c5383e
  s3tests:
    branch: dumpling
  workunit:
    sha1: 0e685c68aa6a500aa7fa433cd9b8246f70c5383e
owner: scheduled_teuthology@teuthology
roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mon.c
  - osd.3
  - osd.4
  - osd.5
  - client.0
suite: upgrade:dumpling
targets:
  ubuntu@plana47.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDIDs+Sl9mfdFX0VZy41zfHkMJlD1OtErcQsSPLq3wrQ5HPPflg8IRtS45LHsYBAyBcAl4fmeG1q45jOrHj6yibJhT/HzxO6TOikYKUgr3uWy804NsTAP/A2SJ4wA8Y+1WyiAjBs0EAnISCnIl5Arj17NYPBroDeclKqZEJpL5UbPCmMibqEmjIPNVNulukox/Dh9GJZWlxqD0aKfTRnIOM+6RWHv2h9mCYVtU2ZbMx+QR0sxji5IJzg4ITlGaR9g+U3DnOU95pkxb3x6mE+H8udB58FONty/DHaAC/rXGfjkg056R3iq8FJUpF3HKKjILLRHjLb3ueK3tN1zrEueCt
  ubuntu@plana49.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC8zmyBYSlUYcY+7hO2WPa2uECTc3U4k/Sm7DY/YZAEtc0Z8i1FpvFa1MA1PV3g2DAzo+E6JIBFzbFSrgXeDsEMFl1Qy2jdrvnq66veVOEyDk4y7I4wseE/eTSnhTQE720cnWq1DuVyFgcyRvYGJo3HHoFTgxLwvPYdhfLPq8elBNJmuOllsqCgsmX5rwqZmsQcP29aus/1s3IVF1TEHYmJLTwRb8qW6W3r55C2ZbSwrvvNXb7zFyydCgAfPC+8dfqiMfp3T7Gc56HUHAWvXU9EnxDINHQmLhrdccFyq0BVd+i2m1ubuXULK2RMImFojhEnkLTFH2Qa5hSegvLLffab
tasks:
- internal.lock_machines:
  - 2
  - plana
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    tag: v0.67.4
- ceph: null
- parallel:
  - workload
  - upgrade-sequence
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- workunit:
    clients:
      client.0:
      - rbd/test_lock_fence.sh
teuthology_branch: dumpling
upgrade-sequence:
  sequential:
  - install.upgrade:
      all:
        branch: dumpling
  - ceph.restart:
    - mds.a
  - sleep:
      duration: 60
  - ceph.restart:
    - mon.a
  - sleep:
      duration: 60
  - ceph.restart:
    - mon.b
  - sleep:
      duration: 60
  - ceph.restart:
    - mon.c
  - sleep:
      duration: 60
  - ceph.restart:
    - osd.0
  - sleep:
      duration: 30
  - ceph.restart:
    - osd.1
  - sleep:
      duration: 30
  - ceph.restart:
    - osd.2
  - sleep:
      duration: 30
  - ceph.restart:
    - osd.3
  - sleep:
      duration: 30
  - ceph.restart:
    - osd.4
  - sleep:
      duration: 30
  - ceph.restart:
    - osd.5
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.plana.19218
workload:
  sequential:
  - workunit:
      clients:
        client.0:
        - rbd/import_export.sh
      env:
        RBD_CREATE_ARGS: --new-format
  - workunit:
      clients:
        client.0:
        - cls/test_cls_rbd.sh
description: upgrade/dumpling/rbd/{0-cluster/start.yaml 1-dumpling-install/v0.67.4.yaml
  2-workload/rbd.yaml 3-upgrade-sequence/upgrade-mds-mon-osd.yaml 4-final/osdthrash.yaml}
duration: 818.8523149490356
failure_reason: 'Command failed on 10.214.132.31 with status 1: ''/home/ubuntu/cephtest/adjust-ulimits
  ceph-coverage /home/ubuntu/cephtest/archive/coverage sudo /home/ubuntu/cephtest/daemon-helper
  kill ceph-osd -f -i 5'''
flavor: basic
mon.a-kernel-sha1: d35a5d13758c7a7aeb19dbd17e51b4c9a712ca2c
mon.b-kernel-sha1: d35a5d13758c7a7aeb19dbd17e51b4c9a712ca2c
owner: scheduled_teuthology@teuthology
success: false

Related issues 2 (0 open2 closed)

Has duplicate Ceph - Bug #9364: "Assertion: osd/Watch.cc: 290: FAILED assert(!cb)" in upgrade:dumpling-dumpling-distro-basic-vps suiteDuplicate09/05/2014

Actions
Has duplicate Ceph - Bug #9472: osd crash in -upgrade:dumpling-dumpling-distro-basic-vps suiteDuplicate09/15/2014

Actions
Actions #1

Updated by Yuri Weinstein almost 10 years ago

I could not reproduce it on re-run logs are in /a/teuthology-2014-05-07_19:15:07-upgrade:dumpling-dumpling-testing-basic-plana/242015/

Actions #2

Updated by Yuri Weinstein almost 10 years ago

  • Source changed from other to Q/A
Actions #3

Updated by Samuel Just almost 10 years ago

  • Priority changed from Normal to Urgent
Actions #4

Updated by Samuel Just almost 10 years ago

Tricky... check_blacklisted_obc_watchers does not clear watch->cb before calling handle_watch_timeout. Watch.cc suggests that watch->cb should be cleared (because it is currently in the process of being invoked) before handle_watch_timeout (which might replace cb with a DelayedWatchTimeout) is called. Seems like check_blacklisted_obc_watchers needs to "cancel", in some sense, the scheduled Handle[Delayed]WatchTimeout currently occupying cb.

Actions #5

Updated by Samuel Just almost 10 years ago

Also, the bug manifested in dumpling, but appears to exist in the same form in master as well.

Actions #6

Updated by Sage Weil almost 10 years ago

  • Subject changed from osd.5 crashed in upgrade:dumpling-dumpling-testing-basic-plana suite to osd: watch callback vs callback funky
  • Status changed from New to 12
Actions #7

Updated by Samuel Just almost 10 years ago

  • Priority changed from Urgent to High
Actions #8

Updated by Samuel Just over 9 years ago

  • Priority changed from High to Urgent
Actions #9

Updated by Samuel Just over 9 years ago

  • Assignee set to Samuel Just
Actions #10

Updated by Samuel Just over 9 years ago

  • Status changed from 12 to 7
Actions #11

Updated by Samuel Just over 9 years ago

  • Status changed from 7 to Fix Under Review
Actions #12

Updated by Sage Weil over 9 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #13

Updated by Samuel Just over 9 years ago

  • Status changed from Pending Backport to Resolved

firefly

Actions #14

Updated by Loïc Dachary about 9 years ago

  • Backport set to firefly
Actions

Also available in: Atom PDF