Bug #9205: osd: notify ops reordered - Ceph - Ceph

Actions

Copy link

Bug #9205

closed

osd: notify ops reordered

Added by Yuri Weinstein over 9 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

OSD

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-08-21_11:40:02-upgrade:dumpling-x:stress-split-master-distro-basic-vps/439539/

Coredump in /a/teuthology-2014-08-21_11:40:02-upgrade:dumpling-x:stress-split-master-distro-basic-vps/439539/*/ceph-osd.4.log.gz

remote/vpm126/log/ceph-osd.4.log.gz:2014-08-21 21:47:57.275862 7fa973a7a700 -1 *** Caught signal (Aborted) **
remote/vpm126/log/ceph-osd.4.log.gz: in thread 7fa973a7a700
remote/vpm126/log/ceph-osd.4.log.gz:
remote/vpm126/log/ceph-osd.4.log.gz: ceph version 0.84-372-gb0aa846 (b0aa846b3f81225a779de00100e15334fb8156b3)
remote/vpm126/log/ceph-osd.4.log.gz: 1: ceph-osd() [0x9a8a0a]
remote/vpm126/log/ceph-osd.4.log.gz: 2: (()+0xfcb0) [0x7fa9916f6cb0]
remote/vpm126/log/ceph-osd.4.log.gz: 3: (gsignal()+0x35) [0x7fa98ffe14f5]
remote/vpm126/log/ceph-osd.4.log.gz: 4: (abort()+0x17b) [0x7fa98ffe4c5b]
remote/vpm126/log/ceph-osd.4.log.gz: 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fa99093469d]
remote/vpm126/log/ceph-osd.4.log.gz: 6: (()+0xb5846) [0x7fa990932846]
remote/vpm126/log/ceph-osd.4.log.gz: 7: (()+0xb5873) [0x7fa990932873]
remote/vpm126/log/ceph-osd.4.log.gz: 8: (()+0xb596e) [0x7fa99093296e]
remote/vpm126/log/ceph-osd.4.log.gz: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0xa8cf7f]
remote/vpm126/log/ceph-osd.4.log.gz: 10: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1ab8) [0x81cf48]
remote/vpm126/log/ceph-osd.4.log.gz: 11: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x2a5c) [0x826bcc]
remote/vpm126/log/ceph-osd.4.log.gz: 12: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x63f) [0x7c21ef]
remote/vpm126/log/ceph-osd.4.log.gz: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a2) [0x64ac22]
remote/vpm126/log/ceph-osd.4.log.gz: 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x6c1) [0x64b6e1]
remote/vpm126/log/ceph-osd.4.log.gz: 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x6fc) [0xa7de9c]
remote/vpm126/log/ceph-osd.4.log.gz: 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa7f780]
remote/vpm126/log/ceph-osd.4.log.gz: 17: (()+0x7e9a) [0x7fa9916eee9a]

Traceback:

2014-08-21T14:54:53.999 ERROR:teuthology.misc:Saw exception from osd.4
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/misc.py", line 1044, in stop_daemons_of_type
    daemon.stop()
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/daemon.py", line 45, in stop
    run.wait([self.proc], timeout=timeout)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 424, in wait
    proc.wait()
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 102, in wait
    exitstatus=status, node=self.hostname)
CommandFailedError: Command failed on vpm126 with status 1: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-osd -f -i 4'
2014-08-21T14:54:54.025 INFO:teuthology.misc:Shutting down mon daemons...

archive_path: /var/lib/teuthworker/archive/teuthology-2014-08-21_11:40:02-upgrade:dumpling-x:stress-split-master-distro-basic-vps/439539
branch: master
description: upgrade:dumpling-x:stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/rbd-cls.yaml
  6-next-mon/monb.yaml 7-workload/radosbench.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml
  rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/ubuntu_12.04.yaml}
email: ceph-qa@ceph.com
job_id: '439539'
kernel: &id001
  kdb: true
  sha1: distro
last_in_suite: false
machine_type: vps
name: teuthology-2014-08-21_11:40:02-upgrade:dumpling-x:stress-split-master-distro-basic-vps
nuke-on-error: true
os_type: ubuntu
os_version: '12.04'
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      global:
        osd heartbeat grace: 100
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch
    sha1: b0aa846b3f81225a779de00100e15334fb8156b3
  ceph-deploy:
    branch:
      dev: master
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: b0aa846b3f81225a779de00100e15334fb8156b3
  rgw:
    default_idle_timeout: 1200
  s3tests:
    branch: master
  workunit:
    sha1: b0aa846b3f81225a779de00100e15334fb8156b3
owner: scheduled_teuthology@teuthology
priority: 1000
roles:
- - mon.a
  - mon.b
  - mds.a
  - osd.0
  - osd.1
  - osd.2
- - osd.3
  - osd.4
  - osd.5
  - mon.c
- - client.0
suite: upgrade:dumpling-x:stress-split
suite_branch: master
suite_path: /var/lib/teuthworker/src/ceph-qa-suite_master
targets:
  ubuntu@vpm126.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCfbQRWnvxmwbgai9ELh6/tNjbWMPp4Ckjg9F6/F9YG1yYBM7Hpyy6MFgV4aUFXsvqvwqpa6Mz6P49WLHQ+AdfuCFoQ2wNmvPUWhVuddmoVwDasoseiS1mH9IBlTinxOCDxSQlczezr0SltyRNJYIbabyCmupb+oAftgpxRMA0Lbmmhan1kZSEHKAK4E0OS3hgvDwtw9kylMXUz0Km72pAvq0jfY3vqUbI7FLDpZN/p7BK3eUwx1H1gFaMpzTtWNZ+keRr41nz042oxYEijhpG/zPq2+zfcd7qedHa2V1sIZRc5L12fuoywkPSc4XbzL5mCrXcxxtl4mHnOU2o8wWz/
  ubuntu@vpm127.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC35kg1Z1yLwYoic/dcB7cYTTrD0bCRd7Wl7zxMtg24xPsGHOqn1iuCBYi0LnXSlDM3OxP4ziPVEuMj5jvUXQdINA6VQJDmrTNg+JA0lgiVDVwk7MLGK4dSpzxLeMSAMwWuKmvjCLhlKElmGfyHY7wBt1X61StiUXT+YzlQMzFzJafftO367ZrLZgZT1pzD/DbT/tu2rpitKDcOc6dEPYp3A8O3AvjFe6ZIuYDy5ppRTO1SSlLK/qbybvPq0lwxmFqWksHJhhDAoyISpPIJJvdS8Y2DPuc0sxXjvcepN0vTsFGxtJcAAq+VY4LfkqOjuadZy6OOUL62vNKU77PSnOhj
  ubuntu@vpm128.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC/KLph+KmwNuX9PIKpea4V31ca2CCp+ymms56ZivTLTz+gwxoqoDauCEEOxrttz/EPZiwgLa+EICBuUfQpJ5rAdYcYN4zQ6SebSIjMTa0I0vNrqlYuIK822rMYVJDK1zsB/b3qq9KcxhydcBGi4UZuCHZhtKL6ggHmUW5FCvY8pP2LvuRjcB3xFoFiV0EiBGVTddAT6rgiiWPXVzuJW8FJpZCq2YFiD0U4vctcNeQ5LizWl2Dj3ViGGSvajJ4D4K4IB1v3xUiA9/3UhggkFeFBm2RgkZjMwAboStSzginBcniLOt8MSpbMcBb+qA2d8vF30+NnXK5JIExqf5kbJm3X
tasks:
- internal.lock_machines:
  - 3
  - vps
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.serialize_remote_roles: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: dumpling
- ceph:
    fs: xfs
- install.upgrade:
    osd.0: null
- ceph.restart:
    daemons:
    - osd.0
    - osd.1
    - osd.2
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    thrash_primary_affinity: false
    timeout: 1200
- ceph.restart:
    daemons:
    - mon.a
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    branch: dumpling
    clients:
      client.0:
      - cls/test_cls_rbd.sh
- ceph.restart:
    daemons:
    - mon.b
    wait-for-healthy: false
    wait-for-osds-up: true
- radosbench:
    clients:
    - client.0
    time: 1800
- install.upgrade:
    mon.c: null
- ceph.restart:
    daemons:
    - mon.c
    wait-for-healthy: false
    wait-for-osds-up: true
- ceph.wait_for_mon_quorum:
  - a
  - b
  - c
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rados/test-upgrade-firefly.sh
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rbd/test_librbd_python.sh
- rgw:
    client.0: null
    default_idle_timeout: 300
- swift:
    client.0:
      rgw_server: client.0
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
teuthology_branch: master
tube: vps
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.vps.4613

description: upgrade:dumpling-x:stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/rbd-cls.yaml
  6-next-mon/monb.yaml 7-workload/radosbench.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml
  rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/ubuntu_12.04.yaml}
duration: 13344.698642015457
failure_reason: 'Command failed on vpm126 with status 1: ''sudo adjust-ulimits ceph-coverage
  /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-osd -f -i 4'''
flavor: basic
owner: scheduled_teuthology@teuthology
success: false

Actions

Copy link

Updated by Sage Weil over 9 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Sage Weil over 9 years ago

 -8946> 2014-08-21 21:47:55.559494 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] execute_ctx 0x4ab8200
 -8945> 2014-08-21 21:47:55.559523 7fa973a7a700  5 -- op tracker -- , seq: 44897, time: 2014-08-21 21:47:55.559516, event: started, op: osd_op(client.6612.0:13027 notify.0 [watch add cookie 1 ver 0] 142.4322fa9f ondisk+read+write e3232)
 -8944> 2014-08-21 21:47:55.559530 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] do_op 4322fa9f/notify.0/head//142 [watch add cookie 1 ver 0] ov 3243'15 av 3243'16 snapc 0=[] snapset 0=[]:[]+head
 -8943> 2014-08-21 21:47:55.559550 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] do_osd_op 4322fa9f/notify.0/head//142 [watch add cookie 1 ver 0]
 -8942> 2014-08-21 21:47:55.559566 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] do_osd_op  watch add cookie 1 ver 0
 -8941> 2014-08-21 21:47:55.559582 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] watch: ctx->obc=0x9f0d000 cookie=1 oi.version=15 ctx->at_version=3243'16
 -8940> 2014-08-21 21:47:55.559597 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] watch: oi.user_version=12
 -8939> 2014-08-21 21:47:55.559611 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] watch: peer_addr=10.214.140.29:0/1007463
 -8938> 2014-08-21 21:47:55.559631 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded]  registered new watch watch(cookie 1 30s 10.214.140.29:0/1007463) by client.6612
 -8937> 2014-08-21 21:47:55.559657 7fa973a7a700 15 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] do_osd_op_effects on session 0x3f43c80
 -8936> 2014-08-21 21:47:55.559675 7fa973a7a700 15 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] do_osd_op_effects applying watch connect on session 0x3f43c80 watcher 1,client.6612
 -8935> 2014-08-21 21:47:55.559690 7fa973a7a700 15 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] do_osd_op_effects new watcher 1,client.6612
 -8934> 2014-08-21 21:47:55.559706 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded]  -- Watch(1,client.6612) Watch()
 -8933> 2014-08-21 21:47:55.559726 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded]  -- Watch(1,client.6612) connecting
 -8932> 2014-08-21 21:47:55.559746 7fa973a7a700 15 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded]  -- Watch(1,client.6612) unregister_cb
 -8931> 2014-08-21 21:47:55.559765 7fa973a7a700 20 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] make_writeable 4322fa9f/notify.0/head//142 snapset=0x3a37818  snapc=0=[]
 -8930> 2014-08-21 21:47:55.559785 7fa973a7a700 20 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] make_writeable 4322fa9f/notify.0/head//142 done, snapset=0=[]:[]+head
 -8929> 2014-08-21 21:47:55.559802 7fa973a7a700 20 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] finish_ctx 4322fa9f/notify.0/head//142 0x4ab8200 op modify  
 -8928> 2014-08-21 21:47:55.559824 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded]  set mtime to 2014-08-21 19:58:42.038946
 -8927> 2014-08-21 21:47:55.559854 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded]  final snapset 0=[]:[]+head in 4322fa9f/notify.0/head//142
 -8926> 2014-08-21 21:47:55.559882 7fa973a7a700 20 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded]  zeroing write result code 0
 -8925> 2014-08-21 21:47:55.559900 7fa973a7a700 20 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded]  op order client.6612 tid 13027 last was 13030
 -8924> 2014-08-21 21:47:55.559916 7fa973a7a700 -1 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] bad op order, already applied 13030 > this 13027
  -604> 2014-08-21 21:47:56.870603 7fa973a7a700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)' thread 7fa973a7a700 time 2014-08-21 21:47:55.559943
osd/ReplicatedPG.cc: 1924: FAILED assert(0 == "out of order op")

 ceph version 0.84-372-gb0aa846 (b0aa846b3f81225a779de00100e15334fb8156b3)
 1: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1ab8) [0x81cf48]
 2: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x2a5c) [0x826bcc]
 3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x63f) [0x7c21ef]
 4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a2) [0x64ac22]
 5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x6c1) [0x64b6e1]
 6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x6fc) [0xa7de9c]
 7: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa7f780]
 8: (()+0x7e9a) [0x7fa9916eee9a]
 9: (clone()+0x6d) [0x7fa99009f73d]

Actions

Copy link

Updated by Sage Weil over 9 years ago

Subject changed from osd crashed in upgrade:dumpling-x:stress-split-master-distro-basic-vps to osd: notify ops reordered
Category set to OSD

Actions

Copy link

Updated by Sage Weil over 9 years ago

first 2 ops arrive

2014-08-21 21:46:52.167158 7fa969d3f700  1 -- 10.214.140.54:6805/15229 <== client.6612 10.214.140.29:0/1007463 518 ==== osd_op(client.6612.0:13027 notify.0 [watch add cookie 1 ver 0] 142.4322fa9f ondisk+read+write e3232) v4 ==== 151+0+0 (2463101563 0 0) 0x5c98500 con 0x21992c0
2014-08-21 21:46:52.167266 7fa969d3f700 20 osd.4 3232 should_share_map client.6612 10.214.140.29:0/1007463 3232
2014-08-21 21:46:52.186899 7fa969d3f700  1 -- 10.214.140.54:6805/15229 <== client.6612 10.214.140.29:0/1007463 519 ==== osd_op(client.6612.0:13029 notify.7 [watch add cookie 8 ver 0] 142.84ada7c9 ondisk+read+write e3232) v4 ==== 151+0+0 (2889371413 0 0) 0x39b8780 con 0x21992c0
2014-08-21 21:46:52.187032 7fa969d3f700 20 osd.4 3232 should_share_map client.6612 10.214.140.29:0/1007463 3232

we create the pg

2014-08-21 21:46:52.964072 7fa980394700 15 osd.4 3232 project_pg_history 142.7 from 3231 to 3232, start ec=1196 les/c 3184/3191 3231/3231/3231
2014-08-21 21:46:52.964093 7fa980394700 15 osd.4 3232 project_pg_history end ec=1196 les/c 3184/3191 3231/3231/3231
2014-08-21 21:46:52.964111 7fa980394700 20 osd.4 3232 _create_lock_pg pgid 142.7
2014-08-21 21:46:52.964116 7fa980394700 10 osd.4 3232 _open_lock_pg 142.7
2014-08-21 21:46:52.964125 7fa980394700 10 osd.4 3232 _get_pool 142
2014-08-21 21:46:52.964172 7fa980394700  5 osd.4 pg_epoch: 3231 pg[142.7(unlocked)] enter Initial
2014-08-21 21:46:52.964193 7fa980394700 20 osd.4 pg_epoch: 3231 pg[142.7(unlocked)] enter NotTrimming
2014-08-21 21:46:52.964209 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( DNE empty local-les=0 n=0 ec=0 les/c 0/0 0/0/0) [] r=0 lpr=0 crt=0'0 inactive] init role 0 up [4,5] acting [4,5] history ec=1196 les/c 3184/3191 3231/3231/3231 1 past_intervals
2014-08-21 21:46:52.964267 7fa980394700 10 write_log with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: 0, writeout_from: 4294967295'18446744073709551615, trimmed: 
2014-08-21 21:46:52.964282 7fa980394700  7 osd.4 3232 _create_lock_pg pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=0 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive]
2014-08-21 21:46:52.964293 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=0 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] handle_create
2014-08-21 21:46:52.964310 7fa980394700  5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=0 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] exit Initial 0.000139 0 0.000000
2014-08-21 21:46:52.964330 7fa980394700  5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=0 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] enter Reset
2014-08-21 21:46:52.964342 7fa980394700 20 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=0 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] set_last_peering_reset 3231
2014-08-21 21:46:52.964359 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] update_heartbeat_peers  -> 4,5
2014-08-21 21:46:52.964373 7fa980394700 20 osd.4 3232 need_heartbeat_peer_update
2014-08-21 21:46:52.964377 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] take_waiters
2014-08-21 21:46:52.964390 7fa980394700  5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] exit Reset 0.000060 1 0.000048
2014-08-21 21:46:52.964407 7fa980394700  5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] enter Started
2014-08-21 21:46:52.964420 7fa980394700  5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] enter Start
2014-08-21 21:46:52.964432 7fa980394700  1 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] state<Start>: transitioning to Primary
2014-08-21 21:46:52.964448 7fa980394700  5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] exit Start 0.000027 0 0.000000
2014-08-21 21:46:52.964465 7fa980394700  5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] enter Started/Primary
2014-08-21 21:46:52.964479 7fa980394700  5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] enter Started/Primary/Peering
2014-08-21 21:46:52.964492 7fa980394700  5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] enter Started/Primary/Peering/GetInfo
2014-08-21 21:46:52.964505 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] _calc_past_interval_range: already have past intervals back to 3191
2014-08-21 21:46:52.964520 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering]  PriorSet: build_prior interval(3175-3230 up [5,1](5) acting [5,1](5) maybe_went_rw)
2014-08-21 21:46:52.964538 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering]  PriorSet: build_prior final: probe 1,4,5 down  blocked_by {}
2014-08-21 21:46:52.964551 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] up_thru 3180 < same_since 3231, must notify monitor
2014-08-21 21:46:52.964566 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] state<Started/Primary/Peering/GetInfo>:  querying info from osd.1
2014-08-21 21:46:52.964582 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] state<Started/Primary/Peering/GetInfo>:  querying info from osd.5
2014-08-21 21:46:52.964601 7fa980394700 15 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] publish_stats_to_osd 3231:1
2014-08-21 21:46:52.964618 7fa980394700 10 log is not dirty
2014-08-21 21:46:52.964629 7fa980394700 20 osd.4 3232 share_map_peer 0x3260dc0 already has epoch 3232
2014-08-21 21:46:52.964633 7fa980394700  7 osd.4 3232 do_queries querying osd.1 on 1 PGs
2014-08-21 21:46:52.964638 7fa980394700  1 -- 10.214.140.54:6806/15229 --> 10.214.140.163:6801/30961 -- pg_query(142.7 epoch 3232) v3 -- ?+0 0x5302c00 con 0x3260dc0
2014-08-21 21:46:52.964672 7fa978283700  2 filestore(/var/lib/ceph/osd/ceph-4) waiting 51 > 50 ops || 61902 > 104857600
2014-08-21 21:46:52.964687 7fa977a82700  2 filestore(/var/lib/ceph/osd/ceph-4) waiting 51 > 50 ops || 62200 > 104857600
2014-08-21 21:46:52.963884 7fa984fce700 10 filestore(/var/lib/ceph/osd/ceph-4) create_collection /var/lib/ceph/osd/ceph-4/current/0.62_head = 0
2014-08-21 21:46:52.964838 7fa984fce700 10 filestore(/var/lib/ceph/osd/ceph-4) _set_replay_guard 207478.0.0
2014-08-21 21:46:52.965005 7fa980394700 20 osd.4 3232 share_map_peer 0x433d000 already has epoch 3232
2014-08-21 21:46:52.965014 7fa980394700  7 osd.4 3232 do_queries querying osd.5 on 1 PGs
2014-08-21 21:46:52.965018 7fa980394700  1 -- 10.214.140.54:6806/15229 --> 10.214.140.54:6810/17908 -- pg_query(142.7 epoch 3232) v3 -- ?+0 0x5302000 con 0x433d000
2014-08-21 21:46:52.966261 7fa980394700  5 filestore(/var/lib/ceph/osd/ceph-4) queue_transactions new osr(142.7 0x8d19340)/0x8d19340
2014-08-21 21:46:52.966276 7fa980394700  2 filestore(/var/lib/ceph/osd/ceph-4) waiting 51 > 50 ops || 62219 > 104857600

but it blocks submitting hte transaction...

meanwhile, we get another message

2014-08-21 21:46:53.141163 7fa969d3f700  1 -- 10.214.140.54:6805/15229 <== client.6612 10.214.140.29:0/1007463 520 ==== osd_op(client.6612.0:13030 notify.0 [watch remove cookie 1 ver 0] 142.4322fa9f ondisk+write e3233) v4 ==== 151+0+0 (3710587269 0 0) 0x39b8a00 con 0x21992c0
2014-08-21 21:46:53.141261 7fa969d3f700  1 -- 10.214.140.54:6805/15229 --> 10.214.140.163:6790/0 -- mon_subscribe({monmap=2+,osd_pg_creates=0,osdmap=3233}) v2 -- ?+0 0x4f60600 con 0x75f92c0

and then we finish creating out pg and call wake_pg_waiters

2014-08-21 21:46:53.775049 7fa980394700 10 osd.4 3232 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] is new

and then we enqueue in teh wrong order:

2014-08-21 21:46:53.879269 7fa981b97700 15 osd.4 3234 enqueue_op 0x85eac00 prio 63 cost 0 latency 0.738232 osd_op(client.6612.0:13030 notify.0 [watch remove cookie 1 ver 0] 142.4322fa9f ondisk+write e3233) v4
...
2014-08-21 21:46:53.879312 7fa981b97700 15 osd.4 3234 enqueue_op 0x85ea400 prio 63 cost 0 latency 1.712259 osd_op(client.6612.0:13027 notify.0 [watch add cookie 1 ver 0] 142.4322fa9f ondisk+read+write e3232) v4

Actions

Copy link