Actions
Bug #9205
closedosd: notify ops reordered
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Coredump in /a/teuthology-2014-08-21_11:40:02-upgrade:dumpling-x:stress-split-master-distro-basic-vps/439539/*/ceph-osd.4.log.gz
remote/vpm126/log/ceph-osd.4.log.gz:2014-08-21 21:47:57.275862 7fa973a7a700 -1 *** Caught signal (Aborted) ** remote/vpm126/log/ceph-osd.4.log.gz: in thread 7fa973a7a700 remote/vpm126/log/ceph-osd.4.log.gz: remote/vpm126/log/ceph-osd.4.log.gz: ceph version 0.84-372-gb0aa846 (b0aa846b3f81225a779de00100e15334fb8156b3) remote/vpm126/log/ceph-osd.4.log.gz: 1: ceph-osd() [0x9a8a0a] remote/vpm126/log/ceph-osd.4.log.gz: 2: (()+0xfcb0) [0x7fa9916f6cb0] remote/vpm126/log/ceph-osd.4.log.gz: 3: (gsignal()+0x35) [0x7fa98ffe14f5] remote/vpm126/log/ceph-osd.4.log.gz: 4: (abort()+0x17b) [0x7fa98ffe4c5b] remote/vpm126/log/ceph-osd.4.log.gz: 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fa99093469d] remote/vpm126/log/ceph-osd.4.log.gz: 6: (()+0xb5846) [0x7fa990932846] remote/vpm126/log/ceph-osd.4.log.gz: 7: (()+0xb5873) [0x7fa990932873] remote/vpm126/log/ceph-osd.4.log.gz: 8: (()+0xb596e) [0x7fa99093296e] remote/vpm126/log/ceph-osd.4.log.gz: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0xa8cf7f] remote/vpm126/log/ceph-osd.4.log.gz: 10: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1ab8) [0x81cf48] remote/vpm126/log/ceph-osd.4.log.gz: 11: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x2a5c) [0x826bcc] remote/vpm126/log/ceph-osd.4.log.gz: 12: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x63f) [0x7c21ef] remote/vpm126/log/ceph-osd.4.log.gz: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a2) [0x64ac22] remote/vpm126/log/ceph-osd.4.log.gz: 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x6c1) [0x64b6e1] remote/vpm126/log/ceph-osd.4.log.gz: 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x6fc) [0xa7de9c] remote/vpm126/log/ceph-osd.4.log.gz: 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa7f780] remote/vpm126/log/ceph-osd.4.log.gz: 17: (()+0x7e9a) [0x7fa9916eee9a]
Traceback:
2014-08-21T14:54:53.999 ERROR:teuthology.misc:Saw exception from osd.4 Traceback (most recent call last): File "/home/teuthworker/src/teuthology_master/teuthology/misc.py", line 1044, in stop_daemons_of_type daemon.stop() File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/daemon.py", line 45, in stop run.wait([self.proc], timeout=timeout) File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 424, in wait proc.wait() File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 102, in wait exitstatus=status, node=self.hostname) CommandFailedError: Command failed on vpm126 with status 1: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-osd -f -i 4' 2014-08-21T14:54:54.025 INFO:teuthology.misc:Shutting down mon daemons...
archive_path: /var/lib/teuthworker/archive/teuthology-2014-08-21_11:40:02-upgrade:dumpling-x:stress-split-master-distro-basic-vps/439539 branch: master description: upgrade:dumpling-x:stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/rbd-cls.yaml 6-next-mon/monb.yaml 7-workload/radosbench.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/ubuntu_12.04.yaml} email: ceph-qa@ceph.com job_id: '439539' kernel: &id001 kdb: true sha1: distro last_in_suite: false machine_type: vps name: teuthology-2014-08-21_11:40:02-upgrade:dumpling-x:stress-split-master-distro-basic-vps nuke-on-error: true os_type: ubuntu os_version: '12.04' overrides: admin_socket: branch: master ceph: conf: global: osd heartbeat grace: 100 mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon warn on legacy crush tunables: false osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 20 log-whitelist: - slow request - wrongly marked me down - objects unfound and apparently lost - log bound mismatch sha1: b0aa846b3f81225a779de00100e15334fb8156b3 ceph-deploy: branch: dev: master conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 osd default pool size: 2 install: ceph: sha1: b0aa846b3f81225a779de00100e15334fb8156b3 rgw: default_idle_timeout: 1200 s3tests: branch: master workunit: sha1: b0aa846b3f81225a779de00100e15334fb8156b3 owner: scheduled_teuthology@teuthology priority: 1000 roles: - - mon.a - mon.b - mds.a - osd.0 - osd.1 - osd.2 - - osd.3 - osd.4 - osd.5 - mon.c - - client.0 suite: upgrade:dumpling-x:stress-split suite_branch: master suite_path: /var/lib/teuthworker/src/ceph-qa-suite_master targets: ubuntu@vpm126.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCfbQRWnvxmwbgai9ELh6/tNjbWMPp4Ckjg9F6/F9YG1yYBM7Hpyy6MFgV4aUFXsvqvwqpa6Mz6P49WLHQ+AdfuCFoQ2wNmvPUWhVuddmoVwDasoseiS1mH9IBlTinxOCDxSQlczezr0SltyRNJYIbabyCmupb+oAftgpxRMA0Lbmmhan1kZSEHKAK4E0OS3hgvDwtw9kylMXUz0Km72pAvq0jfY3vqUbI7FLDpZN/p7BK3eUwx1H1gFaMpzTtWNZ+keRr41nz042oxYEijhpG/zPq2+zfcd7qedHa2V1sIZRc5L12fuoywkPSc4XbzL5mCrXcxxtl4mHnOU2o8wWz/ ubuntu@vpm127.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC35kg1Z1yLwYoic/dcB7cYTTrD0bCRd7Wl7zxMtg24xPsGHOqn1iuCBYi0LnXSlDM3OxP4ziPVEuMj5jvUXQdINA6VQJDmrTNg+JA0lgiVDVwk7MLGK4dSpzxLeMSAMwWuKmvjCLhlKElmGfyHY7wBt1X61StiUXT+YzlQMzFzJafftO367ZrLZgZT1pzD/DbT/tu2rpitKDcOc6dEPYp3A8O3AvjFe6ZIuYDy5ppRTO1SSlLK/qbybvPq0lwxmFqWksHJhhDAoyISpPIJJvdS8Y2DPuc0sxXjvcepN0vTsFGxtJcAAq+VY4LfkqOjuadZy6OOUL62vNKU77PSnOhj ubuntu@vpm128.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC/KLph+KmwNuX9PIKpea4V31ca2CCp+ymms56ZivTLTz+gwxoqoDauCEEOxrttz/EPZiwgLa+EICBuUfQpJ5rAdYcYN4zQ6SebSIjMTa0I0vNrqlYuIK822rMYVJDK1zsB/b3qq9KcxhydcBGi4UZuCHZhtKL6ggHmUW5FCvY8pP2LvuRjcB3xFoFiV0EiBGVTddAT6rgiiWPXVzuJW8FJpZCq2YFiD0U4vctcNeQ5LizWl2Dj3ViGGSvajJ4D4K4IB1v3xUiA9/3UhggkFeFBm2RgkZjMwAboStSzginBcniLOt8MSpbMcBb+qA2d8vF30+NnXK5JIExqf5kbJm3X tasks: - internal.lock_machines: - 3 - vps - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.serialize_remote_roles: null - internal.check_conflict: null - internal.check_ceph_data: null - internal.vm_setup: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.sudo: null - internal.syslog: null - internal.timer: null - chef: null - clock.check: null - install: branch: dumpling - ceph: fs: xfs - install.upgrade: osd.0: null - ceph.restart: daemons: - osd.0 - osd.1 - osd.2 - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 thrash_primary_affinity: false timeout: 1200 - ceph.restart: daemons: - mon.a wait-for-healthy: false wait-for-osds-up: true - workunit: branch: dumpling clients: client.0: - cls/test_cls_rbd.sh - ceph.restart: daemons: - mon.b wait-for-healthy: false wait-for-osds-up: true - radosbench: clients: - client.0 time: 1800 - install.upgrade: mon.c: null - ceph.restart: daemons: - mon.c wait-for-healthy: false wait-for-osds-up: true - ceph.wait_for_mon_quorum: - a - b - c - workunit: branch: dumpling clients: client.0: - rados/test-upgrade-firefly.sh - workunit: branch: dumpling clients: client.0: - rbd/test_librbd_python.sh - rgw: client.0: null default_idle_timeout: 300 - swift: client.0: rgw_server: client.0 - rados: clients: - client.0 objects: 500 op_weights: delete: 50 read: 100 rollback: 50 snap_create: 50 snap_remove: 50 write: 100 ops: 4000 teuthology_branch: master tube: vps verbose: true worker_log: /var/lib/teuthworker/archive/worker_logs/worker.vps.4613
description: upgrade:dumpling-x:stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/rbd-cls.yaml 6-next-mon/monb.yaml 7-workload/radosbench.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/ubuntu_12.04.yaml} duration: 13344.698642015457 failure_reason: 'Command failed on vpm126 with status 1: ''sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-osd -f -i 4''' flavor: basic owner: scheduled_teuthology@teuthology success: false
Updated by Sage Weil over 9 years ago
-8946> 2014-08-21 21:47:55.559494 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] execute_ctx 0x4ab8200 -8945> 2014-08-21 21:47:55.559523 7fa973a7a700 5 -- op tracker -- , seq: 44897, time: 2014-08-21 21:47:55.559516, event: started, op: osd_op(client.6612.0:13027 notify.0 [watch add cookie 1 ver 0] 142.4322fa9f ondisk+read+write e3232) -8944> 2014-08-21 21:47:55.559530 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] do_op 4322fa9f/notify.0/head//142 [watch add cookie 1 ver 0] ov 3243'15 av 3243'16 snapc 0=[] snapset 0=[]:[]+head -8943> 2014-08-21 21:47:55.559550 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] do_osd_op 4322fa9f/notify.0/head//142 [watch add cookie 1 ver 0] -8942> 2014-08-21 21:47:55.559566 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] do_osd_op watch add cookie 1 ver 0 -8941> 2014-08-21 21:47:55.559582 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] watch: ctx->obc=0x9f0d000 cookie=1 oi.version=15 ctx->at_version=3243'16 -8940> 2014-08-21 21:47:55.559597 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] watch: oi.user_version=12 -8939> 2014-08-21 21:47:55.559611 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] watch: peer_addr=10.214.140.29:0/1007463 -8938> 2014-08-21 21:47:55.559631 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] registered new watch watch(cookie 1 30s 10.214.140.29:0/1007463) by client.6612 -8937> 2014-08-21 21:47:55.559657 7fa973a7a700 15 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] do_osd_op_effects on session 0x3f43c80 -8936> 2014-08-21 21:47:55.559675 7fa973a7a700 15 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] do_osd_op_effects applying watch connect on session 0x3f43c80 watcher 1,client.6612 -8935> 2014-08-21 21:47:55.559690 7fa973a7a700 15 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] do_osd_op_effects new watcher 1,client.6612 -8934> 2014-08-21 21:47:55.559706 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] -- Watch(1,client.6612) Watch() -8933> 2014-08-21 21:47:55.559726 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] -- Watch(1,client.6612) connecting -8932> 2014-08-21 21:47:55.559746 7fa973a7a700 15 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] -- Watch(1,client.6612) unregister_cb -8931> 2014-08-21 21:47:55.559765 7fa973a7a700 20 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] make_writeable 4322fa9f/notify.0/head//142 snapset=0x3a37818 snapc=0=[] -8930> 2014-08-21 21:47:55.559785 7fa973a7a700 20 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] make_writeable 4322fa9f/notify.0/head//142 done, snapset=0=[]:[]+head -8929> 2014-08-21 21:47:55.559802 7fa973a7a700 20 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] finish_ctx 4322fa9f/notify.0/head//142 0x4ab8200 op modify -8928> 2014-08-21 21:47:55.559824 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] set mtime to 2014-08-21 19:58:42.038946 -8927> 2014-08-21 21:47:55.559854 7fa973a7a700 10 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] final snapset 0=[]:[]+head in 4322fa9f/notify.0/head//142 -8926> 2014-08-21 21:47:55.559882 7fa973a7a700 20 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] zeroing write result code 0 -8925> 2014-08-21 21:47:55.559900 7fa973a7a700 20 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] op order client.6612 tid 13027 last was 13030 -8924> 2014-08-21 21:47:55.559916 7fa973a7a700 -1 osd.4 pg_epoch: 3243 pg[142.7( v 3243'15 (0'0,3243'15] local-les=3241 n=1 ec=1196 les/c 3241/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 luod=1338'14 rops=1 crt=1338'14 lcod 0'0 mlcod 0'0 active+recovery_wait+degraded] bad op order, already applied 13030 > this 13027 -604> 2014-08-21 21:47:56.870603 7fa973a7a700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)' thread 7fa973a7a700 time 2014-08-21 21:47:55.559943 osd/ReplicatedPG.cc: 1924: FAILED assert(0 == "out of order op") ceph version 0.84-372-gb0aa846 (b0aa846b3f81225a779de00100e15334fb8156b3) 1: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1ab8) [0x81cf48] 2: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x2a5c) [0x826bcc] 3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x63f) [0x7c21ef] 4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a2) [0x64ac22] 5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x6c1) [0x64b6e1] 6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x6fc) [0xa7de9c] 7: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa7f780] 8: (()+0x7e9a) [0x7fa9916eee9a] 9: (clone()+0x6d) [0x7fa99009f73d]
Updated by Sage Weil over 9 years ago
- Subject changed from osd crashed in upgrade:dumpling-x:stress-split-master-distro-basic-vps to osd: notify ops reordered
- Category set to OSD
Updated by Sage Weil over 9 years ago
first 2 ops arrive
2014-08-21 21:46:52.167158 7fa969d3f700 1 -- 10.214.140.54:6805/15229 <== client.6612 10.214.140.29:0/1007463 518 ==== osd_op(client.6612.0:13027 notify.0 [watch add cookie 1 ver 0] 142.4322fa9f ondisk+read+write e3232) v4 ==== 151+0+0 (2463101563 0 0) 0x5c98500 con 0x21992c0 2014-08-21 21:46:52.167266 7fa969d3f700 20 osd.4 3232 should_share_map client.6612 10.214.140.29:0/1007463 3232 2014-08-21 21:46:52.186899 7fa969d3f700 1 -- 10.214.140.54:6805/15229 <== client.6612 10.214.140.29:0/1007463 519 ==== osd_op(client.6612.0:13029 notify.7 [watch add cookie 8 ver 0] 142.84ada7c9 ondisk+read+write e3232) v4 ==== 151+0+0 (2889371413 0 0) 0x39b8780 con 0x21992c0 2014-08-21 21:46:52.187032 7fa969d3f700 20 osd.4 3232 should_share_map client.6612 10.214.140.29:0/1007463 3232
we create the pg
2014-08-21 21:46:52.964072 7fa980394700 15 osd.4 3232 project_pg_history 142.7 from 3231 to 3232, start ec=1196 les/c 3184/3191 3231/3231/3231 2014-08-21 21:46:52.964093 7fa980394700 15 osd.4 3232 project_pg_history end ec=1196 les/c 3184/3191 3231/3231/3231 2014-08-21 21:46:52.964111 7fa980394700 20 osd.4 3232 _create_lock_pg pgid 142.7 2014-08-21 21:46:52.964116 7fa980394700 10 osd.4 3232 _open_lock_pg 142.7 2014-08-21 21:46:52.964125 7fa980394700 10 osd.4 3232 _get_pool 142 2014-08-21 21:46:52.964172 7fa980394700 5 osd.4 pg_epoch: 3231 pg[142.7(unlocked)] enter Initial 2014-08-21 21:46:52.964193 7fa980394700 20 osd.4 pg_epoch: 3231 pg[142.7(unlocked)] enter NotTrimming 2014-08-21 21:46:52.964209 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( DNE empty local-les=0 n=0 ec=0 les/c 0/0 0/0/0) [] r=0 lpr=0 crt=0'0 inactive] init role 0 up [4,5] acting [4,5] history ec=1196 les/c 3184/3191 3231/3231/3231 1 past_intervals 2014-08-21 21:46:52.964267 7fa980394700 10 write_log with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: 0, writeout_from: 4294967295'18446744073709551615, trimmed: 2014-08-21 21:46:52.964282 7fa980394700 7 osd.4 3232 _create_lock_pg pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=0 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] 2014-08-21 21:46:52.964293 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=0 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] handle_create 2014-08-21 21:46:52.964310 7fa980394700 5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=0 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] exit Initial 0.000139 0 0.000000 2014-08-21 21:46:52.964330 7fa980394700 5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=0 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] enter Reset 2014-08-21 21:46:52.964342 7fa980394700 20 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=0 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] set_last_peering_reset 3231 2014-08-21 21:46:52.964359 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] update_heartbeat_peers -> 4,5 2014-08-21 21:46:52.964373 7fa980394700 20 osd.4 3232 need_heartbeat_peer_update 2014-08-21 21:46:52.964377 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] take_waiters 2014-08-21 21:46:52.964390 7fa980394700 5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] exit Reset 0.000060 1 0.000048 2014-08-21 21:46:52.964407 7fa980394700 5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] enter Started 2014-08-21 21:46:52.964420 7fa980394700 5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] enter Start 2014-08-21 21:46:52.964432 7fa980394700 1 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] state<Start>: transitioning to Primary 2014-08-21 21:46:52.964448 7fa980394700 5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] exit Start 0.000027 0 0.000000 2014-08-21 21:46:52.964465 7fa980394700 5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] enter Started/Primary 2014-08-21 21:46:52.964479 7fa980394700 5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 inactive] enter Started/Primary/Peering 2014-08-21 21:46:52.964492 7fa980394700 5 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] enter Started/Primary/Peering/GetInfo 2014-08-21 21:46:52.964505 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] _calc_past_interval_range: already have past intervals back to 3191 2014-08-21 21:46:52.964520 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] PriorSet: build_prior interval(3175-3230 up [5,1](5) acting [5,1](5) maybe_went_rw) 2014-08-21 21:46:52.964538 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] PriorSet: build_prior final: probe 1,4,5 down blocked_by {} 2014-08-21 21:46:52.964551 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] up_thru 3180 < same_since 3231, must notify monitor 2014-08-21 21:46:52.964566 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] state<Started/Primary/Peering/GetInfo>: querying info from osd.1 2014-08-21 21:46:52.964582 7fa980394700 10 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] state<Started/Primary/Peering/GetInfo>: querying info from osd.5 2014-08-21 21:46:52.964601 7fa980394700 15 osd.4 pg_epoch: 3231 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] publish_stats_to_osd 3231:1 2014-08-21 21:46:52.964618 7fa980394700 10 log is not dirty 2014-08-21 21:46:52.964629 7fa980394700 20 osd.4 3232 share_map_peer 0x3260dc0 already has epoch 3232 2014-08-21 21:46:52.964633 7fa980394700 7 osd.4 3232 do_queries querying osd.1 on 1 PGs 2014-08-21 21:46:52.964638 7fa980394700 1 -- 10.214.140.54:6806/15229 --> 10.214.140.163:6801/30961 -- pg_query(142.7 epoch 3232) v3 -- ?+0 0x5302c00 con 0x3260dc0 2014-08-21 21:46:52.964672 7fa978283700 2 filestore(/var/lib/ceph/osd/ceph-4) waiting 51 > 50 ops || 61902 > 104857600 2014-08-21 21:46:52.964687 7fa977a82700 2 filestore(/var/lib/ceph/osd/ceph-4) waiting 51 > 50 ops || 62200 > 104857600 2014-08-21 21:46:52.963884 7fa984fce700 10 filestore(/var/lib/ceph/osd/ceph-4) create_collection /var/lib/ceph/osd/ceph-4/current/0.62_head = 0 2014-08-21 21:46:52.964838 7fa984fce700 10 filestore(/var/lib/ceph/osd/ceph-4) _set_replay_guard 207478.0.0 2014-08-21 21:46:52.965005 7fa980394700 20 osd.4 3232 share_map_peer 0x433d000 already has epoch 3232 2014-08-21 21:46:52.965014 7fa980394700 7 osd.4 3232 do_queries querying osd.5 on 1 PGs 2014-08-21 21:46:52.965018 7fa980394700 1 -- 10.214.140.54:6806/15229 --> 10.214.140.54:6810/17908 -- pg_query(142.7 epoch 3232) v3 -- ?+0 0x5302000 con 0x433d000 2014-08-21 21:46:52.966261 7fa980394700 5 filestore(/var/lib/ceph/osd/ceph-4) queue_transactions new osr(142.7 0x8d19340)/0x8d19340 2014-08-21 21:46:52.966276 7fa980394700 2 filestore(/var/lib/ceph/osd/ceph-4) waiting 51 > 50 ops || 62219 > 104857600
but it blocks submitting hte transaction...
meanwhile, we get another message
2014-08-21 21:46:53.141163 7fa969d3f700 1 -- 10.214.140.54:6805/15229 <== client.6612 10.214.140.29:0/1007463 520 ==== osd_op(client.6612.0:13030 notify.0 [watch remove cookie 1 ver 0] 142.4322fa9f ondisk+write e3233) v4 ==== 151+0+0 (3710587269 0 0) 0x39b8a00 con 0x21992c0 2014-08-21 21:46:53.141261 7fa969d3f700 1 -- 10.214.140.54:6805/15229 --> 10.214.140.163:6790/0 -- mon_subscribe({monmap=2+,osd_pg_creates=0,osdmap=3233}) v2 -- ?+0 0x4f60600 con 0x75f92c0
and then we finish creating out pg and call wake_pg_waiters
2014-08-21 21:46:53.775049 7fa980394700 10 osd.4 3232 pg[142.7( empty local-les=0 n=0 ec=1196 les/c 3184/3191 3231/3231/3231) [4,5] r=0 lpr=3231 pi=3175-3230/1 crt=0'0 mlcod 0'0 peering] is new
and then we enqueue in teh wrong order:
2014-08-21 21:46:53.879269 7fa981b97700 15 osd.4 3234 enqueue_op 0x85eac00 prio 63 cost 0 latency 0.738232 osd_op(client.6612.0:13030 notify.0 [watch remove cookie 1 ver 0] 142.4322fa9f ondisk+write e3233) v4 ... 2014-08-21 21:46:53.879312 7fa981b97700 15 osd.4 3234 enqueue_op 0x85ea400 prio 63 cost 0 latency 1.712259 osd_op(client.6612.0:13027 notify.0 [watch add cookie 1 ver 0] 142.4322fa9f ondisk+read+write e3232) v4
Updated by Samuel Just over 9 years ago
OSD::session_notify_pg_create queues the pg waiters at the end of the waiting_for_map queue rather than at the beginning. Oops.
Updated by Samuel Just over 9 years ago
- Status changed from Pending Backport to Resolved
Does not need to be backported.
Actions