Bug #7976
closed4.8 missing primary copy of ..., unfound (dumpling)
0%
Description
'"2014-04-03 04:07:23.871077 osd.2 10.214.132.27:6822/30927 3 : [ERR] 4.8 missing primary copy of 57594e48/benchmark_data_plana48_9342_object117/head//4, unfound" in cluster log'
archive_path: /var/lib/teuthworker/archive/teuthology-2014-04-02_22:35:24-upgrade:dumpling-x:stress-split-firefly---basic-plana/164751 description: upgrade/dumpling-x/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/snaps-few-objects.yaml 6-next-mon/monb.yaml 7-workload/radosbench.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/ubuntu_12.04.yaml} email: null job_id: '164751' last_in_suite: false machine_type: plana name: teuthology-2014-04-02_22:35:24-upgrade:dumpling-x:stress-split-firefly---basic-plana nuke-on-error: true os_type: ubuntu os_version: '12.04' overrides: admin_socket: branch: firefly ceph: conf: mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon warn on legacy crush tunables: false osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 20 log-whitelist: - slow request - wrongly marked me down - objects unfound and apparently lost - log bound mismatch sha1: 5c9b8a271588e39fe6e77bd7a88bcf6b535e1d3e ceph-deploy: branch: dev: firefly conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 osd default pool size: 2 install: ceph: sha1: 5c9b8a271588e39fe6e77bd7a88bcf6b535e1d3e s3tests: branch: master workunit: sha1: 5c9b8a271588e39fe6e77bd7a88bcf6b535e1d3e owner: scheduled_teuthology@teuthology roles: - - mon.a - mon.b - mds.a - osd.0 - osd.1 - osd.2 - - osd.3 - osd.4 - osd.5 - mon.c - - client.0 targets: ubuntu@plana48.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDBHhRwSkvAeRW6VrqAd9uejc+y4wydw26ndSiYjo1vv3WiQcGVKa9pNEuAYb/TRjmrTKHDO9lYl9fQwMEyY3BDNceKYJz43YyUHeNd8RNiAczRI4VTuMxJ6/GWv8n0t+24GkMjs88fnpDQ3xS4IdF5nJDzl9QnAzLFZbWGxt/h43rMQ4WOcAp57D9e7ZF/YtagB714HT2fEJfVnw6Q5EgPXnQ/xuxX9FolixiPNdP3iY998OXWS6PWAgbx1yw2t2CLIll7jXlDrBb/9fYIDa8qdV4JakzfQLGG/Ws3UnoP7Bx5V1edSLHHNmyXZ60H+82fVFe3woK+0jBcmL6BEObJ ubuntu@plana49.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCsupc3tilpTrHGdiLAeSUO0fLwE4P+zq+yuyvbs8WD8arny3m7vQvsW3Ef2Pi8rOAkW8LphHmgzwMng5Zx2q8suCJZUawM72Hl9nplLTIf1JrjvyWFUym6kyDc41cohnNkYR9r90/l8jcalTeQQNa3i66YmGG/ZQrlz7n5OjZ78tpZcBWVD1pIU1RVG/V6JinyUMIFjUNr3HToidgcK38jEwwIlyPCzqzsrP5lB/GPiVthKXZaI3DvGHkWydaqADKJrSjz/C8K641TBpxAc4i/tZa99b2kt6K6WbJQjvz0dsfn//b8u3qHkrgOaKmlAn/oFXf8aMk1G7qSpJtRBdXl ubuntu@plana51.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDgNXP3p/sw2sy34ARorzUh9QvDPit80IHDKQ71BGtytSAQL5ijlSpjJRjGT0HB9xHvR6v8115ikzmot1HgVeJSnC07UQKWp3CfVIUHZOtbMgw0exON14083tSlvn2djTA/bphuwag5u9y+0XkufOXBNrY4aBlQS9vNXnsW0PQwlgJ6YqK3W2e1qirpvfMamLugAFLdycCXXmjriXFuAxvHqbFrJYVEvNbsK8Bt+cRE5l0gcBin+5wJmz4iKagwYVAVqW7i1lZM1F0QdffYwuUrQ110/iz9AcnNvu6dSU+3g7agjBKvWCA+DVEn0RWbaRJ7M+FCl2PmLULnjvK44Qsp tasks: - internal.lock_machines: - 3 - plana - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.check_conflict: null - internal.check_ceph_data: null - internal.vm_setup: null - internal.base: null - internal.archive: null - internal.coredump: null - internal.sudo: null - internal.syslog: null - internal.timer: null - chef: null - clock.check: null - install: branch: dumpling - ceph: fs: xfs - install.upgrade: osd.0: null - ceph.restart: daemons: - osd.0 - osd.1 - osd.2 - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 thrash_primary_affinity: false timeout: 1200 - ceph.restart: daemons: - mon.a wait-for-healthy: false wait-for-osds-up: true - rados: clients: - client.0 objects: 50 op_weights: delete: 50 read: 100 rollback: 50 snap_create: 50 snap_remove: 50 write: 100 ops: 4000 - ceph.restart: daemons: - mon.b wait-for-healthy: false wait-for-osds-up: true - radosbench: clients: - client.0 time: 1800 - install.upgrade: mon.c: null - ceph.restart: daemons: - mon.c wait-for-healthy: false wait-for-osds-up: true - ceph.wait_for_mon_quorum: - a - b - c - workunit: branch: dumpling clients: client.0: - rados/test-upgrade-firefly.sh - workunit: branch: dumpling clients: client.0: - rbd/test_librbd_python.sh - rgw: client.0: idle_timeout: 120 - swift: client.0: rgw_server: client.0 - rados: clients: - client.0 objects: 500 op_weights: delete: 50 read: 100 rollback: 50 snap_create: 50 snap_remove: 50 write: 100 ops: 4000 teuthology_branch: firefly verbose: true worker_log: /var/lib/teuthworker/archive/worker_logs/worker.plana.16946
description: upgrade/dumpling-x/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/snaps-few-objects.yaml 6-next-mon/monb.yaml 7-workload/radosbench.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/ubuntu_12.04.yaml} duration: 6908.64345407486 failure_reason: '"2014-04-03 04:07:23.871077 osd.2 10.214.132.27:6822/30927 3 : [ERR] 4.8 missing primary copy of 57594e48/benchmark_data_plana48_9342_object117/head//4, unfound" in cluster log' flavor: basic owner: scheduled_teuthology@teuthology success: false
Updated by Samuel Just about 10 years ago
- Priority changed from Normal to Urgent
Updated by Sage Weil about 10 years ago
- Subject changed from "[ERR] 4.8 missing" in upgrade:dumpling-x:stress-split-firefly---basic-plana suite to 4.8 missing primary copy of ..., unfound
- Category set to OSD
- Status changed from New to 12
- Source changed from other to Q/A
Updated by Sage Weil about 10 years ago
This is a backfill bug in dumpling. The problem seems to be somewhere around here... the backfill for 57594e48/benchmark_data_plana48_9342_object117/head//4 is in progress (half-way through the object) when the primary (osd.3) does
2014-04-03 03:56:37.169617 7f2b8d470700 5 osd.3 pg_epoch: 156 pg[4.8( v 151'72 (12'2,151'72] local-les=139 n=72 ec=11 les/c 139/127 137/138/132) [2]/[3,2] r=0 lpr=138 pi=126-137/4 rops=5 bft=2 mlcod 151'72 active+degraded+remapped+backfilling] backfill_pos is 1d90988/benchmark_data_plana48_9342_object738/head//4 and pinfo.last_backfill is d9408f28/benchmark_data_plana48_9342_object744/head//4 2014-04-03 03:56:37.169631 7f2b8d470700 20 osd.3 pg_epoch: 156 pg[4.8( v 151'72 (12'2,151'72] local-les=139 n=72 ec=11 les/c 139/127 137/138/132) [2]/[3,2] r=0 lpr=138 pi=126-137/4 rops=5 bft=2 mlcod 151'72 active+degraded+remapped+backfilling] 57594e48/benchmark_data_plana48_9342_object117/head//4 is still in flight 2014-04-03 03:56:37.169657 7f2b8d470700 20 osd.3 pg_epoch: 156 pg[4.8( v 151'72 (12'2,151'72] local-les=139 n=72 ec=11 les/c 139/127 137/138/132) [2]/[3,2] r=0 lpr=138 pi=126-137/4 rops=5 bft=2 mlcod 151'72 active+degraded+remapped+backfilling] 22548168/benchmark_data_plana48_9342_object731/head//4 is still in flight 2014-04-03 03:56:37.169670 7f2b8d470700 20 osd.3 pg_epoch: 156 pg[4.8( v 151'72 (12'2,151'72] local-les=139 n=72 ec=11 les/c 139/127 137/138/132) [2]/[3,2] r=0 lpr=138 pi=126-137/4 rops=5 bft=2 mlcod 151'72 active+degraded+remapped+backfilling] d6dc668/benchmark_data_plana48_9342_object676/head//4 is still in flight 2014-04-03 03:56:37.169681 7f2b8d470700 20 osd.3 pg_epoch: 156 pg[4.8( v 151'72 (12'2,151'72] local-les=139 n=72 ec=11 les/c 139/127 137/138/132) [2]/[3,2] r=0 lpr=138 pi=126-137/4 rops=5 bft=2 mlcod 151'72 active+degraded+remapped+backfilling] 79cf768/benchmark_data_plana48_9342_object1211/head//4 is still in flight 2014-04-03 03:56:37.169693 7f2b8d470700 20 osd.3 pg_epoch: 156 pg[4.8( v 151'72 (12'2,151'72] local-les=139 n=72 ec=11 les/c 139/127 137/138/132) [2]/[3,2] r=0 lpr=138 pi=126-137/4 rops=5 bft=2 mlcod 151'72 active+degraded+remapped+backfilling] 23350f68/benchmark_data_plana48_9342_object850/head//4 is still in flight 2014-04-03 03:56:37.169711 7f2b8d470700 1 -- 10.214.132.29:6801/27792 --> osd.2 10.214.132.27:6823/30927 -- pg_backfill(progress 4.8 e 156/156 lb 57594e48/benchmark_data_plana48_9342_object117/head//4) v2 -- ?+0 0x5c8a400
and the backfill on that object never completes. later, we think it is there and it is not.
Updated by Sage Weil about 10 years ago
- Subject changed from 4.8 missing primary copy of ..., unfound to 4.8 missing primary copy of ..., unfound (dumpling)
Updated by Samuel Just about 10 years ago
- Status changed from 12 to Fix Under Review
Updated by Samuel Just about 10 years ago
mm, doesn't apply cleanly, working on it
Updated by Samuel Just about 10 years ago
- Status changed from Fix Under Review to Pending Backport
- Assignee deleted (
Samuel Just) - Priority changed from Urgent to High
Actually, this is mildly tricky since the backfill logic changed a bit after dumpling to avoid some other related categories of bug. Unasigning, setting pending backport, and setting to high until the tests force us to fix it.
Updated by Samuel Just about 10 years ago
- Status changed from Pending Backport to 12
4139e75d63b0503dbb7fea8036044eda5e8b7cf1 and 7a06a71e0f2023f66d003dfb0168f4fe51eaa058 are part of the story. We probably would also have to add the patches moving from backfill_pos to last_backfill_started.