Project

General

Profile

Actions

Bug #7976

closed

4.8 missing primary copy of ..., unfound (dumpling)

Added by Yuri Weinstein about 10 years ago. Updated almost 10 years ago.

Status:
Duplicate
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-04-02_22:35:24-upgrade:dumpling-x:stress-split-firefly---basic-plana/164751/

 '"2014-04-03 04:07:23.871077 osd.2 10.214.132.27:6822/30927
    3 : [ERR] 4.8 missing primary copy of 57594e48/benchmark_data_plana48_9342_object117/head//4,
    unfound" in cluster log'
archive_path: /var/lib/teuthworker/archive/teuthology-2014-04-02_22:35:24-upgrade:dumpling-x:stress-split-firefly---basic-plana/164751
description: upgrade/dumpling-x/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/snaps-few-objects.yaml
  6-next-mon/monb.yaml 7-workload/radosbench.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml
  rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/ubuntu_12.04.yaml}
email: null
job_id: '164751'
last_in_suite: false
machine_type: plana
name: teuthology-2014-04-02_22:35:24-upgrade:dumpling-x:stress-split-firefly---basic-plana
nuke-on-error: true
os_type: ubuntu
os_version: '12.04'
overrides:
  admin_socket:
    branch: firefly
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch
    sha1: 5c9b8a271588e39fe6e77bd7a88bcf6b535e1d3e
  ceph-deploy:
    branch:
      dev: firefly
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 5c9b8a271588e39fe6e77bd7a88bcf6b535e1d3e
  s3tests:
    branch: master
  workunit:
    sha1: 5c9b8a271588e39fe6e77bd7a88bcf6b535e1d3e
owner: scheduled_teuthology@teuthology
roles:
- - mon.a
  - mon.b
  - mds.a
  - osd.0
  - osd.1
  - osd.2
- - osd.3
  - osd.4
  - osd.5
  - mon.c
- - client.0
targets:
  ubuntu@plana48.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDBHhRwSkvAeRW6VrqAd9uejc+y4wydw26ndSiYjo1vv3WiQcGVKa9pNEuAYb/TRjmrTKHDO9lYl9fQwMEyY3BDNceKYJz43YyUHeNd8RNiAczRI4VTuMxJ6/GWv8n0t+24GkMjs88fnpDQ3xS4IdF5nJDzl9QnAzLFZbWGxt/h43rMQ4WOcAp57D9e7ZF/YtagB714HT2fEJfVnw6Q5EgPXnQ/xuxX9FolixiPNdP3iY998OXWS6PWAgbx1yw2t2CLIll7jXlDrBb/9fYIDa8qdV4JakzfQLGG/Ws3UnoP7Bx5V1edSLHHNmyXZ60H+82fVFe3woK+0jBcmL6BEObJ
  ubuntu@plana49.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCsupc3tilpTrHGdiLAeSUO0fLwE4P+zq+yuyvbs8WD8arny3m7vQvsW3Ef2Pi8rOAkW8LphHmgzwMng5Zx2q8suCJZUawM72Hl9nplLTIf1JrjvyWFUym6kyDc41cohnNkYR9r90/l8jcalTeQQNa3i66YmGG/ZQrlz7n5OjZ78tpZcBWVD1pIU1RVG/V6JinyUMIFjUNr3HToidgcK38jEwwIlyPCzqzsrP5lB/GPiVthKXZaI3DvGHkWydaqADKJrSjz/C8K641TBpxAc4i/tZa99b2kt6K6WbJQjvz0dsfn//b8u3qHkrgOaKmlAn/oFXf8aMk1G7qSpJtRBdXl
  ubuntu@plana51.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDgNXP3p/sw2sy34ARorzUh9QvDPit80IHDKQ71BGtytSAQL5ijlSpjJRjGT0HB9xHvR6v8115ikzmot1HgVeJSnC07UQKWp3CfVIUHZOtbMgw0exON14083tSlvn2djTA/bphuwag5u9y+0XkufOXBNrY4aBlQS9vNXnsW0PQwlgJ6YqK3W2e1qirpvfMamLugAFLdycCXXmjriXFuAxvHqbFrJYVEvNbsK8Bt+cRE5l0gcBin+5wJmz4iKagwYVAVqW7i1lZM1F0QdffYwuUrQ110/iz9AcnNvu6dSU+3g7agjBKvWCA+DVEn0RWbaRJ7M+FCl2PmLULnjvK44Qsp
tasks:
- internal.lock_machines:
  - 3
  - plana
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: dumpling
- ceph:
    fs: xfs
- install.upgrade:
    osd.0: null
- ceph.restart:
    daemons:
    - osd.0
    - osd.1
    - osd.2
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    thrash_primary_affinity: false
    timeout: 1200
- ceph.restart:
    daemons:
    - mon.a
    wait-for-healthy: false
    wait-for-osds-up: true
- rados:
    clients:
    - client.0
    objects: 50
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
- ceph.restart:
    daemons:
    - mon.b
    wait-for-healthy: false
    wait-for-osds-up: true
- radosbench:
    clients:
    - client.0
    time: 1800
- install.upgrade:
    mon.c: null
- ceph.restart:
    daemons:
    - mon.c
    wait-for-healthy: false
    wait-for-osds-up: true
- ceph.wait_for_mon_quorum:
  - a
  - b
  - c
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rados/test-upgrade-firefly.sh
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rbd/test_librbd_python.sh
- rgw:
    client.0:
      idle_timeout: 120
- swift:
    client.0:
      rgw_server: client.0
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
teuthology_branch: firefly
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.plana.16946
description: upgrade/dumpling-x/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/snaps-few-objects.yaml
  6-next-mon/monb.yaml 7-workload/radosbench.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml
  rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/ubuntu_12.04.yaml}
duration: 6908.64345407486
failure_reason: '"2014-04-03 04:07:23.871077 osd.2 10.214.132.27:6822/30927 3 : [ERR]
  4.8 missing primary copy of 57594e48/benchmark_data_plana48_9342_object117/head//4,
  unfound" in cluster log'
flavor: basic
owner: scheduled_teuthology@teuthology
success: false

Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #8162: osd: dumpling advances last_backfill prematurelyResolvedSamuel Just04/19/2014

Actions
Actions #1

Updated by Samuel Just about 10 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Sage Weil about 10 years ago

  • Subject changed from "[ERR] 4.8 missing" in upgrade:dumpling-x:stress-split-firefly---basic-plana suite to 4.8 missing primary copy of ..., unfound
  • Category set to OSD
  • Status changed from New to 12
  • Source changed from other to Q/A
Actions #3

Updated by Sage Weil about 10 years ago

This is a backfill bug in dumpling. The problem seems to be somewhere around here... the backfill for 57594e48/benchmark_data_plana48_9342_object117/head//4 is in progress (half-way through the object) when the primary (osd.3) does

2014-04-03 03:56:37.169617 7f2b8d470700  5 osd.3 pg_epoch: 156 pg[4.8( v 151'72 (12'2,151'72] local-les=139 n=72 ec=11 les/c 139/127 137/138/132) [2]/[3,2] r=0 lpr=138 pi=126-137/4 rops=5 bft=2 mlcod 151'72 active+degraded+remapped+backfilling] backfill_pos is 1d90988/benchmark_data_plana48_9342_object738/head//4 and pinfo.last_backfill is d9408f28/benchmark_data_plana48_9342_object744/head//4
2014-04-03 03:56:37.169631 7f2b8d470700 20 osd.3 pg_epoch: 156 pg[4.8( v 151'72 (12'2,151'72] local-les=139 n=72 ec=11 les/c 139/127 137/138/132) [2]/[3,2] r=0 lpr=138 pi=126-137/4 rops=5 bft=2 mlcod 151'72 active+degraded+remapped+backfilling] 57594e48/benchmark_data_plana48_9342_object117/head//4 is still in flight
2014-04-03 03:56:37.169657 7f2b8d470700 20 osd.3 pg_epoch: 156 pg[4.8( v 151'72 (12'2,151'72] local-les=139 n=72 ec=11 les/c 139/127 137/138/132) [2]/[3,2] r=0 lpr=138 pi=126-137/4 rops=5 bft=2 mlcod 151'72 active+degraded+remapped+backfilling] 22548168/benchmark_data_plana48_9342_object731/head//4 is still in flight
2014-04-03 03:56:37.169670 7f2b8d470700 20 osd.3 pg_epoch: 156 pg[4.8( v 151'72 (12'2,151'72] local-les=139 n=72 ec=11 les/c 139/127 137/138/132) [2]/[3,2] r=0 lpr=138 pi=126-137/4 rops=5 bft=2 mlcod 151'72 active+degraded+remapped+backfilling] d6dc668/benchmark_data_plana48_9342_object676/head//4 is still in flight
2014-04-03 03:56:37.169681 7f2b8d470700 20 osd.3 pg_epoch: 156 pg[4.8( v 151'72 (12'2,151'72] local-les=139 n=72 ec=11 les/c 139/127 137/138/132) [2]/[3,2] r=0 lpr=138 pi=126-137/4 rops=5 bft=2 mlcod 151'72 active+degraded+remapped+backfilling] 79cf768/benchmark_data_plana48_9342_object1211/head//4 is still in flight
2014-04-03 03:56:37.169693 7f2b8d470700 20 osd.3 pg_epoch: 156 pg[4.8( v 151'72 (12'2,151'72] local-les=139 n=72 ec=11 les/c 139/127 137/138/132) [2]/[3,2] r=0 lpr=138 pi=126-137/4 rops=5 bft=2 mlcod 151'72 active+degraded+remapped+backfilling] 23350f68/benchmark_data_plana48_9342_object850/head//4 is still in flight
2014-04-03 03:56:37.169711 7f2b8d470700  1 -- 10.214.132.29:6801/27792 --> osd.2 10.214.132.27:6823/30927 -- pg_backfill(progress 4.8 e 156/156 lb 57594e48/benchmark_data_plana48_9342_object117/head//4) v2 -- ?+0 0x5c8a400

and the backfill on that object never completes. later, we think it is there and it is not.

Actions #4

Updated by Sage Weil about 10 years ago

  • Subject changed from 4.8 missing primary copy of ..., unfound to 4.8 missing primary copy of ..., unfound (dumpling)
Actions #5

Updated by Samuel Just about 10 years ago

wip-7676-dumpling I think

Actions #6

Updated by Samuel Just about 10 years ago

  • Status changed from 12 to Fix Under Review
Actions #7

Updated by Samuel Just about 10 years ago

  • Assignee set to Samuel Just
Actions #8

Updated by Samuel Just about 10 years ago

mm, doesn't apply cleanly, working on it

Actions #9

Updated by Samuel Just about 10 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Assignee deleted (Samuel Just)
  • Priority changed from Urgent to High

Actually, this is mildly tricky since the backfill logic changed a bit after dumpling to avoid some other related categories of bug. Unasigning, setting pending backport, and setting to high until the tests force us to fix it.

Actions #10

Updated by Samuel Just about 10 years ago

  • Status changed from Pending Backport to 12

4139e75d63b0503dbb7fea8036044eda5e8b7cf1 and 7a06a71e0f2023f66d003dfb0168f4fe51eaa058 are part of the story. We probably would also have to add the patches moving from backfill_pos to last_backfill_started.

Actions #11

Updated by Sage Weil almost 10 years ago

  • Status changed from 12 to Duplicate
Actions

Also available in: Atom PDF