Project

General

Profile

Actions

Bug #7804

closed

backfill racing with a hitset object remove

Added by Yuri Weinstein about 10 years ago. Updated almost 10 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-03-20_02:30:02-rados-firefly-distro-basic-plana/139477/

2014-03-20T11:56:17.652 DEBUG:teuthology.orchestra.run:Running [10.214.132.7]: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph status --format=json-pretty'
2014-03-20T11:56:17.882 INFO:teuthology.task.thrashosds.ceph_manager:no progress seen, keeping timeout for now
2014-03-20T11:56:17.882 ERROR:teuthology.run_tasks:Manager failed: thrashosds
Traceback (most recent call last):
  File "/home/teuthworker/teuthology-firefly/teuthology/run_tasks.py", line 84, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/teuthworker/teuthology-firefly/teuthology/task/thrashosds.py", line 172, in task
    thrash_proc.do_join()
  File "/home/teuthworker/teuthology-firefly/teuthology/task/ceph_manager.py", line 153, in do_join
    self.thread.get()
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 331, in get
    raise self._exception
AssertionError: failed to become clean before timeout expired
archive_path: /var/lib/teuthworker/archive/teuthology-2014-03-20_02:30:02-rados-firefly-distro-basic-plana/139477
description: rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml
  thrashers/mapgap.yaml workloads/cache-agent-big.yaml}
email: null
job_id: '139477'
kernel: &id001
  kdb: true
  sha1: distro
last_in_suite: false
machine_type: plana
name: teuthology-2014-03-20_02:30:02-rados-firefly-distro-basic-plana
nuke-on-error: true
os_type: ubuntu
overrides:
  admin_socket:
    branch: firefly
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon min osdmap epochs: 2
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
        osd map cache size: 1
        osd op thread timeout: 60
        osd sloppy crc: true
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: cb744ca3825c42ddf8eb708abe5bc92f0f240287
  ceph-deploy:
    branch:
      dev: firefly
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: cb744ca3825c42ddf8eb708abe5bc92f0f240287
  s3tests:
    branch: master
  workunit:
    sha1: cb744ca3825c42ddf8eb708abe5bc92f0f240287
owner: scheduled_teuthology@teuthology
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
  - client.0
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.1
targets:
  ubuntu@plana61.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCko6xlgb/mYgguPm38M7JukH/8ZcBuIGb+8RF9CInF6PmabxpsWMRcJxBw3HRgAY6hGm9JSzg4h53HcrzZX4ZdV8AoqiDPWHtFl+1qoaOuq7U7SPj6aL960vYVVr3JKfRFQz6u1SQHrKuYgL8RvToiBjI8BLdjgrZ7pdMnWQoaetpU6s9CWxDRb9R28qgBxzI84PcDY3TdoJ8IeiYNNIUP/5co9WMiQzbWGX4fXOwiclJUzPw4n9xGELbSznUJMwP/yhSanipSgeQ+5cDA+h8RtmBqq0BKqMCp44rPYFZyYwOZqUtCbnqSyw0OHF1AcSExAq2vulNn0dFD5xraNVP/
  ubuntu@plana71.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDUQdYSnDc4vySGKfiSAnEJWhIvb94utTzo+KnlO0UXG1PZjNblJvn7jzdYEKhHC7H+zDCROddZHmfD8bkxJQnqUYySqQWmf1u7HPs9DOMtoWTIK/ZnP4/P3i3IGHBL+CFliZb0nvuZ++hCNJ7RUWQyNalaUKpUbttow7hKDg3h4DTNnuAweqMJmDVux1kaHabuYoGPdGs93MFUdkd3hxCL7UlT4hLbOCG5NG0S7JIbeWJNSn6X3XAaCr70Q7AbyhZN/ODrn9TGA+ys7YMSX1AxcbMYLwH33bq6VtyTpiTCw3BsVRL8qz2TtrQCiFtg/xxV6Jif0ymNEYfh3Kh2bYV1
tasks:
- internal.lock_machines:
  - 2
  - plana
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    chance_test_map_discontinuity: 0.5
    timeout: 1800
- exec:
    client.0:
    - ceph osd pool create base 4
    - ceph osd pool create cache 4
    - ceph osd tier add base cache
    - ceph osd tier cache-mode cache writeback
    - ceph osd tier set-overlay base cache
    - ceph osd pool set cache hit_set_type bloom
    - ceph osd pool set cache hit_set_count 8
    - ceph osd pool set cache hit_set_period 60
    - ceph osd pool set cache target_max_objects 5000
- rados:
    clients:
    - client.0
    objects: 10000
    op_weights:
      copy_from: 50
      delete: 50
      read: 100
      write: 100
    ops: 4000
    pools:
    - base
    size: 1024
teuthology_branch: firefly
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.plana.11466
description: rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml
  thrashers/mapgap.yaml workloads/cache-agent-big.yaml}
duration: 3437.8705339431763
failure_reason: failed to become clean before timeout expired
flavor: basic
owner: scheduled_teuthology@teuthology
success: false

Related issues 2 (0 open2 closed)

Has duplicate Ceph - Bug #7894: osd: missing hitset object in cluster logDuplicate03/28/2014

Actions
Is duplicate of Ceph - Bug #7983: osd: erroneously present objectResolvedSage Weil04/04/2014

Actions
Actions #1

Updated by Samuel Just about 10 years ago

  • Assignee set to David Zafman
  • Priority changed from Normal to Urgent
Actions #2

Updated by Samuel Just about 10 years ago

  • Assignee deleted (David Zafman)

This appears to have been caused by a backfill racing with a hitset object remove -- probably easiest to block hitset creation/trimming while backfilling that part of the pg?

Actions #3

Updated by Samuel Just about 10 years ago

  • Status changed from New to Duplicate
Actions #4

Updated by Samuel Just about 10 years ago

  • Subject changed from "failed to become clean before timeout expired" in teuthology-2014-03-20_02:30:02-rados-firefly-distro-basic-plana/139477 to backfill racing with a hitset object remove
Actions #5

Updated by Samuel Just almost 10 years ago

  • Status changed from Duplicate to 12

I don't think this bug is fixed, picking this one as the root bug (there was a duplicates loop)

Actions #6

Updated by Greg Farnum almost 10 years ago

ubuntu@teuthology:/a/gregf-2014-06-02_14:44:16-rados-wip-sharded-threadpool-testing-basic-plana/287095

Actions #7

Updated by Sage Weil almost 10 years ago

  • Assignee set to Sage Weil
Actions #8

Updated by Sage Weil almost 10 years ago

  • Status changed from 12 to Duplicate

This looks like a dup of #7983, where we already fix backfill vs hit_set issues by deferring any hit_set_persist or trim activity until backfill progresses past the initial pgid.ps() hash slot.

Actions

Also available in: Atom PDF