Bug #7804: backfill racing with a hitset object remove - Ceph - Ceph

Actions

Copy link

Bug #7804

closed

backfill racing with a hitset object remove

Added by Yuri Weinstein about 10 years ago. Updated almost 10 years ago.

Status:

Duplicate

Priority:

Urgent

Assignee:

Sage Weil

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-03-20_02:30:02-rados-firefly-distro-basic-plana/139477/

2014-03-20T11:56:17.652 DEBUG:teuthology.orchestra.run:Running [10.214.132.7]: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph status --format=json-pretty'
2014-03-20T11:56:17.882 INFO:teuthology.task.thrashosds.ceph_manager:no progress seen, keeping timeout for now
2014-03-20T11:56:17.882 ERROR:teuthology.run_tasks:Manager failed: thrashosds
Traceback (most recent call last):
  File "/home/teuthworker/teuthology-firefly/teuthology/run_tasks.py", line 84, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/teuthworker/teuthology-firefly/teuthology/task/thrashosds.py", line 172, in task
    thrash_proc.do_join()
  File "/home/teuthworker/teuthology-firefly/teuthology/task/ceph_manager.py", line 153, in do_join
    self.thread.get()
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 331, in get
    raise self._exception
AssertionError: failed to become clean before timeout expired

archive_path: /var/lib/teuthworker/archive/teuthology-2014-03-20_02:30:02-rados-firefly-distro-basic-plana/139477
description: rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml
  thrashers/mapgap.yaml workloads/cache-agent-big.yaml}
email: null
job_id: '139477'
kernel: &id001
  kdb: true
  sha1: distro
last_in_suite: false
machine_type: plana
name: teuthology-2014-03-20_02:30:02-rados-firefly-distro-basic-plana
nuke-on-error: true
os_type: ubuntu
overrides:
  admin_socket:
    branch: firefly
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon min osdmap epochs: 2
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
        osd map cache size: 1
        osd op thread timeout: 60
        osd sloppy crc: true
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: cb744ca3825c42ddf8eb708abe5bc92f0f240287
  ceph-deploy:
    branch:
      dev: firefly
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: cb744ca3825c42ddf8eb708abe5bc92f0f240287
  s3tests:
    branch: master
  workunit:
    sha1: cb744ca3825c42ddf8eb708abe5bc92f0f240287
owner: scheduled_teuthology@teuthology
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
  - client.0
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.1
targets:
  ubuntu@plana61.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCko6xlgb/mYgguPm38M7JukH/8ZcBuIGb+8RF9CInF6PmabxpsWMRcJxBw3HRgAY6hGm9JSzg4h53HcrzZX4ZdV8AoqiDPWHtFl+1qoaOuq7U7SPj6aL960vYVVr3JKfRFQz6u1SQHrKuYgL8RvToiBjI8BLdjgrZ7pdMnWQoaetpU6s9CWxDRb9R28qgBxzI84PcDY3TdoJ8IeiYNNIUP/5co9WMiQzbWGX4fXOwiclJUzPw4n9xGELbSznUJMwP/yhSanipSgeQ+5cDA+h8RtmBqq0BKqMCp44rPYFZyYwOZqUtCbnqSyw0OHF1AcSExAq2vulNn0dFD5xraNVP/
  ubuntu@plana71.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDUQdYSnDc4vySGKfiSAnEJWhIvb94utTzo+KnlO0UXG1PZjNblJvn7jzdYEKhHC7H+zDCROddZHmfD8bkxJQnqUYySqQWmf1u7HPs9DOMtoWTIK/ZnP4/P3i3IGHBL+CFliZb0nvuZ++hCNJ7RUWQyNalaUKpUbttow7hKDg3h4DTNnuAweqMJmDVux1kaHabuYoGPdGs93MFUdkd3hxCL7UlT4hLbOCG5NG0S7JIbeWJNSn6X3XAaCr70Q7AbyhZN/ODrn9TGA+ys7YMSX1AxcbMYLwH33bq6VtyTpiTCw3BsVRL8qz2TtrQCiFtg/xxV6Jif0ymNEYfh3Kh2bYV1
tasks:
- internal.lock_machines:
  - 2
  - plana
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    chance_test_map_discontinuity: 0.5
    timeout: 1800
- exec:
    client.0:
    - ceph osd pool create base 4
    - ceph osd pool create cache 4
    - ceph osd tier add base cache
    - ceph osd tier cache-mode cache writeback
    - ceph osd tier set-overlay base cache
    - ceph osd pool set cache hit_set_type bloom
    - ceph osd pool set cache hit_set_count 8
    - ceph osd pool set cache hit_set_period 60
    - ceph osd pool set cache target_max_objects 5000
- rados:
    clients:
    - client.0
    objects: 10000
    op_weights:
      copy_from: 50
      delete: 50
      read: 100
      write: 100
    ops: 4000
    pools:
    - base
    size: 1024
teuthology_branch: firefly
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.plana.11466

description: rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml
  thrashers/mapgap.yaml workloads/cache-agent-big.yaml}
duration: 3437.8705339431763
failure_reason: failed to become clean before timeout expired
flavor: basic
owner: scheduled_teuthology@teuthology
success: false

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Samuel Just about 10 years ago

Assignee set to David Zafman
Priority changed from Normal to Urgent

Actions

Copy link

Updated by Samuel Just about 10 years ago

Assignee deleted (~~David Zafman~~)

This appears to have been caused by a backfill racing with a hitset object remove -- probably easiest to block hitset creation/trimming while backfilling that part of the pg?

Actions

Copy link

Updated by Samuel Just about 10 years ago

Status changed from New to Duplicate

Actions

Copy link

Updated by Samuel Just about 10 years ago

Subject changed from "failed to become clean before timeout expired" in teuthology-2014-03-20_02:30:02-rados-firefly-distro-basic-plana/139477 to backfill racing with a hitset object remove

Actions

Copy link

Updated by Samuel Just almost 10 years ago

Status changed from Duplicate to 12

I don't think this bug is fixed, picking this one as the root bug (there was a duplicates loop)

Actions

Copy link

Updated by Greg Farnum almost 10 years ago

ubuntu@teuthology:/a/gregf-2014-06-02_14:44:16-rados-wip-sharded-threadpool-testing-basic-plana/287095

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Assignee set to Sage Weil

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Status changed from 12 to Duplicate

This looks like a dup of #7983, where we already fix backfill vs hit_set issues by deferring any hit_set_persist or trim activity until backfill progresses past the initial pgid.ps() hash slot.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #7804

backfill racing with a hitset object remove

Updated by Samuel Just about 10 years ago

Updated by Samuel Just about 10 years ago

Updated by Samuel Just about 10 years ago

Updated by Samuel Just about 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Greg Farnum almost 10 years ago

Updated by Sage Weil almost 10 years ago

Updated by Sage Weil almost 10 years ago