Actions
Bug #7804
closedbackfill racing with a hitset object remove
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-03-20_02:30:02-rados-firefly-distro-basic-plana/139477/
2014-03-20T11:56:17.652 DEBUG:teuthology.orchestra.run:Running [10.214.132.7]: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph status --format=json-pretty' 2014-03-20T11:56:17.882 INFO:teuthology.task.thrashosds.ceph_manager:no progress seen, keeping timeout for now 2014-03-20T11:56:17.882 ERROR:teuthology.run_tasks:Manager failed: thrashosds Traceback (most recent call last): File "/home/teuthworker/teuthology-firefly/teuthology/run_tasks.py", line 84, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/home/teuthworker/teuthology-firefly/teuthology/task/thrashosds.py", line 172, in task thrash_proc.do_join() File "/home/teuthworker/teuthology-firefly/teuthology/task/ceph_manager.py", line 153, in do_join self.thread.get() File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 331, in get raise self._exception AssertionError: failed to become clean before timeout expired
archive_path: /var/lib/teuthworker/archive/teuthology-2014-03-20_02:30:02-rados-firefly-distro-basic-plana/139477 description: rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml thrashers/mapgap.yaml workloads/cache-agent-big.yaml} email: null job_id: '139477' kernel: &id001 kdb: true sha1: distro last_in_suite: false machine_type: plana name: teuthology-2014-03-20_02:30:02-rados-firefly-distro-basic-plana nuke-on-error: true os_type: ubuntu overrides: admin_socket: branch: firefly ceph: conf: global: ms inject socket failures: 5000 mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon min osdmap epochs: 2 osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 20 osd map cache size: 1 osd op thread timeout: 60 osd sloppy crc: true fs: btrfs log-whitelist: - slow request sha1: cb744ca3825c42ddf8eb708abe5bc92f0f240287 ceph-deploy: branch: dev: firefly conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 osd default pool size: 2 install: ceph: sha1: cb744ca3825c42ddf8eb708abe5bc92f0f240287 s3tests: branch: master workunit: sha1: cb744ca3825c42ddf8eb708abe5bc92f0f240287 owner: scheduled_teuthology@teuthology roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - client.0 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - client.1 targets: ubuntu@plana61.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCko6xlgb/mYgguPm38M7JukH/8ZcBuIGb+8RF9CInF6PmabxpsWMRcJxBw3HRgAY6hGm9JSzg4h53HcrzZX4ZdV8AoqiDPWHtFl+1qoaOuq7U7SPj6aL960vYVVr3JKfRFQz6u1SQHrKuYgL8RvToiBjI8BLdjgrZ7pdMnWQoaetpU6s9CWxDRb9R28qgBxzI84PcDY3TdoJ8IeiYNNIUP/5co9WMiQzbWGX4fXOwiclJUzPw4n9xGELbSznUJMwP/yhSanipSgeQ+5cDA+h8RtmBqq0BKqMCp44rPYFZyYwOZqUtCbnqSyw0OHF1AcSExAq2vulNn0dFD5xraNVP/ ubuntu@plana71.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDUQdYSnDc4vySGKfiSAnEJWhIvb94utTzo+KnlO0UXG1PZjNblJvn7jzdYEKhHC7H+zDCROddZHmfD8bkxJQnqUYySqQWmf1u7HPs9DOMtoWTIK/ZnP4/P3i3IGHBL+CFliZb0nvuZ++hCNJ7RUWQyNalaUKpUbttow7hKDg3h4DTNnuAweqMJmDVux1kaHabuYoGPdGs93MFUdkd3hxCL7UlT4hLbOCG5NG0S7JIbeWJNSn6X3XAaCr70Q7AbyhZN/ODrn9TGA+ys7YMSX1AxcbMYLwH33bq6VtyTpiTCw3BsVRL8qz2TtrQCiFtg/xxV6Jif0ymNEYfh3Kh2bYV1 tasks: - internal.lock_machines: - 2 - plana - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.check_conflict: null - internal.check_ceph_data: null - internal.vm_setup: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.sudo: null - internal.syslog: null - internal.timer: null - chef: null - clock.check: null - install: null - ceph: log-whitelist: - wrongly marked me down - objects unfound and apparently lost - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 chance_test_map_discontinuity: 0.5 timeout: 1800 - exec: client.0: - ceph osd pool create base 4 - ceph osd pool create cache 4 - ceph osd tier add base cache - ceph osd tier cache-mode cache writeback - ceph osd tier set-overlay base cache - ceph osd pool set cache hit_set_type bloom - ceph osd pool set cache hit_set_count 8 - ceph osd pool set cache hit_set_period 60 - ceph osd pool set cache target_max_objects 5000 - rados: clients: - client.0 objects: 10000 op_weights: copy_from: 50 delete: 50 read: 100 write: 100 ops: 4000 pools: - base size: 1024 teuthology_branch: firefly verbose: true worker_log: /var/lib/teuthworker/archive/worker_logs/worker.plana.11466
description: rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml thrashers/mapgap.yaml workloads/cache-agent-big.yaml} duration: 3437.8705339431763 failure_reason: failed to become clean before timeout expired flavor: basic owner: scheduled_teuthology@teuthology success: false
Updated by Samuel Just about 10 years ago
- Assignee set to David Zafman
- Priority changed from Normal to Urgent
Updated by Samuel Just about 10 years ago
- Assignee deleted (
David Zafman)
This appears to have been caused by a backfill racing with a hitset object remove -- probably easiest to block hitset creation/trimming while backfilling that part of the pg?
Updated by Samuel Just about 10 years ago
- Subject changed from "failed to become clean before timeout expired" in teuthology-2014-03-20_02:30:02-rados-firefly-distro-basic-plana/139477 to backfill racing with a hitset object remove
Updated by Samuel Just almost 10 years ago
- Status changed from Duplicate to 12
I don't think this bug is fixed, picking this one as the root bug (there was a duplicates loop)
Updated by Greg Farnum almost 10 years ago
ubuntu@teuthology:/a/gregf-2014-06-02_14:44:16-rados-wip-sharded-threadpool-testing-basic-plana/287095
Updated by Sage Weil almost 10 years ago
- Status changed from 12 to Duplicate
This looks like a dup of #7983, where we already fix backfill vs hit_set issues by deferring any hit_set_persist or trim activity until backfill progresses past the initial pgid.ps() hash slot.
Actions