Fix #6116
osd: incomplete pg from thrashing on next
100%
Description
... u'overall_status': u'HEALTH_WARN', u'summary': [{u'severity': u'HEALTH_WARN', u'summary': u'1 pgs incomplete'}]} ...
ubuntu@teuthology:/a/teuthology-2013-08-24_14:13:07-rados-next-testing-basic-plana/4228$ cat orig.config.yaml kernel: kdb: true sha1: c2f29906882bd30794da6993e755a0dab2b7a665 machine_type: plana nuke-on-error: true os_type: ubuntu overrides: admin_socket: branch: next ceph: conf: global: ms inject delay max: 1 ms inject delay probability: 0.005 ms inject delay type: osd ms inject internal delays: 0.002 ms inject socket failures: 2500 mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon min osdmap epochs: 2 osd: osd map cache size: 1 fs: ext4 log-whitelist: - slow request sha1: 4b529c8bceea98aaf69dceec3a4d1a239036d5d7 ceph-deploy: branch: dev: next conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 install: ceph: sha1: 4b529c8bceea98aaf69dceec3a4d1a239036d5d7 s3tests: branch: next workunit: sha1: 4b529c8bceea98aaf69dceec3a4d1a239036d5d7 roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - client.0 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - client.1 tasks: - chef: null - clock.check: null - install: null - ceph: log-whitelist: - wrongly marked me down - objects unfound and apparently lost - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 chance_test_map_discontinuity: 0.5 timeout: 1200 - rados: clients: - client.0 objects: 50 op_weights: delete: 50 read: 100 rollback: 50 snap_create: 50 snap_remove: 50 write: 100 ops: 4000 teuthology_branch: next
Subtasks
Related issues
History
#1 Updated by Sage Weil over 9 years ago
ubuntu@teuthology:/a/teuthology-2013-08-26_15:47:58-rados-next-testing-basic-plana/6694
cluster is still hung
#3 Updated by Sage Weil over 9 years ago
ubuntu@teuthology://a/teuthology-2013-08-28_01:00:04-rados-master-testing-basic-plana/10150
#4 Updated by Samuel Just over 9 years ago
time: 2717s
log: http://qa-proxy.ceph.com/teuthology/teuthology-2013-09-09_20:00:20-rados-dumpling-testing-basic-plana/27708/
failed to become clean before timeout expired
Hung
2013-09-09 22:33:32.641520 mon.0 10.214.131.15:6789/0 3025 : [INF] pgmap v1622: 172 pgs: 171 active+clean, 1 incomplete; 21590 bytes data, 892 MB used, 2174 GB / 2291 GB avail
#5 Updated by Samuel Just over 9 years ago
Hmm, the last osd log entry indicates that the pg in question may have gone clean?
2013-09-09 22:27:19.022997 7f1724a49700 5 osd.3 pg_epoch: 1049 pg[2.1e( empty local-les=1044 n=0 ec=1 les/c 1044/972 1043/1043/1043) [3,0] r=0 lpr=1043 pi=791-1042/8 bft=0 mlcod 0'0 active] enter Started/Primary/Active/Clean
#6 Updated by Samuel Just over 9 years ago
The task was in process of letting the cluster recover with osd.2 down.
#7 Updated by Samuel Just over 9 years ago
There appear to be no pgs in incomplete state according to the osd log. Issue notifying the mon?
#8 Updated by Samuel Just over 9 years ago
From the mon logs, last reported seems to be
2013-09-09 22:31:28.047555 7f56db94d700 15 mon.a@0(leader).pg v1614 got 1.3f reported at 1348:307 state incomplete -> incomplete
#9 Updated by Samuel Just over 9 years ago
1.3f does appear to be incomplete in the osd log
#10 Updated by Samuel Just over 9 years ago
Ok, there are enough logs to confirm that this is the primary-thinks-it's-clean vs backfill-peer-thinks-it's-clean race.
#11 Updated by Samuel Just over 9 years ago
- Tracker changed from Bug to Fix
- Target version set to v0.70
#12 Updated by Samuel Just over 9 years ago
- Target version deleted (
v0.70)
#13 Updated by Samuel Just over 9 years ago
The workaround I put into teuthology was inadequate, I'm going to put this in the backlog and downgrade it now that it should stop messing up the nightlies.
#14 Updated by Samuel Just over 9 years ago
- Target version set to v0.73
#15 Updated by Samuel Just over 9 years ago
- translation missing: en.field_story_points set to 5.0
#16 Updated by Samuel Just over 9 years ago
- Status changed from New to Resolved
I was way off on this one. We do ack the backfill completion. I suspect that the actual problem was probably fixed by the 6585 fixes (backfill_pos vs last_backfill confusion)
#17 Updated by Samuel Just over 9 years ago
Removed the teuthology workaround as well.