Fix #6116
closedosd: incomplete pg from thrashing on next
100%
Description
... u'overall_status': u'HEALTH_WARN', u'summary': [{u'severity': u'HEALTH_WARN', u'summary': u'1 pgs incomplete'}]} ...
ubuntu@teuthology:/a/teuthology-2013-08-24_14:13:07-rados-next-testing-basic-plana/4228$ cat orig.config.yaml kernel: kdb: true sha1: c2f29906882bd30794da6993e755a0dab2b7a665 machine_type: plana nuke-on-error: true os_type: ubuntu overrides: admin_socket: branch: next ceph: conf: global: ms inject delay max: 1 ms inject delay probability: 0.005 ms inject delay type: osd ms inject internal delays: 0.002 ms inject socket failures: 2500 mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon min osdmap epochs: 2 osd: osd map cache size: 1 fs: ext4 log-whitelist: - slow request sha1: 4b529c8bceea98aaf69dceec3a4d1a239036d5d7 ceph-deploy: branch: dev: next conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 install: ceph: sha1: 4b529c8bceea98aaf69dceec3a4d1a239036d5d7 s3tests: branch: next workunit: sha1: 4b529c8bceea98aaf69dceec3a4d1a239036d5d7 roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - client.0 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - client.1 tasks: - chef: null - clock.check: null - install: null - ceph: log-whitelist: - wrongly marked me down - objects unfound and apparently lost - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 chance_test_map_discontinuity: 0.5 timeout: 1200 - rados: clients: - client.0 objects: 50 op_weights: delete: 50 read: 100 rollback: 50 snap_create: 50 snap_remove: 50 write: 100 ops: 4000 teuthology_branch: next
Updated by Sage Weil over 10 years ago
ubuntu@teuthology:/a/teuthology-2013-08-26_15:47:58-rados-next-testing-basic-plana/6694
cluster is still hung
Updated by Sage Weil over 10 years ago
ubuntu@teuthology://a/teuthology-2013-08-28_01:00:04-rados-master-testing-basic-plana/10150
Updated by Samuel Just over 10 years ago
time: 2717s
log: http://qa-proxy.ceph.com/teuthology/teuthology-2013-09-09_20:00:20-rados-dumpling-testing-basic-plana/27708/
failed to become clean before timeout expired
Hung
2013-09-09 22:33:32.641520 mon.0 10.214.131.15:6789/0 3025 : [INF] pgmap v1622: 172 pgs: 171 active+clean, 1 incomplete; 21590 bytes data, 892 MB used, 2174 GB / 2291 GB avail
Updated by Samuel Just over 10 years ago
Hmm, the last osd log entry indicates that the pg in question may have gone clean?
2013-09-09 22:27:19.022997 7f1724a49700 5 osd.3 pg_epoch: 1049 pg[2.1e( empty local-les=1044 n=0 ec=1 les/c 1044/972 1043/1043/1043) [3,0] r=0 lpr=1043 pi=791-1042/8 bft=0 mlcod 0'0 active] enter Started/Primary/Active/Clean
Updated by Samuel Just over 10 years ago
The task was in process of letting the cluster recover with osd.2 down.
Updated by Samuel Just over 10 years ago
There appear to be no pgs in incomplete state according to the osd log. Issue notifying the mon?
Updated by Samuel Just over 10 years ago
From the mon logs, last reported seems to be
2013-09-09 22:31:28.047555 7f56db94d700 15 mon.a@0(leader).pg v1614 got 1.3f reported at 1348:307 state incomplete -> incomplete
Updated by Samuel Just over 10 years ago
1.3f does appear to be incomplete in the osd log
Updated by Samuel Just over 10 years ago
Ok, there are enough logs to confirm that this is the primary-thinks-it's-clean vs backfill-peer-thinks-it's-clean race.
Updated by Samuel Just over 10 years ago
- Tracker changed from Bug to Fix
- Target version set to v0.70
Updated by Samuel Just over 10 years ago
The workaround I put into teuthology was inadequate, I'm going to put this in the backlog and downgrade it now that it should stop messing up the nightlies.
Updated by Samuel Just over 10 years ago
- Translation missing: en.field_story_points set to 5.0
Updated by Samuel Just over 10 years ago
- Status changed from New to Resolved
I was way off on this one. We do ack the backfill completion. I suspect that the actual problem was probably fixed by the 6585 fixes (backfill_pos vs last_backfill confusion)
Updated by Samuel Just over 10 years ago
Removed the teuthology workaround as well.