Bug #12687
openosd thrashing + pg import/export can cause maybe_went_rw intervals to be missed
0%
Description
symptom is
2015-08-12T16:29:07.660 INFO:tasks.rados.rados.0.burnupi49.stderr:1396: oid 174 version is 266 and expected 1350 2015-08-12T16:29:07.660 INFO:tasks.rados.rados.0.burnupi49.stderr:./test/osd/RadosModel.h: In function 'virtual void ReadOp::_finish(TestOp::CallbackInfo*)' thread 7f09737fe700 time 2015-08-12 16:29:07.681113 2015-08-12T16:29:07.661 INFO:tasks.rados.rados.0.burnupi49.stderr:./test/osd/RadosModel.h: 1146: FAILED assert(version == old_value.version)
pg maps to [0,5]
does updates to 100
pg maps to [0]
does updates to 150
osd.0 stopped
pg maps to [4,5] (or whatever)
1.1 exported from 0
1.1 imported to 1
pg primary queries 0 and gets an empty info, assumes osd.5's 100 is latest
pg goes active
osd.1 comes up with v 150, divergent and ignored
see pg 1.1 in this run: /a/sage-2015-08-12_12:21:53-rados:thrash-wip-newstore-sort-distro-basic-multi/1012192
Updated by Samuel Just over 8 years ago
ubuntu@teuthology:/a/samuelj-2015-08-19_00:45:42-rados-wip-sam-working-distro-basic-multi/1022696/remote
Similar pattern, results in osd/PG.cc: 290: FAILED assert(info.last_epoch_started >= info.history.last_epoch_started)
0.14 is moved from 0 to 3. 0 is queried and has a useless history.les, so we go with a master log from a prior interval.
Updated by Sage Weil over 8 years ago
maybe/probably also this run on 1/d6c871c1/plana798404-412/head in 1.1, run /a/sage-2015-08-26_09:07:57-rados-wip-sage-testing---basic-multi/1033682 . it reverted to a prior version.
Updated by Sage Weil over 8 years ago
Here's one possible workaround: leave the exporting osd down until the importing osd is up and the pg has completely peered. this ensures that we never draw the wrong conclusion by probing the exporting osd and getting a 'dne' response.
Updated by Greg Farnum almost 7 years ago
- Project changed from Ceph to RADOS
- Category set to Tests