Project

General

Profile

Bug #19023

ceph_test_rados invalid read caused apparently by lost intervals due to mons trimming past last_epoch_clean for pg 1.16

Added by Samuel Just over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
02/21/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor, OSD
Pull request ID:

Description

samuelj@teuthology:/a/samuelj-2017-02-20_18:45:04-rados-wip-18937---basic-smithi/839771/remote

If you look back in the history of the invalid object, you'll see a delete followed by a few writes. The delete happened in an interval with osds 0 and 3. When osd.4 comes back up, the map containing the mapping in which that delete took place has apparently been trimmed and osd.4 ends up with past_intervals which only mention itself and 3.


Related issues

Related to Ceph - Bug #20431: dump_stuck.py fails assert len(unclean) == num_unclean Resolved 06/27/2017

History

#1 Updated by Greg Farnum over 1 year ago

I assume from your description that this was a dirty interval the monitor shouldn't have trimmed? Or did osd.4 perhaps construct the intervals incorrectly in concluding it was current?

#2 Updated by Samuel Just over 1 year ago

At epoch 255, 1.16 is on [4,3] and is active+clean

2017-02-20 20:45:10.962790 7fd9b7cba700 10 osd.4 pg_epoch: 255 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 163/163/163) [4,3] r=0 lpr=163 crt=254'369 lcod 251'367 mlcod 251'367 active+clean snaptrimq=[e4~1]] SnapTrimmer state<Trimming/AwaitAsyncWork>: AwaitAsyncWork: trimming snap e4

At epoch 256, it switches to [3]

2017-02-20 20:45:12.065898 7f75c93f8700 10 osd.3 pg_epoch: 256 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 256/256/256) [3] r=0 lpr=256 pi=141-255/2 crt=254'369 lcod 251'367 mlcod 0'0 peering] handle_peering_event: epoch_sent: 256 epoch_requested: 256 NullEvt

it ultimately goes peered in this interval:

2017-02-20 20:45:14.165201 7f75c7bf5700 10 osd.3 258 do_recovery started 0/1 on pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/257/0 256/256/256) [3] r=0 lpr=256 crt=254'369 lcod 251'367 mlcod 0'0 undersized+degraded+peered snaptrimq=[e5~2]]

At 276, it switches to [3,1]

2017-02-20 20:45:53.261831 7f75c93f8700 5 osd.3 pg_epoch: 276 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/257/0 276/276/256) [3,1] r=0 lpr=276 pi=256-275/1 crt=254'369 lcod 251'367 mlcod 0'0 peering] enter Started/Primary/Peering/GetInfo

and eventually goes clean

2017-02-20 20:45:53.695581 7f75c9bf9700 10 osd.3 pg_epoch: 277 pg[1.16( v 277'375 (0'0,277'375] local-les=277 n=3 ec=141 les/c/f 277/277/0 276/276/256) [3,1] r=0 lpr=276 pi=256-275/1 crt=277'375 lcod 277'374 mlcod 277'374 active+clean snaptrimq=[e5~2]] trim_past_intervals: trimming interval(256-275 up [3](3) acting [3](3))
...

2017-02-20 20:45:58.639763 7f75c93f8700 10 osd.3 pg_epoch: 282 pg[1.16( v 278'379 (0'0,278'379] local-les=277 n=1 ec=141 les/c/f 277/277/0 276/276/256) [3,1] r=0 lpr=276 crt=278'379 lcod 278'377 mlcod 278'377 active+clean] handle_advance_map [3]/[3] -- 3/3

at 283 (was active+clean in 282), it goes back to [3]

2017-02-20 20:45:58.640745 7f75c93f8700 5 osd.3 pg_epoch: 283 pg[1.16( v 278'379 (0'0,278'379] local-les=277 n=1 ec=141 les/c/f 277/277/0 283/283/256) [3] r=0 lpr=283 pi=276-282/1 crt=278'379 lcod 278'377 mlcod 0'0 peering] enter Started/Primary/Peering/GetInfo

At 289, the up and acting sets become empty.

2017-02-20 20:46:28.602761 7ffa3242c700 10 osd.4 pg_epoch: 289 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 289/289/289) [] r=-1 lpr=289 pi=163-288/2 crt=254'369 lcod 0'0 inactive NOTIFY] up [3] -> [], acting [3] -> [], acting_primary 3 -> -1, up_primary 3 -> -1, role -1 -> -1, features

at 300, it goes to [0]

2017-02-20 20:46:40.163307 7ffa31c2b700 10 osd.4 pg_epoch: 300 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 300/300/300) [0] r=-1 lpr=300 pi=163-299/3 crt=254'369 lcod 0'0 inactive NOTIFY] up [] -> [0], acting [] -> [0], acting_primary ? -> 0, up_primary ? -> 0, role -1 -> -1, features

in 301, it goes to [0]/[0,4]

2017-02-20 20:46:41.742639 7f9e277b2700 10 osd.0 pg_epoch: 301 pg[1.16( empty local-les=0 n=0 ec=141 les/c/f 164/164/0 300/301/300) [0]/[0,4] r=0 lpr=301 pi=163-300/4 crt=0'0 mlcod 0'0 remapped] up [0] -> [0], acting [0] -> [0,4], acting_primary 0 -> 0, up_primary 0 -> 0, role 0 -> 0, features acting 1152323339925389

and activates

2017-02-20 20:46:42.184250 7f9e277b2700 10 osd.0 pg_epoch: 302 pg[1.16( v 254'369 lc 251'364 (0'0,254'369] local-les=302 n=3 ec=141 les/c/f 164/164/0 300/301/300) [0]/[0,4] r=0 lpr=301 pi=163-300/4 crt=254'369 lcod 0'0 mlcod 0'0 activating+degraded+remapped m=3 snaptrimq=[da~1,e5~3]] state<Started/Primary/Active>: peer osd.4 activated and committed

#3 Updated by Samuel Just over 1 year ago

Notably, when it goes active at the end there, it's missing the 10 commits which happened during the [3,1] interval.

#4 Updated by Samuel Just over 1 year ago

2017-02-20 20:46:41.743173 7f9e277b2700 10 osd.0 pg_epoch: 301 pg[1.16( empty local-les=0 n=0 ec=141 les/c/f 164/164/0 300/301/300) [0]/[0,4] r=0 lpr=301 pi=163-300/4 crt=0'0 mlcod 0'0 remapped+peering] PriorSet: build_prior interval(163-283 up [4,3](4) acting [4,3](4) maybe_went_rw)
2017-02-20 20:46:41.743185 7f9e277b2700 10 osd.0 pg_epoch: 301 pg[1.16( empty local-les=0 n=0 ec=141 les/c/f 164/164/0 300/301/300) [0]/[0,4] r=0 lpr=301 pi=163-300/4 crt=0'0 mlcod 0'0 remapped+peering] PriorSet: build_prior prior osd.3 is down

Ok, so 0 has a bogus interval from 163-283 where it thinks the acting set was [4,3].

#5 Updated by Samuel Just over 1 year ago

2017-02-20 20:46:40.165108 7f9e2ffc3700 10 osd.0 pg_epoch: 300 pg[1.16( DNE empty local-les=0 n=0 ec=0 les/c/f 0/0/0 0/0/0) [] r=-1 lpr=0 crt=0'0 inactive] init role 0 up [0] acting [0] history ec=141 les/c/f 164/164/0 300/300/300 3 past_intervals
...
2017-02-20 20:46:40.165312 7f9e2ffc3700 10 osd.0 pg_epoch: 300 pg[1.16( empty local-les=0 n=0 ec=141 les/c/f 164/164/0 300/300/300) [0] r=0 lpr=300 pi=163-299/3 crt=0'0 mlcod 0'0 peering] _calc_past_interval_range: already have past intervals back to 164
2017-02-20 20:46:40.165316 7ffa3242c700 10 osd.4 pg_epoch: 300 pg[0.b( empty local-les=216 n=0 ec=1 les/c/f 216/216/0 289/289/289) [] r=-1 lpr=289 pi=216-288/2 crt=0'0 inactive NOTIFY] state<Started>: Started advmap
2017-02-20 20:46:40.165323 7f9e2ffc3700 10 osd.0 pg_epoch: 300 pg[1.16( empty local-les=0 n=0 ec=141 les/c/f 164/164/0 300/300/300) [0] r=0 lpr=300 pi=163-299/3 crt=0'0 mlcod 0'0 peering] PriorSet: build_prior interval(289-299 up [](-1) acting [](-1))
2017-02-20 20:46:40.165333 7f9e2ffc3700 10 osd.0 pg_epoch: 300 pg[1.16( empty local-les=0 n=0 ec=141 les/c/f 164/164/0 300/300/300) [0] r=0 lpr=300 pi=163-299/3 crt=0'0 mlcod 0'0 peering] PriorSet: build_prior interval(284-288 up [3](3) acting [3](3))
2017-02-20 20:46:40.165342 7ffa3242c700 10 osd.4 pg_epoch: 300 pg[0.b( empty local-les=216 n=0 ec=1 les/c/f 216/216/0 289/289/289) [] r=-1 lpr=289 pi=216-288/2 crt=0'0 inactive NOTIFY] check_recovery_sources no source osds () went down
2017-02-20 20:46:40.165343 7f9e2ffc3700 10 osd.0 pg_epoch: 300 pg[1.16( empty local-les=0 n=0 ec=141 les/c/f 164/164/0 300/300/300) [0] r=0 lpr=300 pi=163-299/3 crt=0'0 mlcod 0'0 peering] PriorSet: build_prior interval(163-283 up [4,3](4) acting [4,3](4) maybe_went_rw)
2017-02-20 20:46:40.165351 7f9e2ffc3700 10 osd.0 pg_epoch: 300 pg[1.16( empty local-les=0 n=0 ec=141 les/c/f 164/164/0 300/300/300) [0] r=0 lpr=300 pi=163-299/3 crt=0'0 mlcod 0'0 peering] PriorSet: build_prior prior osd.3 is down
2017-02-20 20:46:40.165360 7f9e2ffc3700 10 osd.0 pg_epoch: 300 pg[1.16( empty local-les=0 n=0 ec=141 les/c/f 164/164/0 300/300/300) [0] r=0 lpr=300 pi=163-299/3 crt=0'0 mlcod 0'0 peering] PriorSet: build_prior final: probe 0,4 down 3 blocked_by {}

osd.0 creates its copy of 1.16 in epoch 300 with an incorrect copy of past_intervals from osd.4.

#6 Updated by Samuel Just over 1 year ago

2017-02-20 20:46:28.567065 7ffa3242c700 10 osd.4 pg_epoch: 255 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 163/163/163) [4,3] r=0 lpr=255 crt=254'369 lcod 0'0 mlcod 0'0 inactive] handle_advance_map [3]/[3] -- 3/3
2017-02-20 20:46:28.567066 7ffa31c2b700 20 osd.4 284 get_map 257 - loading and decoding 0x7ffa5c2c9600
2017-02-20 20:46:28.567079 7ffa3242c700 20 PGPool::update cached_removed_snaps [1~e7] newly_removed_snaps [da~1,e5~3] snapc e9=[] (updated)
2017-02-20 20:46:28.567084 7ffa3242c700 10 osd.4 pg_epoch: 284 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 163/163/163) [4,3] r=0 lpr=255 crt=254'369 lcod 0'0 mlcod 0'0 inactive] state<Reset>: Reset advmap
2017-02-20 20:46:28.567085 7ffa31c2b700 15 filestore(/var/lib/ceph/osd/ceph-4) read meta/#-1:a6f36935:::osdmap.257:0# 0~0
2017-02-20 20:46:28.567097 7ffa3242c700 10 osd.4 pg_epoch: 284 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 163/163/163) [4,3] r=0 lpr=255 crt=254'369 lcod 0'0 mlcod 0'0 inactive] _calc_past_interval_range start epoch 284 >= end epoch 163, nothing to do
2017-02-20 20:46:28.567104 7ffa3242c700 20 osd.4 pg_epoch: 284 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 163/163/163) [4,3] r=0 lpr=255 crt=254'369 lcod 0'0 mlcod 0'0 inactive] new interval newup [3] newacting [3]
2017-02-20 20:46:28.567111 7ffa3242c700 10 osd.4 pg_epoch: 284 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 163/163/163) [4,3] r=0 lpr=255 crt=254'369 lcod 0'0 mlcod 0'0 inactive] state<Reset>: should restart peering, calling start_peering_interval again
2017-02-20 20:46:28.567119 7ffa3242c700 20 osd.4 pg_epoch: 284 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 163/163/163) [4,3] r=0 lpr=255 crt=254'369 lcod 0'0 mlcod 0'0 inactive] set_last_peering_reset 284
2017-02-20 20:46:28.567127 7ffa3242c700 10 osd.4 pg_epoch: 284 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 163/163/163) [4,3] r=0 lpr=284 crt=254'369 lcod 0'0 mlcod 0'0 inactive] Clearing blocked outgoing recovery messages
2017-02-20 20:46:28.567135 7ffa3242c700 10 osd.4 pg_epoch: 284 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 163/163/163) [4,3] r=0 lpr=284 crt=254'369 lcod 0'0 mlcod 0'0 inactive] Not blocking outgoing recovery messages
2017-02-20 20:46:28.567150 7ffa4237d700 10 filestore(/var/lib/ceph/osd/ceph-4) remove meta/#-1:a9f6e935:::osdmap.139:0# = 0
2017-02-20 20:46:28.567156 7ffa4237d700 15 filestore(/var/lib/ceph/osd/ceph-4) remove meta/#-1:431ebe63:::inc_osdmap.139:0#
2017-02-20 20:46:28.567152 7ffa3242c700 10 osd.4 pg_epoch: 284 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 163/163/163) [3] r=-1 lpr=284 pi=163-283/1 crt=254'369 lcod 0'0 inactive] start_peering_interval: check_new_interval output: check_new_interval interval(163-283 up [4,3](4) acting [4,3](4)): not rw, up_thru 214 up_from 5 last_epoch_clean 164
2017-02-20 20:46:28.567158 7ffa3242c700 10 osd.4 pg_epoch: 284 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 163/163/163) [3] r=-1 lpr=284 pi=163-283/1 crt=254'369 lcod 0'0 inactive] noting past interval(163-283 up [4,3](4) acting [4,3](4) maybe_went_rw)
2017-02-20 20:46:28.567166 7ffa3242c700 10 osd.4 pg_epoch: 284 pg[1.16( v 254'369 (0'0,254'369] local-les=164 n=3 ec=141 les/c/f 164/164/0 284/284/284) [3] r=-1 lpr=284 pi=163-283/1 crt=254'369 lcod 0'0 inactive] up [4,3] -> [3], acting [4,3] -> [3], acting_primary 4 -> 3, up_primary 4 -> 3, role 0 -> -1, features acting 1152323339925389307 upacting 1152323339925389307

Here, osd.4 skips from 255 to 284 and ends up skipping over the [3,1] interval. The real question is how the map could have been trimmed to 284 if 282 was the last epoch in which 1.16 was active+clean.

#7 Updated by Samuel Just over 1 year ago

  • Status changed from New to Testing
  • Assignee set to Samuel Just

2017-02-20 20:45:59.104093 7f75c93f8700 10 osd.3 pg_epoch: 284 pg[1.16( v 278'379 (0'0,278'379] local-les=277 n=1 ec=141 les/c/f 277/284/0 283/283/256) [3] r=0 lpr=283 pi=276-282/1 crt=278'379 lcod 278'377 mlcod 0'0 undersized+degraded+peered] trim_past_intervals: trimming interval(276-282 up [3,1](3) acting [3,1](3)

Found it. osd.3 even though it wasn't active set last_epoch_clean to the current epoch when "recovery" "completed". Simple enough fix.

#8 Updated by Samuel Just over 1 year ago

Well, sort of. last_epoch_clean is really about when we can forget OSDMaps. Should we retain OSDMaps on the mon (and past_intervals locally) until the PG is clean, or just until we have enough "clean" replicas, or just until we have any clean acting set? Basically, which of
1)active+clean (up == acting, acting is large enough, all recovery is complete)
2)active+[remapped]+!degraded+"clean" (up may or may not be acting, acting is large enough, all recovery is complete though clean isn't set)
3)active+[remapped]+"clean" (up may or may not be acting, acting may not be large enough, all recovery is complete though clean isn't set)

mark_clean() seems to think 3 should be the case, but it's not clear that makes sense and pg_stat_t::get_effective_last_epoch_clean() doesn't actually behave right if clean isn't set.

For now, I'll update mark_clean() to match get_effective_last_epoch_clean(). However, this will tend to cause clusters with remapped pgs to not trim maps.

#10 Updated by Sage Weil over 1 year ago

  • Status changed from Testing to Need Review
  • Assignee changed from Samuel Just to Sage Weil

#11 Updated by Greg Farnum over 1 year ago

  • Project changed from Ceph to RADOS
  • Component(RADOS) Monitor, OSD added

#12 Updated by Kefu Chai over 1 year ago

  • Status changed from Need Review to Resolved

#13 Updated by Sage Weil over 1 year ago

That fix (d24a8886658c2d8882275d69c6409717a62701be and 31d3ae8a878f7ede6357f602852d586e0621c73f) was not quite complete, breaks dump_stuck.py test, see #20431

#14 Updated by Sage Weil over 1 year ago

  • Related to Bug #20431: dump_stuck.py fails assert len(unclean) == num_unclean added

Also available in: Atom PDF