Actions
Bug #1062
closedpg_refactor: active primary is getting MPGLog
Added by Greg Farnum almost 13 years ago. Updated almost 13 years ago.
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Logs are in kai:~gregf/logs/pg_refactor/active_gets_pglog
Same steps to reproduce as #1058.
Causes a crash in the state machine since it shouldn't be getting logs there, obviously!
Updated by Sage Weil almost 13 years ago
I have a slightly different crash, but i think it's the same problem. see this:
2011-05-04 16:35:04.610534 7fbd3f3cd700 osd1 3 pg[0.2( DNE empty n=0 ec=0 les=0 0/0/0) [] r=0 mlcod 0'0 inactive] write_info info 309 2011-05-04 16:35:04.610552 7fbd3f3cd700 osd1 3 pg[0.2( DNE empty n=0 ec=0 les=0 0/0/0) [] r=0 mlcod 0'0 inactive] write_info bigbl 8 2011-05-04 16:35:04.610566 7fbd3f3cd700 osd1 3 pg[0.2( DNE empty n=0 ec=0 les=0 0/0/0) [] r=0 mlcod 0'0 inactive] write_log 2011-05-04 16:35:04.610582 7fbd3f3cd700 osd1 3 pg[0.2( DNE empty n=0 ec=0 les=0 0/0/0) [] r=0 mlcod 0'0 inactive] write_log to 0~0 2011-05-04 16:35:04.610595 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 inactive] clear_primary_state 2011-05-04 16:35:04.610622 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] write_info info 309 2011-05-04 16:35:04.610641 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] write_info bigbl 8 2011-05-04 16:35:04.610655 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] write_log 2011-05-04 16:35:04.610671 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] write_log to 0~0 2011-05-04 16:35:04.610681 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] is new 2011-05-04 16:35:04.610691 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] handle_create 2011-05-04 16:35:04.610705 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] exit Initial 0.000225 0 0.000000 2011-05-04 16:35:04.610720 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] enter Started 2011-05-04 16:35:04.610732 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] enter Start 2011-05-04 16:35:04.610743 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] state<Start>: transitioning to Primary 2011-05-04 16:35:04.610760 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] exit Start 0.000026 0 0.000000 2011-05-04 16:35:04.610774 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] enter Started/Primary 2011-05-04 16:35:04.610792 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] enter Started/Primary/Peering 2011-05-04 16:35:04.610807 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] enter Started/Primary/Peering/GetInfo 2011-05-04 16:35:04.610819 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] generate_past_intervals over epochs 2-2 2011-05-04 16:35:04.610863 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] generate_past_intervals interval(2-2 [0]/[0] maybe_went_rw) : primary up 2-0 2011-05-04 16:35:04.610883 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] PgPriorSet: build_prior 0 have started since joining this pg 2011-05-04 16:35:04.610895 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] PgPriorSet: build_prior interval(2-2 [0]/[0] maybe_went_rw) 2011-05-04 16:35:04.610907 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] 0x7fbd3f3cbbb8 2011-05-04 16:35:04.610920 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] build_prior: pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] 2011-05-04 16:35:04.610932 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] state<Started/Primary/Peering/GetInfo>: PG::do_peer: peer prior_set is [[ cur=0, down=, lost= ]] 2011-05-04 16:35:04.610943 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] PgPriorSet: querying info from osd0 2011-05-04 16:35:04.610966 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] handle_notify 0.2( empty n=0 ec=2 les=2 3/3/3) from osd0 2011-05-04 16:35:04.610985 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] state<Started/Primary>: handle_pg_notify from 0 2011-05-04 16:35:04.611010 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] update_stats 3'1 2011-05-04 16:35:04.611028 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] exit Started/Primary/Peering/GetInfo 0.000220 1 0.000262 2011-05-04 16:35:04.611043 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] enter Started/Primary/Peering/GetLog 2011-05-04 16:35:04.611055 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] state<Started/Primary/Peering/GetLog>: In GetLog, selecting log location 2011-05-04 16:35:04.611070 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] choose_acting osd0 can catch up with osd1 log 2011-05-04 16:35:04.611081 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] choose_acting want [1,0] (== acting) 2011-05-04 16:35:04.611092 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] state<Started/Primary/Peering/GetLog>: GetLog: neither backlog nor master log needed, moving to GetMissing 2011-05-04 16:35:04.611109 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] state<Started/Primary/Peering/GetLog>: leaving GetLog 2011-05-04 16:35:04.611122 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] exit Started/Primary/Peering/GetLog 0.000078 0 0.000000 2011-05-04 16:35:04.611136 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] enter Started/Primary/Peering/GetMissing 2011-05-04 16:35:04.611159 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] exit Started/Primary/Peering/GetMissing 0.000021 0 0.000000 2011-05-04 16:35:04.611173 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] state<Started/Primary/Peering>: Leaving Peering 2011-05-04 16:35:04.611191 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] exit Started/Primary/Peering 0.000398 0 0.000000 2011-05-04 16:35:04.611207 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] enter Started/Primary/Active 2011-05-04 16:35:04.611219 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] state<Started/Primary/Active>: In Active, about to call activate 2011-05-04 16:35:04.611230 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] cancel_generate_backlog 2011-05-04 16:35:04.611255 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] write_info info 309 2011-05-04 16:35:04.611274 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] write_info bigbl 38 2011-05-04 16:35:04.611289 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] write_log 2011-05-04 16:35:04.611305 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] write_log to 0~0 2011-05-04 16:35:04.611315 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] clean_up_local 2011-05-04 16:35:04.611327 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] activate - snap_trimq [] 2011-05-04 16:35:04.611349 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] write_info info 309 2011-05-04 16:35:04.611367 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] write_info bigbl 38 2011-05-04 16:35:04.611382 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] activate - complete 2011-05-04 16:35:04.611395 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] activate peer osd0 0.2( empty n=0 ec=2 les=2 3/3/3) 2011-05-04 16:35:04.611405 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] activate peer osd0 is up to date, but sending pg_log anyway 2011-05-04 16:35:04.611427 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] activate peer osd0 sending log(0'0,0'0] missing(0) 2011-05-04 16:35:04.611438 7fbd3f3cd700 -- 10.0.1.201:6805/28909 --> osd0 10.0.1.201:6801/28778 -- pg_log(0.2 e3) v1 -- ?+0 0x2e3ec00 2011-05-04 16:35:04.611457 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] activate peer osd0 already uptodate, 0.2( empty n=0 ec=2 les=2 3/3/3) 2011-05-04 16:35:04.611469 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] is_all_uptodate: everyone is uptodate 2011-05-04 16:35:04.611479 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] finish_recovery 2011-05-04 16:35:04.611489 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active+clean] share_pg_info 2011-05-04 16:35:04.611500 7fbd3f3cd700 -- 10.0.1.201:6805/28909 --> osd0 10.0.1.201:6801/28778 -- pg_info(1 pgs e3) v1 -- ?+0 0x2e2dc40 2011-05-04 16:35:04.611522 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active+clean] clear_recovery_state 2011-05-04 16:35:04.611548 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active+clean] write_info info 309 2011-05-04 16:35:04.611568 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active+clean] write_info bigbl 38 2011-05-04 16:35:04.611586 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active+clean] update_stats 3'2 2011-05-04 16:35:04.611598 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active+clean] state<Started/Primary/Active>: Activate Finished
basically we go through a few different states.. create sends out infos, but right after that we process the notify we just got (and now no longer need to wait for a reply)...do some other stuff.. and eventually activate.
I think the problem is that get_or_create_pg shouldn't actually do anything yet.. it should end up in an intermediate state, because we only call it go get/create a pg we're about to feed more information to.
Updated by Greg Farnum almost 13 years ago
- Status changed from In Progress to Resolved
Fixed my bug with 6ac5572946337e8e74b5014fc299e0ee10c70d9f and I think Sage already got his, too.
Actions