Project

General

Profile

Bug #1062

pg_refactor: active primary is getting MPGLog

Added by Greg Farnum almost 9 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Logs are in kai:~gregf/logs/pg_refactor/active_gets_pglog

Same steps to reproduce as #1058.

Causes a crash in the state machine since it shouldn't be getting logs there, obviously!

History

#1 Updated by Sage Weil almost 9 years ago

I have a slightly different crash, but i think it's the same problem. see this:


2011-05-04 16:35:04.610534 7fbd3f3cd700 osd1 3 pg[0.2( DNE empty n=0 ec=0 les=0 0/0/0) [] r=0 mlcod 0'0 inactive] write_info info 309
2011-05-04 16:35:04.610552 7fbd3f3cd700 osd1 3 pg[0.2( DNE empty n=0 ec=0 les=0 0/0/0) [] r=0 mlcod 0'0 inactive] write_info bigbl 8
2011-05-04 16:35:04.610566 7fbd3f3cd700 osd1 3 pg[0.2( DNE empty n=0 ec=0 les=0 0/0/0) [] r=0 mlcod 0'0 inactive] write_log
2011-05-04 16:35:04.610582 7fbd3f3cd700 osd1 3 pg[0.2( DNE empty n=0 ec=0 les=0 0/0/0) [] r=0 mlcod 0'0 inactive] write_log to 0~0
2011-05-04 16:35:04.610595 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 inactive] clear_primary_state
2011-05-04 16:35:04.610622 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] write_info info 309
2011-05-04 16:35:04.610641 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] write_info bigbl 8
2011-05-04 16:35:04.610655 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] write_log
2011-05-04 16:35:04.610671 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] write_log to 0~0
2011-05-04 16:35:04.610681 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] is new
2011-05-04 16:35:04.610691 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] handle_create
2011-05-04 16:35:04.610705 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] exit Initial 0.000225 0 0.000000
2011-05-04 16:35:04.610720 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] enter Started
2011-05-04 16:35:04.610732 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] enter Start
2011-05-04 16:35:04.610743 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] state<Start>: transitioning to Primary
2011-05-04 16:35:04.610760 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] exit Start 0.000026 0 0.000000
2011-05-04 16:35:04.610774 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] enter Started/Primary
2011-05-04 16:35:04.610792 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml inactive] enter Started/Primary/Peering
2011-05-04 16:35:04.610807 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] enter Started/Primary/Peering/GetInfo
2011-05-04 16:35:04.610819 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] generate_past_intervals over epochs 2-2
2011-05-04 16:35:04.610863 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] generate_past_intervals interval(2-2 [0]/[0] maybe_went_rw) : primary up 2-0
2011-05-04 16:35:04.610883 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] PgPriorSet: build_prior 0 have started since joining this pg
2011-05-04 16:35:04.610895 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] PgPriorSet: build_prior interval(2-2 [0]/[0] maybe_went_rw)
2011-05-04 16:35:04.610907 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] 0x7fbd3f3cbbb8
2011-05-04 16:35:04.610920 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] build_prior: pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] 
2011-05-04 16:35:04.610932 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] state<Started/Primary/Peering/GetInfo>: PG::do_peer: peer prior_set is [[ cur=0, down=, lost= ]]
2011-05-04 16:35:04.610943 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] PgPriorSet:  querying info from osd0
2011-05-04 16:35:04.610966 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] handle_notify 0.2( empty n=0 ec=2 les=2 3/3/3) from osd0
2011-05-04 16:35:04.610985 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] state<Started/Primary>: handle_pg_notify from 0
2011-05-04 16:35:04.611010 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] update_stats 3'1
2011-05-04 16:35:04.611028 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] exit Started/Primary/Peering/GetInfo 0.000220 1 0.000262
2011-05-04 16:35:04.611043 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] enter Started/Primary/Peering/GetLog
2011-05-04 16:35:04.611055 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] state<Started/Primary/Peering/GetLog>: In GetLog, selecting log location
2011-05-04 16:35:04.611070 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] choose_acting  osd0 can catch up with osd1 log
2011-05-04 16:35:04.611081 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] choose_acting want [1,0] (== acting)
2011-05-04 16:35:04.611092 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] state<Started/Primary/Peering/GetLog>: GetLog: neither backlog nor master log needed, moving to GetMissing
2011-05-04 16:35:04.611109 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] state<Started/Primary/Peering/GetLog>: leaving GetLog
2011-05-04 16:35:04.611122 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] exit Started/Primary/Peering/GetLog 0.000078 0 0.000000
2011-05-04 16:35:04.611136 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] enter Started/Primary/Peering/GetMissing
2011-05-04 16:35:04.611159 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] exit Started/Primary/Peering/GetMissing 0.000021 0 0.000000
2011-05-04 16:35:04.611173 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] state<Started/Primary/Peering>: Leaving Peering
2011-05-04 16:35:04.611191 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] exit Started/Primary/Peering 0.000398 0 0.000000
2011-05-04 16:35:04.611207 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] enter Started/Primary/Active
2011-05-04 16:35:04.611219 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml peering] state<Started/Primary/Active>: In Active, about to call activate
2011-05-04 16:35:04.611230 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] cancel_generate_backlog
2011-05-04 16:35:04.611255 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] write_info info 309
2011-05-04 16:35:04.611274 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] write_info bigbl 38
2011-05-04 16:35:04.611289 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] write_log
2011-05-04 16:35:04.611305 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] write_log to 0~0
2011-05-04 16:35:04.611315 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] clean_up_local
2011-05-04 16:35:04.611327 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] activate - snap_trimq []
2011-05-04 16:35:04.611349 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] write_info info 309
2011-05-04 16:35:04.611367 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] write_info bigbl 38
2011-05-04 16:35:04.611382 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] activate - complete
2011-05-04 16:35:04.611395 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] activate peer osd0 0.2( empty n=0 ec=2 les=2 3/3/3)
2011-05-04 16:35:04.611405 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] activate peer osd0 is up to date, but sending pg_log anyway
2011-05-04 16:35:04.611427 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] activate peer osd0 sending log(0'0,0'0] missing(0)
2011-05-04 16:35:04.611438 7fbd3f3cd700 -- 10.0.1.201:6805/28909 --> osd0 10.0.1.201:6801/28778 -- pg_log(0.2 e3) v1 -- ?+0 0x2e3ec00
2011-05-04 16:35:04.611457 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] activate peer osd0 already uptodate, 0.2( empty n=0 ec=2 les=2 3/3/3)
2011-05-04 16:35:04.611469 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] is_all_uptodate: everyone is uptodate
2011-05-04 16:35:04.611479 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active] finish_recovery
2011-05-04 16:35:04.611489 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active+clean] share_pg_info
2011-05-04 16:35:04.611500 7fbd3f3cd700 -- 10.0.1.201:6805/28909 --> osd0 10.0.1.201:6801/28778 -- pg_info(1 pgs e3) v1 -- ?+0 0x2e2dc40
2011-05-04 16:35:04.611522 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active+clean] clear_recovery_state
2011-05-04 16:35:04.611548 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active+clean] write_info info 309
2011-05-04 16:35:04.611568 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active+clean] write_info bigbl 38
2011-05-04 16:35:04.611586 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active+clean] update_stats 3'2
2011-05-04 16:35:04.611598 7fbd3f3cd700 osd1 3 pg[0.2( empty n=0 ec=2 les=2 3/3/3) [1,0] r=0 mlcod 0'0 !hml active+clean] state<Started/Primary/Active>: Activate Finished

basically we go through a few different states.. create sends out infos, but right after that we process the notify we just got (and now no longer need to wait for a reply)...do some other stuff.. and eventually activate.

I think the problem is that get_or_create_pg shouldn't actually do anything yet.. it should end up in an intermediate state, because we only call it go get/create a pg we're about to feed more information to.

#2 Updated by Sage Weil almost 9 years ago

  • Target version set to v0.28

#3 Updated by Greg Farnum almost 9 years ago

  • Status changed from In Progress to Resolved

Fixed my bug with 6ac5572946337e8e74b5014fc299e0ee10c70d9f and I think Sage already got his, too.

Also available in: Atom PDF