Bug #7922
closedosd: multi-backfill reservation does not release on reject
0%
Description
Several OSDs are crashing quite frequently with this error, at least on version 0.77 and 0.78 (possibly earlier as well.)
Log file for the entire life of osd.11, showing several such failures (111 MiB): http://www.aarontc.com/logs/ceph-osd.11.log
I've modified the log settings to include 'ms = 1, osd = 20, filestore = 20' and will continue logging in case that will help narrow down the problem.
JSON osdmap as requested by joshd:
{ "epoch":46935, "fsid":"34aad620-aacf-424b-9b52-27b8580bdef5", "created":"2013-09-28 01:15:40.854897", "modified":"2014-03-31 13:30:59.635115", "flags":"noout", "cluster_snapshot":"", "pool_max":3, "max_osd":23, "pools":[ { "pool":2, "pool_name":"rbd", "flags":0, "flags_names":"", "type":1, "size":2, "min_size":1, "crush_ruleset":2, "object_hash":2, "pg_num":1200, "pg_placement_num":1200, "crash_replay_interval":0, "last_change":"24332", "auid":0, "snap_mode":"selfmanaged", "snap_seq":5, "snap_epoch":24332, "pool_snaps":[ ], "removed_snaps":"[1~5]", "quota_max_bytes":0, "quota_max_objects":0, "tiers":[ ], "tier_of":-1, "read_tier":-1, "write_tier":-1, "cache_mode":"none", "target_max_bytes":0, "target_max_objects":0, "cache_target_dirty_ratio_micro":0, "cache_target_full_ratio_micro":0, "cache_min_flush_age":0, "cache_min_evict_age":0, "properties":{ }, "hit_set_params":{ "type":"none" }, "hit_set_period":0, "hit_set_count":0, "stripe_width":0 }, { "pool":3, "pool_name":"vm", "flags":0, "flags_names":"", "type":1, "size":4, "min_size":2, "crush_ruleset":0, "object_hash":2, "pg_num":1000, "pg_placement_num":1000, "crash_replay_interval":0, "last_change":"43647", "auid":0, "snap_mode":"selfmanaged", "snap_seq":3, "snap_epoch":40541, "pool_snaps":[ ], "removed_snaps":"[1~3]", "quota_max_bytes":0, "quota_max_objects":0, "tiers":[ ], "tier_of":-1, "read_tier":-1, "write_tier":-1, "cache_mode":"none", "target_max_bytes":0, "target_max_objects":0, "cache_target_dirty_ratio_micro":0, "cache_target_full_ratio_micro":0, "cache_min_flush_age":0, "cache_min_evict_age":0, "properties":{ }, "hit_set_params":{ "type":"none" }, "hit_set_period":0, "hit_set_count":0, "stripe_width":0 } ], "osds":[ { "osd":0, "uuid":"7c6eddc7-9e12-4f6c-a890-bf884d979d19", "up":0, "in":1, "weight":"1.000000", "primary_affinity":"1.000000", "last_clean_begin":46912, "last_clean_end":46919, "up_from":46923, "up_thru":46930, "down_at":46932, "lost_at":0, "public_addr":"10.42.6.29:6800\/906", "cluster_addr":"10.43.100.29:6800\/906", "heartbeat_back_addr":"10.43.100.29:6801\/906", "heartbeat_front_addr":"10.42.6.29:6801\/906", "state":[ "exists" ] }, { "osd":1, "uuid":"22cf6928-bb56-41a2-912a-885eb2745a63", "up":1, "in":1, "weight":"1.000000", "primary_affinity":"1.000000", "last_clean_begin":45968, "last_clean_end":45983, "up_from":46141, "up_thru":46933, "down_at":45985, "lost_at":0, "public_addr":"10.42.6.29:6804\/15608", "cluster_addr":"10.43.100.29:6803\/15608", "heartbeat_back_addr":"10.43.100.29:6804\/15608", "heartbeat_front_addr":"10.42.6.29:6805\/15608", "state":[ "exists", "up" ] }, { "osd":3, "uuid":"97a22758-77e7-41f7-94fc-773573b3cf82", "up":1, "in":1, "weight":"1.000000", "primary_affinity":"1.000000", "last_clean_begin":45948, "last_clean_end":45954, "up_from":46188, "up_thru":46934, "down_at":45956, "lost_at":0, "public_addr":"10.42.6.29:6809\/22543", "cluster_addr":"10.43.100.29:6805\/22543", "heartbeat_back_addr":"10.43.100.29:6807\/22543", "heartbeat_front_addr":"10.42.6.29:6810\/22543", "state":[ "exists", "up" ] }, { "osd":4, "uuid":"e1031378-64be-4f09-a990-4d816096aa36", "up":0, "in":1, "weight":"0.823349", "primary_affinity":"1.000000", "last_clean_begin":45558, "last_clean_end":46608, "up_from":46885, "up_thru":46887, "down_at":46893, "lost_at":0, "public_addr":"10.42.6.31:6800\/3080", "cluster_addr":"10.43.100.31:6800\/3080", "heartbeat_back_addr":"10.43.100.31:6801\/3080", "heartbeat_front_addr":"10.42.6.31:6801\/3080", "state":[ "exists" ] }, { "osd":5, "uuid":"9bd0b619-35bf-4797-b15b-5bec36f3b250", "up":1, "in":1, "weight":"0.779297", "primary_affinity":"1.000000", "last_clean_begin":43005, "last_clean_end":45521, "up_from":45545, "up_thru":46933, "down_at":45522, "lost_at":0, "public_addr":"10.42.6.31:6815\/31069", "cluster_addr":"10.43.100.31:6812\/31069", "heartbeat_back_addr":"10.43.100.31:6813\/31069", "heartbeat_front_addr":"10.42.6.31:6816\/31069", "state":[ "exists", "up" ] }, { "osd":6, "uuid":"161dff6b-ec99-49b0-a650-b4618b46ccd5", "up":1, "in":1, "weight":"0.500000", "primary_affinity":"1.000000", "last_clean_begin":42979, "last_clean_end":44836, "up_from":44837, "up_thru":46933, "down_at":44826, "lost_at":0, "public_addr":"10.42.6.32:6803\/6766", "cluster_addr":"10.43.100.32:6804\/7006766", "heartbeat_back_addr":"10.43.100.32:6805\/7006766", "heartbeat_front_addr":"10.42.6.32:6806\/7006766", "state":[ "exists", "up" ] }, { "osd":7, "uuid":"991d93e3-3b08-4ca4-a656-f9a850bbab19", "up":1, "in":1, "weight":"1.000000", "primary_affinity":"1.000000", "last_clean_begin":42961, "last_clean_end":45271, "up_from":45580, "up_thru":46932, "down_at":45278, "lost_at":0, "public_addr":"10.42.6.29:6812\/27694", "cluster_addr":"10.43.100.29:6808\/27694", "heartbeat_back_addr":"10.43.100.29:6809\/27694", "heartbeat_front_addr":"10.42.6.29:6813\/27694", "state":[ "exists", "up" ] }, { "osd":8, "uuid":"3d1d4058-c8cf-4c49-9f99-3d073a01ce0f", "up":1, "in":0, "weight":"0.000000", "primary_affinity":"1.000000", "last_clean_begin":42655, "last_clean_end":42934, "up_from":42952, "up_thru":36806, "down_at":42948, "lost_at":0, "public_addr":"10.42.6.29:6815\/7153", "cluster_addr":"10.43.100.29:6810\/7153", "heartbeat_back_addr":"10.43.100.29:6811\/7153", "heartbeat_front_addr":"10.42.6.29:6816\/7153", "state":[ "exists", "up" ] }, { "osd":9, "uuid":"af352212-e20a-46de-b6b8-138073fb049b", "up":1, "in":1, "weight":"1.000000", "primary_affinity":"1.000000", "last_clean_begin":42966, "last_clean_end":45058, "up_from":45579, "up_thru":46934, "down_at":45066, "lost_at":0, "public_addr":"10.42.6.29:6818\/27886", "cluster_addr":"10.43.100.29:6812\/27886", "heartbeat_back_addr":"10.43.100.29:6813\/27886", "heartbeat_front_addr":"10.42.6.29:6819\/27886", "state":[ "exists", "up" ] }, { "osd":10, "uuid":"de00daac-fd9a-480f-bdc1-1e82ebca988d", "up":1, "in":1, "weight":"0.714142", "primary_affinity":"1.000000", "last_clean_begin":42974, "last_clean_end":45525, "up_from":45532, "up_thru":46934, "down_at":45526, "lost_at":0, "public_addr":"10.42.6.31:6802\/30266", "cluster_addr":"10.43.100.31:6804\/30266", "heartbeat_back_addr":"10.43.100.31:6805\/30266", "heartbeat_front_addr":"10.42.6.31:6805\/30266", "state":[ "exists", "up" ] }, { "osd":11, "uuid":"d99ac1b4-c8e5-489f-aa19-a9be63d5c3c8", "up":0, "in":1, "weight":"0.858551", "primary_affinity":"1.000000", "last_clean_begin":46227, "last_clean_end":46606, "up_from":46843, "up_thru":46862, "down_at":46882, "lost_at":0, "public_addr":"10.42.6.33:6800\/10089", "cluster_addr":"10.43.100.33:6800\/10089", "heartbeat_back_addr":"10.43.100.33:6801\/10089", "heartbeat_front_addr":"10.42.6.33:6801\/10089", "state":[ "exists" ] }, { "osd":12, "uuid":"2c8b7c37-0bc6-4e41-a3b7-52b9a758453b", "up":1, "in":1, "weight":"0.731506", "primary_affinity":"1.000000", "last_clean_begin":42970, "last_clean_end":45523, "up_from":45530, "up_thru":46932, "down_at":45524, "lost_at":0, "public_addr":"10.42.6.31:6807\/30515", "cluster_addr":"10.43.100.31:6806\/30515", "heartbeat_back_addr":"10.43.100.31:6807\/30515", "heartbeat_front_addr":"10.42.6.31:6808\/30515", "state":[ "exists", "up" ] }, { "osd":13, "uuid":"83c6d0f6-0cf9-4948-8d1e-f782e91a68b1", "up":0, "in":1, "weight":"0.913635", "primary_affinity":"1.000000", "last_clean_begin":45512, "last_clean_end":46748, "up_from":46842, "up_thru":46893, "down_at":46905, "lost_at":0, "public_addr":"10.42.6.33:6803\/10338", "cluster_addr":"10.43.100.33:6802\/10338", "heartbeat_back_addr":"10.43.100.33:6803\/10338", "heartbeat_front_addr":"10.42.6.33:6804\/10338", "state":[ "exists" ] }, { "osd":14, "uuid":"3a160892-cff8-45eb-96ad-374aeac03d74", "up":0, "in":1, "weight":"0.909729", "primary_affinity":"1.000000", "last_clean_begin":45508, "last_clean_end":46521, "up_from":46837, "up_thru":46893, "down_at":46909, "lost_at":0, "public_addr":"10.42.6.33:6808\/10493", "cluster_addr":"10.43.100.33:6806\/10493", "heartbeat_back_addr":"10.43.100.33:6807\/10493", "heartbeat_front_addr":"10.42.6.33:6809\/10493", "state":[ "exists" ] }, { "osd":15, "uuid":"5e278740-fc0c-4dff-83c2-f33a181eb736", "up":1, "in":1, "weight":"0.916687", "primary_affinity":"1.000000", "last_clean_begin":43038, "last_clean_end":45500, "up_from":45510, "up_thru":46933, "down_at":45501, "lost_at":0, "public_addr":"10.42.6.33:6811\/418", "cluster_addr":"10.43.100.33:6808\/418", "heartbeat_back_addr":"10.43.100.33:6809\/418", "heartbeat_front_addr":"10.42.6.33:6812\/418", "state":[ "exists", "up" ] }, { "osd":16, "uuid":"585c98f9-de36-4106-a81f-dff4599fe531", "up":1, "in":1, "weight":"0.929016", "primary_affinity":"1.000000", "last_clean_begin":45934, "last_clean_end":45940, "up_from":46238, "up_thru":46934, "down_at":45947, "lost_at":0, "public_addr":"10.42.6.33:6806\/20402", "cluster_addr":"10.43.100.33:6804\/20402", "heartbeat_back_addr":"10.43.100.33:6805\/20402", "heartbeat_front_addr":"10.42.6.33:6807\/20402", "state":[ "exists", "up" ] }, { "osd":17, "uuid":"3777e1e4-17c4-4b28-9b2a-578a2e0a97c8", "up":1, "in":0, "weight":"0.000000", "primary_affinity":"1.000000", "last_clean_begin":42938, "last_clean_end":43515, "up_from":43529, "up_thru":46932, "down_at":43515, "lost_at":0, "public_addr":"10.42.6.34:6800\/28100", "cluster_addr":"10.43.100.34:6808\/1028100", "heartbeat_back_addr":"10.43.100.34:6809\/1028100", "heartbeat_front_addr":"10.42.6.34:6812\/1028100", "state":[ "exists", "up" ] }, { "osd":18, "uuid":"d34a89d8-062b-4fc5-adc7-f6753707da94", "up":1, "in":1, "weight":"1.000000", "primary_affinity":"1.000000", "last_clean_begin":42945, "last_clean_end":43627, "up_from":43628, "up_thru":46932, "down_at":43625, "lost_at":0, "public_addr":"10.42.6.34:6803\/28260", "cluster_addr":"10.43.100.34:6800\/2028260", "heartbeat_back_addr":"10.43.100.34:6802\/2028260", "heartbeat_front_addr":"10.42.6.34:6801\/2028260", "state":[ "exists", "up" ] }, { "osd":19, "uuid":"2692ad07-b964-4634-8dd7-5f2a9c0b9ec6", "up":1, "in":1, "weight":"0.843536", "primary_affinity":"1.000000", "last_clean_begin":42940, "last_clean_end":46758, "up_from":46759, "up_thru":46934, "down_at":46754, "lost_at":0, "public_addr":"10.42.6.34:6806\/28436", "cluster_addr":"10.43.100.34:6806\/4028436", "heartbeat_back_addr":"10.43.100.34:6807\/4028436", "heartbeat_front_addr":"10.42.6.34:6810\/4028436", "state":[ "exists", "up" ] }, { "osd":20, "uuid":"b9eb0cd1-9b9a-4a52-99d0-90432d07f342", "up":1, "in":1, "weight":"0.815689", "primary_affinity":"1.000000", "last_clean_begin":42942, "last_clean_end":43641, "up_from":43642, "up_thru":46933, "down_at":43639, "lost_at":0, "public_addr":"10.42.6.34:6809\/28608", "cluster_addr":"10.43.100.34:6801\/4028608", "heartbeat_back_addr":"10.43.100.34:6805\/4028608", "heartbeat_front_addr":"10.42.6.34:6807\/4028608", "state":[ "exists", "up" ] }, { "osd":21, "uuid":"95ec9d5f-3262-4318-8d5a-8b0ad298dce7", "up":1, "in":0, "weight":"0.000000", "primary_affinity":"1.000000", "last_clean_begin":42950, "last_clean_end":44865, "up_from":44866, "up_thru":37196, "down_at":44865, "lost_at":0, "public_addr":"10.42.6.32:6800\/6252", "cluster_addr":"10.43.100.32:6800\/4006252", "heartbeat_back_addr":"10.43.100.32:6801\/4006252", "heartbeat_front_addr":"10.42.6.32:6801\/4006252", "state":[ "exists", "up" ] }, { "osd":22, "uuid":"f8759d63-1af3-4222-bfc6-b0f9db9bebb4", "up":1, "in":1, "weight":"1.000000", "primary_affinity":"1.000000", "last_clean_begin":42957, "last_clean_end":44833, "up_from":44834, "up_thru":46934, "down_at":44826, "lost_at":0, "public_addr":"10.42.6.29:6806\/6306", "cluster_addr":"10.43.100.29:6802\/4006306", "heartbeat_back_addr":"10.43.100.29:6806\/4006306", "heartbeat_front_addr":"10.42.6.29:6802\/4006306", "state":[ "exists", "up" ] } ], "osd_xinfo":[ { "osd":0, "down_stamp":"2014-03-31 13:30:55.328970", "laggy_probability":"0.044324", "laggy_interval":0, "features":4398046511103 }, { "osd":1, "down_stamp":"2014-03-30 12:59:31.682860", "laggy_probability":"0.023634", "laggy_interval":0, "features":4398046511103 }, { "osd":3, "down_stamp":"2014-03-30 12:54:47.014294", "laggy_probability":"0.023891", "laggy_interval":0, "features":4398046511103 }, { "osd":4, "down_stamp":"2014-03-31 07:18:54.959367", "laggy_probability":"0.447987", "laggy_interval":4, "features":4398046511103 }, { "osd":5, "down_stamp":"2014-03-30 11:59:51.301429", "laggy_probability":"0.670597", "laggy_interval":4, "features":4398046511103 }, { "osd":6, "down_stamp":"2014-03-30 03:32:11.593850", "laggy_probability":"0.927334", "laggy_interval":16, "features":4398046511103 }, { "osd":7, "down_stamp":"2014-03-30 06:30:48.078926", "laggy_probability":"0.589133", "laggy_interval":11, "features":4398046511103 }, { "osd":8, "down_stamp":"2014-03-27 00:06:45.516936", "laggy_probability":"0.068547", "laggy_interval":0, "features":4398046511103 }, { "osd":9, "down_stamp":"2014-03-30 04:22:12.879276", "laggy_probability":"0.589133", "laggy_interval":9, "features":4398046511103 }, { "osd":10, "down_stamp":"2014-03-30 11:59:56.820541", "laggy_probability":"0.679418", "laggy_interval":3, "features":4398046511103 }, { "osd":11, "down_stamp":"2014-03-31 07:17:28.580618", "laggy_probability":"0.033294", "laggy_interval":0, "features":4398046511103 }, { "osd":12, "down_stamp":"2014-03-30 11:59:54.218787", "laggy_probability":"0.639987", "laggy_interval":10, "features":4398046511103 }, { "osd":13, "down_stamp":"2014-03-31 07:38:05.950992", "laggy_probability":"0.429996", "laggy_interval":2, "features":4398046511103 }, { "osd":14, "down_stamp":"2014-03-31 07:38:33.233753", "laggy_probability":"0.447957", "laggy_interval":0, "features":4398046511103 }, { "osd":15, "down_stamp":"2014-03-30 11:58:38.719460", "laggy_probability":"0.645891", "laggy_interval":9, "features":4398046511103 }, { "osd":16, "down_stamp":"2014-03-30 12:53:54.313320", "laggy_probability":"0.055840", "laggy_interval":2, "features":4398046511103 }, { "osd":17, "down_stamp":"2014-03-28 20:43:38.658859", "laggy_probability":"0.642679", "laggy_interval":46, "features":4398046511103 }, { "osd":18, "down_stamp":"2014-03-30 02:31:02.131297", "laggy_probability":"0.748638", "laggy_interval":28, "features":4398046511103 }, { "osd":19, "down_stamp":"2014-03-31 02:32:30.118261", "laggy_probability":"0.877520", "laggy_interval":40, "features":4398046511103 }, { "osd":20, "down_stamp":"2014-03-30 02:34:08.107965", "laggy_probability":"0.876955", "laggy_interval":66, "features":4398046511103 }, { "osd":21, "down_stamp":"2014-03-30 03:34:19.100347", "laggy_probability":"0.838935", "laggy_interval":2, "features":4398046511103 }, { "osd":22, "down_stamp":"2014-03-30 03:32:11.593850", "laggy_probability":"0.779668", "laggy_interval":19, "features":4398046511103 } ], "pg_temp":[ { "pgid":"2.22", "osds":[ 7, 17 ] }, { "pgid":"2.27", "osds":[ 4, 17 ] }, { "pgid":"2.28", "osds":[ 7, 17 ] }, { "pgid":"2.33", "osds":[ 3, 17 ] }, { "pgid":"2.39", "osds":[ 0, 17 ] }, { "pgid":"2.55", "osds":[ 22, 17 ] }, { "pgid":"2.66", "osds":[ 0, 1 ] }, { "pgid":"2.88", "osds":[ 22, 16 ] }, { "pgid":"2.90", "osds":[ 1, 17 ] }, { "pgid":"2.9b", "osds":[ 7, 17 ] }, { "pgid":"2.db", "osds":[ 0, 17 ] }, { "pgid":"2.f3", "osds":[ 19, 17 ] }, { "pgid":"2.16c", "osds":[ 14, 17 ] }, { "pgid":"2.18a", "osds":[ 19, 9 ] }, { "pgid":"2.19f", "osds":[ 20, 6 ] }, { "pgid":"2.1d7", "osds":[ 14, 17 ] }, { "pgid":"2.1db", "osds":[ 1, 17 ] }, { "pgid":"2.1ef", "osds":[ 14, 7 ] }, { "pgid":"2.1fa", "osds":[ 12, 17 ] }, { "pgid":"2.22e", "osds":[ 3, 17 ] }, { "pgid":"2.236", "osds":[ 0, 17 ] }, { "pgid":"2.245", "osds":[ 3, 17 ] }, { "pgid":"2.276", "osds":[ 3, 17 ] }, { "pgid":"2.278", "osds":[ 9, 17 ] }, { "pgid":"2.27f", "osds":[ 14, 17 ] }, { "pgid":"2.2b0", "osds":[ 5, 17 ] }, { "pgid":"2.2bc", "osds":[ 9, 17 ] }, { "pgid":"2.2c6", "osds":[ 9, 17 ] }, { "pgid":"2.2ce", "osds":[ 13, 17 ] }, { "pgid":"2.2d6", "osds":[ 9, 17 ] }, { "pgid":"2.2dc", "osds":[ 15, 1 ] }, { "pgid":"2.318", "osds":[ 6, 17 ] }, { "pgid":"2.33c", "osds":[ 4, 17 ] }, { "pgid":"2.358", "osds":[ 7, 17 ] }, { "pgid":"2.359", "osds":[ 5, 16 ] }, { "pgid":"2.35b", "osds":[ 6, 17 ] }, { "pgid":"2.373", "osds":[ 19, 17 ] }, { "pgid":"2.37c", "osds":[ 1, 17 ] }, { "pgid":"2.397", "osds":[ 1, 17 ] }, { "pgid":"2.39c", "osds":[ 7, 17 ] }, { "pgid":"2.3a2", "osds":[ 9, 17 ] }, { "pgid":"2.3a8", "osds":[ 3, 17 ] }, { "pgid":"2.3d6", "osds":[ 3, 17 ] }, { "pgid":"2.3df", "osds":[ 6, 17 ] }, { "pgid":"2.3f3", "osds":[ 9, 17 ] }, { "pgid":"2.3fd", "osds":[ 0, 17 ] }, { "pgid":"2.431", "osds":[ 16, 17 ] }, { "pgid":"2.434", "osds":[ 1, 17 ] }, { "pgid":"2.440", "osds":[ 22, 17 ] }, { "pgid":"2.441", "osds":[ 16, 17 ] }, { "pgid":"2.464", "osds":[ 17, 22 ] }, { "pgid":"2.468", "osds":[ 3, 17 ] }, { "pgid":"2.46a", "osds":[ 16, 17 ] }, { "pgid":"2.487", "osds":[ 13, 15 ] }, { "pgid":"2.4af", "osds":[ 1, 17 ] }, { "pgid":"3.0", "osds":[ 11, 7, 20 ] }, { "pgid":"3.6", "osds":[ 4, 16, 7, 17 ] }, { "pgid":"3.1a", "osds":[ 5, 11, 17, 22 ] }, { "pgid":"3.21", "osds":[ 7, 20, 4, 16 ] }, { "pgid":"3.26", "osds":[ 4, 1, 11, 17 ] }, { "pgid":"3.27", "osds":[ 7, 13, 17 ] }, { "pgid":"3.28", "osds":[ 10, 15, 22, 17 ] }, { "pgid":"3.2d", "osds":[ 15, 19, 1, 5 ] }, { "pgid":"3.35", "osds":[ 14, 1, 4, 17 ] }, { "pgid":"3.4a", "osds":[ 5, 15, 0, 17 ] }, { "pgid":"3.50", "osds":[ 22, 16, 5, 17 ] }, { "pgid":"3.54", "osds":[ 22, 4, 15, 17 ] }, { "pgid":"3.5d", "osds":[ 3, 15, 20, 17 ] }, { "pgid":"3.65", "osds":[ 20, 6, 1, 10 ] }, { "pgid":"3.7f", "osds":[ 3, 6, 11, 17 ] }, { "pgid":"3.81", "osds":[ 16, 1, 4, 17 ] }, { "pgid":"3.87", "osds":[ 9, 5, 16, 17 ] }, { "pgid":"3.8b", "osds":[ 22, 5, 14, 17 ] }, { "pgid":"3.9a", "osds":[ 7, 5, 14, 17 ] }, { "pgid":"3.a7", "osds":[ 1, 15, 10, 17 ] }, { "pgid":"3.ae", "osds":[ 11, 3, 4, 17 ] }, { "pgid":"3.b9", "osds":[ 12, 14, 1, 17 ] }, { "pgid":"3.c3", "osds":[ 0, 5, 6, 17 ] }, { "pgid":"3.cf", "osds":[ 16, 1, 5, 17 ] }, { "pgid":"3.da", "osds":[ 14, 17 ] }, { "pgid":"3.dc", "osds":[ 16, 7, 5, 17 ] }, { "pgid":"3.dd", "osds":[ 1, 5, 15, 17 ] }, { "pgid":"3.10b", "osds":[ 1, 16, 4, 11 ] }, { "pgid":"3.119", "osds":[ 10, 11, 9, 17 ] }, { "pgid":"3.11a", "osds":[ 7, 5, 11, 17 ] }, { "pgid":"3.11d", "osds":[ 4, 0, 15, 17 ] }, { "pgid":"3.123", "osds":[ 5, 22, 15, 17 ] }, { "pgid":"3.126", "osds":[ 5, 0, 11 ] }, { "pgid":"3.12b", "osds":[ 15, 19, 1, 3 ] }, { "pgid":"3.136", "osds":[ 7, 13, 6, 3 ] }, { "pgid":"3.13a", "osds":[ 7, 6, 5, 17 ] }, { "pgid":"3.13c", "osds":[ 9, 16, 5 ] }, { "pgid":"3.144", "osds":[ 0, 14, 5, 17 ] }, { "pgid":"3.149", "osds":[ 13, 20, 3, 0 ] }, { "pgid":"3.158", "osds":[ 3, 15, 5, 17 ] }, { "pgid":"3.15c", "osds":[ 0, 5, 13, 17 ] }, { "pgid":"3.15e", "osds":[ 13, 22, 6, 17 ] }, { "pgid":"3.172", "osds":[ 7, 6, 13, 17 ] }, { "pgid":"3.17a", "osds":[ 5, 1, 16, 14 ] }, { "pgid":"3.188", "osds":[ 16, 18, 10, 0 ] }, { "pgid":"3.1a1", "osds":[ 22, 15, 5, 17 ] }, { "pgid":"3.1a2", "osds":[ 22, 11, 5, 17 ] }, { "pgid":"3.1a9", "osds":[ 6, 7, 18, 17 ] }, { "pgid":"3.1ab", "osds":[ 7, 5, 13, 17 ] }, { "pgid":"3.1b1", "osds":[ 15, 4, 9, 17 ] }, { "pgid":"3.1b7", "osds":[ 15, 18, 1, 5 ] }, { "pgid":"3.1bc", "osds":[ 9, 10, 13, 17 ] }, { "pgid":"3.1bd", "osds":[ 7, 18, 11, 15 ] }, { "pgid":"3.1c5", "osds":[ 16, 3, 4, 17 ] }, { "pgid":"3.1da", "osds":[ 1, 20, 15, 10 ] }, { "pgid":"3.1e0", "osds":[ 15, 22, 4, 17 ] }, { "pgid":"3.1e4", "osds":[ 15, 20, 13, 17 ] }, { "pgid":"3.1ee", "osds":[ 13, 7, 19, 5 ] }, { "pgid":"3.1ef", "osds":[ 0, 15, 4, 17 ] }, { "pgid":"3.1f3", "osds":[ 11, 1, 4, 17 ] }, { "pgid":"3.1f7", "osds":[ 11, 5, 22, 17 ] }, { "pgid":"3.1f9", "osds":[ 0, 12, 20, 17 ] }, { "pgid":"3.200", "osds":[ 12, 0, 5, 16 ] }, { "pgid":"3.202", "osds":[ 11, 5, 17, 22 ] }, { "pgid":"3.214", "osds":[ 6, 13, 22, 17 ] }, { "pgid":"3.21b", "osds":[ 22, 5, 15, 17 ] }, { "pgid":"3.229", "osds":[ 18, 1, 11, 20 ] }, { "pgid":"3.22d", "osds":[ 3, 6, 14, 17 ] }, { "pgid":"3.235", "osds":[ 0, 13, 11, 17 ] }, { "pgid":"3.23b", "osds":[ 14, 5, 18, 0 ] }, { "pgid":"3.248", "osds":[ 14, 1, 5, 17 ] }, { "pgid":"3.250", "osds":[ 16, 1, 6, 17 ] }, { "pgid":"3.25d", "osds":[ 14, 22, 4, 17 ] }, { "pgid":"3.264", "osds":[ 13, 0, 5, 17 ] }, { "pgid":"3.268", "osds":[ 9, 16, 20, 5 ] }, { "pgid":"3.26a", "osds":[ 0, 14, 17 ] }, { "pgid":"3.28b", "osds":[ 16, 4, 20, 17 ] }, { "pgid":"3.297", "osds":[ 1, 4, 13, 17 ] }, { "pgid":"3.29e", "osds":[ 16, 3, 5, 17 ] }, { "pgid":"3.2a1", "osds":[ 3, 14, 5, 17 ] }, { "pgid":"3.2a2", "osds":[ 1, 13, 4, 17 ] }, { "pgid":"3.2a4", "osds":[ 5, 3, 16, 4 ] }, { "pgid":"3.2a6", "osds":[ 1, 15, 6, 17 ] }, { "pgid":"3.2af", "osds":[ 0, 6, 5 ] }, { "pgid":"3.2b2", "osds":[ 11, 10, 3, 17 ] }, { "pgid":"3.2bb", "osds":[ 5, 6, 9, 17 ] }, { "pgid":"3.2c5", "osds":[ 9, 20, 17 ] }, { "pgid":"3.2cd", "osds":[ 0, 13, 4, 9 ] }, { "pgid":"3.2cf", "osds":[ 7, 15, 5, 17 ] }, { "pgid":"3.2d3", "osds":[ 22, 16, 4, 17 ] }, { "pgid":"3.2d5", "osds":[ 9, 6, 17 ] }, { "pgid":"3.2db", "osds":[ 0, 15, 19, 1 ] }, { "pgid":"3.2df", "osds":[ 1, 11, 5, 17 ] }, { "pgid":"3.2ea", "osds":[ 13, 22, 5, 17 ] }, { "pgid":"3.2ee", "osds":[ 9, 12, 5 ] }, { "pgid":"3.2f2", "osds":[ 14, 0, 5, 17 ] }, { "pgid":"3.2f3", "osds":[ 10, 9, 11, 17 ] }, { "pgid":"3.2f8", "osds":[ 1, 16, 5, 17 ] }, { "pgid":"3.308", "osds":[ 13, 3, 17 ] }, { "pgid":"3.30a", "osds":[ 0, 16, 10, 17 ] }, { "pgid":"3.311", "osds":[ 15, 0, 4, 17 ] }, { "pgid":"3.313", "osds":[ 15, 1, 4, 17 ] }, { "pgid":"3.317", "osds":[ 6, 1, 11, 17 ] }, { "pgid":"3.320", "osds":[ 4, 22, 15, 17 ] }, { "pgid":"3.324", "osds":[ 5, 7, 16, 17 ] }, { "pgid":"3.340", "osds":[ 0, 4, 14, 15 ] }, { "pgid":"3.346", "osds":[ 9, 13, 4, 17 ] }, { "pgid":"3.351", "osds":[ 4, 11, 17, 22 ] }, { "pgid":"3.358", "osds":[ 5, 9, 20, 7 ] }, { "pgid":"3.37b", "osds":[ 1, 14, 19, 0 ] }, { "pgid":"3.380", "osds":[ 13, 1, 5, 0 ] }, { "pgid":"3.382", "osds":[ 9, 11, 4, 17 ] }, { "pgid":"3.391", "osds":[ 0, 6, 17 ] }, { "pgid":"3.399", "osds":[ 1, 11, 17 ] }, { "pgid":"3.39b", "osds":[ 7, 14, 5, 17 ] }, { "pgid":"3.3a1", "osds":[ 9, 13, 5, 17 ] }, { "pgid":"3.3a7", "osds":[ 19, 3 ] }, { "pgid":"3.3ab", "osds":[ 15, 6, 19, 17 ] }, { "pgid":"3.3b6", "osds":[ 4, 1, 14, 17 ] }, { "pgid":"3.3b7", "osds":[ 4, 22, 14, 17 ] }, { "pgid":"3.3b8", "osds":[ 4, 14, 9, 17 ] }, { "pgid":"3.3bf", "osds":[ 7, 13, 20, 4 ] }, { "pgid":"3.3cc", "osds":[ 3, 15, 6, 17 ] }, { "pgid":"3.3d5", "osds":[ 0, 13, 3, 15 ] }, { "pgid":"3.3e0", "osds":[ 4, 19, 17 ] }, { "pgid":"3.3e7", "osds":[ 5, 22, 14, 17 ] } ], "primary_temp":[ ], "blacklist":[ ] }
Updated by Aaron T about 10 years ago
After the additional logging directives were added and osd.0 was restarted, it quickly crashed again. Here is the logfile: http://www.aarontc.com/logs/ceph-osd.0.log.bz2
Updated by Sage Weil about 10 years ago
We got a duplicate backfill reservation request from osd.3:
-1500> 2014-03-31 07:17:00.482664 7f97d116d700 1 -- 10.43.100.33:6800/10089 <== osd.3 10.43.100.29:6805/22543 1001 ==== MBackfillReserve REQUEST pgid: 3.22d, query_epoch: 46880, prio: 0 v3 ==== 30+0+0 (1766824561 0 0) 0x7f9774001380 con 0x7f97b82e9060 ... -94> 2014-03-31 07:17:10.491106 7f97d116d700 1 -- 10.43.100.33:6800/10089 <== osd.3 10.43.100.29:6805/22543 1030 ==== MBackfillReserve REQUEST pgid: 3.22d, query_epoch: 46880, prio: 0 v3 ==== 30+0+0 (1766824561 0 0) 0x7f9774002a00 con 0x7f97b82e9060
Updated by Sage Weil about 10 years ago
- Category set to OSD
- Status changed from New to Need More Info
it would help to see the log from the primary sending the dup backfill request. it is not entirely trivial to determine, but if you search backward on the thread that crashes, you will see it processing the backfill request that triggers the bad state change, and just above that you will see the line where it is received from a peer osd. in this case it was osd.3.
Updated by Aaron T about 10 years ago
Here's a partial log from osd.3 around the problematic time: http://www.aarontc.com/logs/ceph-osd.3.log.bz2
Please let me know if I can do anything else to help :)
Updated by Aaron T about 10 years ago
All highly-verbose logfiles have been uploaded at Sage's request (4.3GiB): http://www.aarontc.com/logs/ceph-logs-chekov.tar.bz2
Updated by Sage Weil about 10 years ago
- Status changed from Need More Info to In Progress
Updated by Sage Weil about 10 years ago
- Subject changed from osd/PG.cc: 5255: FAILED assert(0 == "we got a bad state machine event") to osd: multi-backfill reservation does not release on reject
- Assignee changed from Sage Weil to David Zafman
logs on flab:/more2/t/7922
on osd.9, we see:
2014-04-02 00:02:46.461229 7f72ab8ed700 1 -- 10.43.100.29:6809/28350 <== osd.3 10.43.100.29:6807/27807 16 ==== MBackfillReserve REQUEST pgid: 2.47d, query_epoch: 49313, prio: 1 v3 ==== 30+0+0 (2028193633 0 0) 0x7f71c40073d0 con 0x7f721c0014e0 ... 2014-04-02 00:02:55.835619 7f72a60e2700 1 -- 10.43.100.29:6809/28350 --> osd.3 10.43.100.29:6807/27807 -- MBackfillReserve GRANT pgid: 2.47d, query_epoch: 49319 v3 -- ?+0 0x7f729021be30 ... 2014-04-02 00:03:40.363440 7f72ab8ed700 1 -- 10.43.100.29:6809/28350 <== osd.3 10.43.100.29:6807/27807 18 ==== MBackfillReserve REQUEST pgid: 2.47d, query_epoch: 49330, prio: 1 v3 ==== 30+0+0 (713200269 0 0) 0x7f71c40008c0 con 0x7f721c0014e0
this is within the same peering epoch. looking at hte primary on osd.3 we can see why:
2014-04-02 00:02:46.461130 7f9a5eb40700 1 -- 10.43.100.29:6807/27807 --> 10.43.100.29:6809/28350 -- MBackfillReserve REQUEST pgid: 2.47d, query_epoch: 49313, prio: 1 v3 -- ?+0 0x7f9a541d55c0 con 0x7f9a580026a0 ... 2014-04-02 00:02:55.836696 7f9a6434b700 1 -- 10.43.100.29:6807/27807 <== osd.9 10.43.100.29:6809/28350 39 ==== MBackfillReserve GRANT pgid: 2.47d, query_epoch: 49319 v3 ==== 30+0+0 (2897045009 0 0) 0x7f999c000d60 con 0x7f9a580026a0 (ok so far...) ... 2014-04-02 00:02:55.836825 7f9a5eb40700 1 -- 10.43.100.29:6807/27807 --> 10.43.100.34:6810/9885 -- MBackfillReserve REQUEST pgid: 2.47d, query_epoch: 49319, prio: 1 v3 -- ?+0 0x7f9a540b5dc0 con 0x7f9a5400cd10 (reserve secon backfill target, i guess...) ... 2014-04-02 00:02:57.937030 7f9a6434b700 1 -- 10.43.100.29:6807/27807 <== osd.18 10.43.100.34:6810/9885 843 ==== MBackfillReserve REJECT pgid: 2.47d, query_epoch: 49319 v3 ==== 30+0+0 (3036834360 0 0) 0x7f9994000ee0 con 0x7f9a5400cd10
and we go into PG::RecoveryState::WaitRemoteBackfillReserved::react(const RemoteReservationRejected &evt)
which cancels the local_reserver and schedules a retry later, but does not send a message to osd.9 canceling the reservation. and when it tries again later, it causes the crash.
it looks like we have REQUEST, GRANT and REJECT, but no CANCEL, which is what is needed here. I think we should be looping over the sorted_backfill_set from begin() up to but not including backfill_osd_it (what we've reserved so far) and sending those cancels from the Rejected react function?
Updated by Sage Weil about 10 years ago
i reproduced this on my first try with the patch in wip-7922.
MON=1 MDS=1 OSD=5 ./vstart.sh -d -x -n -l -o 'osd debug reject backfill probability = .2' ./rados -p rbd bench 30 write -b 123 ./init-ceph stop osd.0 osd.1 ./ceph osd out 0 1
and a minute later an osd crashed with the bad state machine event.
Updated by Sage Weil about 10 years ago
interactive-on-error: true overrides: admin_socket: branch: master ceph: conf: mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon min osdmap epochs: 2 osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 20 fs: xfs install: ceph: sha1: 31df91e091c01255cfea4c45d949d784198acdaa roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - client.0 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - client.1 tasks: - install: null - ceph: conf: osd: osd debug reject backfill probability: .3 osd min pg log entries: 25 osd max pg log entries: 100 - exec: client.0: - ceph osd pool create foo 64 - rados -p foo bench 60 write -b 1024 --no-cleanup - ceph osd pool set foo size 3 - ceph osd out 0 1 - sleep: duration: 60 - exec: client.0: - ceph osd in 0 1 - sleep: duration: 60
triggers the failure, when the wip-7922 patch that adds the rejection thing is there.
Updated by David Zafman about 10 years ago
- Status changed from In Progress to Fix Under Review
Updated by Sage Weil about 10 years ago
- Status changed from Fix Under Review to Resolved
Updated by Kenneth Waegeman about 10 years ago
Is this fixed in 0.79 ? Or can I patch this myself? I seem to have this problem too
Updated by Sage Weil about 10 years ago
Kenneth Waegeman wrote:
Is this fixed in 0.79 ? Or can I patch this myself? I seem to have this problem too
Yes, it's fixed in 0.79