Project

General

Profile

Actions

Bug #5504

closed

osd stack on peereng for a long time

Added by Dominik Mostowiec almost 11 years ago. Updated almost 11 years ago.

Status:
Duplicate
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

PGs from one sometimes osd stacks on peereng for a long time.

--
ceph health details
HEALTH_WARN 3 pgs peering; 3 pgs stuck inactive; 5 pgs stuck unclean;
recovery 64/38277874 degraded (0.000%)
pg 5.df9 is stuck inactive for 138669.746512, current state peering,
last acting [87,2,151]
pg 5.a82 is stuck inactive for 138638.121867, current state peering,
last acting [151,87,42]
pg 5.80d is stuck inactive for 138621.069523, current state peering,
last acting [151,47,87]
pg 5.df9 is stuck unclean for 138669.746761, current state peering,
last acting [87,2,151]
pg 5.ae2 is stuck unclean for 139479.810499, current state active,
last acting [87,151,28]
pg 5.7b6 is stuck unclean for 139479.693271, current state active,
last acting [87,105,2]
pg 5.a82 is stuck unclean for 139479.713859, current state peering,
last acting [151,87,42]
pg 5.80d is stuck unclean for 139479.800820, current state peering,
last acting [151,47,87]
pg 5.df9 is peering, acting [87,2,151]
pg 5.a82 is peering, acting [151,87,42]
pg 5.80d is peering, acting [151,47,87]
recovery 64/38277874 degraded (0.000%)

osd pg query for 5.df9: { "state": "peering",
"up": [
87,
2,
151],
"acting": [
87,
2,
151],
"info": { "pgid": "5.df9",
"last_update": "119454'58844953",
"last_complete": "119454'58844953",
"log_tail": "119454'58843952",
"last_backfill": "MAX",
"purged_snaps": "[]",
"history": { "epoch_created": 365,
"last_epoch_started": 119456,
"last_epoch_clean": 119456,
"last_epoch_split": 117806,
"same_up_since": 119458,
"same_interval_since": 119458,
"same_primary_since": 119458,
"last_scrub": "119442'58732630",
"last_scrub_stamp": "2013-06-29 20:02:24.817352",
"last_deep_scrub": "119271'57224023",
"last_deep_scrub_stamp": "2013-06-23 02:04:49.654373",
"last_clean_scrub_stamp": "2013-06-29 20:02:24.817352"},
"stats": { "version": "119454'58844953",
"reported": "119458'42382189",
"state": "peering",
"last_fresh": "2013-06-30 20:35:29.489826",
"last_change": "2013-06-30 20:35:28.469854",
"last_active": "2013-06-30 20:33:24.126599",
"last_clean": "2013-06-30 20:33:24.126599",
"last_unstale": "2013-06-30 20:35:29.489826",
"mapping_epoch": 119455,
"log_start": "119454'58843952",
"ondisk_log_start": "119454'58843952",
"created": 365,
"last_epoch_clean": 365,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "119442'58732630",
"last_scrub_stamp": "2013-06-29 20:02:24.817352",
"last_deep_scrub": "119271'57224023",
"last_deep_scrub_stamp": "2013-06-23 02:04:49.654373",
"last_clean_scrub_stamp": "2013-06-29 20:02:24.817352",
"log_size": 135341,
"ondisk_log_size": 135341,
"stats_invalid": "0",
"stat_sum": { "num_bytes": 1010563373,
"num_objects": 3099,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_degraded": 0,
"num_objects_unfound": 0,
"num_read": 302,
"num_read_kb": 0,
"num_write": 32264,
"num_write_kb": 798650,
"num_scrub_errors": 0,
"num_objects_recovered": 8235,
"num_bytes_recovered": 2085653757,
"num_keys_recovered": 249061471},
"stat_cat_sum": {},
"up": [
87,
2,
151],
"acting": [
87,
2,
151]},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 119454},
"recovery_state": [ { "name": "Started\/Primary\/Peering\/GetLog",
"enter_time": "2013-06-30 20:35:28.545478",
"newest_update_osd": 2}, { "name": "Started\/Primary\/Peering",
"enter_time": "2013-06-30 20:35:28.469841",
"past_intervals": [ { "first": 119453,
"last": 119454,
"maybe_went_rw": 1,
"up": [
87,
2,
151],
"acting": [
87,
2,
151]}, { "first": 119455,
"last": 119457,
"maybe_went_rw": 1,
"up": [
2,
151],
"acting": [
2,
151]}],
"probing_osds": [
2,
87,
151],
"down_osds_we_would_probe": [],
"peering_blocked_by": []}, { "name": "Started",
"enter_time": "2013-06-30 20:35:28.469765"}]}

Other pg query reports:
https://www.dropbox.com/s/q5iv8lwzecioy3d/pg_query.tar.tz

Performance graphs for this osd:
https://www.dropbox.com/s/o07wae2041hu06l/osd_87_performance.PNG

Restart helps.

On 2 osd in this cluster op_rw is much higher than on others.
Graph: https://www.dropbox.com/s/wneqrmxzhpx8du2/op_rw_top_5.PNG
One on this is osd.87.
I found on it:
/data/osd.87/current# du -sh omap/
2.5G omap/

Logs from another cluster (debug 10) when it stack on peereng on PG from osd.57.
https://www.dropbox.com/s/vxvh8084b8ty19u/osd.57_20130628_13xx.log.tar.gz

--
Regards
Dominik

Actions #1

Updated by Dominik Mostowiec almost 11 years ago

ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c)
6 mons at at 6 servers
156 osds: 156 up, 156 in
6488 pgs
3990 GB data, 14747 GB used, 28673 GB / 43420 GB avail

Actions #2

Updated by Sage Weil almost 11 years ago

  • Status changed from New to Duplicate

see #5517

Actions

Also available in: Atom PDF