Project

General

Profile

Actions

Bug #39249

closed

Some PGs stuck in active+remapped state

Added by Марк Коренберг about 5 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Sometimes my PGs stuck in this state. When I stop primary OSD containig this PG, it becomes `active+undersized+degraded` and does not get remapped even when I start this OSD back again.

How to debug that? I have plenty of space on other OSDs. Restarting all OSDs does not help.

```
$ ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME
-1 14.15028 - 14 TiB 5.6 TiB 8.1 TiB 40.80 1.00 - root default
-2 3.19478 - 3.2 TiB 1.3 TiB 1.9 TiB 41.32 1.01 - host node1
6 blue_ssd 0.45599 1.00000 467 GiB 202 GiB 265 GiB 43.26 1.06 256 osd.6
1 prod 1.82419 0.93387 1.8 TiB 764 GiB 1.1 TiB 40.90 1.00 223 osd.1
2 prod 0.91460 0.79158 937 GiB 386 GiB 551 GiB 41.19 1.01 107 osd.2
-3 2.28519 - 2.3 TiB 1000 GiB 1.3 TiB 42.72 1.05 - host node2
0 blue_ssd 0.45599 1.00000 467 GiB 202 GiB 265 GiB 43.28 1.06 256 osd.0
3 prod 0.91460 0.83400 937 GiB 396 GiB 541 GiB 42.29 1.04 104 osd.3
4 prod 0.91460 0.72214 937 GiB 402 GiB 535 GiB 42.88 1.05 119 osd.4
-4 2.28996 - 1.8 TiB 826 GiB 1.0 TiB 44.05 1.08 - host node3
7 blue_ssd 0.45599 1.00000 467 GiB 202 GiB 265 GiB 43.26 1.06 256 osd.7
11 prod 0.45969 0 0 B 0 B 0 B 0 0 0 osd.11
13 prod 0.45969 0.84837 471 GiB 216 GiB 255 GiB 45.86 1.12 57 osd.13
14 prod 0.91460 0.65007 937 GiB 408 GiB 529 GiB 43.53 1.07 97 osd.14
-9 3.63689 - 3.6 TiB 1.4 TiB 2.3 TiB 37.66 0.92 - host node4
5 prod 0.90919 1.00000 931 GiB 350 GiB 581 GiB 37.58 0.92 97 osd.5
9 prod 1.81850 1.00000 1.8 TiB 745 GiB 1.1 TiB 40.00 0.98 207 osd.9
10 prod 0.90919 1.00000 931 GiB 308 GiB 623 GiB 33.04 0.81 92 osd.10
-16 2.74347 - 2.7 TiB 1.1 TiB 1.6 TiB 40.57 0.99 - host node5
8 prod 0.91449 0.94768 936 GiB 387 GiB 549 GiB 41.36 1.01 120 osd.8
12 prod 0.91449 0.84109 936 GiB 377 GiB 559 GiB 40.28 0.99 91 osd.12
16 prod 0.91449 0.70984 936 GiB 375 GiB 561 GiB 40.07 0.98 93 osd.16
TOTAL 14 TiB 5.6 TiB 8.6 TiB 40.80
```

So, my question is: how to debug such cases. My crushmap does not contain anything special (like upmaps) except two classes defined (prod and blue_ssd)

Actions #1

Updated by Марк Коренберг about 5 years ago

$ ceph osd df tree
ID  CLASS    WEIGHT   REWEIGHT SIZE    USE      AVAIL   %USE  VAR  PGS TYPE NAME      
 -1          14.15028        -  14 TiB  5.6 TiB 8.1 TiB 40.80 1.00   - root default   
 -2           3.19478        - 3.2 TiB  1.3 TiB 1.9 TiB 41.32 1.01   -     host node1 
  6 blue_ssd  0.45599  1.00000 467 GiB  202 GiB 265 GiB 43.26 1.06 256         osd.6  
  1     prod  1.82419  0.93387 1.8 TiB  764 GiB 1.1 TiB 40.90 1.00 223         osd.1  
  2     prod  0.91460  0.79158 937 GiB  386 GiB 551 GiB 41.19 1.01 107         osd.2  
 -3           2.28519        - 2.3 TiB 1000 GiB 1.3 TiB 42.72 1.05   -     host node2 
  0 blue_ssd  0.45599  1.00000 467 GiB  202 GiB 265 GiB 43.28 1.06 256         osd.0  
  3     prod  0.91460  0.83400 937 GiB  396 GiB 541 GiB 42.29 1.04 104         osd.3  
  4     prod  0.91460  0.72214 937 GiB  402 GiB 535 GiB 42.88 1.05 119         osd.4  
 -4           2.28996        - 1.8 TiB  826 GiB 1.0 TiB 44.05 1.08   -     host node3 
  7 blue_ssd  0.45599  1.00000 467 GiB  202 GiB 265 GiB 43.26 1.06 256         osd.7  
 11     prod  0.45969        0     0 B      0 B     0 B     0    0   0         osd.11 
 13     prod  0.45969  0.84837 471 GiB  216 GiB 255 GiB 45.86 1.12  57         osd.13 
 14     prod  0.91460  0.65007 937 GiB  408 GiB 529 GiB 43.53 1.07  97         osd.14 
 -9           3.63689        - 3.6 TiB  1.4 TiB 2.3 TiB 37.66 0.92   -     host node4 
  5     prod  0.90919  1.00000 931 GiB  350 GiB 581 GiB 37.58 0.92  97         osd.5  
  9     prod  1.81850  1.00000 1.8 TiB  745 GiB 1.1 TiB 40.00 0.98 207         osd.9  
 10     prod  0.90919  1.00000 931 GiB  308 GiB 623 GiB 33.04 0.81  92         osd.10 
-16           2.74347        - 2.7 TiB  1.1 TiB 1.6 TiB 40.57 0.99   -     host node5 
  8     prod  0.91449  0.94768 936 GiB  387 GiB 549 GiB 41.36 1.01 120         osd.8  
 12     prod  0.91449  0.84109 936 GiB  377 GiB 559 GiB 40.28 0.99  91         osd.12 
 16     prod  0.91449  0.70984 936 GiB  375 GiB 561 GiB 40.07 0.98  93         osd.16 
                         TOTAL  14 TiB  5.6 TiB 8.6 TiB 40.80                         
Actions #2

Updated by Марк Коренберг about 5 years ago

$ ceph pg dump | egrep 'PG|unders'
dumped all
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES      LOG  DISK_LOG STATE                      STATE_STAMP                VERSION         REPORTED        UP         UP_PRIMARY ACTING     ACTING_PRIMARY LAST_SCRUB      SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN 
19.21       676                  0      676         0       0 2530544738 3029     3029 active+undersized+degraded 2019-04-11 17:16:02.726121  63928'17602712  63928:17911658     [5,12]          5     [5,12]              5  63838'17595232 2019-04-11 01:04:54.802666  63702'17549735 2019-04-06 01:04:51.953306             0 
OSD_STAT USED    AVAIL   TOTAL   HB_PEERS                          PG_SUM PRIMARY_PG_SUM 
Actions #3

Updated by Марк Коренберг about 5 years ago

$ ceph -f json pg dump | jq '.pg_stats[] | select(.state | contains("unders"))'
dumped all
{
  "pgid": "19.21",
  "version": "63928'17602797",
  "reported_seq": "17911743",
  "reported_epoch": "63928",
  "state": "active+undersized+degraded",
  "last_fresh": "2019-04-11 17:34:03.450789",
  "last_change": "2019-04-11 17:16:02.726121",
  "last_active": "2019-04-11 17:34:03.450789",
  "last_peered": "2019-04-11 17:34:03.450789",
  "last_clean": "2019-04-11 17:15:08.576010",
  "last_became_active": "2019-04-11 17:16:02.726121",
  "last_became_peered": "2019-04-11 17:16:02.726121",
  "last_unstale": "2019-04-11 17:34:03.450789",
  "last_undegraded": "2019-04-11 17:16:02.724250",
  "last_fullsized": "2019-04-11 17:16:02.724138",
  "mapping_epoch": 63926,
  "log_start": "63838'17599783",
  "ondisk_log_start": "63838'17599783",
  "created": 1173,
  "last_epoch_clean": 63841,
  "parent": "0.0",
  "parent_split_bits": 0,
  "last_scrub": "63838'17595232",
  "last_scrub_stamp": "2019-04-11 01:04:54.802666",
  "last_deep_scrub": "63702'17549735",
  "last_deep_scrub_stamp": "2019-04-06 01:04:51.953306",
  "last_clean_scrub_stamp": "2019-04-11 01:04:54.802666",
  "log_size": 3014,
  "ondisk_log_size": 3014,
  "stats_invalid": false,
  "dirty_stats_invalid": false,
  "omap_stats_invalid": false,
  "hitset_stats_invalid": false,
  "hitset_bytes_stats_invalid": false,
  "pin_stats_invalid": false,
  "manifest_stats_invalid": true,
  "snaptrimq_len": 0,
  "stat_sum": {
     ...
  },
  "up": [
    5,
    12
  ],
  "acting": [
    5,
    12
  ],
  "blocked_by": [],
  "up_primary": 5,
  "acting_primary": 5,
  "purged_snaps": []
}

Actions #4

Updated by Марк Коренберг about 5 years ago

OSD.11 previously took part in this PG. I don't know now if as primary or not. The bug happened after I made `ceph osd out osd.11`

Actions #6

Updated by Марк Коренберг about 5 years ago

$ ceph pg 19.21 query
{
    "state": "active+clean+remapped",
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "epoch": 65828,
    "up": [
        5,
        16
    ],
    "acting": [
        5,
        16,
        12
    ],
    "acting_recovery_backfill": [
        "5",
        "12",
        "16" 
    ],
    "info": {
        "pgid": "19.21",
        "last_update": "65828'17613229",
        "last_complete": "65828'17613229",
        "log_tail": "65765'17610184",
        "last_user_version": 17613229,
        "last_backfill": "MAX",
        "last_backfill_bitwise": 1,
        "purged_snaps": [],
        "history": {
            "epoch_created": 1173,
            "epoch_pool_created": 1173,
            "last_epoch_started": 65728,
            "last_interval_started": 65727,
            "last_epoch_clean": 65728,
            "last_interval_clean": 65727,
            "last_epoch_split": 0,
            "last_epoch_marked_full": 0,
            "same_up_since": 65726,
            "same_interval_since": 65727,
            "same_primary_since": 65705,
            "last_scrub": "65342'17606121",
            "last_scrub_stamp": "2019-04-12 02:18:09.539815",
            "last_deep_scrub": "63702'17549735",
            "last_deep_scrub_stamp": "2019-04-06 01:04:51.953306",
            "last_clean_scrub_stamp": "2019-04-12 02:18:09.539815" 
        },
        "stats": {
            "version": "65828'17613229",
            "reported_seq": "17930530",
            "reported_epoch": "65828",
            "state": "active+clean+remapped",
            "last_fresh": "2019-04-12 18:04:12.211201",
            "last_change": "2019-04-12 14:23:34.812265",
            "last_active": "2019-04-12 18:04:12.211201",
            "last_peered": "2019-04-12 18:04:12.211201",
            "last_clean": "2019-04-12 18:04:12.211201",
            "last_became_active": "2019-04-12 12:57:16.825414",
            "last_became_peered": "2019-04-12 12:57:16.825414",
            "last_unstale": "2019-04-12 18:04:12.211201",
            "last_undegraded": "2019-04-12 18:04:12.211201",
            "last_fullsized": "2019-04-12 18:04:12.211201",
            "mapping_epoch": 65727,
            "log_start": "65765'17610184",
            "ondisk_log_start": "65765'17610184",
            "created": 1173,
            "last_epoch_clean": 65728,
            "parent": "0.0",
            "parent_split_bits": 0,
            "last_scrub": "65342'17606121",
            "last_scrub_stamp": "2019-04-12 02:18:09.539815",
            "last_deep_scrub": "63702'17549735",
            "last_deep_scrub_stamp": "2019-04-06 01:04:51.953306",
            "last_clean_scrub_stamp": "2019-04-12 02:18:09.539815",
            "log_size": 3045,
            "ondisk_log_size": 3045,
            "stats_invalid": false,
            "dirty_stats_invalid": false,
            "omap_stats_invalid": false,
            "hitset_stats_invalid": false,
            "hitset_bytes_stats_invalid": false,
            "pin_stats_invalid": false,
            "manifest_stats_invalid": true,
            "snaptrimq_len": 0,
            "stat_sum": {
                "num_bytes": 2525160546,
                "num_objects": 653,
                "num_object_clones": 16,
                "num_object_copies": 1959,
                "num_objects_missing_on_primary": 0,
                "num_objects_missing": 0,
                "num_objects_degraded": 0,
                "num_objects_misplaced": 653,
                "num_objects_unfound": 0,
                "num_objects_dirty": 653,
                "num_whiteouts": 0,
                "num_read": 4669894,
                "num_read_kb": 184550843,
                "num_write": 17585691,
                "num_write_kb": 1432285508,
                "num_scrub_errors": 0,
                "num_shallow_scrub_errors": 0,
                "num_deep_scrub_errors": 0,
                "num_objects_recovered": 62562,
                "num_bytes_recovered": 248121752746,
                "num_keys_recovered": 14,
                "num_objects_omap": 0,
                "num_objects_hit_set_archive": 0,
                "num_bytes_hit_set_archive": 0,
                "num_flush": 0,
                "num_flush_kb": 0,
                "num_evict": 0,
                "num_evict_kb": 0,
                "num_promote": 0,
                "num_flush_mode_high": 0,
                "num_flush_mode_low": 0,
                "num_evict_mode_some": 0,
                "num_evict_mode_full": 0,
                "num_objects_pinned": 0,
                "num_legacy_snapsets": 0,
                "num_large_omap_objects": 0,
                "num_objects_manifest": 0
            },
            "up": [
                5,
                16
            ],
            "acting": [
                5,
                16,
                12
            ],
            "blocked_by": [],
            "up_primary": 5,
            "acting_primary": 5,
            "purged_snaps": []
        },
        "empty": 0,
        "dne": 0,
        "incomplete": 0,
        "last_epoch_started": 65728,
        "hit_set_history": {
            "current_last_update": "0'0",
            "history": []
        }
    },
    "peer_info": [
        {
            "peer": "12",
            "pgid": "19.21",
            "last_update": "65828'17613229",
            "last_complete": "65828'17613229",
            "log_tail": "65342'17606783",
            "last_user_version": 17609878,
            "last_backfill": "MAX",
            "last_backfill_bitwise": 1,
            "purged_snaps": [],
            "history": {
                "epoch_created": 1173,
                "epoch_pool_created": 1173,
                "last_epoch_started": 65728,
                "last_interval_started": 65727,
                "last_epoch_clean": 65728,
                "last_interval_clean": 65727,
                "last_epoch_split": 0,
                "last_epoch_marked_full": 0,
                "same_up_since": 65726,
                "same_interval_since": 65727,
                "same_primary_since": 65705,
                "last_scrub": "65342'17606121",
                "last_scrub_stamp": "2019-04-12 02:18:09.539815",
                "last_deep_scrub": "63702'17549735",
                "last_deep_scrub_stamp": "2019-04-06 01:04:51.953306",
                "last_clean_scrub_stamp": "2019-04-12 02:18:09.539815" 
            },
            "stats": {
                "version": "65712'17609876",
                "reported_seq": "17923240",
                "reported_epoch": "65712",
                "state": "active+clean+remapped",
                "last_fresh": "2019-04-12 12:39:04.048144",
                "last_change": "2019-04-12 12:27:14.334773",
                "last_active": "2019-04-12 12:39:04.048144",
                "last_peered": "2019-04-12 12:39:04.048144",
                "last_clean": "2019-04-12 12:39:04.048144",
                "last_became_active": "2019-04-12 12:27:14.334462",
                "last_became_peered": "2019-04-12 12:27:14.334462",
                "last_unstale": "2019-04-12 12:39:04.048144",
                "last_undegraded": "2019-04-12 12:39:04.048144",
                "last_fullsized": "2019-04-12 12:39:04.048144",
                "mapping_epoch": 65727,
                "log_start": "65342'17606783",
                "ondisk_log_start": "65342'17606783",
                "created": 1173,
                "last_epoch_clean": 65706,
                "parent": "0.0",
                "parent_split_bits": 0,
                "last_scrub": "65342'17606121",
                "last_scrub_stamp": "2019-04-12 02:18:09.539815",
                "last_deep_scrub": "63702'17549735",
                "last_deep_scrub_stamp": "2019-04-06 01:04:51.953306",
                "last_clean_scrub_stamp": "2019-04-12 02:18:09.539815",
                "log_size": 3093,
                "ondisk_log_size": 3093,
                "stats_invalid": false,
                "dirty_stats_invalid": false,
                "omap_stats_invalid": false,
                "hitset_stats_invalid": false,
                "hitset_bytes_stats_invalid": false,
                "pin_stats_invalid": false,
                "manifest_stats_invalid": true,
                "snaptrimq_len": 0,
                "stat_sum": {
                    "num_bytes": 2537348194,
                    "num_objects": 679,
                    "num_object_clones": 42,
                    "num_object_copies": 2037,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 679,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 679,
                    "num_whiteouts": 0,
                    "num_read": 4665795,
                    "num_read_kb": 184471580,
                    "num_write": 17582383,
                    "num_write_kb": 1431930472,
                    "num_scrub_errors": 0,
                    "num_shallow_scrub_errors": 0,
                    "num_deep_scrub_errors": 0,
                    "num_objects_recovered": 62562,
                    "num_bytes_recovered": 248121752746,
                    "num_keys_recovered": 14,
                    "num_objects_omap": 0,
                    "num_objects_hit_set_archive": 0,
                    "num_bytes_hit_set_archive": 0,
                    "num_flush": 0,
                    "num_flush_kb": 0,
                    "num_evict": 0,
                    "num_evict_kb": 0,
                    "num_promote": 0,
                    "num_flush_mode_high": 0,
                    "num_flush_mode_low": 0,
                    "num_evict_mode_some": 0,
                    "num_evict_mode_full": 0,
                    "num_objects_pinned": 0,
                    "num_legacy_snapsets": 0,
                    "num_large_omap_objects": 0,
                    "num_objects_manifest": 0
                },
                "up": [
                    5,
                    16
                ],
                "acting": [
                    5,
                    16,
                    12
                ],
                "blocked_by": [],
                "up_primary": 5,
                "acting_primary": 5,
                "purged_snaps": []
            },
            "empty": 0,
            "dne": 0,
            "incomplete": 0,
            "last_epoch_started": 65728,
            "hit_set_history": {
                "current_last_update": "0'0",
                "history": []
            }
        },
        {
            "peer": "16",
            "pgid": "19.21",
            "last_update": "65828'17613229",
            "last_complete": "65828'17613229",
            "log_tail": "65342'17606783",
            "last_user_version": 17609878,
            "last_backfill": "MAX",
            "last_backfill_bitwise": 1,
            "purged_snaps": [],
            "history": {
                "epoch_created": 1173,
                "epoch_pool_created": 1173,
                "last_epoch_started": 65728,
                "last_interval_started": 65727,
                "last_epoch_clean": 65728,
                "last_interval_clean": 65727,
                "last_epoch_split": 0,
                "last_epoch_marked_full": 0,
                "same_up_since": 65726,
                "same_interval_since": 65727,
                "same_primary_since": 65705,
                "last_scrub": "65342'17606121",
                "last_scrub_stamp": "2019-04-12 02:18:09.539815",
                "last_deep_scrub": "63702'17549735",
                "last_deep_scrub_stamp": "2019-04-06 01:04:51.953306",
                "last_clean_scrub_stamp": "2019-04-12 02:18:09.539815" 
            },
            "stats": {
                "version": "65712'17609876",
                "reported_seq": "17923240",
                "reported_epoch": "65712",
                "state": "active+clean+remapped",
                "last_fresh": "2019-04-12 12:39:04.048144",
                "last_change": "2019-04-12 12:27:14.334773",
                "last_active": "2019-04-12 12:39:04.048144",
                "last_peered": "2019-04-12 12:39:04.048144",
                "last_clean": "2019-04-12 12:39:04.048144",
                "last_became_active": "2019-04-12 12:27:14.334462",
                "last_became_peered": "2019-04-12 12:27:14.334462",
                "last_unstale": "2019-04-12 12:39:04.048144",
                "last_undegraded": "2019-04-12 12:39:04.048144",
                "last_fullsized": "2019-04-12 12:39:04.048144",
                "mapping_epoch": 65727,
                "log_start": "65342'17606783",
                "ondisk_log_start": "65342'17606783",
                "created": 1173,
                "last_epoch_clean": 65706,
                "parent": "0.0",
                "parent_split_bits": 0,
                "last_scrub": "65342'17606121",
                "last_scrub_stamp": "2019-04-12 02:18:09.539815",
                "last_deep_scrub": "63702'17549735",
                "last_deep_scrub_stamp": "2019-04-06 01:04:51.953306",
                "last_clean_scrub_stamp": "2019-04-12 02:18:09.539815",
                "log_size": 3093,
                "ondisk_log_size": 3093,
                "stats_invalid": false,
                "dirty_stats_invalid": false,
                "omap_stats_invalid": false,
                "hitset_stats_invalid": false,
                "hitset_bytes_stats_invalid": false,
                "pin_stats_invalid": false,
                "manifest_stats_invalid": true,
                "snaptrimq_len": 0,
                "stat_sum": {
                    "num_bytes": 2537348194,
                    "num_objects": 679,
                    "num_object_clones": 42,
                    "num_object_copies": 2037,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 679,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 679,
                    "num_whiteouts": 0,
                    "num_read": 4665795,
                    "num_read_kb": 184471580,
                    "num_write": 17582383,
                    "num_write_kb": 1431930472,
                    "num_scrub_errors": 0,
                    "num_shallow_scrub_errors": 0,
                    "num_deep_scrub_errors": 0,
                    "num_objects_recovered": 62562,
                    "num_bytes_recovered": 248121752746,
                    "num_keys_recovered": 14,
                    "num_objects_omap": 0,
                    "num_objects_hit_set_archive": 0,
                    "num_bytes_hit_set_archive": 0,
                    "num_flush": 0,
                    "num_flush_kb": 0,
                    "num_evict": 0,
                    "num_evict_kb": 0,
                    "num_promote": 0,
                    "num_flush_mode_high": 0,
                    "num_flush_mode_low": 0,
                    "num_evict_mode_some": 0,
                    "num_evict_mode_full": 0,
                    "num_objects_pinned": 0,
                    "num_legacy_snapsets": 0,
                    "num_large_omap_objects": 0,
                    "num_objects_manifest": 0
                },
                "up": [
                    5,
                    16
                ],
                "acting": [
                    5,
                    16,
                    12
                ],
                "blocked_by": [],
                "up_primary": 5,
                "acting_primary": 5,
                "purged_snaps": []
            },
            "empty": 0,
            "dne": 0,
            "incomplete": 0,
            "last_epoch_started": 65728,
            "hit_set_history": {
                "current_last_update": "0'0",
                "history": []
            }
        }
    ],
    "recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2019-04-12 12:57:16.816173",
            "might_have_unfound": [],
            "recovery_progress": {
                "backfill_targets": [],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "pull_from_peer": [],
                    "pushing": []
                }
            },
            "scrub": {
                "scrubber.epoch_start": "65107",
                "scrubber.active": false,
                "scrubber.state": "INACTIVE",
                "scrubber.start": "MIN",
                "scrubber.end": "MIN",
                "scrubber.max_end": "MIN",
                "scrubber.subset_last_update": "0'0",
                "scrubber.deep": false,
                "scrubber.waiting_on_whom": []
            }
        },
        {
            "name": "Started",
            "enter_time": "2019-04-12 12:57:15.795368" 
        }
    ],
    "agent_state": {}
}

Actions #7

Updated by Nathan Cutler about 5 years ago

@Mark: Which version of Mimic are you running?

Actions #8

Updated by Jake Grimmett about 5 years ago

We have a Mimic 13.2.5 cluster with a similar looking problem:

After replacing a failing OSD, the cluster mostly healed, then stuck at 0.006%:

osd: 454 osds: 454 up, 454 in; 8 remapped pgs
pgs: 378896/6223874662 objects misplaced (0.006%)
8200 active+clean
16 active+clean+scrubbing+deep
8 active+clean+remapped

[root@ceph1 ~]# ceph health detail
HEALTH_WARN 378895/6223876069 objects misplaced (0.006%)
OBJECT_MISPLACED 378895/6223876069 objects misplaced (0.006%)

If relevant, the OSD was replaced by:
ceph osd out 193
...waiting until "ceph osd safe-to-destroy 193" came back positive.
physically replacing the drive, then
ceph osd crush remove osd.193 (perhaps ceph osd purge should have been used?)
"ceph-volume lvm create --osd-id 193 --data /dev/sds failed with "RuntimeError: The osd ID 193 is already in use or does not exist."
so the new drive was added using:
ceph-volume lvm create --bluestore --data /dev/sds
this added the drive as osd.0 (osd.0 had been removed some time ago)
"ceph osd tree | grep 193" gives no reply.

Actions #9

Updated by Марк Коренберг about 5 years ago

exactly the same. In order to heal that I have changed all my reweights to 1. This helped. But anyway, I don't understand how to debug that. I need to understand why that happens.

Actions #10

Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to RADOS
Actions #11

Updated by Jake Grimmett about 5 years ago

I've not tried changing reweights to 1, though last week I ran "ceph osd reweight-by-utilization 110"

Cluster is still showing:
[root@ceph1 ~]# ceph health
HEALTH_WARN 382063/6296785540 objects misplaced (0.006%)

Happy to pull any debug info or logs for the dev team...
:)

Actions #12

Updated by Sage Weil about 5 years ago

  • Status changed from New to Closed

This looks like CRUSH's fault. Can you check with tunables you are running? (ceph osd crush show-tunables)

Using newer tunables may help.

I think the better solution is to get away from using the old reweight-by-utilization and osd reweight values. Instead, use the balancer and crush-compat mode. The balancer will even do a smooth/gradual transition away from the old reweights. See http://docs.ceph.com/docs/master/rados/operations/balancer/?highlight=balancer

Actions

Also available in: Atom PDF