Bug #16051: ceph restart cluster cause pg stucked - Ceph - Ceph

Actions

Copy link

Bug #16051

closed

ceph restart cluster cause pg stucked

Added by Xin Zhao almost 8 years ago. Updated over 7 years ago.

Status:

Closed

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v10.2.0

ceph-qa-suite:

rados

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hello everyone! We encountered a problem very difficult to handle. The version of ceph we used is 10.2.0. Our cluster consist of 5 hosts, and three of them hold monitor. Because we have no enough disks, we deployed osds on a single disk in each host( just wanna check if cluster will stable ), and we use erasure code pool.
At first, we chose simple messenger to build connections, but our machine can't support too many threads( 5 hosts 360 osds total, 36 osds for each ). Then we modified to use async messenger. The pool is be created ok, but once we restart the cluster, there's very often occured several pgs stuck on 'peering' or 'remaped+peering' status, and the cluster health err. Waiting can't resolve this problem, the stuck is very stubborn.
We guessed it's happened because too many osds, then reduced the osd number, 25 osds and 50 osds in total are ok, but when the osds number reaching 70 the stuck occured again. Then we shrank the cluster to test different settings:

3 hosts, 10 osds for each, 1200 pgs, k=6 m=1 (ok)
        3 hosts, 15 osds for each, 1800 pgs, k=6 m=1 (ok)
        3 hosts, 25 osds for each, 1800 pgs, k=6 m=1 (ok)
        3 hosts, 30 osds for each, 1800 pgs, k=6 m=1 (ok)
        3 hosts, 35 osds for each, 1800 pgs, k=6 m=1 (fail)

It seems the stuck has nothing to do with pg number, but relative with osd number.

And lately we also test simple messenger in ec pool, use ulimit to reduce thread stack size to half (4096k), but the outcome is worse, almost half pgs can't active.
Following is the log file. We found all osds in the acting set of stucked pg will record this error. The connection between two osds can’t build because the remote osd daemon’s pid is different with original copy in local osd’s osdmap.

2016-05-27 17:35:38.233397 7f4bb3385700  0 -- 192.168.18.88:0/14775 >> 192.168.18.89:7025/2571 conn(0x7f4bc7695800 sd=159 :-1 s=STATE_CONNECTING_WAIT_IDENTIFY_PEER pgs=0 cs=0 l=1)._process_connection connect claims to be 192.168.18.89:7025/5807 not 192.168.18.89:7025/2571 - wrong node!
2016-05-27 17:35:38.233512 7f4bb3b86700  0 -- 192.168.18.88:0/14775 >> 192.168.18.89:7024/2571 conn(0x7f4bc7694000 sd=161 :-1 s=STATE_CONNECTING_WAIT_IDENTIFY_PEER pgs=0 cs=0 l=1)._process_connection connect claims to be 192.168.18.89:7024/5807 not 192.168.18.89:7024/2571 - wrong node!
2016-05-27 17:35:38.233539 7f4bb3b86700  0 -- 192.168.18.88:0/14775 >> 192.168.18.89:7010/1906 conn(0x7f4bc76f1000 sd=167 :-1 s=STATE_CONNECTING_WAIT_IDENTIFY_PEER pgs=0 cs=0 l=1)._process_connection connect claims to be 192.168.18.89:7010/5591 not 192.168.18.89:7010/1906 - wrong node!
2016-05-27 17:35:38.235373 7f4bb2383700  0 -- 192.168.18.88:6834/14775 >> 192.168.18.89:7008/1906 conn(0x7f4bc7775800 sd=155 :-1 s=STATE_CONNECTING_WAIT_IDENTIFY_PEER pgs=0 cs=0 l=0)._process_connection connect claims to be 192.168.18.89:7008/5591 not 192.168.18.89:7008/1906 - wrong node!
2016-05-27 17:35:38.236408 7f4bb4387700  0 -- 192.168.18.88:0/14775 >> 192.168.18.89:7009/1906 conn(0x7f4bc76ef800 sd=165 :-1 s=STATE_CONNECTING_WAIT_IDENTIFY_PEER pgs=0 cs=0 l=1)._process_connection connect claims to be 192.168.18.89:7009/5591 not 192.168.18.89:7009/1906 - wrong node!

Use pgquery command to checkout stucked pg’s content, will always found “blocked_by”, and sometimes this “blocked_by”ed osd in acting set will be marked as CRUSH_ITEM_NONE.

"up": [
                108, 
                67,  
                162, 
                31,  
                32,  
                140, 
                101  
            ],   
            "acting": [
                108, 
                2147483647,
                162, 
                31,  
                2147483647,
                2147483647,
                2147483647
            ],   
            "blocked_by": [
                67   
            ],   
            "up_primary": 108, 
            "acting_primary": 108

By the way, we tried to increase the heartbeat response interval from default 20s to 60s, the thing goes much better than before, the times of stuck happened decreased a lot, but still maintain.

And the replicated pool also has this problem. We rebuild a new cluster which has 5 hosts and 177 osds( 35 or 36 osd for one host ), created a replicated pool with 3 replicas, 8800 pgs. Following are ruleset:

tunable choose_local_tries 0
 tunable choose_local_fallback_tries 0
 tunable choose_total_tries 50
 tunable chooseleaf_descend_once 1
 tunable chooseleaf_vary_r 1
 tunable straw_calc_version 1

root default {
  id -1   # do not change unnecessarily
 # weight 622.083
  alg straw
  hash 0  # rjenkins1
  item NP-18-84 weight 130.903
  item NP-18-86 weight 127.267
  item NP-18-87 weight 109.379
  item NP-18-88 weight 127.267
  item NP-18-89 weight 127.267
}

rule erasure-code {
  ruleset 1
  type erasure
  min_size 3
  max_size 10
  step set_chooseleaf_tries 5
  step set_choose_tries 100 
  step take default
  step chooseleaf indep 0 type osd 
  step emit
}

The stuck also happened on this pool, ceph status:

cluster fd9c77a9-badc-453f-9ac2-51007ab7afc0
     health HEALTH_ERR
            8 pgs are stuck inactive for more than 300 seconds
            40 pgs peering
            8 pgs stuck inactive
     monmap e1: 1 mons at {NP-18-84=192.168.18.84:6789/0}
            election epoch 35, quorum 0 NP-18-84
     osdmap e4577: 177 osds: 177 up, 177 in; 21 remapped pgs
            flags sortbitwise
      pgmap v13775: 2500 pgs, 1 pools, 0 bytes data, 0 objects
            2510 GB used, 619 TB / 622 TB avail
                2460 active+clean
                  21 remapped+peering
                  19 peering

Actions

Copy link

Updated by Josh Durgin over 7 years ago

Two things to try:

1) set crush tunables to optimal - older tunables are the default to stay compatible with older kernels, but can cause issues like this (especially with few OSDs, or few per host). (http://docs.ceph.com/docs/master/rados/operations/crush-map/#tunables)

2) async messenger likely has some bugs in 10.2.0 - see if you have the same issues with simple messenger

Actions

Copy link