Project

General

Profile

Bug #15133

pg stuck in down+peering state

Added by huang jun about 8 years ago. Updated about 8 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version: 0.94.5
kernel version: 3.18.25

Ceph Cluster include 4 hosts:
server1: 192.168.10.1 (24 osd)
server2:12.168.10.2 (24 osd)
server3: 192.168.10.3(24 osd)
server4:192.168.10.4 (1mon, 1mds, 24osd)

Pool info(we have a ec pool with k:m=3:1):
pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 393 flags

hashpspool stripe_width 0
pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 389 flags

hashpspool crash_replay_interval 45 stripe_width 0
pool 2 'metadata' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 399 flags

hashpspool stripe_width 0
pool 3 'ecpool-1' erasure size 4 min_size 3 crush_ruleset 1 object_hash rjenkins pg_num 1152 pgp_num 1152 last_change 408 lfor

408 flags hashpspool tiers 4 read_tier 4 write_tier 4 stripe_width 196608
pool 4 'capool-1' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 288 pgp_num 288 last_change 414 flags

hashpspool,incomplete_clones tier_of 3 cache_mode readproxy target_bytes 2000000000000 hit_set bloom{false_positive_probability:

0.05, target_size: 0, seed: 0} 3600s x1 stripe_width 0

Our test steps:
1. cut off server1 and server2's power at same time.
2. wait for 10min, 48 osds status turns from DOWN to OUT.
duringt the down-->out, there is no client io.
3. power on host server1
4. the final cluster status is:
cluster 29613322-0857-4633-8f73-7f5ebe16f4b8
health HEALTH_WARN
119 pgs degraded
560 pgs down
560 pgs peering
119 pgs stuck degraded
560 pgs stuck inactive
1152 pgs stuck unclean
119 pgs stuck undersized
119 pgs undersized
monmap e1: 1 mons at {server4=192.168.10.4:6789/0}
election epoch 2, quorum 0 server4
mdsmap e6: 1/1/1 up {0=server4=up:active}
osdmap e1101: 96 osds: 72 up, 72 in; 473 remapped pgs
pgmap v6677: 2208 pgs, 5 pools, 45966 MB data, 33630 objects
891 GB used, 130 TB / 130 TB avail
1056 active+clean
560 down+peering
473 active+remapped
119 active+undersized+degraded
5、ceph pg 3.1ac query
"recovery_state": [ {
"name": "Started\/Primary\/Peering\/GetInfo",
"enter_time": "2016-03-15 16:21:50.002705",
"requested_info_from": []
}, {
"name": "Started\/Primary\/Peering",
"enter_time": "2016-03-15 16:21:50.002696",
"past_intervals": [ {
"first": 906,
"last": 923,
"maybe_went_rw": 1,
"up": [
94,
60,
27,
13
],
"acting": [
94,
60,
27,
13
],
"primary": 94,
"up_primary": 94
}, {
"first": 924,
"last": 925,
"maybe_went_rw": 1,
"up": [
94,
60,
2147483647,
13
],
"acting": [
94,
60,
2147483647,
13
],
"primary": 94,
"up_primary": 94
}, {
"first": 926,
"last": 934,
"maybe_went_rw": 0,
"up": [
94,
2147483647,
2147483647,
13
],
"acting": [
94,
2147483647,
2147483647,
13
],
"primary": 94,
"up_primary": 94
}, {
"first": 935,
"last": 937,
"maybe_went_rw": 1,
"up": [
94,
2147483647,
24,
13
],
"acting": [
94,
2147483647,
24,
13
],
"primary": 94,
"up_primary": 94
}
],
"probing_osds": [
"13(3)",
"24(2)",
"27(2)",
"94(0)"
],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
60
],
"peering_blocked_by": [ {
"osd": 60,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"
}
]
}, {
"name": "Started",
"enter_time": "2016-03-15 16:21:50.002649"
}
],

what we found:
during osdmap [924~925],pg(3.1ac) mapped to:
"up": [94,60,2147483647, 13]
"acting": [94,60,2147483647, 13]

But now, osd.60(which is on server2) in DOWN state,
so this pg's peering procedure will blocked on osd.60,
and set pg state to 'down+peering'.

osd.60 in epoch 924~925 doesn't marked down, so we think during this
time there maybe write/update in this pg, so we cannot continue the peering
until we manually set the osd.60 'lost'.But we don't have any client IO at all.

Is there any way to handle this special case smarter?
the fast-peering plans to resolve this?

History

#1 Updated by Desmond Shih about 8 years ago

Is that OSD disk totally break or not?
In my opinion, the osd state (up/down) means that the osd daemon alive or not.
Did you confirm that that osd disk(osd.60) is alive?
It seems that this pg in not in state "down+peering" :
...
ceph pg 3.1ac query
"recovery_state": [ {
"name": "Started\/Primary\/Peering\/GetInfo",
"enter_time": "2016-03-15 16:21:50.002705",
...
It just need to recover one of its osd(osd.60) to recover this pg health.

Maybe "down+peering" is other pgs, because other pool in your Ceph is set to
have two replicas. Then, you did shutdown two hosts of your cluster.
So, the problem of pgs might be in other pools which set 2 to defualt size.

#2 Updated by huang jun about 8 years ago

osd.60 is on server2, which i powered off already.
so osd.60 is down in osdmap.

osd.60 is already down,
but for the failure detecting and reporting latency,
it was marked up during epoch [924~925].

#3 Updated by huang jun about 8 years ago

BTW, we set
mon_osd_adjust_heartbeat_grace = false
which will report down immediately when the osd down,
maybe this can result the osdmap frequently change,
like my case, during epoch 924 and 925, the osdmap changed.
And i will try to set it back to true.

#4 Updated by huang jun about 8 years ago

Set mon_osd_adjust_heartbeat_grace = true
and do the same test as described.
there is smaller chance to get down+peering pg state,
I think set it to true means mon 'mark down' osd slower,
which will avoid frequently osdmap change.
For this particularly test case, osds on the two hosts,
maybe marked down at different epoch,
if we delay the 'mark down' process, maybe those osds were marked down in the same epoch.

#5 Updated by Samuel Just about 8 years ago

  • Status changed from New to Rejected

This is working as intended. For a few epochs, the acting set had those 3 osds in it, so the cluster can't prove that it didn't accept writes. You can recover from such a case by marking the lost osd lost.

Also available in: Atom PDF