Bug #12580: Enclosure power failure pausing client IO - Ceph - Ceph

Actions

Copy link

Bug #12580

closed

Enclosure power failure pausing client IO

Added by Mallikarjun Biradar over 8 years ago. Updated over 8 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Sage Weil

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

v0.94.2

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have active client IO running on cluster. (Random write profile with 4M block size & 64 Queue depth).

One of storage enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected.

But client IO got paused (size=2 & min_size=1). After some time enclosure & hosts connected to it came up.
And all OSD's on that hosts came up.

Till this time, cluster was not serving IO. Once all hosts & OSD's pertaining to that enclosure came up, client IO resumed.

Setup Details:
Total Number of hosts: 8
Number of Storage enclosures/chassis: 2 (each connected with 4 hosts )
Failure domain: Chassis
Replication size: 2
Min size: 1
All pools were created with chassis ruleset.

This issue seen on Giant release 0.87.2

Actions

Copy link

Updated by Varada Kari over 8 years ago

To add more context to the problem:

Min_size was set to 1 and replication size is 2.

There was a flaky power connection to one of the enclosures. With min_size 1, we were able to continue the IO's, and recovery was active once the power comes back. But if there is a power failure again when recovery is in progress, some of the PGs are going to down+peering state.

Extract from pg query.

$ ceph pg 1.143 query { "state": "down+peering",
"snap_trimq": "[]",
"epoch": 3918,
"up": [
17],
"acting": [
17],
"info": { "pgid": "1.143",
"last_update": "3166'40424",
"last_complete": "3166'40424",
"log_tail": "2577'36847",
"last_user_version": 40424,
"last_backfill": "MAX",
"purged_snaps": "[]",

...... "recovery_state": [ { "name": "Started\/Primary\/Peering\/GetInfo",
"enter_time": "2015-07-15 12:48:51.372676",
"requested_info_from": []}, { "name": "Started\/Primary\/Peering",
"enter_time": "2015-07-15 12:48:51.372675",
"past_intervals": [ { "first": 3147,
"last": 3166,
"maybe_went_rw": 1,
"up": [
17,
4],
"acting": [
17,
4],
"primary": 17,
"up_primary": 17}, { "first": 3167,
"last": 3167,
"maybe_went_rw": 0,
"up": [
10,
20],
"acting": [
10,
20],
"primary": 10,
"up_primary": 10}, { "first": 3168,
"last": 3181,
"maybe_went_rw": 1,
"up": [
10,
20],
"acting": [
10,
4],
"primary": 10,
"up_primary": 10}, { "first": 3182,
"last": 3184,
"maybe_went_rw": 0,
"up": [
20],
"acting": [
4],
"primary": 4,
"up_primary": 20}, { "first": 3185,
"last": 3188,
"maybe_went_rw": 1,
"up": [
20],
"acting": [
20],
"primary": 20,
"up_primary": 20}],
"probing_osds": [
"17",
"20"],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
4,
10],
"peering_blocked_by": [ { "osd": 4,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"}, { "osd": 10,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"}]}, { "name": "Started",
"enter_time": "2015-07-15 12:48:51.372671"}],
"agent_state": {}}

And Pgs are not coming to active+clean till power is resumed again. During this period no IOs are allowed to the cluster. Not able to follow why the PGs are ending up in peering state? Each Pg has two copies in both the enclosures. If one of enclosure is down for some time, should be able to serve IO's from the second one. That was true, if no recovery IO is involved. In case of any recovery, we are ending up some Pg's in down and peering state.

Actions

Copy link

Updated by Varada Kari over 8 years ago

BTW, crush failure domain was set to chassis level. Till the other chassis is powered up, we are not able to complete the peering as expected. what was not understood, why the PGs are landing up in peering state?

Actions

Copy link

Updated by Varada Kari over 8 years ago

Steps lead to hit the issue:

Step 1:
Populated some data
Cluster is in clean state & all OSD’s up. Client IO is in active.
Chassis 1 went down with power failure.

Step 2:
Cluster in rebalancing state, all OSD’s UP.
Total data in cluster is 14TB. Client IO is in active.
Chassis 1 went down with power failure. -> Client IO paused, till all clients comes up.

Step 3:
Cluster in clean state.
All OSD's up. Client IO is in active.
Chassis 1 went down with power failure. -> Client IO is not impacted.

Step 4:
Cluster in rebalancing state. One of the host connected to Chassis 1 is down.
Client IO is in active.
Chassis 1 went down with power failure. -> Client IO is not impacted.

Step: 5:
Cluster in rebalancing state. All OSD’s up. (all connected hosts up)
Client IO is in active.
Chassis 1 went down with power failure. -> Client IO paused.
Some Pgs(37 Pgs out of 1024) went to peering state.

Actions

Copy link