Bug #22848: Pull the cable,5mins later,Put back to the cable，pg stuck a long time ulitl to restart ceph-osd - RADOS - Ceph

Actions

Copy link

Bug #22848

open

Pull the cable,5mins later,Put back to the cable，pg stuck a long time ulitl to restart ceph-osd

Added by Yong Wang about 6 years ago. Updated about 6 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

Ceph - v10.2.11

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v10.2.9

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi all,
We have 3 nodes ceph cluster, version 10.2.10.
new installing enviroment and prosessional rpms from download.ceph.com.
We had a test found that pg status keep always peering and stucked,
util I restart the pg mapped primary osd.
The restarted osd number is 22 at the node that cables is pulled and put.
I think it maybe a serious bug,due to the status will keeped util no end.

2018-02-01 11:56:43.818399 7eff4d08e700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked        for > 30.117346 secs
101712 2018-02-01 11:56:43.818405 7eff4d08e700  0 log_channel(cluster) log [WRN] : slow request 30.117346 seconds old, received at 20       18-02-01 11:56:13.701020: osd_op(client.3478190.1:2007253 51.5f8eefc9 10000000115.00004605 [write 1933312~4096 [1@-1]] snapc 1       =[] ondisk+write+known_if_redirected e18185) currently waiting for peered
101713 2018-02-01 11:56:44.818547 7eff4d08e700  0 log_channel(cluster) log [WRN] : 3 slow requests, 2 included below; oldest blocked        for > 31.117467 secs
101714 2018-02-01 11:56:44.818557 7eff4d08e700  0 log_channel(cluster) log [WRN] : slow request 30.825603 seconds old, received at 20       18-02-01 11:56:13.992885: osd_op(client.3447345.1:1513575 51.c2291f3e 100000004b9.00002522 [write 0~2785280 [1@-1]] snapc 1=[]        ondisk+write+known_if_redirected e18185) currently waiting for peered
101715 2018-02-01 11:56:44.818562 7eff4d08e700  0 log_channel(cluster) log [WRN] : slow request 30.405136 seconds old, received at 20       18-02-01 11:56:14.413352: osd_op(client.3447345.1:1513752 51.c2291f3e 100000004b9.00002522 [write 2785280~4096 [1@-1]] snapc 1       =[] ondisk+write+known_if_redirected e18185) currently waiting for peered
101716 2018-02-01 11:56:47.437123 7eff4d88f700  0 -- 10.0.20.92:6914/6011090 >> 10.0.20.92:6802/8022975 conn(0x7efee6b1b000 sd=-1 :-1        s=STATE_WAIT pgs=0 cs=0 l=0).fault with nothing to send, going to standby
101717 2018-02-01 11:56:47.437161 7eff4f893700  0 -- 10.0.20.92:6914/6011090 >> 10.0.20.92:6802/8022975 conn(0x7efed44c5000 sd=2209 :       6914 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existin       g_state=STATE_STANDBY
101718 2018-02-01 11:56:47.437223 7eff4f893700  0 -- 10.0.20.92:6914/6011090 >> 10.0.20.92:6802/8022975 conn(0x7efed44c5000 sd=2209 :       6914 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 1 vs existing csq=0 existin       g_state=STATE_STANDBY                                                                                                         
101719 2018-02-01 11:56:53.819381 7eff4d08e700  0 log_channel(cluster) log [WRN] : 4 slow requests, 1 included below; oldest blocked        for > 40.118314 secs
101720 2018-02-01 11:56:53.819388 7eff4d08e700  0 log_channel(cluster) log [WRN] : slow request 30.451847 seconds old, received at 20       18-02-01 11:56:23.367488: osd_op(mds.0.485:38951 51.ae313465 (undecoded) ack+read+rwordered+known_if_redirected+full_force e18       185) currently waiting for peered

Pull the cable,5mins later,Put back to the cable，pg stuck a long time ulitl to restart ceph-osd.

ceph -s
cluster 11fc0e41-cb88-4002-a115-571a3f616b7a
health HEALTH_WARN
7 pgs peering
7 pgs stuck inactive
7 pgs stuck unclean
46 requests are blocked > 32 sec
monmap e1: 3 mons at {node92=10.0.20.92:6789/0,node93=10.0.20.93:6789/0,node94=10.0.20.94:6789/0}
election epoch 928, quorum 0,1,2 node92,node93,node94
fsmap e489: 1/1/1 up {0=node93=up:active}, 1 up:standby
osdmap e18196: 75 osds: 75 up, 75 in
flags sortbitwise,require_jewel_osds
pgmap v904368: 4856 pgs, 10 pools, 22024 GB data, 5511 kobjects
44106 GB used, 87579 GB / 128 TB avail
4849 active+clean
7 peering {
"name": "node94",
"rank": 2,
"state": "peon",
"election_epoch": 928,
"quorum": [
0,
1,
2
],
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 1,
"fsid": "11fc0e41-cb88-4002-a115-571a3f616b7a",
"modified": "2018-01-04 15:38:07.948152",
"created": "2018-01-04 15:38:07.948152",
"mons": [ {
"rank": 0,
"name": "node92",
"addr": "10.0.20.92:6789\/0"
}, {
"rank": 1,
"name": "node93",
"addr": "10.0.20.93:6789\/0"
}, {
"rank": 2,
"name": "node94",
"addr": "10.0.20.94:6789\/0"
}
]
}
}

ceph health detail
HEALTH_WARN 7 pgs peering; 7 pgs stuck inactive; 7 pgs stuck unclean; 46 requests are blocked > 32 sec; 2 osds have slow requests
pg 51.bb8 is stuck inactive for 4444.964483, current state peering, last acting [22,61]
pg 51.10a5 is stuck inactive for 4462.902957, current state peering, last acting [61,22]
pg 51.108e is stuck inactive for 4511.913559, current state peering, last acting [61,22]
pg 51.465 is stuck inactive for 4444.965214, current state peering, last acting [22,61]
pg 51.dc8 is stuck inactive for 4447.723889, current state peering, last acting [61,22]
pg 51.f3e is stuck inactive for 4444.964224, current state peering, last acting [22,61]
pg 51.fc9 is stuck inactive for 4444.965689, current state peering, last acting [22,61]
pg 51.bb8 is stuck unclean for 4447.129294, current state peering, last acting [22,61]
pg 51.10a5 is stuck unclean for 4462.903174, current state peering, last acting [61,22]
pg 51.108e is stuck unclean for 4767.972795, current state peering, last acting [61,22]
pg 51.465 is stuck unclean for 4461.220115, current state peering, last acting [22,61]
pg 51.dc8 is stuck unclean for 4767.973020, current state peering, last acting [61,22]
pg 51.f3e is stuck unclean for 4468.751189, current state peering, last acting [22,61]
pg 51.fc9 is stuck unclean for 4446.945741, current state peering, last acting [22,61]
pg 51.fc9 is peering, acting [22,61]
pg 51.f3e is peering, acting [22,61]
pg 51.dc8 is peering, acting [61,22]
pg 51.465 is peering, acting [22,61]
pg 51.108e is peering, acting [61,22]
pg 51.10a5 is peering, acting [61,22]
pg 51.bb8 is peering, acting [22,61]
16 ops are blocked > 8388.61 sec on osd.61
6 ops are blocked > 4194.3 sec on osd.61
17 ops are blocked > 8388.61 sec on osd.22
7 ops are blocked > 4194.3 sec on osd.22
2 osds have slow requests

ceph daemon mon.node94 mon_status

Actions

Copy link

Updated by Yong Wang about 6 years ago

why pgs status is peering alawys, I could sure that such as monitor osd both ok.

those pg state machine should work normally, but now the status peering should wait what?

If it can not get something to explain it, it must a bug.

Actions

Copy link

Updated by Greg Farnum about 6 years ago

Project changed from Ceph to RADOS
Category deleted (~~OSD~~)

Actions

Copy link

Updated by Josh Durgin about 6 years ago

Which cable are you pulling? Do you have logs from the monitors and osds? The default failure detection timeouts can be up to 15 minutes for a small cluster.

Actions

Copy link

Updated by Yong Wang about 6 years ago

Hi Josh Durginm,
1.They both are fibre-optical cable in our networkcard.
2.Log files cann't be found yet,due to attachment file is limited to small.
I have checked all the monitor and osd logs besides for the previous logs already pasted, remaining logs very little and it seems to be ok.
3.osd heartbeat check should be ok, due to no error communicated logs shown.
Those pgs keep peering and stuck status due to we put back the cable.If we keep cable out,pgs will be not peering and stucked.
Why default failure detection timeouts near to 15 minutes? Is there have some configure arguments for adjusting about the detection timeouts minutes.
We haven't waited to 15minutes,but the period is too long.
The 15mintues latency will cause to many IOs failed.
Tks.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #22848

Pull the cable,5mins later,Put back to the cable，pg stuck a long time ulitl to restart ceph-osd

Updated by Yong Wang about 6 years ago

Updated by Greg Farnum about 6 years ago

Updated by Josh Durgin about 6 years ago

Updated by Yong Wang about 6 years ago