Project

General

Profile

Actions

Bug #22848

open

Pull the cable,5mins later,Put back to the cable,pg stuck a long time ulitl to restart ceph-osd

Added by Yong Wang over 6 years ago. Updated about 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi all,
We have 3 nodes ceph cluster, version 10.2.10.
new installing enviroment and prosessional rpms from download.ceph.com.
We had a test found that pg status keep always peering and stucked,
util I restart the pg mapped primary osd.
The restarted osd number is 22 at the node that cables is pulled and put.
I think it maybe a serious bug,due to the status will keeped util no end.

2018-02-01 11:56:43.818399 7eff4d08e700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked        for > 30.117346 secs
101712 2018-02-01 11:56:43.818405 7eff4d08e700 0 log_channel(cluster) log [WRN] : slow request 30.117346 seconds old, received at 20 18-02-01 11:56:13.701020: osd_op(client.3478190.1:2007253 51.5f8eefc9 10000000115.00004605 [write 1933312~4096 [1@-1]] snapc 1 =[] ondisk+write+known_if_redirected e18185) currently waiting for peered
101713 2018-02-01 11:56:44.818547 7eff4d08e700 0 log_channel(cluster) log [WRN] : 3 slow requests, 2 included below; oldest blocked for > 31.117467 secs
101714 2018-02-01 11:56:44.818557 7eff4d08e700 0 log_channel(cluster) log [WRN] : slow request 30.825603 seconds old, received at 20 18-02-01 11:56:13.992885: osd_op(client.3447345.1:1513575 51.c2291f3e 100000004b9.00002522 [write 0~2785280 [1@-1]] snapc 1=[] ondisk+write+known_if_redirected e18185) currently waiting for peered
101715 2018-02-01 11:56:44.818562 7eff4d08e700 0 log_channel(cluster) log [WRN] : slow request 30.405136 seconds old, received at 20 18-02-01 11:56:14.413352: osd_op(client.3447345.1:1513752 51.c2291f3e 100000004b9.00002522 [write 2785280~4096 [1@-1]] snapc 1 =[] ondisk+write+known_if_redirected e18185) currently waiting for peered
101716 2018-02-01 11:56:47.437123 7eff4d88f700 0 -- 10.0.20.92:6914/6011090 >> 10.0.20.92:6802/8022975 conn(0x7efee6b1b000 sd=-1 :-1 s=STATE_WAIT pgs=0 cs=0 l=0).fault with nothing to send, going to standby
101717 2018-02-01 11:56:47.437161 7eff4f893700 0 -- 10.0.20.92:6914/6011090 >> 10.0.20.92:6802/8022975 conn(0x7efed44c5000 sd=2209 : 6914 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existin g_state=STATE_STANDBY
101718 2018-02-01 11:56:47.437223 7eff4f893700 0 -- 10.0.20.92:6914/6011090 >> 10.0.20.92:6802/8022975 conn(0x7efed44c5000 sd=2209 : 6914 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 1 vs existing csq=0 existin g_state=STATE_STANDBY
101719 2018-02-01 11:56:53.819381 7eff4d08e700 0 log_channel(cluster) log [WRN] : 4 slow requests, 1 included below; oldest blocked for > 40.118314 secs
101720 2018-02-01 11:56:53.819388 7eff4d08e700 0 log_channel(cluster) log [WRN] : slow request 30.451847 seconds old, received at 20 18-02-01 11:56:23.367488: osd_op(mds.0.485:38951 51.ae313465 (undecoded) ack+read+rwordered+known_if_redirected+full_force e18 185) currently waiting for peered

Pull the cable,5mins later,Put back to the cable,pg stuck a long time ulitl to restart ceph-osd.

ceph -s
cluster 11fc0e41-cb88-4002-a115-571a3f616b7a
health HEALTH_WARN
7 pgs peering
7 pgs stuck inactive
7 pgs stuck unclean
46 requests are blocked > 32 sec
monmap e1: 3 mons at {node92=10.0.20.92:6789/0,node93=10.0.20.93:6789/0,node94=10.0.20.94:6789/0}
election epoch 928, quorum 0,1,2 node92,node93,node94
fsmap e489: 1/1/1 up {0=node93=up:active}, 1 up:standby
osdmap e18196: 75 osds: 75 up, 75 in
flags sortbitwise,require_jewel_osds
pgmap v904368: 4856 pgs, 10 pools, 22024 GB data, 5511 kobjects
44106 GB used, 87579 GB / 128 TB avail
4849 active+clean
7 peering {
"name": "node94",
"rank": 2,
"state": "peon",
"election_epoch": 928,
"quorum": [
0,
1,
2
],
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 1,
"fsid": "11fc0e41-cb88-4002-a115-571a3f616b7a",
"modified": "2018-01-04 15:38:07.948152",
"created": "2018-01-04 15:38:07.948152",
"mons": [ {
"rank": 0,
"name": "node92",
"addr": "10.0.20.92:6789\/0"
}, {
"rank": 1,
"name": "node93",
"addr": "10.0.20.93:6789\/0"
}, {
"rank": 2,
"name": "node94",
"addr": "10.0.20.94:6789\/0"
}
]
}
}

ceph health detail
HEALTH_WARN 7 pgs peering; 7 pgs stuck inactive; 7 pgs stuck unclean; 46 requests are blocked > 32 sec; 2 osds have slow requests
pg 51.bb8 is stuck inactive for 4444.964483, current state peering, last acting [22,61]
pg 51.10a5 is stuck inactive for 4462.902957, current state peering, last acting [61,22]
pg 51.108e is stuck inactive for 4511.913559, current state peering, last acting [61,22]
pg 51.465 is stuck inactive for 4444.965214, current state peering, last acting [22,61]
pg 51.dc8 is stuck inactive for 4447.723889, current state peering, last acting [61,22]
pg 51.f3e is stuck inactive for 4444.964224, current state peering, last acting [22,61]
pg 51.fc9 is stuck inactive for 4444.965689, current state peering, last acting [22,61]
pg 51.bb8 is stuck unclean for 4447.129294, current state peering, last acting [22,61]
pg 51.10a5 is stuck unclean for 4462.903174, current state peering, last acting [61,22]
pg 51.108e is stuck unclean for 4767.972795, current state peering, last acting [61,22]
pg 51.465 is stuck unclean for 4461.220115, current state peering, last acting [22,61]
pg 51.dc8 is stuck unclean for 4767.973020, current state peering, last acting [61,22]
pg 51.f3e is stuck unclean for 4468.751189, current state peering, last acting [22,61]
pg 51.fc9 is stuck unclean for 4446.945741, current state peering, last acting [22,61]
pg 51.fc9 is peering, acting [22,61]
pg 51.f3e is peering, acting [22,61]
pg 51.dc8 is peering, acting [61,22]
pg 51.465 is peering, acting [22,61]
pg 51.108e is peering, acting [61,22]
pg 51.10a5 is peering, acting [61,22]
pg 51.bb8 is peering, acting [22,61]
16 ops are blocked > 8388.61 sec on osd.61
6 ops are blocked > 4194.3 sec on osd.61
17 ops are blocked > 8388.61 sec on osd.22
7 ops are blocked > 4194.3 sec on osd.22
2 osds have slow requests
ceph daemon mon.node94 mon_status
Actions

Also available in: Atom PDF