Project

General

Profile

Actions

Bug #12581

closed

osd: pg stuck at peering

Added by Guang Yang over 8 years ago. Updated over 8 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Ceph version: v0.87

PG stuck at peering even though there is no write to the PG, looks like it stuck at GetInfo, and even though all (live) peers response, the state machine does not move forward.

PG id: 3.8c7

Log will be attached.


Files

Actions #1

Updated by Guang Yang over 8 years ago

Log file - https://www.dropbox.com/sh/g8d0hc0kqwt9yig/AABy9UqOnD7wWDAlpGvJuAC0a?dl=0

(I had problem to directly attach the log to this bug..)

Actions #2

Updated by Guang Yang over 8 years ago

Further digging at the peering log, I found the problem came from OSD.43, which did not response to a GetInfo in a timely manner.
The bigger problem, is that at OSD.43, although the daemon is running, on filesystem, looks like a bunch of meta-data (including journal link) were erased for an unclear reason.

-bash-4.1$ hostname
osd005c201.cos.gq1.yahoo.com
-bash-4.1$ pwd
/home/y/var/lib/ceph/osd/ceph-43
-bash-4.1$ ls
current  superblock

Actions #3

Updated by Guang Yang over 8 years ago

Got some logs from OSD.43:

2015-08-04 18:34:21.036670 7f42a89af700 10 osd.43 pg_epoch: 51340 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_advance_map [247,486,146,65,42,43,210,193,408,509,350]/[247,486,146,65,42,43,210,193,408,509,350] -- 247/247
2015-08-04 18:34:21.036716 7f42a89af700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] state<Started>: Started advmap
2015-08-04 18:34:21.036745 7f42a89af700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] check_recovery_sources no source osds () went down
2015-08-04 18:34:21.036769 7f42a89af700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_activate_map
2015-08-04 18:34:21.036787 7f42a89af700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] take_waiters
2015-08-04 18:34:21.036804 7f42a89af700 20 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_activate_map: Not dirtying info: last_persisted is 51340 while current is 51341
2015-08-04 18:34:21.036824 7f42a89af700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_peering_event: epoch_sent: 51341 epoch_requested: 51341 NullEvt
2015-08-04 18:34:21.040673 7f42a89af700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_peering_event: epoch_sent: 51341 epoch_requested: 51341 MQuery from 247(0) query_epoch 51341 query: query(info 0'0)
2015-08-04 18:34:22.010150 7f42ae9bb700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] null
2015-08-04 18:34:22.013384 7f42a89af700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_advance_map [247,486,146,65,42,43,210,193,408,509,350]/[247,486,146,65,42,43,210,193,408,509,350] -- 247/247
2015-08-04 18:34:22.013431 7f42a89af700 10 osd.43 pg_epoch: 51342 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] state<Started>: Started advmap
2015-08-04 18:34:22.013453 7f42a89af700 10 osd.43 pg_epoch: 51342 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] check_recovery_sources no source osds () went down
2015-08-04 18:34:22.013468 7f42a89af700 10 osd.43 pg_epoch: 51342 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_activate_map
2015-08-04 18:34:22.013479 7f42a89af700 10 osd.43 pg_epoch: 51342 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] take_waiters
2015-08-04 18:34:22.013498 7f42a89af700 20 osd.43 pg_epoch: 51342 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_activate_map: Not dirtying info: last_persisted is 51340 while current is 51342
2015-08-04 18:34:22.013510 7f42a89af700 10 osd.43 pg_epoch: 51342 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_peering_event: epoch_sent: 51342 epoch_requested: 51342 NullEvt

logs from osd.247:

2015-08-04 18:34:21.024225 7fa0c06de700 10 osd.247 pg_epoch: 51341 pg[3.8c7s0( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51341/51341/51335) [247,486,146,65,42,43,210,193,408,509,2147483647] r=0 lpr=51341 pi=51339-51340/1 crt=50351'184191 lcod 0'0 mlcod 0'0 peering] state<Started/Primary/Peering/GetInfo>:  querying info from osd.43(5)

Looks like with epoch 51341, osd.247 and osd.43 have different acting/up set for the PG?

Actions #6

Updated by Samuel Just over 8 years ago

/home/sam/git-checkouts/ceph2/src/osdmaptool: osdmap file 'osdmap.51341__0_4F1413C9__none__osd247'
parsed '3.8c7' -> 3.8c7
3.8c7 raw ([247,486,146,65,42,43,210,193,408,509,512], p247) up ([247,486,146,65,42,43,210,193,408,509,2147483647], p247) acting ([247,486,146,65,42,43,210,193,408,509,2147483647], p247)
~/Downloads [deepthought?] ยป ~/git-checkouts/ceph2/src/osdmaptool --test-map-pg 3.8c7 osdmap.51341__0_4F1413C9__none_osd43
/home/sam/git-checkouts/ceph2/src/osdmaptool: osdmap file 'osdmap.51341__0_4F1413C9__none_osd43'
parsed '3.8c7' -> 3.8c7
3.8c7 raw ([247,486,146,65,42,43,210,193,408,509,512], p247) up ([247,486,146,65,42,43,210,193,408,509,2147483647], p247) acting ([247,486,146,65,42,43,210,193,408,509,2147483647], p247)

Actions #7

Updated by Samuel Just over 8 years ago

This may just be related to whatever messed up the osd directory (which seems unlikely to be the osd process).

Actions #8

Updated by Guang Yang over 8 years ago

  • Status changed from New to Can't reproduce

We haven't seen this since last time talking with Sam, close this issue.

Actions

Also available in: Atom PDF