Bug #12581
closedosd: pg stuck at peering
0%
Description
Ceph version: v0.87
PG stuck at peering even though there is no write to the PG, looks like it stuck at GetInfo, and even though all (live) peers response, the state machine does not move forward.
PG id: 3.8c7
Log will be attached.
Files
Updated by Guang Yang over 8 years ago
Log file - https://www.dropbox.com/sh/g8d0hc0kqwt9yig/AABy9UqOnD7wWDAlpGvJuAC0a?dl=0
(I had problem to directly attach the log to this bug..)
Updated by Guang Yang over 8 years ago
Further digging at the peering log, I found the problem came from OSD.43, which did not response to a GetInfo in a timely manner.
The bigger problem, is that at OSD.43, although the daemon is running, on filesystem, looks like a bunch of meta-data (including journal link) were erased for an unclear reason.
-bash-4.1$ hostname osd005c201.cos.gq1.yahoo.com -bash-4.1$ pwd /home/y/var/lib/ceph/osd/ceph-43 -bash-4.1$ ls current superblock
Updated by Guang Yang over 8 years ago
Got some logs from OSD.43:
2015-08-04 18:34:21.036670 7f42a89af700 10 osd.43 pg_epoch: 51340 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_advance_map [247,486,146,65,42,43,210,193,408,509,350]/[247,486,146,65,42,43,210,193,408,509,350] -- 247/247 2015-08-04 18:34:21.036716 7f42a89af700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] state<Started>: Started advmap 2015-08-04 18:34:21.036745 7f42a89af700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] check_recovery_sources no source osds () went down 2015-08-04 18:34:21.036769 7f42a89af700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_activate_map 2015-08-04 18:34:21.036787 7f42a89af700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] take_waiters 2015-08-04 18:34:21.036804 7f42a89af700 20 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_activate_map: Not dirtying info: last_persisted is 51340 while current is 51341 2015-08-04 18:34:21.036824 7f42a89af700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_peering_event: epoch_sent: 51341 epoch_requested: 51341 NullEvt 2015-08-04 18:34:21.040673 7f42a89af700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_peering_event: epoch_sent: 51341 epoch_requested: 51341 MQuery from 247(0) query_epoch 51341 query: query(info 0'0) 2015-08-04 18:34:22.010150 7f42ae9bb700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] null 2015-08-04 18:34:22.013384 7f42a89af700 10 osd.43 pg_epoch: 51341 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_advance_map [247,486,146,65,42,43,210,193,408,509,350]/[247,486,146,65,42,43,210,193,408,509,350] -- 247/247 2015-08-04 18:34:22.013431 7f42a89af700 10 osd.43 pg_epoch: 51342 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] state<Started>: Started advmap 2015-08-04 18:34:22.013453 7f42a89af700 10 osd.43 pg_epoch: 51342 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] check_recovery_sources no source osds () went down 2015-08-04 18:34:22.013468 7f42a89af700 10 osd.43 pg_epoch: 51342 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_activate_map 2015-08-04 18:34:22.013479 7f42a89af700 10 osd.43 pg_epoch: 51342 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] take_waiters 2015-08-04 18:34:22.013498 7f42a89af700 20 osd.43 pg_epoch: 51342 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_activate_map: Not dirtying info: last_persisted is 51340 while current is 51342 2015-08-04 18:34:22.013510 7f42a89af700 10 osd.43 pg_epoch: 51342 pg[3.8c7s5( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51339/51339/51335) [247,486,146,65,42,43,210,193,408,509,350] r=5 lpr=51339 pi=3779-51338/279 luod=0'0 crt=50351'184191 active] handle_peering_event: epoch_sent: 51342 epoch_requested: 51342 NullEvt
logs from osd.247:
2015-08-04 18:34:21.024225 7fa0c06de700 10 osd.247 pg_epoch: 51341 pg[3.8c7s0( v 50351'184193 (50344'181192,50351'184193] local-les=51340 n=157666 ec=1152 les/c 51340/51340 51341/51341/51335) [247,486,146,65,42,43,210,193,408,509,2147483647] r=0 lpr=51341 pi=51339-51340/1 crt=50351'184191 lcod 0'0 mlcod 0'0 peering] state<Started/Primary/Peering/GetInfo>: querying info from osd.43(5)
Looks like with epoch 51341, osd.247 and osd.43 have different acting/up set for the PG?
Updated by Guang Yang over 8 years ago
Updated by Guang Yang over 8 years ago
Updated by Samuel Just over 8 years ago
/home/sam/git-checkouts/ceph2/src/osdmaptool: osdmap file 'osdmap.51341__0_4F1413C9__none__osd247'
parsed '3.8c7' -> 3.8c7
3.8c7 raw ([247,486,146,65,42,43,210,193,408,509,512], p247) up ([247,486,146,65,42,43,210,193,408,509,2147483647], p247) acting ([247,486,146,65,42,43,210,193,408,509,2147483647], p247)
~/Downloads [deepthought?] ยป ~/git-checkouts/ceph2/src/osdmaptool --test-map-pg 3.8c7 osdmap.51341__0_4F1413C9__none_osd43
/home/sam/git-checkouts/ceph2/src/osdmaptool: osdmap file 'osdmap.51341__0_4F1413C9__none_osd43'
parsed '3.8c7' -> 3.8c7
3.8c7 raw ([247,486,146,65,42,43,210,193,408,509,512], p247) up ([247,486,146,65,42,43,210,193,408,509,2147483647], p247) acting ([247,486,146,65,42,43,210,193,408,509,2147483647], p247)
Updated by Samuel Just over 8 years ago
This may just be related to whatever messed up the osd directory (which seems unlikely to be the osd process).
Updated by Guang Yang over 8 years ago
- Status changed from New to Can't reproduce
We haven't seen this since last time talking with Sam, close this issue.