Bug #14028
PG stats are not refresh within osd_pg_stat_report_interval_max
0%
Description
OSDs do not seem to refresh their PG stats at least every osd_pg_stat_report_interval_max seconds.
On a test cluster which is idle, no I/O at all:
root@bravo:~# cat /var/log/ceph/ceph-osd.1.log|grep publish_stats_to_osd|tail -n 10 && date 2015-12-09 16:58:27.347286 7f824a9c0700 15 osd.1 pg_epoch: 77 pg[0.20( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,0,2] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77:100 2015-12-09 16:58:27.352735 7f8257b4d700 15 osd.1 pg_epoch: 77 pg[0.21( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77: no change since 2015-12-09 16:58:27.370986 7f8257b4d700 15 osd.1 pg_epoch: 77 pg[0.23( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77: no change since 2015-12-09 16:58:27.380638 7f8257b4d700 15 osd.1 pg_epoch: 77 pg[0.25( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,0,2] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77: no change since 2015-12-09 16:58:27.380816 7f8257b4d700 15 osd.1 pg_epoch: 77 pg[0.1c( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77: no change since 2015-12-09 16:58:27.382065 7f8257b4d700 15 osd.1 pg_epoch: 77 pg[0.20( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,0,2] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77: no change since 2015-12-09 16:58:30.896598 7f82469b8700 15 osd.1 pg_epoch: 77 pg[0.34( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean+scrubbing+deep] publish_stats_to_osd 77:101 2015-12-09 16:58:30.904211 7f82491bd700 15 osd.1 pg_epoch: 77 pg[0.34( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77:102 2015-12-09 16:58:33.907928 7f82481bb700 15 osd.1 pg_epoch: 77 pg[0.13( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean+scrubbing+deep] publish_stats_to_osd 77:103 2015-12-09 16:58:33.913326 7f82481bb700 15 osd.1 pg_epoch: 77 pg[0.13( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77:104 Wed Dec 9 17:10:40 CET 2015 root@bravo:~#
After 12 minutes there have been no PG stat updates.
pg query shows:
"stats": { "version": "0'0", "reported_seq": "97", "reported_epoch": "75", "state": "active+undersized+degraded", "last_fresh": "2015-12-09 16:58:15.469734", "last_change": "2015-12-09 16:58:15.468556", "last_active": "2015-12-09 16:58:15.469734",
Ceph version in this case is running from master:
root@bravo:~# ceph -v ceph version 10.0.0-723-g8ce9302 (8ce9302a975d683c2e6fd21214c96e974b4a71fc) root@bravo:~#
Even after waiting 15 minutes, the PG was not refreshed.
History
#1 Updated by Wido den Hollander over 8 years ago
I just checked the logs again:
2015-12-09 16:58:26.564670 7f824a9c0700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=75 n=0 ec=1 les/c/f 75/75/0 76/76/76) [1,2,0] r=0 lpr=76 pi=72-75/2 crt=0'0 mlcod 0'0 peering] publish_stats_to_osd 76: no change since 2015-12-09 16:58:26.564855 7f824a9c0700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 75/75/0 76/76/76) [1,2,0] r=0 lpr=76 pi=72-75/2 crt=0'0 mlcod 0'0 activating] publish_stats_to_osd 77:98 2015-12-09 16:58:26.814815 7f824b1c1700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 75/75/0 76/76/76) [1,2,0] r=0 lpr=76 pi=72-75/2 crt=0'0 mlcod 0'0 activating] publish_stats_to_osd 77: no change since 2015-12-09 16:58:26.938889 7f824a9c0700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 75/75/0 76/76/76) [1,2,0] r=0 lpr=76 pi=72-75/2 crt=0'0 mlcod 0'0 activating] publish_stats_to_osd 77: no change since 2015-12-09 16:58:27.013986 7f824a9c0700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 77/75/0 76/76/76) [1,2,0] r=0 lpr=76 pi=72-75/2 crt=0'0 mlcod 0'0 active] publish_stats_to_osd 77:99 2015-12-09 16:58:27.014052 7f824a9c0700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 77/75/0 76/76/76) [1,2,0] r=0 lpr=76 pi=72-75/2 crt=0'0 mlcod 0'0 active] publish_stats_to_osd 77: no change since 2015-12-09 16:58:27.079788 7f824a9c0700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77:100 2015-12-09 16:58:27.112058 7f8257b4d700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77: no change since 2015-12-09 21:06:00.588905 7f82491bd700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean+scrubbing] publish_stats_to_osd 77:101 2015-12-09 21:06:00.590728 7f82491bd700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77:102
This was PG 0.c and not 0.13, but it's on the same OSD.
So there was about 3 hours between the two updates and it seems this only happened when the PG was touched by scrubbing.
The timestamps have also been updated:
"last_fresh": "2015-12-09 21:06:00.590726", "last_change": "2015-12-09 21:06:00.590726", "last_active": "2015-12-09 21:06:00.590726", "last_peered": "2015-12-09 21:06:00.590726", "last_clean": "2015-12-09 21:06:00.590726", "last_became_active": "2015-12-09 16:58:27.013980", "last_became_peered": "2015-12-09 16:58:27.013980", "last_unstale": "2015-12-09 21:06:00.590726", "last_undegraded": "2015-12-09 21:06:00.590726", "last_fullsized": "2015-12-09 21:06:00.590726",
Looking at PG.cc I see that publish_stats_to_osd() is called a couple of times, but only after a scrub or a recovery action. In OSD.cc and ReplicatedPG.cc it is also called, but only during exceptional situations. There doesn't seem to be a routinely base where publish_stats_to_osd() is called.
#2 Updated by Samuel Just over 8 years ago
I'm not sure this is a bug. The OSDs need to send a stat message so the mons don't assume they died, but the PGs don't necessarily update the published stats. Which field are you worried about?
#3 Updated by Wido den Hollander over 8 years ago
Samuel Just wrote:
I'm not sure this is a bug. The OSDs need to send a stat message so the mons don't assume they died, but the PGs don't necessarily update the published stats. Which field are you worried about?
A brief discussion with Sage on the devel list made me think it is a bug. I would expect that at least "last_active" is updated every X minutes.
See: http://article.gmane.org/gmane.comp.file-systems.ceph.devel/28598
#4 Updated by Samuel Just about 8 years ago
- Priority changed from Normal to Urgent
#5 Updated by Sage Weil about 8 years ago
- Status changed from New to Won't Fix
I think I was wrong in that email. As long as the OSD reports in as healthy I don't think its necessary for the OSD to update every PG's timestamp. It does mean that if the pg then goes inactive the time it's been stale might be misleading, but I don't think that's terribly important?
To "fix" this we probably need a bunch of complicated OSD code to resend PGs even though they don't have updated stats.. which doesn't seem worth it to me.