Project

General

Profile

Bug #14028

PG stats are not refresh within osd_pg_stat_report_interval_max

Added by Wido den Hollander over 8 years ago. Updated about 8 years ago.

Status:
Won't Fix
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

OSDs do not seem to refresh their PG stats at least every osd_pg_stat_report_interval_max seconds.

On a test cluster which is idle, no I/O at all:

root@bravo:~# cat /var/log/ceph/ceph-osd.1.log|grep publish_stats_to_osd|tail -n 10 && date
2015-12-09 16:58:27.347286 7f824a9c0700 15 osd.1 pg_epoch: 77 pg[0.20( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,0,2] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77:100
2015-12-09 16:58:27.352735 7f8257b4d700 15 osd.1 pg_epoch: 77 pg[0.21( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77: no change since
2015-12-09 16:58:27.370986 7f8257b4d700 15 osd.1 pg_epoch: 77 pg[0.23( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77: no change since
2015-12-09 16:58:27.380638 7f8257b4d700 15 osd.1 pg_epoch: 77 pg[0.25( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,0,2] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77: no change since
2015-12-09 16:58:27.380816 7f8257b4d700 15 osd.1 pg_epoch: 77 pg[0.1c( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77: no change since
2015-12-09 16:58:27.382065 7f8257b4d700 15 osd.1 pg_epoch: 77 pg[0.20( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,0,2] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77: no change since
2015-12-09 16:58:30.896598 7f82469b8700 15 osd.1 pg_epoch: 77 pg[0.34( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean+scrubbing+deep] publish_stats_to_osd 77:101
2015-12-09 16:58:30.904211 7f82491bd700 15 osd.1 pg_epoch: 77 pg[0.34( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77:102
2015-12-09 16:58:33.907928 7f82481bb700 15 osd.1 pg_epoch: 77 pg[0.13( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean+scrubbing+deep] publish_stats_to_osd 77:103
2015-12-09 16:58:33.913326 7f82481bb700 15 osd.1 pg_epoch: 77 pg[0.13( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77:104
Wed Dec  9 17:10:40 CET 2015
root@bravo:~#

After 12 minutes there have been no PG stat updates.

pg query shows:

            "stats": {
                "version": "0'0",
                "reported_seq": "97",
                "reported_epoch": "75",
                "state": "active+undersized+degraded",
                "last_fresh": "2015-12-09 16:58:15.469734",
                "last_change": "2015-12-09 16:58:15.468556",
                "last_active": "2015-12-09 16:58:15.469734",

Ceph version in this case is running from master:

root@bravo:~# ceph -v
ceph version 10.0.0-723-g8ce9302 (8ce9302a975d683c2e6fd21214c96e974b4a71fc)
root@bravo:~#

Even after waiting 15 minutes, the PG was not refreshed.

History

#1 Updated by Wido den Hollander over 8 years ago

I just checked the logs again:

2015-12-09 16:58:26.564670 7f824a9c0700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=75 n=0 ec=1 les/c/f 75/75/0 76/76/76) [1,2,0] r=0 lpr=76 pi=72-75/2 crt=0'0 mlcod 0'0 peering] publish_stats_to_osd 76: no change since
2015-12-09 16:58:26.564855 7f824a9c0700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 75/75/0 76/76/76) [1,2,0] r=0 lpr=76 pi=72-75/2 crt=0'0 mlcod 0'0 activating] publish_stats_to_osd 77:98
2015-12-09 16:58:26.814815 7f824b1c1700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 75/75/0 76/76/76) [1,2,0] r=0 lpr=76 pi=72-75/2 crt=0'0 mlcod 0'0 activating] publish_stats_to_osd 77: no change since
2015-12-09 16:58:26.938889 7f824a9c0700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 75/75/0 76/76/76) [1,2,0] r=0 lpr=76 pi=72-75/2 crt=0'0 mlcod 0'0 activating] publish_stats_to_osd 77: no change since
2015-12-09 16:58:27.013986 7f824a9c0700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 77/75/0 76/76/76) [1,2,0] r=0 lpr=76 pi=72-75/2 crt=0'0 mlcod 0'0 active] publish_stats_to_osd 77:99
2015-12-09 16:58:27.014052 7f824a9c0700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 77/75/0 76/76/76) [1,2,0] r=0 lpr=76 pi=72-75/2 crt=0'0 mlcod 0'0 active] publish_stats_to_osd 77: no change since
2015-12-09 16:58:27.079788 7f824a9c0700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77:100
2015-12-09 16:58:27.112058 7f8257b4d700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77: no change since
2015-12-09 21:06:00.588905 7f82491bd700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean+scrubbing] publish_stats_to_osd 77:101
2015-12-09 21:06:00.590728 7f82491bd700 15 osd.1 pg_epoch: 77 pg[0.c( empty local-les=77 n=0 ec=1 les/c/f 77/77/0 76/76/76) [1,2,0] r=0 lpr=76 crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 77:102

This was PG 0.c and not 0.13, but it's on the same OSD.

So there was about 3 hours between the two updates and it seems this only happened when the PG was touched by scrubbing.

The timestamps have also been updated:

            "last_fresh": "2015-12-09 21:06:00.590726",
            "last_change": "2015-12-09 21:06:00.590726",
            "last_active": "2015-12-09 21:06:00.590726",
            "last_peered": "2015-12-09 21:06:00.590726",
            "last_clean": "2015-12-09 21:06:00.590726",
            "last_became_active": "2015-12-09 16:58:27.013980",
            "last_became_peered": "2015-12-09 16:58:27.013980",
            "last_unstale": "2015-12-09 21:06:00.590726",
            "last_undegraded": "2015-12-09 21:06:00.590726",
            "last_fullsized": "2015-12-09 21:06:00.590726",

Looking at PG.cc I see that publish_stats_to_osd() is called a couple of times, but only after a scrub or a recovery action. In OSD.cc and ReplicatedPG.cc it is also called, but only during exceptional situations. There doesn't seem to be a routinely base where publish_stats_to_osd() is called.

#2 Updated by Samuel Just over 8 years ago

I'm not sure this is a bug. The OSDs need to send a stat message so the mons don't assume they died, but the PGs don't necessarily update the published stats. Which field are you worried about?

#3 Updated by Wido den Hollander over 8 years ago

Samuel Just wrote:

I'm not sure this is a bug. The OSDs need to send a stat message so the mons don't assume they died, but the PGs don't necessarily update the published stats. Which field are you worried about?

A brief discussion with Sage on the devel list made me think it is a bug. I would expect that at least "last_active" is updated every X minutes.

See: http://article.gmane.org/gmane.comp.file-systems.ceph.devel/28598

#4 Updated by Samuel Just about 8 years ago

  • Priority changed from Normal to Urgent

#5 Updated by Sage Weil about 8 years ago

  • Status changed from New to Won't Fix

I think I was wrong in that email. As long as the OSD reports in as healthy I don't think its necessary for the OSD to update every PG's timestamp. It does mean that if the pg then goes inactive the time it's been stale might be misleading, but I don't think that's terribly important?

To "fix" this we probably need a bunch of complicated OSD code to resend PGs even though they don't have updated stats.. which doesn't seem worth it to me.

Also available in: Atom PDF