Bug #37783
closedscrub health warning
0%
Description
This message popped up this afternoon: HEALTH_WARN 1280 pgs not scrubbed for 1.61061e+10
It seems rather unlikely, further details:
[root@ceph1 ceph]# ceph status
cluster:
id: e474ffde-40fd-44c2-aa4d-753dd4dd0761
health: HEALTH_WARN
1280 pgs not scrubbed for 1.61061e+10
services:
mon: 3 daemons, quorum ceph1,ceph2,ceph3
mgr: ceph1(active), standbys: ceph2, ceph3
mds: cephfs-nan-1/1/1 up {0=ceph3=up:active}, 2 up:standby
osd: 36 osds: 36 up, 36 in
data:
pools: 2 pools, 1280 pgs
objects: 43.00 M objects, 28 TiB
usage: 49 TiB used, 279 TiB / 327 TiB avail
pgs: 1280 active+clean
io:
client: 234 MiB/s rd, 113 MiB/s wr, 915 op/s rd, 1.14 kop/s wr
[root@ceph1 ceph]#
[root@ceph1 ~]# date
Thu 3 Jan 15:08:14 GMT 2019
[root@ceph1 ~]# ceph health detail
HEALTH_WARN 1280 pgs not scrubbed for 1.61061e+10
PG_NOT_SCRUBBED 1280 pgs not scrubbed for 1.61061e+10
pg 5.3ff not scrubbed since 2019-01-02 17:09:41.041659
pg 5.3fe not scrubbed since 2019-01-03 13:55:12.059461
pg 5.3fd not scrubbed since 2019-01-02 14:55:46.924332
pg 5.3fc not scrubbed since 2019-01-03 02:17:17.074052
pg 5.3fb not scrubbed since 2019-01-02 10:47:21.080051
pg 5.3fa not scrubbed since 2019-01-03 03:26:22.262786
pg 5.3f9 not scrubbed since 2019-01-03 06:47:43.489866
pg 5.3f8 not scrubbed since 2019-01-02 06:23:15.001033
...
As you can see from the above messages scrubs are in fact ongoing!
Our cluster includes 3 servers, all running :
[root@ceph1 ~]# uname -a
Linux ceph1.iop.kcl.ac.uk 3.10.0-862.14.4.el7.x86_64 #1 SMP Tue Sep 25 14:32:52 CDT 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@ceph1 ~]# cat /etc/redhat-release
Scientific Linux release 7.6 (Nitrogen)
[root@ceph1 ~]# ceph --version
ceph version 13.2.3 (9bf3c8b1a04b0aa4a3cc78456a508f1c48e70279) mimic (stable)
[root@ceph1 ~]#
Rather annoyingly the yum-cron process updated ceph this morning, this feature has now been disabled. I noticed that all processes were restarted by the update, expect for ceph-mon which was still running. I have restarted these processes, but the error has not cleared.
Chris
Updated by Mykola Golub over 5 years ago
- Status changed from New to Need More Info
What are your settings for mon_warn_not_scrubbed and mon_scrub_interval?
Updated by Chris Webb over 5 years ago
We've not explicitly set these two options, they are set to the default values:
[root@ceph1 ceph]# ceph daemon osd.11 config show | grep scrub
"mds_max_scrub_ops_in_progress": "5",
"mon_scrub_inject_crc_mismatch": "0.000000",
"mon_scrub_inject_missing_keys": "0.000000",
"mon_scrub_interval": "86400",
"mon_scrub_max_keys": "100",
"mon_scrub_timeout": "300",
"mon_warn_not_deep_scrubbed": "0",
"mon_warn_not_scrubbed": "0",
"osd_debug_deep_scrub_sleep": "0.000000",
"osd_deep_scrub_interval": "604800.000000",
"osd_deep_scrub_keys": "1024",
"osd_deep_scrub_large_omap_object_key_threshold": "2000000",
"osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824",
"osd_deep_scrub_randomize_ratio": "0.150000",
"osd_deep_scrub_stride": "524288",
"osd_deep_scrub_update_digest_min_age": "7200",
"osd_max_scrubs": "1",
"osd_op_queue_mclock_scrub_lim": "0.001000",
"osd_op_queue_mclock_scrub_res": "0.000000",
"osd_op_queue_mclock_scrub_wgt": "1.000000",
"osd_requested_scrub_priority": "120",
"osd_scrub_auto_repair": "false",
"osd_scrub_auto_repair_num_errors": "5",
"osd_scrub_backoff_ratio": "0.660000",
"osd_scrub_begin_hour": "0",
"osd_scrub_begin_week_day": "0",
"osd_scrub_chunk_max": "25",
"osd_scrub_chunk_min": "5",
"osd_scrub_cost": "52428800",
"osd_scrub_during_recovery": "false",
"osd_scrub_end_hour": "24",
"osd_scrub_end_week_day": "7",
"osd_scrub_interval_randomize_ratio": "0.500000",
"osd_scrub_invalid_stats": "true",
"osd_scrub_load_threshold": "0.500000",
"osd_scrub_max_interval": "604800.000000",
"osd_scrub_max_preemptions": "5",
"osd_scrub_min_interval": "86400.000000",
"osd_scrub_priority": "5",
"osd_scrub_sleep": "0.000000",
[root@ceph1 ceph]#
Mykola Golub wrote:
What are your settings for mon_warn_not_scrubbed and mon_scrub_interval?
Updated by Mykola Golub over 5 years ago
Chris Webb wrote:
[root@ceph1 ceph]# ceph daemon osd.11 config show | grep scrub
Could you please run this for mon.ceph1?
Updated by Chris Webb over 5 years ago
As requested:
[root@ceph1 ceph]# ceph daemon mon.ceph1 config show | grep scrub
"mds_max_scrub_ops_in_progress": "5",
"mon_scrub_inject_crc_mismatch": "0.000000",
"mon_scrub_inject_missing_keys": "0.000000",
"mon_scrub_interval": "86400",
"mon_scrub_max_keys": "100",
"mon_scrub_timeout": "300",
"mon_warn_not_deep_scrubbed": "0",
"mon_warn_not_scrubbed": "0",
"osd_debug_deep_scrub_sleep": "0.000000",
"osd_deep_scrub_interval": "604800.000000",
"osd_deep_scrub_keys": "1024",
"osd_deep_scrub_large_omap_object_key_threshold": "2000000",
"osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824",
"osd_deep_scrub_randomize_ratio": "0.150000",
"osd_deep_scrub_stride": "524288",
"osd_deep_scrub_update_digest_min_age": "7200",
"osd_max_scrubs": "1",
"osd_op_queue_mclock_scrub_lim": "0.001000",
"osd_op_queue_mclock_scrub_res": "0.000000",
"osd_op_queue_mclock_scrub_wgt": "1.000000",
"osd_requested_scrub_priority": "120",
"osd_scrub_auto_repair": "false",
"osd_scrub_auto_repair_num_errors": "5",
"osd_scrub_backoff_ratio": "0.660000",
"osd_scrub_begin_hour": "0",
"osd_scrub_begin_week_day": "0",
"osd_scrub_chunk_max": "25",
"osd_scrub_chunk_min": "5",
"osd_scrub_cost": "52428800",
"osd_scrub_during_recovery": "false",
"osd_scrub_end_hour": "24",
"osd_scrub_end_week_day": "7",
"osd_scrub_interval_randomize_ratio": "0.500000",
"osd_scrub_invalid_stats": "true",
"osd_scrub_load_threshold": "0.500000",
"osd_scrub_max_interval": "604800.000000",
"osd_scrub_max_preemptions": "5",
"osd_scrub_min_interval": "86400.000000",
"osd_scrub_priority": "5",
"osd_scrub_sleep": "0.000000",
[root@ceph1 ceph]#
Mykola Golub wrote:
Chris Webb wrote:
[root@ceph1 ceph]# ceph daemon osd.11 config show | grep scrub
Could you please run this for mon.ceph1?
Updated by Mykola Golub over 5 years ago
Chris Webb wrote:
"mon_warn_not_scrubbed": "0",
Interesting, with this set to 0 the health check should not even try to check the pgs scrub age. As I understand it is observed after the cluster upgrade? What was the previous version? What is the output for `ceph tell mon.\* version`?
Updated by Chris Webb over 5 years ago
Mykola Golub wrote:
Chris Webb wrote:
"mon_warn_not_scrubbed": "0",
Interesting, with this set to 0 the health check should not even try to check the pgs scrub age.
That is also my understanding from the documentation.
As I understand it is observed after the cluster upgrade?
At between 8:00 and 8:15 the ceph software was auto upgraded. The all the services were restarted by the upgrade process, however the mds services (one on each server) failed to restart.
Jan 3 08:04:49 ceph1 ceph-mds: in thread 7f1ed4c9f480 thread_name:ceph-mds
Jan 3 08:04:49 ceph1 ceph-mds: ceph version 13.2.3 (9bf3c8b1a04b0aa4a3cc78456a508f1c48e70279) mimic (stable)
Jan 3 08:04:49 ceph1 ceph-mds: 1: (()+0x3ed0a0) [0x55ea096da0a0]
Jan 3 08:04:49 ceph1 ceph-mds: 2: (()+0xf5d0) [0x7f1ec9b305d0]
Jan 3 08:04:49 ceph1 ceph-mds: 3: (std::string::assign(std::string const&)+0x10) [0x55ea093e05c0]
Jan 3 08:04:49 ceph1 ceph-mds: 4: (Beacon::init(MDSMap const*)+0x53) [0x55ea09424e73]
Jan 3 08:04:49 ceph1 ceph-mds: 5: (MDSDaemon::init()+0x1872) [0x55ea093ea672]
Jan 3 08:04:49 ceph1 ceph-mds: 6: (main()+0xa5f) [0x55ea093d9b4f]
Jan 3 08:04:49 ceph1 ceph-mds: 7: (__libc_start_main()+0xf5) [0x7f1ec8b2a3d5]
Jan 3 08:04:49 ceph1 ceph-mds: 8: (()+0xf10b0) [0x55ea093de0b0]
Jan 3 08:04:49 ceph1 ceph-mds: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jan 3 08:04:49 ceph1 systemd: ceph-mds@ceph1.service: main process exited, code=killed, status=11/SEGV
Jan 3 08:04:49 ceph1 systemd: Unit ceph-mds@ceph1.service entered failed state.
Jan 3 08:04:49 ceph1 systemd: ceph-mds@ceph1.service failed.
Jan 3 08:04:49 ceph1 systemd: ceph-mds@ceph1.service holdoff time over, scheduling restart.
Jan 3 08:04:49 ceph1 systemd: start request repeated too quickly for ceph-mds@ceph1.service
Jan 3 08:04:49 ceph1 systemd: Unit ceph-mds@ceph1.service entered failed state.
Jan 3 08:04:49 ceph1 systemd: ceph-mds@ceph1.service failed.
Jan 3 08:04:54 ceph1 yum228896: Updated: ceph-mds.x86_64 2:13.2.3-0.el7
We logged into each server and ran a systemctl restart for the process, it restarted without an any issues. Everything appeared fine, however, the messages about scrubs appear in the logs from 8:59 onward.
What was the previous version?
We have the http://download.ceph.com/rpm-mimic/el7/x86_64/ repository configured, as we inadvertently had auto-updates on I cannot be 100% sure, but it would make sense that it was the previous version (13.2.2). The two latest ceph rpms in the repo:
ceph-13.2.2-0.el7.x86_64.rpm 15-Oct-2018 17:17 3024
ceph-13.2.3-0.el7.x86_64.rpm 02-Jan-2019 21:01 3024
What is the output for `ceph tell mon.\* version`?
[root@ceph1 ceph]# ceph tell mon.\* version
mon.ceph1: ceph version 13.2.3 (9bf3c8b1a04b0aa4a3cc78456a508f1c48e70279) mimic (stable)
mon.ceph2: ceph version 13.2.3 (9bf3c8b1a04b0aa4a3cc78456a508f1c48e70279) mimic (stable)
mon.ceph3: ceph version 13.2.3 (9bf3c8b1a04b0aa4a3cc78456a508f1c48e70279) mimic (stable)
[root@ceph1 ceph]#
Chris
Updated by Mykola Golub over 5 years ago
Ah, now it is mgr who runs pgmap health check. So could you please verify ceph-mgr in-memory mon_warn_not_scrubbed and mon_scrub_interval params?
ceph daemon mgr.ceph1 config show | grep scrub
Also it would be good to ensure the mgr daemons were restarted after the upgrade.
Updated by Chris Webb over 5 years ago
The in-memory options do not appear to differ:
[root@ceph1 ceph]# ceph daemon mgr.ceph1 config show | grep scrub
"mds_max_scrub_ops_in_progress": "5",
"mon_scrub_inject_crc_mismatch": "0.000000",
"mon_scrub_inject_missing_keys": "0.000000",
"mon_scrub_interval": "86400",
"mon_scrub_max_keys": "100",
"mon_scrub_timeout": "300",
"mon_warn_not_deep_scrubbed": "0",
"mon_warn_not_scrubbed": "0",
"osd_debug_deep_scrub_sleep": "0.000000",
"osd_deep_scrub_interval": "604800.000000",
"osd_deep_scrub_keys": "1024",
"osd_deep_scrub_large_omap_object_key_threshold": "2000000",
"osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824",
"osd_deep_scrub_randomize_ratio": "0.150000",
"osd_deep_scrub_stride": "524288",
"osd_deep_scrub_update_digest_min_age": "7200",
"osd_max_scrubs": "1",
"osd_op_queue_mclock_scrub_lim": "0.001000",
"osd_op_queue_mclock_scrub_res": "0.000000",
"osd_op_queue_mclock_scrub_wgt": "1.000000",
"osd_requested_scrub_priority": "120",
"osd_scrub_auto_repair": "false",
"osd_scrub_auto_repair_num_errors": "5",
"osd_scrub_backoff_ratio": "0.660000",
"osd_scrub_begin_hour": "0",
"osd_scrub_begin_week_day": "0",
"osd_scrub_chunk_max": "25",
"osd_scrub_chunk_min": "5",
"osd_scrub_cost": "52428800",
"osd_scrub_during_recovery": "false",
"osd_scrub_end_hour": "24",
"osd_scrub_end_week_day": "7",
"osd_scrub_interval_randomize_ratio": "0.500000",
"osd_scrub_invalid_stats": "true",
"osd_scrub_load_threshold": "0.500000",
"osd_scrub_max_interval": "604800.000000",
"osd_scrub_max_preemptions": "5",
"osd_scrub_min_interval": "86400.000000",
"osd_scrub_priority": "5",
"osd_scrub_sleep": "0.000000",
[root@ceph1 ceph]#
I've checked the restart time of the mgr, it looks like it was restarted early on during the upgrade process.
[root@ceph1 ceph]# systemctl status ceph-mgr@ceph1 -l
● ceph-mgr@ceph1.service - Ceph cluster manager daemon
Loaded: loaded (/usr/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2019-01-03 08:04:42 GMT; 1 day 6h ago
Main PID: 230099 (ceph-mgr)
CGroup: /system.slice/system-ceph\x2dmgr.slice/ceph-mgr@ceph1.service
└─230099 /usr/bin/ceph-mgr -f --cluster ceph --id ceph1 --setuser ceph --setgroup ceph
Jan 03 08:04:42 ceph1.iop.kcl.ac.uk systemd1: Started Ceph cluster manager daemon.
Jan 03 08:04:42 ceph1.iop.kcl.ac.uk ceph-mgr230099: 2019-01-03 08:04:42.442 7f7c2adb9380 -1 deliberately leaking some memory
Jan 03 08:04:59 ceph1.iop.kcl.ac.uk ceph-mgr230099: 2019-01-03 08:04:59.293 7f7c1718c700 -1 received signal: Hangup from (PID: 230676) UID: 0
Jan 03 08:04:59 ceph1.iop.kcl.ac.uk ceph-mgr230099: 2019-01-03 08:04:59.314 7f7c1718c700 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw (PID: 230677) UID: 0
Jan 03 11:34:30 ceph1.iop.kcl.ac.uk ceph-mgr230099: ignoring --setuser ceph since I am not root
Jan 03 11:34:30 ceph1.iop.kcl.ac.uk ceph-mgr230099: ignoring --setgroup ceph since I am not root
Jan 03 11:34:30 ceph1.iop.kcl.ac.uk ceph-mgr230099: 2019-01-03 11:34:30.367 7f59613d3380 -1 deliberately leaking some memory
Jan 04 03:31:01 ceph1.iop.kcl.ac.uk ceph-mgr230099: 2019-01-04 03:31:01.936 7f594da2f700 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw (PID: 242857) UID: 0
Jan 04 03:31:01 ceph1.iop.kcl.ac.uk ceph-mgr230099: 2019-01-04 03:31:01.963 7f594da2f700 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw (PID: 242858) UID: 0
[root@ceph1 ceph]#
I've just restarted it and the warning message has disappeared.
[root@ceph3 log]# ceph status
cluster:
id: e474ffde-40fd-44c2-aa4d-753dd4dd0761
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph1,ceph2,ceph3
mgr: ceph1(active), standbys: ceph2, ceph3
mds: cephfs-nan-1/1/1 up {0=ceph3=up:active}, 2 up:standby
osd: 36 osds: 36 up, 36 in
data:
pools: 2 pools, 1280 pgs
objects: 45.42 M objects, 29 TiB
usage: 51 TiB used, 276 TiB / 327 TiB avail
pgs: 1279 active+clean
1 active+clean+scrubbing+deep
io:
client: 416 KiB/s rd, 17 MiB/s wr, 26 op/s rd, 109 op/s wr
[root@ceph3 log]#
Here is the order in which the patches were applied:
Jan 03 08:13:08 Updated: librados2.x86_64 2:13.2.3-0.el7
Jan 03 08:13:08 Updated: python-rados.x86_64 2:13.2.3-0.el7
Jan 03 08:13:08 Updated: librbd1.x86_64 2:13.2.3-0.el7
Jan 03 08:13:08 Updated: libcephfs2.x86_64 2:13.2.3-0.el7
Jan 03 08:13:09 Updated: librgw2.x86_64 2:13.2.3-0.el7
Jan 03 08:13:09 Updated: python-rgw.x86_64 2:13.2.3-0.el7
Jan 03 08:13:09 Updated: python-cephfs.x86_64 2:13.2.3-0.el7
Jan 03 08:13:09 Updated: python-rbd.x86_64 2:13.2.3-0.el7
Jan 03 08:13:09 Updated: libradosstriper1.x86_64 2:13.2.3-0.el7
Jan 03 08:13:11 Updated: ceph-common.x86_64 2:13.2.3-0.el7
Jan 03 08:13:12 Updated: ceph-base.x86_64 2:13.2.3-0.el7
Jan 03 08:13:35 Updated: ceph-selinux.x86_64 2:13.2.3-0.el7
Jan 03 08:13:36 Updated: ceph-mgr.x86_64 2:13.2.3-0.el7
Jan 03 08:13:36 Updated: ceph-mds.x86_64 2:13.2.3-0.el7
Jan 03 08:13:37 Updated: ceph-mon.x86_64 2:13.2.3-0.el7
Jan 03 08:13:38 Updated: ceph-osd.x86_64 2:13.2.3-0.el7
Jan 03 08:13:38 Updated: ceph.x86_64 2:13.2.3-0.el7
Jan 03 08:13:39 Updated: ceph-radosgw.x86_64 2:13.2.3-0.el7
Jan 03 08:57:30 Erased: yum-cron-3.4.3-161.sl7.noarch
Jan 03 15:03:06 Installed: zabbix-release-3.2-1.el7.noarch
Jan 03 15:04:09 Installed: zabbix-agent-3.2.11-1.el7.x86_64
Jan 03 15:04:35 Installed: zabbix-sender-3.2.11-1.el7.x86_64
I guess this wouldn't normally be an issue, I would expect to run an update followed by a server reboot.
Chris
Updated by Sage Weil almost 3 years ago
- Status changed from Need More Info to Closed