Bug #23263
closedJournaling feature causes cluster to have slow requests and inconsistent PG
0%
Description
First noticed this problem in our ESXi/iSCSI cluster, but now I can replicate it in lab with just Ubuntu:
1. Create an image with journaling (and required exclusive-lock) feature
2. Mount the image, make a fs and write a large file to it:
rbd-nbd map matte/scuttle2
/dev/nbd0
mkfs.xfs /dev/nbd0
mount -t xfs /dev/nbd0 /srv/exports/sclun69
xfs_io -c "extsize 256M" /srv/exports/sclun69
root@lumd1:/var/log# dd if=/dev/zero of=/srv/exports/sclun69/junk bs=1M count=2800000
2800000+0 records in
2800000+0 records out
2936012800000 bytes (2.9 TB, 2.7 TiB) copied, 35199.2 s, 83.4 MB/s
3. At some point, slow requests begin.
2018-03-06 22:00:00.000175 mon.lumc1 [INF] overall HEALTH_OK
2018-03-06 22:27:27.945814 mon.lumc1 [WRN] Health check failed: 1 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-06 22:27:34.406352 mon.lumc1 [WRN] Health check update: 10 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-06 22:27:38.496184 mon.lumc1 [INF] Health check cleared: REQUEST_SLOW (was: 10 slow requests are blocked > 32 sec)
2018-03-06 22:27:38.496215 mon.lumc1 [INF] Cluster is now healthy
2018-03-06 23:00:00.000196 mon.lumc1 [INF] overall HEALTH_OK
2018-03-06 23:29:45.538387 osd.4 [ERR] 12.308 shard 17: soid 12:10dbc229:::rbd_data.39e1022ae8944a.00000000000cd96d:head candidate had a read error
2018-03-06 23:29:56.937346 mon.lumc1 [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2018-03-06 23:29:56.937415 mon.lumc1 [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2018-03-06 23:29:54.835693 osd.4 [ERR] 12.308 deep-scrub 0 missing, 1 inconsistent objects
2018-03-06 23:29:54.835703 osd.4 [ERR] 12.308 deep-scrub 1 errors
2018-03-07 00:00:00.000155 mon.lumc1 [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-03-07 01:00:00.000201 mon.lumc1 [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-03-07 02:00:00.000179 mon.lumc1 [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-03-07 03:00:00.000235 mon.lumc1 [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
Files