Project

General

Profile

Actions

Bug #23263

closed

Journaling feature causes cluster to have slow requests and inconsistent PG

Added by Alex Gorbachev about 6 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

First noticed this problem in our ESXi/iSCSI cluster, but now I can replicate it in lab with just Ubuntu:

1. Create an image with journaling (and required exclusive-lock) feature

2. Mount the image, make a fs and write a large file to it:

rbd-nbd map matte/scuttle2
/dev/nbd0

mkfs.xfs /dev/nbd0
mount -t xfs /dev/nbd0 /srv/exports/sclun69
xfs_io -c "extsize 256M" /srv/exports/sclun69

root@lumd1:/var/log# dd if=/dev/zero of=/srv/exports/sclun69/junk bs=1M count=2800000
2800000+0 records in
2800000+0 records out
2936012800000 bytes (2.9 TB, 2.7 TiB) copied, 35199.2 s, 83.4 MB/s

3. At some point, slow requests begin.

2018-03-06 22:00:00.000175 mon.lumc1 [INF] overall HEALTH_OK
2018-03-06 22:27:27.945814 mon.lumc1 [WRN] Health check failed: 1 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-06 22:27:34.406352 mon.lumc1 [WRN] Health check update: 10 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-06 22:27:38.496184 mon.lumc1 [INF] Health check cleared: REQUEST_SLOW (was: 10 slow requests are blocked > 32 sec)
2018-03-06 22:27:38.496215 mon.lumc1 [INF] Cluster is now healthy
2018-03-06 23:00:00.000196 mon.lumc1 [INF] overall HEALTH_OK
2018-03-06 23:29:45.538387 osd.4 [ERR] 12.308 shard 17: soid 12:10dbc229:::rbd_data.39e1022ae8944a.00000000000cd96d:head candidate had a read error
2018-03-06 23:29:56.937346 mon.lumc1 [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2018-03-06 23:29:56.937415 mon.lumc1 [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2018-03-06 23:29:54.835693 osd.4 [ERR] 12.308 deep-scrub 0 missing, 1 inconsistent objects
2018-03-06 23:29:54.835703 osd.4 [ERR] 12.308 deep-scrub 1 errors
2018-03-07 00:00:00.000155 mon.lumc1 [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-03-07 01:00:00.000201 mon.lumc1 [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-03-07 02:00:00.000179 mon.lumc1 [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-03-07 03:00:00.000235 mon.lumc1 [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)


Files

ceph-osd.0.log.2.gz (321 KB) ceph-osd.0.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.1.log.2.gz (322 KB) ceph-osd.1.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.2.log.2.gz (278 KB) ceph-osd.2.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.3.log.2.gz (307 KB) ceph-osd.3.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.5.log.2.gz (168 KB) ceph-osd.5.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.6.log.2.gz (323 KB) ceph-osd.6.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.7.log.2.gz (320 KB) ceph-osd.7.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.8.log.2.gz (293 KB) ceph-osd.8.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.9.log.2.gz (148 KB) ceph-osd.9.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.10.log.2.gz (269 KB) ceph-osd.10.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.11.log.2.gz (163 KB) ceph-osd.11.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.12.log.2.gz (306 KB) ceph-osd.12.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.13.log.2.gz (294 KB) ceph-osd.13.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.15.log.2.gz (257 KB) ceph-osd.15.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.14.log.2.gz (336 KB) ceph-osd.14.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.16.log.2.gz (322 KB) ceph-osd.16.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.17.log.2.gz (289 KB) ceph-osd.17.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.19.log.2.gz (164 KB) ceph-osd.19.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.18.log.2.gz (316 KB) ceph-osd.18.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.20.log.2.gz (262 KB) ceph-osd.20.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.21.log.2.gz (153 KB) ceph-osd.21.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.22.log.2.gz (322 KB) ceph-osd.22.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
ceph-osd.23.log.2.gz (153 KB) ceph-osd.23.log.2.gz Alex Gorbachev, 03/12/2018 06:23 PM
Actions #1

Updated by Patrick Donnelly about 6 years ago

  • Project changed from Ceph to rbd
Actions #2

Updated by Jason Dillaman about 6 years ago

  • Status changed from New to Need More Info

@Alexandre Dupouy: can you please dump out the slow requests from the OSDs to see what object is causing the issue?

Updated by Alex Gorbachev about 6 years ago

Hi Jason, issue impeded by https://tracker.ceph.com/issues/23205#change-108877 - OSDs are not showing anything that I understand to be a slow request. I have attached all the OSD logs from the lab cluster for this time period in log above.

Actions #6

Updated by Jason Dillaman about 6 years ago

@Alexandre Dupouy: can you please run "ceph daemon osd.<X> dump_blocked_ops"?

Actions #7

Updated by Jason Dillaman about 6 years ago

... and missing the log from osd.4 which was the only one mentioned in your problem description.

Actions #8

Updated by Alex Gorbachev about 6 years ago

Hi Jason, all dump_blocked_ops are zero, I ran them in a script against all OSDs, maybe too long has passed?

"ops": [],
"complaint_time": 30.000000,
"num_blocked_ops": 0

I can't find osd.4 log, sorry. I'll rerun this test and see if I can get it back.

Actions #9

Updated by Jason Dillaman about 6 years ago

@Alexandre Dupouy: it might not have been osd.4 that had any blocked ops. Hopefully "ceph health" should tell you which specific OSDs are currently encountering slow ops.

Actions #10

Updated by Alex Gorbachev about 6 years ago

Jason, is there a way to trigger a ceph health on a detection of slow operation? I realize this can be a logwatch type event, but anything built in?

I am so far not able to replicate the problem. Rerunning this scenario is not causing the error to occur again.

Actions #11

Updated by Jason Dillaman almost 6 years ago

@Alexandre Dupouy: any update on this issue? Any chance you had a large RBD writeback cache configured (see #23526)?

Actions #12

Updated by Jason Dillaman about 5 years ago

  • Status changed from Need More Info to Closed

Closing due to a lack of activity

Actions

Also available in: Atom PDF