Project

General

Profile

Actions

Bug #20612

closed

radosgw ceases responding to list requests

Added by Bob Bobington almost 7 years ago. Updated over 6 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have 4 OSDs set up with --dmcrypt and --bluestore. Occasionally, likely due to http://tracker.ceph.com/issues/20545 they crash.

I'm running radosgw with CivetWeb (the el7 package doesn't seem to have fastcgi enabled) and I've configured a pool with erasure coding:

ceph osd erasure-code-profile set myprofile k=2 m=1 ruleset-failure-domain=osd
ceph osd pool create default.rgw.buckets.data 256 256 erasure myprofile
systemctl start ceph-radosgw@rgw.radosgw.service

I'm running a copy with the open source rclone tool:

rclone copy -v --transfers 4 --stats 10s /place/on/disk/ ceph:bucket

And after a while, an OSD will always stop responding to queries, resulting in things like this turning up in the logs:

2017-07-12 16:23:21.121534 osd.2 osd.2 192.168.122.132:6808/75831 27539 : cluster [WRN] 5022 slow requests, 5 included below; oldest blocked for > 4871.736939 secs
2017-07-12 16:23:21.121544 osd.2 osd.2 192.168.122.132:6808/75831 27540 : cluster [WRN] slow request 960.787066 seconds old, received at 2017-07-12 16:07:20.333473: osd_op(client.14171.0:714555 10.7 10.4322fa9f (undecoded) ondisk+write+kn
own_if_redirected e124) currently queued_for_pg

After this, I generally end up manually restarting the OSD then restarting radosgw. When I make subsequent requests, I see the following repeated in the radosgw logs until rclone gives up:

2017-07-12 18:08:44.918031 7f0cc3acb700  1 ====== starting new request req=0x7f0cc3ac55d0 =====
2017-07-12 18:08:45.011691 7f0cc3acb700  0 WARNING: set_req_state_err err_no=36 resorting to 500
2017-07-12 18:08:45.011862 7f0cc3acb700  1 ====== req done req=0x7f0cc3ac55d0 op status=-36 http_status=500 ======
2017-07-12 18:08:45.011941 7f0cc3acb700  1 civetweb: 0x7f0cf84ba000: 192.168.122.1 - - [12/Jul/2017:18:07:57 -0700] "GET /bucket?delimiter=%2F&max-keys=1024&prefix= HTTP/1.1" 1 0 - rclone/v1.36

The data pool has ~172k objects and the bucket ~13k, so I expect a list request to be somewhat expensive but that doesn't seem to be what this is.

I'm able to retrieve objects fine, it seems to only be list requests that have issues.

Actions

Also available in: Atom PDF