Bug #20612: radosgw ceases responding to list requests - rgw - Ceph

Actions

Copy link

Bug #20612

closed

radosgw ceases responding to list requests

Added by Bob Bobington almost 7 years ago. Updated over 6 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Target version:

Ceph - v12.1.0

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.1.0

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have 4 OSDs set up with --dmcrypt and --bluestore. Occasionally, likely due to http://tracker.ceph.com/issues/20545 they crash.

I'm running radosgw with CivetWeb (the el7 package doesn't seem to have fastcgi enabled) and I've configured a pool with erasure coding:

ceph osd erasure-code-profile set myprofile k=2 m=1 ruleset-failure-domain=osd
ceph osd pool create default.rgw.buckets.data 256 256 erasure myprofile
systemctl start ceph-radosgw@rgw.radosgw.service

I'm running a copy with the open source rclone tool:

rclone copy -v --transfers 4 --stats 10s /place/on/disk/ ceph:bucket

And after a while, an OSD will always stop responding to queries, resulting in things like this turning up in the logs:

2017-07-12 16:23:21.121534 osd.2 osd.2 192.168.122.132:6808/75831 27539 : cluster [WRN] 5022 slow requests, 5 included below; oldest blocked for > 4871.736939 secs
2017-07-12 16:23:21.121544 osd.2 osd.2 192.168.122.132:6808/75831 27540 : cluster [WRN] slow request 960.787066 seconds old, received at 2017-07-12 16:07:20.333473: osd_op(client.14171.0:714555 10.7 10.4322fa9f (undecoded) ondisk+write+kn
own_if_redirected e124) currently queued_for_pg

After this, I generally end up manually restarting the OSD then restarting radosgw. When I make subsequent requests, I see the following repeated in the radosgw logs until rclone gives up:

2017-07-12 18:08:44.918031 7f0cc3acb700  1 ====== starting new request req=0x7f0cc3ac55d0 =====
2017-07-12 18:08:45.011691 7f0cc3acb700  0 WARNING: set_req_state_err err_no=36 resorting to 500
2017-07-12 18:08:45.011862 7f0cc3acb700  1 ====== req done req=0x7f0cc3ac55d0 op status=-36 http_status=500 ======
2017-07-12 18:08:45.011941 7f0cc3acb700  1 civetweb: 0x7f0cf84ba000: 192.168.122.1 - - [12/Jul/2017:18:07:57 -0700] "GET /bucket?delimiter=%2F&max-keys=1024&prefix= HTTP/1.1" 1 0 - rclone/v1.36

The data pool has ~172k objects and the bucket ~13k, so I expect a list request to be somewhat expensive but that doesn't seem to be what this is.

I'm able to retrieve objects fine, it seems to only be list requests that have issues.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #20612

radosgw ceases responding to list requests

Updated by Greg Farnum almost 7 years ago

Updated by Matt Benjamin almost 7 years ago

Updated by Matt Benjamin almost 7 years ago

Updated by Orit Wasserman almost 7 years ago

Updated by Bob Bobington almost 7 years ago

Updated by Orit Wasserman almost 7 years ago

Updated by Wei Wu over 6 years ago

Updated by Bob Bobington over 6 years ago

Updated by Bob Bobington over 6 years ago

Updated by Wei Wu over 6 years ago