Project

General

Profile

Actions

Bug #24645

open

Upload to radosgw fails when there are degraded objects

Added by Michal Cila almost 6 years ago. Updated almost 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

we use Ceph RadosGW for storing and serving milions of small images. Everything is working well until recovery is running and there is at least 1 degraded object in the cluster. After that upload (PUT requests) stops working and we're getting: Client.Timeout exceeded while awaiting headers

Our cluster is installed on Ubuntu 16.04. We're using Luminous with Bluestore, v12.2.5, 3x OSD+MON nodes (each has 6x 1TB Samsung SSD + blockdb on Kingston V300 480GB) + 1 RGW node.

Cluster state when upload is not working:

  cluster:
    id:     e2518c9d-0552-4cee-9dad-d176c2f79a8f
    health: HEALTH_WARN
            13250795/68251224 objects misplaced (19.415%)
            Degraded data redundancy: 1/68251224 objects degraded (0.000%), 1 pg degraded

  services:
    mon: 3 daemons, quorum ceph-backup-ssd1,ceph-backup-ssd2,ceph-backup-ssd3
    mgr: ceph-backup-ssd2(active), standbys: ceph-backup-ssd3, ceph-backup-ssd1
    osd: 18 osds: 18 up, 18 in; 368 remapped pgs
    rgw: 4 daemons active

  data:
    pools:   6 pools, 1184 pgs
    objects: 33325k objects, 1025 GB
    usage:   2689 GB used, 14075 GB / 16765 GB avail
    pgs:     1/68251224 objects degraded (0.000%)
             13250795/68251224 objects misplaced (19.415%)
             815 active+clean
             365 active+remapped+backfill_wait
             3   active+remapped+backfilling
             1   active+recovering+degraded

  io:
    client:   638 B/s rd, 0 op/s rd, 0 op/s wr
    recovery: 34251 kB/s, 1095 objects/s

health detail:

HEALTH_WARN 13160078/68251224 objects misplaced (19.282%); Degraded data redundancy: 1/68251224 objects degraded (0.000%), 1 pg degraded
OBJECT_MISPLACED 13160078/68251224 objects misplaced (19.282%)
PG_DEGRADED Degraded data redundancy: 1/68251224 objects degraded (0.000%), 1 pg degraded
    pg 14.1 is active+recovering+degraded, acting [14,5]

Affected pg's query is attached in pg_query.txt file.

I checked radosgw logs with debug enabled and compared logs when uploading is working and when it's not working and the only difference is that these lines are there when upload is working:

2018-06-25 09:07:05.599764 7fa8ed156700 20 get_obj_state: rctx=0x7fa8ed14ee70 obj=prod:monitoringprod1 state=0x55bbe07ce208 s->prefetch_data=0
2018-06-25 09:07:05.601909 7fa8ed156700 10 manifest: total_size = 140768
2018-06-25 09:07:05.602082 7fa8ed156700 20 get_obj_state: setting s->obj_tag to d1891381-8dc4-4f9b-bddc-ad2fc3baf791.843179.890853
2018-06-25 09:07:05.602305 7fa8ed156700 20 get_obj_state: rctx=0x7fa8ed14ee70 obj=prod:monitoringprod1 state=0x55bbe07ce208 s->prefetch_data=0
2018-06-25 09:07:05.602916 7fa8ed156700 10 setting object write_tag=d1891381-8dc4-4f9b-bddc-ad2fc3baf791.843179.890854
2018-06-25 09:07:05.610445 7fa8ed156700  2 req 890854:0.055985:s3:PUT /prod/monitoringprod1:put_obj:completing
2018-06-25 09:07:05.610779 7fa8ed156700  2 req 890854:0.056353:s3:PUT /prod/monitoringprod1:put_obj:op status=0
2018-06-25 09:07:05.610895 7fa8ed156700  2 req 890854:0.056468:s3:PUT /prod/monitoringprod1:put_obj:http status=200
2018-06-25 09:07:05.611043 7fa8ed156700  1 ====== req done req=0x7fa8ed1502c0 op status=0 http_status=200 ======
2018-06-25 09:07:05.611302 7fa8ed156700  1 civetweb: 0x55bbd7f41000: 62.240.183.203 - - [25/Jun/2018:09:07:05 +0200] "PUT /prod/monitoringprod1 HTTP/1.1" 200 0 - aws-sdk-nodejs/2.133.0 linux/v4.2.6 callback

ceph.conf for MON+OSD nodes is attached in ceph_conf.txt and radosgw's ceph.conf is attached in radosgw_ceph_conf.txt

Thanks for any ideas what could be wrong!


Files

pg_query.txt (6.32 KB) pg_query.txt Michal Cila, 06/25/2018 07:44 AM
radosgw_ceph_conf.txt (1.9 KB) radosgw_ceph_conf.txt Michal Cila, 06/25/2018 07:50 AM
ceph_conf.txt (1.89 KB) ceph_conf.txt Michal Cila, 06/25/2018 07:50 AM

Related issues 1 (1 open0 closed)

Related to rgw - Bug #22072: one object degraded cause all ceph rgw request hangNew11/08/2017

Actions
Actions #1

Updated by Abhishek Lekshmanan almost 6 years ago

  • Project changed from rgw to RADOS

When the cluster is in recovery this is expected that we're waiting for the OSDs to respond

Actions #2

Updated by Nathan Cutler almost 6 years ago

  • Related to Bug #22072: one object degraded cause all ceph rgw request hang added
Actions

Also available in: Atom PDF