Bug #24645
openUpload to radosgw fails when there are degraded objects
0%
Description
Hi,
we use Ceph RadosGW for storing and serving milions of small images. Everything is working well until recovery is running and there is at least 1 degraded object in the cluster. After that upload (PUT requests) stops working and we're getting: Client.Timeout exceeded while awaiting headers
Our cluster is installed on Ubuntu 16.04. We're using Luminous with Bluestore, v12.2.5, 3x OSD+MON nodes (each has 6x 1TB Samsung SSD + blockdb on Kingston V300 480GB) + 1 RGW node.
Cluster state when upload is not working:
cluster: id: e2518c9d-0552-4cee-9dad-d176c2f79a8f health: HEALTH_WARN 13250795/68251224 objects misplaced (19.415%) Degraded data redundancy: 1/68251224 objects degraded (0.000%), 1 pg degraded services: mon: 3 daemons, quorum ceph-backup-ssd1,ceph-backup-ssd2,ceph-backup-ssd3 mgr: ceph-backup-ssd2(active), standbys: ceph-backup-ssd3, ceph-backup-ssd1 osd: 18 osds: 18 up, 18 in; 368 remapped pgs rgw: 4 daemons active data: pools: 6 pools, 1184 pgs objects: 33325k objects, 1025 GB usage: 2689 GB used, 14075 GB / 16765 GB avail pgs: 1/68251224 objects degraded (0.000%) 13250795/68251224 objects misplaced (19.415%) 815 active+clean 365 active+remapped+backfill_wait 3 active+remapped+backfilling 1 active+recovering+degraded io: client: 638 B/s rd, 0 op/s rd, 0 op/s wr recovery: 34251 kB/s, 1095 objects/s
health detail:
HEALTH_WARN 13160078/68251224 objects misplaced (19.282%); Degraded data redundancy: 1/68251224 objects degraded (0.000%), 1 pg degraded OBJECT_MISPLACED 13160078/68251224 objects misplaced (19.282%) PG_DEGRADED Degraded data redundancy: 1/68251224 objects degraded (0.000%), 1 pg degraded pg 14.1 is active+recovering+degraded, acting [14,5]
Affected pg's query is attached in pg_query.txt file.
I checked radosgw logs with debug enabled and compared logs when uploading is working and when it's not working and the only difference is that these lines are there when upload is working:
2018-06-25 09:07:05.599764 7fa8ed156700 20 get_obj_state: rctx=0x7fa8ed14ee70 obj=prod:monitoringprod1 state=0x55bbe07ce208 s->prefetch_data=0 2018-06-25 09:07:05.601909 7fa8ed156700 10 manifest: total_size = 140768 2018-06-25 09:07:05.602082 7fa8ed156700 20 get_obj_state: setting s->obj_tag to d1891381-8dc4-4f9b-bddc-ad2fc3baf791.843179.890853 2018-06-25 09:07:05.602305 7fa8ed156700 20 get_obj_state: rctx=0x7fa8ed14ee70 obj=prod:monitoringprod1 state=0x55bbe07ce208 s->prefetch_data=0 2018-06-25 09:07:05.602916 7fa8ed156700 10 setting object write_tag=d1891381-8dc4-4f9b-bddc-ad2fc3baf791.843179.890854 2018-06-25 09:07:05.610445 7fa8ed156700 2 req 890854:0.055985:s3:PUT /prod/monitoringprod1:put_obj:completing 2018-06-25 09:07:05.610779 7fa8ed156700 2 req 890854:0.056353:s3:PUT /prod/monitoringprod1:put_obj:op status=0 2018-06-25 09:07:05.610895 7fa8ed156700 2 req 890854:0.056468:s3:PUT /prod/monitoringprod1:put_obj:http status=200 2018-06-25 09:07:05.611043 7fa8ed156700 1 ====== req done req=0x7fa8ed1502c0 op status=0 http_status=200 ====== 2018-06-25 09:07:05.611302 7fa8ed156700 1 civetweb: 0x55bbd7f41000: 62.240.183.203 - - [25/Jun/2018:09:07:05 +0200] "PUT /prod/monitoringprod1 HTTP/1.1" 200 0 - aws-sdk-nodejs/2.133.0 linux/v4.2.6 callback
ceph.conf for MON+OSD nodes is attached in ceph_conf.txt and radosgw's ceph.conf is attached in radosgw_ceph_conf.txt
Thanks for any ideas what could be wrong!
Files