Bug #5823
closedcpu load on cluster node is very high, client can't get data on pg from primary node (cpu hight) ...
0%
Description
ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
env: 3 cluster nodes (10 osds/node), use dedicated network (private network: infiniband over IP)
I meet one big issue on cluster node: when only one cluster node has cpu load is very high (full load 100% for all 24 cores).
-> all attached block device is unreachable.
see in logs files: Have slow requests and exist about 100 pgs stuck unclean, but i wait for along time, system hasn't repaired automatically -> it makes whole system is hang. I must to make down the cluster node with high cpu load
see my logs for more detail
2013-08-01 08:48:14.269903 osd.5 xx.xx.xx:6805/11563 95 : [WRN] 2 slow requests, 2 included below; oldest blocked for > 85.363643 secs
2013-08-01 08:48:14.269909 osd.5 xx.xx.xx:6805/11563 96 : [WRN] slow request 85.363643 seconds old, received at 2013-08-01 08:46:48.906210: osd_op(client.50425.0:128 rbd_header.c4f62ae8944a [watch add cookie 1 ver 0] 3.1662d8c3 e4048) v4 currently reached pg
2013-08-01 08:48:14.269915 osd.5 xx.xx.xx:6805/11563 97 : [WRN] slow request 85.363507 seconds old, received at 2013-08-01 08:46:48.906346: osd_op(client.50425.0:129 rbd_header.c4f62ae8944a [watch add cookie 1 ver 0] 3.1662d8c3 e4048) v4 currently reached pg
2013-08-01 08:48:16.554671 osd.5 xx.xx.xx:6805/11563 98 : [WRN] 2 slow requests, 2 included below; oldest blocked for > 87.648414 secs
2013-08-01 08:48:16.554679 osd.5 xx.xx.xx:6805/11563 99 : [WRN] slow request 87.648414 seconds old, received at 2013-08-01 08:46:48.906210: osd_op(client.50425.0:128 rbd_header.c4f62ae8944a [watch add cookie 1 ver 0] 3.1662d8c3 e4048) v4 currently reached pg
2013-08-01 08:48:16.554684 osd.5 xx.xx.xx:6805/11563 100 : [WRN] slow request 87.648278 seconds old, received at 2013-08-01 08:46:48.906346: osd_op(client.50425.0:129 rbd_header.c4f62ae8944a [watch add cookie 1 ver 0] 3.1662d8c3 e4048) v4 currently reached pg
2013-08-01 08:48:25.400571 mon.0 xx.xx.xx:6789/0 13039 : [INF] pgmap v781157: 12096 pgs: 4037 active+clean, 3544 peering, 4515 active+degraded; 3380 GB data, 6095 GB used, 8956 GB / 15051 GB avail; 5900B/s wr, 1op/s; 313884/1681762 degraded (18.664%)
Files