Bug #5823: cpu load on cluster node is very high, client can't get data on pg from primary node (cpu hight) ... - Ceph - Ceph

Actions

Copy link

Bug #5823

closed

cpu load on cluster node is very high, client can't get data on pg from primary node (cpu hight) ...

Added by Khanh Nguyen Dang Quoc over 10 years ago. Updated about 10 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Samuel Just

Category:

OSD

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)

env: 3 cluster nodes (10 osds/node), use dedicated network (private network: infiniband over IP)

I meet one big issue on cluster node: when only one cluster node has cpu load is very high (full load 100% for all 24 cores).

-> all attached block device is unreachable.

see in logs files: Have slow requests and exist about 100 pgs stuck unclean, but i wait for along time, system hasn't repaired automatically -> it makes whole system is hang. I must to make down the cluster node with high cpu load

see my logs for more detail

2013-08-01 08:48:14.269903 osd.5 xx.xx.xx:6805/11563 95 : [WRN] 2 slow requests, 2 included below; oldest blocked for > 85.363643 secs
2013-08-01 08:48:14.269909 osd.5 xx.xx.xx:6805/11563 96 : [WRN] slow request 85.363643 seconds old, received at 2013-08-01 08:46:48.906210: osd_op(client.50425.0:128 rbd_header.c4f62ae8944a [watch add cookie 1 ver 0] 3.1662d8c3 e4048) v4 currently reached pg
2013-08-01 08:48:14.269915 osd.5 xx.xx.xx:6805/11563 97 : [WRN] slow request 85.363507 seconds old, received at 2013-08-01 08:46:48.906346: osd_op(client.50425.0:129 rbd_header.c4f62ae8944a [watch add cookie 1 ver 0] 3.1662d8c3 e4048) v4 currently reached pg
2013-08-01 08:48:16.554671 osd.5 xx.xx.xx:6805/11563 98 : [WRN] 2 slow requests, 2 included below; oldest blocked for > 87.648414 secs
2013-08-01 08:48:16.554679 osd.5 xx.xx.xx:6805/11563 99 : [WRN] slow request 87.648414 seconds old, received at 2013-08-01 08:46:48.906210: osd_op(client.50425.0:128 rbd_header.c4f62ae8944a [watch add cookie 1 ver 0] 3.1662d8c3 e4048) v4 currently reached pg
2013-08-01 08:48:16.554684 osd.5 xx.xx.xx:6805/11563 100 : [WRN] slow request 87.648278 seconds old, received at 2013-08-01 08:46:48.906346: osd_op(client.50425.0:129 rbd_header.c4f62ae8944a [watch add cookie 1 ver 0] 3.1662d8c3 e4048) v4 currently reached pg
2013-08-01 08:48:25.400571 mon.0 xx.xx.xx:6789/0 13039 : [INF] pgmap v781157: 12096 pgs: 4037 active+clean, 3544 peering, 4515 active+degraded; 3380 GB data, 6095 GB used, 8956 GB / 15051 GB avail; 5900B/s wr, 1op/s; 313884/1681762 degraded (18.664%)

Files

Download all files

highload_ceph_osd.png (109 KB) highload_ceph_osd.png		Khanh Nguyen Dang Quoc, 08/08/2013 07:23 PM
ceph-osd.15.log.zip (234 KB) ceph-osd.15.log.zip		Khanh Nguyen Dang Quoc, 09/10/2013 12:49 AM

Actions

Copy link

Updated by Ian Colle over 10 years ago

Assignee set to Samuel Just
Priority changed from Urgent to High

Actions

Copy link

Updated by Samuel Just over 10 years ago

Which process is causing the load?

Actions

Copy link

Updated by Khanh Nguyen Dang Quoc over 10 years ago

File highload_ceph_osd.png highload_ceph_osd.png added

ceph-osd process is causing the high load, ...

When i used the htop to monitor cpu load, I saw that %cpu spend for system occurred so high and Cpu's status was sleeping.

as cpu load on this cluster was high, sometime the osd was marked down wrongly.

See the attached image for more detail workload.

Actions

Copy link

Updated by Samuel Just over 10 years ago

what kernel are you running?

Actions

Copy link

Updated by Khanh Nguyen Dang Quoc over 10 years ago

I'm using Ubuntu 12.10 (GNU/Linux 3.5.0-25-generic x86_64) kernel for all cluster nodes.

sometimes, i saw the osd has status like as "osd.18 [WRN] map e13601 wrongly marked me down" ? I don't know why it exists.
could you please help me explain more detail about that ?

Actions

Copy link

Updated by Khanh Nguyen Dang Quoc over 10 years ago

File ceph-osd.15.log.zip ceph-osd.15.log.zip added

The slow request increase frequently ...
see the attached file for more detail

Actions

Copy link

Updated by Samuel Just over 10 years ago

Status changed from New to Need More Info

This is only happening on a particular node? Is this still a problem?

Actions

Copy link

Updated by Khanh Nguyen Dang Quoc over 10 years ago

It occurred as having multiple write requests to cluster...
I deployed 10 osds/node with default configuration.

My physical server has 64GB ram, Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz 24 cores, ssd

Here the details of my servers:

Number of servers: 3
Host CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
Host RAM: 64 G
Host disks: 20 disks SaS 300GB/node , 2 SSD as journal serve for 10 OSDs/nodes
Host NIC: 10Gbps (primary network)
dedicated network : infiniband card.

With the my hard configuration as below, I configured 10 osds/node . Is it ok ?

And do you have any recommendations for my cluster storage?

Actions

Copy link