Project

General

Profile

Actions

Bug #5823

closed

cpu load on cluster node is very high, client can't get data on pg from primary node (cpu hight) ...

Added by Khanh Nguyen Dang Quoc over 10 years ago. Updated about 10 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)

env: 3 cluster nodes (10 osds/node), use dedicated network (private network: infiniband over IP)

I meet one big issue on cluster node: when only one cluster node has cpu load is very high (full load 100% for all 24 cores).

-> all attached block device is unreachable.

see in logs files: Have slow requests and exist about 100 pgs stuck unclean, but i wait for along time, system hasn't repaired automatically -> it makes whole system is hang. I must to make down the cluster node with high cpu load

see my logs for more detail

2013-08-01 08:48:14.269903 osd.5 xx.xx.xx:6805/11563 95 : [WRN] 2 slow requests, 2 included below; oldest blocked for > 85.363643 secs
2013-08-01 08:48:14.269909 osd.5 xx.xx.xx:6805/11563 96 : [WRN] slow request 85.363643 seconds old, received at 2013-08-01 08:46:48.906210: osd_op(client.50425.0:128 rbd_header.c4f62ae8944a [watch add cookie 1 ver 0] 3.1662d8c3 e4048) v4 currently reached pg
2013-08-01 08:48:14.269915 osd.5 xx.xx.xx:6805/11563 97 : [WRN] slow request 85.363507 seconds old, received at 2013-08-01 08:46:48.906346: osd_op(client.50425.0:129 rbd_header.c4f62ae8944a [watch add cookie 1 ver 0] 3.1662d8c3 e4048) v4 currently reached pg
2013-08-01 08:48:16.554671 osd.5 xx.xx.xx:6805/11563 98 : [WRN] 2 slow requests, 2 included below; oldest blocked for > 87.648414 secs
2013-08-01 08:48:16.554679 osd.5 xx.xx.xx:6805/11563 99 : [WRN] slow request 87.648414 seconds old, received at 2013-08-01 08:46:48.906210: osd_op(client.50425.0:128 rbd_header.c4f62ae8944a [watch add cookie 1 ver 0] 3.1662d8c3 e4048) v4 currently reached pg
2013-08-01 08:48:16.554684 osd.5 xx.xx.xx:6805/11563 100 : [WRN] slow request 87.648278 seconds old, received at 2013-08-01 08:46:48.906346: osd_op(client.50425.0:129 rbd_header.c4f62ae8944a [watch add cookie 1 ver 0] 3.1662d8c3 e4048) v4 currently reached pg
2013-08-01 08:48:25.400571 mon.0 xx.xx.xx:6789/0 13039 : [INF] pgmap v781157: 12096 pgs: 4037 active+clean, 3544 peering, 4515 active+degraded; 3380 GB data, 6095 GB used, 8956 GB / 15051 GB avail; 5900B/s wr, 1op/s; 313884/1681762 degraded (18.664%)


Files

highload_ceph_osd.png (109 KB) highload_ceph_osd.png Khanh Nguyen Dang Quoc, 08/08/2013 07:23 PM
ceph-osd.15.log.zip (234 KB) ceph-osd.15.log.zip Khanh Nguyen Dang Quoc, 09/10/2013 12:49 AM
Actions #1

Updated by Ian Colle over 10 years ago

  • Assignee set to Samuel Just
  • Priority changed from Urgent to High
Actions #2

Updated by Samuel Just over 10 years ago

Which process is causing the load?

Actions #3

Updated by Khanh Nguyen Dang Quoc over 10 years ago

ceph-osd process is causing the high load, ...

When i used the htop to monitor cpu load, I saw that %cpu spend for system occurred so high and Cpu's status was sleeping.

as cpu load on this cluster was high, sometime the osd was marked down wrongly.

See the attached image for more detail workload.

Actions #4

Updated by Samuel Just over 10 years ago

what kernel are you running?

Actions #5

Updated by Khanh Nguyen Dang Quoc over 10 years ago

I'm using Ubuntu 12.10 (GNU/Linux 3.5.0-25-generic x86_64) kernel for all cluster nodes.

sometimes, i saw the osd has status like as "osd.18 [WRN] map e13601 wrongly marked me down" ? I don't know why it exists.
could you please help me explain more detail about that ?

Actions #6

Updated by Khanh Nguyen Dang Quoc over 10 years ago

The slow request increase frequently ...
see the attached file for more detail

Actions #7

Updated by Samuel Just over 10 years ago

  • Status changed from New to Need More Info

This is only happening on a particular node? Is this still a problem?

Actions #8

Updated by Khanh Nguyen Dang Quoc over 10 years ago

It occurred as having multiple write requests to cluster...
I deployed 10 osds/node with default configuration.

My physical server has 64GB ram, Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz 24 cores, ssd

Here the details of my servers:

Number of servers: 3
Host CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
Host RAM: 64 G
Host disks: 20 disks SaS 300GB/node , 2 SSD as journal serve for 10 OSDs/nodes
Host NIC: 10Gbps (primary network)
dedicated network : infiniband card.

With the my hard configuration as below, I configured 10 osds/node . Is it ok ?

And do you have any recommendations for my cluster storage?

Actions #9

Updated by Samuel Just over 10 years ago

I mean, was it always osds on a particular node which get marked down?

Actions #10

Updated by Khanh Nguyen Dang Quoc over 10 years ago

I saw in the log file "osd is marked wrongly", but i checked its process was running...

Actions #11

Updated by Samuel Just about 10 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF