Project

General

Profile

Actions

Bug #23447

closed

collocated storage: losing one node leads to client write freeze 24 second

Added by Suvendu Mitra about 6 years ago. Updated about 6 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are running ceph version 12.2.2, with bluestore with 3 Controller and 3 compute nodes. Ceph OSD and Ceph monitor's running collocated on controller.

A. Test details -
1. Launch a fedora-24 VM. start writing with dd on attached cinder volume. and other window monitor the resultant file size.
2. Restart one Controller Fri Mar 23 10:41:37 EET 2018 by issue "sudo reboot -f" from shell.

B. Observation
1. File size is constant for 14 sec, after that size grows

I have used several Ceph parameter e.g
osd heartbeat grace = 3, osd heartbeat interval = 2, mon lease = 1.0, mon election timeout = 2, osd mon heartbeat interval = 10

My observation sometime the client freeze less than 3 second but sometimes more than 20 second.

a. 1st try rebooting controller-2 -> 4 second write freeze
b. 2nd try rebooting controller-2 -> 2 second write freeze
c. 3rd time rebooting controller-3 -> 24 second write freeze

I have attached logs for "ceph -w" and ceph logs for all nodes

[root@controller-1 ~]# ceph -s
cluster:
id: 1c90ccdc-b322-4eb4-80af-7d8bca09206d
health: HEALTH_OK

services:
mon: 3 daemons, quorum controller-1,controller-2,controller-3
mgr: controller-1(active), standbys: controller-2, controller-3
osd: 6 osds: 6 up, 6 in
data:
pools: 4 pools, 736 pgs
objects: 25744 objects, 100 GB
usage: 209 GB used, 2025 GB / 2235 GB avail
pgs: 736 active+clean
io:
client: 5797 B/s wr, 0 op/s rd, 0 op/s wr

[root@controller-1 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 2.18271 root default
-3 0.72757 host controller-1
0 ssd 0.36378 osd.0 up 1.00000 1.00000
4 ssd 0.36378 osd.4 up 1.00000 1.00000
-5 0.72757 host controller-2
1 ssd 0.36378 osd.1 up 1.00000 1.00000
3 ssd 0.36378 osd.3 up 1.00000 1.00000
-7 0.72757 host controller-3
2 ssd 0.36378 osd.2 up 1.00000 1.00000
5 ssd 0.36378 osd.5 up 1.00000 1.00000
[root@controller-1 ~]# ceph mon stat
e1: 3 mons at {controller-1=192.168.1.23:6789/0,controller-2=192.168.1.24:6789/0,controller-3=192.168.1.25:6789/0}, election epoch 38, leader 0 controller-1, quorum 0,1,2 controller-1,controller-2,controller-3
[root@controller-1 ~]#


Files

ceph_w_2.log (26.6 KB) ceph_w_2.log Ceph -w when the test runs Suvendu Mitra, 03/23/2018 09:05 AM
ceph_log_controller-1_latest.tar.gz (593 KB) ceph_log_controller-1_latest.tar.gz Controller-1 log Suvendu Mitra, 03/23/2018 09:09 AM
ceph_log_controller-2_latest.tar.gz (362 KB) ceph_log_controller-2_latest.tar.gz Controleler-2 log Suvendu Mitra, 03/23/2018 09:09 AM
ceph_log_controller-3_latest.tar.gz (405 KB) ceph_log_controller-3_latest.tar.gz Controller-3 log Suvendu Mitra, 03/23/2018 09:10 AM
Actions #1

Updated by Greg Farnum about 6 years ago

  • Status changed from New to Rejected

Please don't open duplicate tickets because you found a response unsatisfactory.

Actions

Also available in: Atom PDF