Bug #16910: cluster takes 30 mins to detect node down and meanwhile rbd is unresponsive - Ceph - Ceph

Actions

Copy link

Bug #16910

closed

cluster takes 30 mins to detect node down and meanwhile rbd is unresponsive

Added by Martin Hoffmann over 7 years ago. Updated almost 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

v10.2.2

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We have a 3 node cluster on ubuntu 14.04 with jewel 10.2.2:

node1 = 7 OSDs, Mon
node2 = 7 OSDs, Mon
node3 = Mon (no OSDs)

If i shutdown one cluster node (e.g. node1) this is immediately reflected in 'ceph status' or 'ceph health'.
However it takes 30 minutes until 'ceph osd tree' reflects this change and marks corresponding OSD's as 'down'.
Meanwhile commands like 'rbd --pool XYZ list' hang while they try to contact the already shutdown node1:

rbd --pool MyPool list
2016-08-03 16:01:21.792451 7f74908a8700 0 -- 192.168.250.64:0/3690406302 >> 192.168.250.60:6806/2451 pipe(0x7f74b95adb20 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f74b95ac810).fault

(192.168.250.60 is the IP of the node1 i intentionally shutdown)

So why does it still try to contact the node that has been shutdown? 'ceph health' already knows that this one is down.

This one hangs until 30 minutes pass. Then reports correct pool info.

In 'ceph -w' you can see whats going on:

[...]
2016-08-03 15:57:39.154696 mon.2 [INF] mon.node3 calling new monitor election
2016-08-03 15:57:39.159369 mon.1 [INF] mon.node2 calling new monitor election
2016-08-03 15:57:44.161381 mon.1 [INF] mon.node2@1 won leader election with quorum 1,2
2016-08-03 15:57:44.168341 mon.1 [INF] HEALTH_WARN; 1 mons down, quorum 1,2 node2,node3
2016-08-03 15:57:44.565779 mon.1 [INF] monmap e2: 3 mons at {node1=192.168.250.60:6789/0,node2=192.168.250.62:6789/0,node3=192.168.250.65:6789/0}
2016-08-03 15:57:44.565927 mon.1 [INF] pgmap v318474: 676 pgs: 676 active+clean; 15648 MB data, 32002 MB used, 4246 GB / 4277 GB avail
2016-08-03 15:57:44.566278 mon.1 [INF] fsmap e1:
2016-08-03 15:57:44.566870 mon.1 [INF] osdmap e223: 14 osds: 14 up, 14 in
[...]
2016-08-03 15:58:15.310769 osd.12 [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.195356 secs
2016-08-03 15:58:15.310789 osd.12 [WRN] slow request 30.195356 seconds old, received at 2016-08-03 15:57:45.098517: osd_op(client.334483.0:6472 1.5d99f7bd rbd_data.51df12ae8944a.00000000000001f4 [write 1937408~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e223) currently waiting for subops from 11
[...]
2016-08-03 16:24:43.437667 osd.1 [WRN] slow request 480.091726 seconds old, received at 2016-08-03 16:16:43.345860: osd_op(client.334467.0:9645 1.6d1f93fa rbd_data.519a9238e1f29.0000000000000535 [write 548864~8192] snapc 22=[] ack+ondisk+write+known_if_redirected e223) currently waiting for subops from 0
2016-08-03 16:25:24.961764 mon.1 [INF] pgmap v318510: 676 pgs: 676 active+clean; 15648 MB data, 31991 MB used, 4246 GB / 4277 GB avail
2016-08-03 16:25:34.955124 mon.1 [INF] pgmap v318511: 676 pgs: 676 active+clean; 15648 MB data, 31991 MB used, 4246 GB / 4277 GB avail
2016-08-03 16:26:05.328142 mon.1 [INF] pgmap v318512: 676 pgs: 676 active+clean; 15648 MB data, 31991 MB used, 4246 GB / 4277 GB avail
2016-08-03 16:27:48.191196 mon.1 [INF] osd.0 marked down after no pg stats for 900.056937seconds
2016-08-03 16:27:48.191472 mon.1 [INF] osd.2 marked down after no pg stats for 900.056937seconds
2016-08-03 16:27:48.191554 mon.1 [INF] osd.5 marked down after no pg stats for 900.056937seconds
2016-08-03 16:27:48.191625 mon.1 [INF] osd.6 marked down after no pg stats for 900.056937seconds
2016-08-03 16:27:48.191723 mon.1 [INF] osd.7 marked down after no pg stats for 900.056937seconds
2016-08-03 16:27:48.191956 mon.1 [INF] osd.10 marked down after no pg stats for 900.056937seconds
2016-08-03 16:27:48.192174 mon.1 [INF] osd.11 marked down after no pg stats for 900.056937seconds
2016-08-03 16:27:48.268658 mon.1 [INF] osdmap e224: 14 osds: 7 up, 14 in
2016-08-03 16:27:48.273828 mon.1 [INF] pgmap v318513: 676 pgs: 676 active+clean; 15648 MB data, 31991 MB used, 4246 GB / 4277 GB avail
2016-08-03 16:27:49.258680 mon.1 [INF] osdmap e225: 14 osds: 7 up, 14 in
2016-08-03 16:27:49.269408 mon.1 [INF] pgmap v318514: 676 pgs: 318 stale+active+clean, 358 active+clean; 15648 MB data, 31991 MB used, 4246 GB / 4277 GB avail
2016-08-03 16:27:50.307523 mon.1 [INF] pgmap v318515: 676 pgs: 250 active+undersized+degraded, 193 stale+active+clean, 233 active+clean; 15648 MB data, 31993 MB used, 4246 GB / 4277 GB avail; 0 B/s rd, 24175 B/s wr, 2 op/s; 11806/65988 objects degraded (17.891%); 47554 kB/s, 439 objects/s recovering
[...]

As you can see at 16:27 it detects osd down for 900s (i guess mon osd report timeout = 900; but why does it wait 1800s???) and then all clears up. OSDs are 'down' and 'rbd' Tool start working again.

Settings from ceph.conf:

[global]
fsid = zzzz
mon_initial_members = node3, node1, node2
mon_host = 192.168.250.65,192.168.250.60,192.168.250.62
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_pool_default_size = 2
osd_pool_default_min_size = 1
public_network = 192.168.250.0/24
cluster_network = yyy
mon_osd_adjust_heartbeat_grace = false

So from my understanding it should detect a lot earlier that there IS a problem, since 'ceph health' already knows this a few seconds after shutdown. And 'rbd' tool should try to detect the UP mons not the first one???

What am i missing here?

Actions

Copy link

Updated by Achim Ledermüller over 7 years ago

Hi,

we saw the same behavior twice since the upgrade from hammer to jewel (10.2.3). After a disk failure, 14 OSDs report after 56 seconds, that osd.32 is down, but the monitors are waiting 900 seconds before osd.32 is marked down. The heartbeat settings etc. should be fine.

19:59:14.182205 mon.0 10.1.0.168:6789/0 4359131 : cluster [INF] osd.32 10.1.0.170:6809/10591 failed (14 reporters from different host after 56.711391 >= grace 53.016111)
[...]
20:13:19.335828 mon.0 10.1.0.168:6789/0 4360018 : cluster [INF] osd.32 marked down after no pg stats for 904.832193seconds

OS: ubuntu 14.04
Kernel: 4.4 Ubuntu LTS
Tuneables: Firefly (with straw_calc_version=0)
mon_osd_min_down_reporters: 2
osd_heartbeat_interval: 6
osd_heartbeat_grace: 20
mon osd report timeout: 900

In #18104 is mentioned, that the docs in http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/#osds-report-down-osds are out of date. Has the osd down detection changed?

How can i avoid/fix this behavior? Do you need more detailed information?

Regards, Achim

Actions

Copy link

Updated by Alessandro Garbelli almost 7 years ago

Dear Martin,
we were in the same condition (30 mins to detect node down and meanwhile rbd is unresponsive);
We have a 3 node cluster on debian 8 with jewel 10.2.2:
node1 = 4 OSDs, Mon
node2 = 4 OSDs, Mon
node3 = Mon (no OSDs).

We noted that adding an osd (also a single little 100 GB ssd osd) to node 3, when shutdown node 1 or 2 the system immediately marks corresponding OSD's as 'down' (rbd is responsive).

Best regards

Actions

Copy link

Updated by Josh Durgin almost 7 years ago

Status changed from New to Closed

The 900 second timeout after a disk failure is from your 'mon osd report timeout' setting - perhaps the docs for that should be clearer. In jewel and later, osds make a best effort to mark themselves down when they are shutdown, which will avoid that timeout.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #16910

cluster takes 30 mins to detect node down and meanwhile rbd is unresponsive

Updated by Achim Ledermüller over 7 years ago

Updated by Alessandro Garbelli almost 7 years ago

Updated by Josh Durgin almost 7 years ago