Project

General

Profile

Actions

Bug #10947

closed

ceph osd setcrushmap loops when crushtool lags

Added by Loïc Dachary about 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
hammer
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

If crushtool takes longer than mon lease and block the mon, an election will happen and the command will run again, indefinitely.

mon lease renew interval < mon lease < mon lease ack timeout
zhuzha:~/ceph/ceph/src% PATH=$(pwd):$PATH ./vstart.sh
zhuzha:~/ceph/ceph/src% cat > crushtool
#!/bin/sh
sleep 10
exit 0 # success
^D
zhuzha:~/ceph/ceph/src% ceph osd getcrushmap -o /tmp/map
got crush map from osdmap epoch 18
zhuzha:~/ceph/ceph/src% ceph osd setcrushmap -i /tmp/map

Hangs forever. In logs at that time:

zhuzha:~/ceph/ceph/src% fgrep 'preprocess_query mon_command({"prefix": "osd setcrushmap"}' out/mon.a.log |tail -5
2015-02-25 12:07:17.734768 7f9154604700 10 mon.a@0(leader).osd e23 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924
2015-02-25 12:07:28.158111 7f9154604700 10 mon.a@0(leader).osd e23 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924
2015-02-25 12:07:38.504739 7f9154604700 10 mon.a@0(leader).osd e23 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924
2015-02-25 12:07:49.209307 7f9154604700 10 mon.a@0(leader).osd e24 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924
2015-02-25 12:07:59.577932 7f9154604700 10 mon.a@0(leader).osd e24 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924

Changing the following fixes the problem:
    Modified   src/vstart.sh
diff --git a/src/vstart.sh b/src/vstart.sh
index bf863dc..e1440e6 100755
--- a/src/vstart.sh
+++ b/src/vstart.sh
@@ -358,6 +358,10 @@ if [ "$start_mon" -eq 1 ]; then
         mon osd full ratio = .99
         mon data avail warn = 10
         mon data avail crit = 1
+        mon lease = 20
+        mon lease renew interval = 18
+        mon lease ack timeout = 40
         osd pool default erasure code directory = $EC_PATH
         osd pool default erasure code profile = plugin=jerasure technique=reed_sol_van k=2 m=1 ruleset-failure-domain=osd
         rgw frontends = fastcgi, civetweb port=$CEPH_RGW_PORT
Actions

Also available in: Atom PDF