Bug #10947
Updated by Loïc Dachary about 9 years ago
If crushtool takes too long and block the mon, an election will happen and the command will run again, indefinitely.
<pre>
zhuzha:~/ceph/ceph/src% PATH=$(pwd):$PATH ./vstart.sh
zhuzha:~/ceph/ceph/src% cat > crushtool
#!/bin/sh
sleep 10
exit 0 # success
^D
zhuzha:~/ceph/ceph/src% ceph osd getcrushmap -o /tmp/map
got crush map from osdmap epoch 18
zhuzha:~/ceph/ceph/src% ceph osd setcrushmap -i /tmp/map
Hangs forever. In logs at that time:
zhuzha:~/ceph/ceph/src% fgrep 'preprocess_query mon_command({"prefix": "osd setcrushmap"}' out/mon.a.log |tail -5
2015-02-25 12:07:17.734768 7f9154604700 10 mon.a@0(leader).osd e23 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924
2015-02-25 12:07:28.158111 7f9154604700 10 mon.a@0(leader).osd e23 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924
2015-02-25 12:07:38.504739 7f9154604700 10 mon.a@0(leader).osd e23 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924
2015-02-25 12:07:49.209307 7f9154604700 10 mon.a@0(leader).osd e24 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924
2015-02-25 12:07:59.577932 7f9154604700 10 mon.a@0(leader).osd e24 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924
</pre>
Changing the following fixes the problem:
<pre>
Modified src/vstart.sh
diff --git a/src/vstart.sh b/src/vstart.sh
index bf863dc..e1440e6 100755
--- a/src/vstart.sh
+++ b/src/vstart.sh
@@ -358,6 +358,10 @@ if [ "$start_mon" -eq 1 ]; then
mon osd full ratio = .99
mon data avail warn = 10
mon data avail crit = 1
+ mon tick interval = 20
+ mon lease = 20
+ mon lease renew interval = 18
+ mon lease ack timeout = 40
osd pool default erasure code directory = $EC_PATH
osd pool default erasure code profile = plugin=jerasure technique=reed_sol_van k=2 m=1 ruleset-failure-domain=osd
rgw frontends = fastcgi, civetweb port=$CEPH_RGW_PORT
</pre>