Project

General

Profile

Bug #10947

Updated by Loïc Dachary about 9 years ago

If crushtool takes longer than *mon lease* too long and block the mon, an election will happen and the command will run again, indefinitely. 

     mon lease renew interval < mon lease < mon lease ack timeout 

 <pre> 
 zhuzha:~/ceph/ceph/src% PATH=$(pwd):$PATH ./vstart.sh 
 zhuzha:~/ceph/ceph/src% cat > crushtool 
 #!/bin/sh 
 sleep 10 
 exit 0 # success 
 ^D 
 zhuzha:~/ceph/ceph/src% ceph osd getcrushmap -o /tmp/map 
 got crush map from osdmap epoch 18 
 zhuzha:~/ceph/ceph/src% ceph osd setcrushmap -i /tmp/map 

 Hangs forever. In logs at that time: 

 zhuzha:~/ceph/ceph/src% fgrep 'preprocess_query mon_command({"prefix": "osd setcrushmap"}' out/mon.a.log |tail -5 
 2015-02-25 12:07:17.734768 7f9154604700 10 mon.a@0(leader).osd e23 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924 
 2015-02-25 12:07:28.158111 7f9154604700 10 mon.a@0(leader).osd e23 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924 
 2015-02-25 12:07:38.504739 7f9154604700 10 mon.a@0(leader).osd e23 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924 
 2015-02-25 12:07:49.209307 7f9154604700 10 mon.a@0(leader).osd e24 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924 
 2015-02-25 12:07:59.577932 7f9154604700 10 mon.a@0(leader).osd e24 preprocess_query mon_command({"prefix": "osd setcrushmap"} v 0) v1 from client.14102 172.18.128.29:0/1020924 
 </pre> 
 Changing the following fixes the problem: 
 <pre> 
	 Modified     src/vstart.sh 
 diff --git a/src/vstart.sh b/src/vstart.sh 
 index bf863dc..e1440e6 100755 
 --- a/src/vstart.sh 
 +++ b/src/vstart.sh 
 @@ -358,6 +358,10 @@ if [ "$start_mon" -eq 1 ]; then 
          mon osd full ratio = .99 
          mon data avail warn = 10 
          mon data avail crit = 1 
 +          mon lease = 20 
 +          mon lease renew interval = 18 
 +          mon lease ack timeout = 40 
          osd pool default erasure code directory = $EC_PATH 
          osd pool default erasure code profile = plugin=jerasure technique=reed_sol_van k=2 m=1 ruleset-failure-domain=osd 
          rgw frontends = fastcgi, civetweb port=$CEPH_RGW_PORT 
 </pre> 

Back