Deb Barba wrote:
Sage,
In all due respect, I disagree.
I can see how you do not want a single typo in a config file to mess up the entire cluster, but now you are asking someone that has the possibility of making that typo, to re-make that typo, multiple times, across each node, over and over in how many nodes?
I had just 3 nodes, but one was down for hours, because the changes i thought i made did not get picked up.... if I had 100 nodes, and had to migrate them, the cluster would be down for weeks.
This would never happen on a 100 node cluster. You wouldn't run the monitors on machines picking up dynamic IPs via DHCP in anything in production.
In any case, though, this is a manual process for each monitor (i.e., 3-5 nodes).
If you do not think we should have the maps slurp from the config, then what is the config for? It should be the master source for all configuration.
It's for feeding config to start daemons. There is a separation between 'configuration' (conf file, daemon behavior) and 'cluster state'. Cluster state is zealously protected by paxos on the monitors because if it is disrupted then all consistency bets are off and everything breaks spectacularly (as opposed to daemons just being down).
Second, if you are not willing to implement this, then we should add a command, that you can type in the proper new settings, and it will go and update the config file and any maps or other locations that need to change. Currently, it is so convoluted to find what needs changing and why, that a customer would become very frustrated and walk away. At this time, there is too much chance left to know what needs to be done, and too many manual changes that can go very wrong if not done properly. Did you realize that all OSD disappear when the IPs went away? did you know the only instructions are how to make new ones, not recover old ones? these are the mistakes i would like us to not have the customer make....
Agreed. The customer error is in using a dynamic ip for the monitors, though.. I think that is the first/most important thing to help them avoid.
Once they are operating vaguely within the range of what will actually work/can be supported/makes sense, we want to build tools and docs to help them with common procedures/problems. If they are outside of what makes sense or we would vaguely want to support (dynamic ips for monitors), then it's not a good use of our time...