Derivative-free Optimization¶
My children, the only true technology is nature. All the other forms of manmade technology are perversions.
—Ralph Bakshi, Wizards
Random Search¶
In this chapter, I would like to pinpoint in mathematical terms the ideas behind the numerical experiments with which this thesis is concerned. The following is far from a rigorous treatment, may seem pretentious or naive, and probably it is. Nonetheless, I think it’s helpful to have a mental model of the computations that are going to be performed.
As more thoroughly explained in the previous chapters, we are concerned with the calibration of the GEOtop hydrological model. Calibration means to find the values of the input parameters that one has to set to obtain the best possible overlapping between the outputs of a simulation and the experimental data.
“Overlapping” is not a precise term, but for the moment let’s suppose that we have a pure function, the cost function, which takes the value of the parameters as arguments, runs a simulation, and returns a real number representing the discrepancy with observations: higher values mean worst overlap, and 0 means perfection. We will consider only continuous parameters, and suppose that you can get arbitrarily small differences by evaluating the cost function with ” close enough” arguments, which we can image as points of the parameter/search space.
Therefore, we can model our cost function as lower bounded, continuous function f:D⊆Rn↦[0,+∞), where the search space D is compact such that we know that there exist one or more global minima. We should consider an extension of [0,+∞) that includes failing computations, when the model crashes or fails to converge, something like bottom in Haskell. Let’s forget about it though, and consider this information encoded in the domain D, within which our computation acts like a real mathematical function.
A black box optimizer is an iterator that at each step returns the approximate location of the minimum (referred to as recommendation or candidate), with increasing precision as the iterations go by. In this innocent statement lingers the assumption that it exists only one global minimum, assumption which for the moment we will ignore. In particular, we will consider only randomized algorithms which return a different sequence of points at each execution.
Therefore, we can define an optimizer as a random process Xi:Ω↦D that gives better and better recommendations, that is ∀ω∈Ω.f(Xi(ω))≤f(Xi−1(ω)).
The first optimizer we consider is random search, in which we sample the search space uniformly at each step and recommend the best result to that point. Let Ui:Ω↦D a sequence of iid random variables uniformly distributed on D (that is compact, hence has finite Lebesgue measure), then we can define the random search Xi as follows
In this way, the sequence of losses Yi=f(Xi) (which are random variables themselves) is point-wise monotonically decreasing, so it converges point-wise to a random variable Y=limi→∞Yi. It is easy to show that Y=minx∈Df(x) almost everywhere.
Evolutionary Algorithms¶
Random search doesn’t perform poorly at all in the long run, and, if we could wait forever and sample infinite points of the search space, it would be a good option. Therefore, when we ask another optimizer X′i to “outperform random search” what we mean is that, on average, we want smaller losses Y′i after a finite number of steps: E[Y′i]≤E[Yi].
To this purpose, other optimizers use a more refined strategy. At each step, they sample from random variables U′i whose distribution is inferred from previous steps, and which usually converge to some a posteriori U′. In this way, these algorithms super-sample the region of the search space where, based on their assumptions, it is more probable to find a global minimum and accelerate the convergence of the best candidate X′i. It is crucial to note that while they focus on a particular minimum (global or local), they subsample the rest of the search space.
In other words, if a heuristic algorithm converges, usually it does rapidly to a minimum and then sits there, with minimal chances to discover different minima. Therefore, these algorithms typically have a parameter that can be tweaked to explore more the search space but which slows convergence: exploration vs exploitation.
Evolutionary algorithm is a broad term applicable to a large collections of heuristic optimization algorithms inspired by biological evolution or other biological processes, as in Particle Swarm Optimization. An evolutionary algorithm starts by drawing a set of random points of the search space, the so-called individuals, with some a priori distribution, and then repeat the process using a modified distribution, which takes into account the losses of the individuals. Each iteration is called a generation, and the individuals of a generation are the population.
In general, the population size behave as the parameter described above, allowing more or less exploration of the search space. The main difference between an evolutionary algorithm and another is how random variables from which individuals of a generation are drawn depend on the ones of the previous generation. The crucial point is that individuals of the same generation are independent one another.